Andromeda Cluster is hiring a Performance Engineer - AI Infrastructure

Responsibilities

Profile and optimize training workloads from start to finish, identifying bottlenecks in GPU kernel execution, NCCL communication, and storage input/output.
Work with systems engineering teams to enhance scheduling efficiency, collective communication performance, and kernel-level execution.
Develop and maintain precise monitoring tools to track and visualize model flops utilization, throughput metrics, and cluster availability.
Design structured technical procedures such as incident response protocols and postmortem analyses to prevent recurring performance issues.

Work Arrangement

Hybrid

Required Skills

PythonC++RustCUDAPytorchTensorFlowKubernetesLinuxDistributed Systems

About company

Andromeda Cluster gives early-stage startups access to scaled AI infrastructure. It works with leading AI labs, data centers, and cloud providers to deliver compute globally, routing training and inference jobs across global supply. Its long-term vision is to build the liquidity layer for global AI compute.

All jobs at Andromeda Cluster Visit website

Job Details

Category infrastructure

Posted 4 months ago