San Francisco, USA Remote (Global)

Andromeda Cluster is hiring a Performance Engineer - AI Infrastructure

Andromeda Cluster is hiring a Performance Engineer to join our AI Infrastructure team. This role focuses on optimizing the efficiency and throughput of our massive-scale AI clusters. You will profile end-to-end training runs to identify bottlenecks in compute, communication, and storage, translating performance data into concrete engineering improvements.

What You'll Do

  • Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O.
  • Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution.
  • Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime.
  • Design technical processes (e.g., postmortem reviews, incident response) to help the team operate effectively and avoid repeating performance regressions.

What We're Looking For

  • Systems intuition and a passion for optimizing performance and digging into systems to understand interactions from training loop to hardware.
  • Proven experience running distributed training jobs on multi-GPU systems or HPC clusters.
  • Strong programming skills in Python and C++.
  • Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built.
  • Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code.
  • A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.

Nice to Have

  • Experience with Rust or CUDA.
  • Low-level mastery: Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level.
  • Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX).
  • Expertise in security best practices for high-scale infrastructure.
  • Familiarity with monitoring tools like Prometheus and Grafana.

Technical Stack

  • Languages: Python, C++, Rust, CUDA
  • Frameworks: PyTorch, JAX, TensorFlow
  • Infrastructure: Kubernetes, Linux
  • Low-Level Tools: eBPF, NCCL, MPI, UCX
  • Monitoring: Prometheus, Grafana

Team & Environment

This role is part of the Growth team. It is a builder’s role with ownership and autonomy to shape how systems run.

Work Mode

This position is global and open to remote work globally, with optional hubs in San Francisco.

Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Required Skills
PythonC++RustCUDAPyTorchJAXTensorFlowKubernetesLinuxeBPFAI InfrastructurePerformance EngineeringDistributed SystemsBenchmarkingProfiling PythonC++RustCUDAPyTorchJAXTensorFlowKubernetesLinuxeBPFAI InfrastructurePerformance EngineeringDistributed SystemsBenchmarkingProfiling
Visa expiring soon?

Extend or switch without leaving Thailand

Running out of time on your current visa? SVBL identifies your best option — extension, category switch, or long-term visa — and handles the entire process.

Visa extensions & category switches
LTR & DTV visa applications
90-day reporting managed
Overstay prevention
Check your options
Prevent overstay issues
About company
Andromeda Cluster
Andromeda Cluster gives early-stage startups access to scaled AI infrastructure. It works with leading AI labs, data centers, and cloud providers to deliver compute globally, routing training and inference jobs across global supply. Its long-term vision is to build the liquidity layer for global AI compute.
All jobs at Andromeda Cluster Visit website
Job Details
Category infrastructure
Posted 2 months ago