Responsibilities
- Deploying clusters of 1,000+ GPUs using custom written playbooks; modifying these tools as necessary to provide the perfect solution for a customer.
- Validating correctness and performance of underlying compute, storage, and networking infrastructure, and working with providers to optimize these subsystems.
- Migrating petabytes of data from public cloud platforms to local storage, as quickly and cost effectively as possible.
- Debugging issues anywhere in the stack, from “this server’s fan is blocked by a plastic bag” to “optimizing S3 dataloaders from buckets in different regions”.
- Building internal tooling to decrease deployment time and increase cluster reliability, including automation where the customer benefits clearly outweigh the implementation overhead.
Requirements
- 2+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience.
- Great verbal and written communication skills in English.
- Experience deploying and operating Kubernetes and/or SLURM clusters.
- Experience in writing Go, Python, Bash.
- Experience using Ansible, Terraform, and other automation or IAC tools.
- Strong engineering background, preferably in Computer Science, Software Engineering, Math, Computer Engineering, or similar fields.
Nice to Have
- You have built and operated an AI workload at 1000+ GPU scale.
- You have built multi-tenant, hyperscale Kubernetes based services.
- You have physically deployed infrastructure in a datacenter, managed bare metal hardware via MaaS or Netbox, etc.
- You have deployed and managed multi-tenant InfiniBand or RoCE networks.
- You have deployed and managed petabyte scale all-flash storage systems, including DDN, VAST, and/or Weka; or Ceph, LUSTRE, or similar open source tools.
Benefits
- Competitive total compensation package (cash + equity).
- Retirement or pension plan, in line with local norms.
- Health, dental, and vision insurance.
- Generous PTO policy, in line with local norms.
- Fluidstack is remote first, but has offices in London, New York, and SF. For all other locations, we provide access to WeWork.
Work Arrangement
Remote (Worldwide)
Additional Information
- This role will involve being part of an on-call rotation up to one week per month.


