Responsibilities
- Architect, deploy, and manage scalable Kubernetes environments tailored for AI training and inference workloads
- Operate and fine-tune Slurm-managed high-performance computing clusters for distributed large language model training
- Create reliable APIs and workflow orchestration systems supporting training and inference pipelines
- Develop resource allocation and job scheduling frameworks across diverse computing platforms
- Evaluate system performance, identify performance constraints, and apply targeted optimizations for training and inference systems
- Design monitoring, alerting, and observability tooling specific to machine learning operations on Kubernetes and Slurm
- Troubleshoot infrastructure incidents rapidly and coordinate with cross-functional teams to ensure continuous operation of critical systems
- Improve cluster efficiency through intelligent autoscaling and resource utilization strategies for fluctuating workloads
Compensation
Competitive salary and equity package
Work Arrangement
Hybrid
Team
Collaborative engineering team focused on AI systems infrastructure
Responsibilities
- Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
- Manage and optimize Slurm-based HPC environments for distributed training of large language models
- Develop robust APIs and orchestration systems for both training pipelines and inference services
- Implement resource scheduling and job management systems across heterogeneous compute environments
- Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
- Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
- Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
- Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands
Available for qualified candidates