Responsibilities
- Architect, launch, and manage scalable Kubernetes clusters supporting AI model training and inference workloads
- Operate and enhance Slurm-based high-performance computing environments used for distributed training of large language models
- Create reliable APIs and workflow orchestration systems for training pipelines and inference platforms
- Develop job scheduling and resource allocation systems across diverse computing infrastructures
- Evaluate system performance, identify performance constraints, and apply optimizations in training and inference environments
- Design monitoring, alerting, and observability frameworks specific to machine learning workloads on Kubernetes and Slurm
- Respond promptly to infrastructure failures and coordinate with cross-functional teams to ensure continuous operation of critical AI systems
- Improve cluster efficiency and implement dynamic autoscaling to meet fluctuating workload demands