Architect, launch, and manage scalable Kubernetes clusters supporting AI model training and inference workloads
Operate and enhance Slurm-based high-performance computing environments used for distributed training of large language models
Create reliable APIs and workflow orchestration systems for training pipelines and inference platforms
Develop job scheduling and resource allocation systems across diverse computing infrastructures
Evaluate system performance, identify performance constraints, and apply optimizations in training and inference environments
Design monitoring, alerting, and observability frameworks specific to machine learning workloads on Kubernetes and Slurm
Respond promptly to infrastructure failures and coordinate with cross-functional teams to ensure continuous operation of critical AI systems
Improve cluster efficiency and implement dynamic autoscaling to meet fluctuating workload demands

Perplexity is hiring a Member of Technical Staff (AI Infrastructure Engineer)