Build and operate the foundational infrastructure that powers AI/ML research and product development in a hybrid environment. Focus on scalability, reliability, and automation across cloud and on-premise systems using modern platform engineering practices.
Responsibilities
- Design and maintain a robust, scalable Kubernetes platform running on both AWS and on-premise environments to support diverse applications and services.
- Implement Infrastructure-as-Code using Terraform to manage and version infrastructure across multiple environments.
- Monitor system performance and reliability using observability tools like Prometheus and Grafana.
- Automate deployment, scaling, and failover of AI/ML workloads using Kubernetes operators and CI/CD pipelines.
- Collaborate with ML engineers to optimize GPU resource allocation and scheduling via Slurm and Kubernetes device plugins.
- Ensure high availability and disaster recovery readiness for critical platform components.
- Develop and maintain internal developer platform tooling to streamline service deployment and configuration.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- 3+ years of experience in site reliability, platform engineering, or systems administration.
- Strong proficiency in Kubernetes, containerization, and cloud infrastructure (AWS).
- Hands-on experience with Infrastructure-as-Code tools such as Terraform or Pulumi.
- Solid understanding of networking, distributed systems, and Linux internals.
- Experience with monitoring, logging, and observability stacks (e.g., Prometheus, Loki, Grafana).
Tech Stack
Kubernetes, Terraform, AWS (EC2, S3, EKS, VPC), Prometheus, Grafana, Slurm, Docker, GitLab CI/CD, ArgoCD
Benefits
- Comprehensive health, dental, and vision insurance
- 401(k) matching program
- Unlimited paid time off
- Flexible work hours and remote-friendly policy
- Annual learning and development stipend
- Onsite and virtual wellness programs
- Company-sponsored tech talks and hackathons
- Parental leave policy
- Employee resource groups and inclusion initiatives
- Free healthy meals and snacks in office
- Commuter benefits program
- Stocked kitchens and game rooms
- Annual retreats and team-building events
- Mental health support and counseling services
- Pet insurance option
- Volunteer time off program
Work Arrangement
Hybrid (remote and on-site options available)
Additional Information
- This role supports on-call incident response on a rotating basis.
- Candidates must be located in the U.S. or Canada for tax and compliance reasons.
- We are committed to building a diverse and inclusive team.
- The hiring process includes a technical screening, system design interview, and culture fit discussion.
- Relocation assistance is available for eligible candidates.


