Responsibilities
- Design, deploy, and maintain production-grade Kubernetes clusters—including custom scheduling, GPU allocation, and optimized workload orchestration
- Implement and manage Infrastructure as Code (IaC) for repeatable and reliable environments
- Set up and maintain observability tools like Prometheus and Grafana for performance monitoring and alerting
- Ensure the scalability and reliability of distributed systems across various environments
- Independently access remote systems using SSH and perform operational and debugging tasks
- Collaborate with product and engineering teams to align infrastructure with application needs
Requirements
- Kubernetes Expert – Hands-on experience with cluster management, custom schedulers, and GPU-focused workloads
- Infrastructure as Code – Strong experience using Terraform, Pulumi, or similar tools
- Observability-First Mindset – Proficient with monitoring and logging tools (e.g. Prometheus, Grafana)
- Distributed Systems Knowledge – Understanding how to scale and maintain resilient infrastructure across services
- Operational Independence – Comfortable working directly on remote systems (SSH) and resolving issues end-to-end
Nice to Have
- Experience with container runtime internals (e.g., containerd, CRI-O)
- Familiarity with edge computing or decentralized architectures
- Contributions to open-source projects
Benefits
- Work on high-impact projects in the open-source ecosystem
- Remote-first with flexibility across time zones
- Lean team, real ownership, and space to innovate
- Chance to shape infrastructure for fast-growing software ventures


