Requirements
- 10+ years of software and infrastructure engineering experience, including significant experience operating infrastructure-as-code platforms in cloud-first organizations.
- Experience designing and operating large-scale Kubernetes platforms and scaling compute services on Kubernetes; experience with related cloud-native technologies including ArgoCD, Argo Rollouts, Istio, etc.
- Deep understanding of Kubernetes platform architecture and operations, including workload isolation, autoscaling, networking, service mesh management, ingress patterns, observability, upgrades, and multi-tenant cluster design.
- Experience designing and maintaining CI/CD systems for both infrastructure-as-code deployments and application delivery workflows. (Terragrunt, Atlas, ArgoCD, Octopus Deploy, Travis CI, etc.)
- Experience building scalable infrastructure-as-code platforms using Terraform and related tooling, including modular architectures, remote state management, policy enforcement, deployment orchestration, and reusable infrastructure patterns.
- Experience with monitoring and observability tooling and practices (metrics, logs, traces) and their management at scale. Experience with major observability platforms such as Grafana, Datadog, Honeycomb, etc.
- Comfortable implementing and securing services in Google Cloud Platform as infrastructure-as-code, including GCP Projects, VPC Networks, Google Kubernetes Engine, IAM Roles, Groups, policies, and secure networking patterns.
- Experience designing secure-by-default infrastructure including least-privilege access controls, workload identity, network segmentation, secret management, auditability, and compliance-oriented platform controls.
- Strong operational instincts and experience debugging complex distributed systems, leading incident response efforts, and improving reliability through automation and observability.
- Experience balancing developer experience, platform governance, operational reliability, and organizational scalability in fast-growing engineering environments.
- Experience with backend languages (e.g. Python, GoLang, Node, Rust).
Nice to Have
- Up-to-date on industry best practices and tools, and enjoy learning new things.
- Excited about being hands-on while also driving platform direction, architecture decisions, and operational maturity in a fast-moving and supportive environment.
- Willing to pitch in wherever needed — as a fast-moving startup we need to do good work, quickly.
- Demonstrates strong curiosity and a proactive interest in AI, actively exploring and applying emerging technologies.
Work Arrangement
Hybrid — San Francisco, New York, Pittsburgh
Additional Information
- This role is approximately 80% infrastructure focused and 20% application software focused.
- This role has a rotational on-call schedule.
- You will have the opportunity to shape incident response practices, operational standards, and platform reliability strategy for the team and throughout the organization.