Responsibilities
- Operate and scale Kong’s global SaaS platform (Konnect), ensuring reliability, availability, and performance across regions and clouds.
- Build, automate, and maintain Kubernetes-based infrastructure and deployment workflows using Terraform/Terragrunt, Helm, and ArgoCD.
- Design, maintain, and optimize multi-region data and caching layers — including PostgreSQL, Redis, ClickHouse, and Druid — for high availability and low latency.
- Operate and improve Kong Gateway and Kong Mesh environments supporting hybrid and distributed architectures.
- Develop and maintain CI/CD pipelines and GitOps workflows to automate service delivery and ensure consistent infrastructure changes.
- Enhance observability and incident response readiness through systems like Datadog, Prometheus, Grafana, and Thanos, defining and tracking SLOs.
- Collaborate closely with development and security teams to ensure smooth operation of SaaS services in compliance with reliability, security, and regulatory standards.
- Participate in a global 24/7 on-call rotation and drive continuous improvement of operational playbooks and postmortem practices.
- Lead and contribute to scaling initiatives that improve elasticity, reliability, and cost-efficiency across the SaaS platform.
Requirements
- BS in Computer Science or equivalent practical experience.
- Proven experience managing SaaS or PaaS systems at enterprise scale (multi-region, multi-tenant, secure environments).
- Deep expertise in Kubernetes, including debugging cluster/networking issues and designing for fault tolerance and scalability.
- Strong proficiency with Infrastructure as Code tools like Terraform or Terragrunt.
- Experience with CI/CD pipelines and GitOps workflows (ArgoCD, Atlantis, Helm).
- Proficiency in one or more programming languages (Go, Python, Bash) for automation and tooling.
- Solid understanding of Linux/Unix systems, networking (DNS, TLS/SSL, HTTP), load balancers and distributed systems.
- Experiencing working with API gateway and service mesh technologies
- Familiarity with streaming systems like Kafka and observability platforms (Datadog, Prometheus, Grafana).
- Experience working in a 24/7/365 production support environment.
Nice to Have
- Hands-on experience with Kong Gateway, Kong Mesh, or similar service connectivity technologies.
- Experience operating ClickHouse, Druid, or other time-series and analytics databases.
- Experience managing PostgreSQL and Redis in multi-region configurations.
- Working knowledge of AWS networking (PrivateLink, Transit Gateway, VPC Peering, Firewalls), Azure VNet, or GCP NCC.
- Strong understanding of disaster recovery, resiliency testing, and compliance-driven reliability practices.
Work Arrangement
Remote (Worldwide)
Additional Information
- Participate in a global 24/7 on-call rotation
- Must be ready to apply even if not meeting all criteria