Join a global engineering team focused on building and operating a robust public cloud infrastructure platform. In this role, you will shape the foundation of a distributed system that powers real-time communication services at scale. Your expertise will ensure the platform meets rigorous standards for uptime, security, efficiency, and performance across multiple regions and cloud providers.
Key Responsibilities
- Assess system designs for operational resilience, focusing on availability, security posture, performance efficiency, and cost-effectiveness.
- Collaborate with development teams to forecast infrastructure needs and align capacity planning with product growth trajectories.
- Architect and evolve a multi-account, multi-region AWS environment to support high availability and fault tolerance.
- Automate infrastructure provisioning and management using Terraform, CloudFormation, and Ansible, integrated into CI/CD workflows via GitHub Actions and Atlantis.
- Enforce secure access controls through well-scoped IAM policies, roles, and permission boundaries based on least-privilege principles.
- Build and maintain observability systems including monitoring, alerting, log aggregation, and automated remediation workflows.
- Document system architectures, operational procedures, and incident response runbooks.
- Support critical issues as a technical escalation point, coordinating with networking teams and external vendors when needed.
- Lead initiatives to optimize cloud spending through resource right-sizing, reserved instance planning, and lifecycle automation.
What We’re Looking For
- Degree in Computer Science or equivalent practical experience.
- Minimum of five years in cloud infrastructure roles, with deep experience in AWS services such as VPC, EC2, S3, IAM, Lambda, CloudTrail, and Organizations.
- Proven background in Linux system administration, particularly Ubuntu and RHEL, including networking and security hardening.
- Hands-on experience with infrastructure-as-code tools like Terraform and Ansible, and CI/CD pipelines using Git-based systems.
- Proficiency in scripting languages such as Bash or Python for automation and tooling.
- Familiarity with monitoring solutions including CloudWatch, Prometheus, or Grafana.
- Strong analytical and troubleshooting abilities in complex, distributed systems.
- Excellent communication skills and experience working across time zones in a collaborative, remote-first environment.
Nice to Have
- Experience with Kubernetes, particularly EKS, and containerized workloads.
- Knowledge of large-scale cloud networking concepts like Transit Gateway, peering, and multi-region routing.
- Exposure to cloud cost management tools such as AWS Cost Explorer, CUR, or FinOps platforms.
Work Environment
This position supports a geographically distributed team, enabling collaboration across time zones. You’ll work in an environment that values DevOps principles, shared ownership of systems, and continuous improvement through feedback and iteration.