As a Senior Site Reliability Engineer, you will play a key role in shaping the foundation of our cloud-native systems. You'll design and maintain highly available infrastructure that supports continuous delivery, real-time data processing, and secure, compliant operations at scale.
What You'll Do
- Architect and manage Kubernetes-based environments on AWS, including EKS clusters, ensuring high availability and efficient scaling.
- Develop and enforce Infrastructure as Code (IaC) practices using Terraform to automate provisioning and maintain consistency across environments.
- Build and optimize CI/CD pipelines in GitLab to support reliable, repeatable deployments across services.
- Oversee database infrastructure, including RDS-hosted Postgres and MySQL, managing migrations, replication, and performance.
- Design and maintain event-driven architectures using Kafka and Kinesis for asynchronous service communication and real-time data flow.
- Implement comprehensive monitoring and observability solutions to detect, diagnose, and resolve issues proactively.
- Collaborate with engineering and compliance teams to meet regulatory requirements including HIPAA, PCI, SOC 2, ISO 27001, and HITRUST.
- Use security tools like Wiz to identify and remediate infrastructure risks, ensuring strong posture across cloud assets.
What We're Looking For
- Proven experience with AWS cloud services and advanced networking across multiple regions.
- Strong background in containerization with Docker and orchestration via Kubernetes and EKS.
- Hands-on expertise with Terraform for managing infrastructure at scale.
- Experience managing relational databases and leading decentralized migration efforts.
- Familiarity with compliance frameworks and audit processes, particularly in regulated environments.
- Working knowledge of security best practices and tools, including cloud-native security platforms.
- Proficiency with AI-powered engineering tools to enhance development and operations workflows.
Preferred Skills
- Programming experience in Python, Node.js, Golang, or Bash.
- Strong troubleshooting abilities in production environments.
- Experience with GitOps workflows and tools such as ArgoCD.
- Knowledge of managed Kafka services like MSK in AWS.
- Background in IoT, edge computing, or hardware integration, especially in operational environments.


