Responsibilities
- Design and manage infrastructure as code for AWS services like RDS and EC2 alongside physical Linux servers using Terraform, Terragrunt, and Ansible.
- Create custom automation tools and scripts in Python or Go to handle complex operational workflows not supported by existing solutions.
- Automate the full lifecycle of databases including provisioning, performance tuning, and schema changes in close collaboration with database administrators.
- Implement and maintain detailed monitoring systems for databases and infrastructure health using Prometheus and related technologies.
- Enforce security standards by managing secrets and access controls centrally with HashiCorp Vault.
- Lead knowledge transfer through documentation, mentoring, and active participation in improving team-wide infrastructure practices.
Requirements
- Minimum of three years of experience in DevOps or Site Reliability Engineering with a focus on automating infrastructure.
- Demonstrated ability to operate and maintain both AWS cloud environments and on-premises Linux bare-metal servers.
- Advanced proficiency with configuration management and infrastructure-as-code tools, specifically Ansible and Terraform/Terragrunt.
- Strong programming skills in Python or Go for developing internal tools and automation scripts.
- In-depth knowledge of Linux internals, including disk I/O performance and network configuration.
Nice to Have
- Practical experience managing Kubernetes clusters, either on AWS via EKS or self-hosted deployments.
- Experience or interest in running databases as stateful workloads in Kubernetes using Helm charts or Operators.
- A commitment to mentoring teammates and promoting best practices in automation and container orchestration.
Benefits
- Work on high-scale systems where automation is mandatory and manual intervention is actively eliminated.
- Be part of a collaborative team where your technical decisions directly influence system reliability.
- Access to a modern technology stack and the autonomy to adopt new tools that address real operational challenges.
- Opportunities for professional growth through learning programs and complex technical tasks.
- Relocation support including travel and temporary accommodation for up to two weeks.
- Support for language development, including partial reimbursement for language courses.
- Annual birthday gift as part of company culture celebrations.
Responsibilities
- Design and maintain IaC for both AWS (RDS, EC2) and physical bare-metal servers using Terraform/Terragrunt and Ansible.
- Develop custom tools and scripts (using Python or Go) to automate complex workflows that off-the-shelf tools can't handle.
- Automate provisioning, configuration tuning, and schema migrations. Work closely with DBAs to eliminate manual toil through code.
- Build and maintain high-fidelity monitoring for databases and system health using the Prometheus stack.
- Centralize secret management and access control using HashiCorp Vault.
- Act as a subject matter expert, documentation enthusiast, and mentor for the team as we evolve our infrastructure practices.
Required
- 3+ years in DevOps/SRE with a focus on infrastructure automation.
- Proven experience managing both Cloud (AWS) and On-prem (Linux bare-metal) environments.
- Expert-level knowledge of Ansible and Terraform/Terragrunt.
- Strong proficiency in Python or Go for infrastructure scripting and internal tooling.
- Deep understanding of Linux systems, disk I/O optimization, and networking.
Preferred
- Hands-on experience with K8s (EKS or self-managed).
- Experience (or a strong desire to lead the transition) of running stateful workloads (Databases) inside Kubernetes using Operators or Helm.
- A passion for teaching and sharing K8s/Automation best practices with colleagues to level up the entire team.
Benefits
- The chance to work on complex, high-load systems where "manual" is a forbidden word.
- A collaborative environment where your architectural decisions directly impact the stability of the product.
- Modern stack and the freedom to introduce new tools that solve real problems.
- Learning and development opportunities and interesting challenging tasks
- Relocation package (tickets, staying in a hotel for 2 weeks)
- Opportunity to develop language skills and partial compensation for the cost of language classes
- Birthday celebration present