About the Role
The role involves designing, implementing, and maintaining reliable systems by combining software engineering and operational practices to support large-scale distributed services.
Responsibilities
- Design and deploy scalable infrastructure solutions
- Monitor system performance and respond to incidents
- Implement automated recovery and self-healing mechanisms
- Collaborate with development teams to improve service reliability
- Define and track key reliability metrics
- Troubleshoot complex production issues
- Optimize system availability and latency
- Develop tools for operational efficiency
- Maintain documentation for systems and processes
- Support deployment pipelines and CI/CD workflows
- Enforce security and compliance standards
- Participate in on-call rotations
- Conduct post-incident reviews
- Improve observability through logging and alerting
- Reduce technical debt in production systems
- Evaluate new technologies for operational impact
- Drive incident response coordination
- Ensure capacity planning meets demand
- Promote best practices in reliability engineering
- Integrate feedback loops for continuous improvement
Nice to Have
- Master's degree in a technical field
- Experience with large-scale microservices architectures
- Contributions to open-source projects
- Certifications in cloud or DevOps platforms
- Background in machine learning infrastructure
- Experience with service-level objectives and error budgets
- Knowledge of chaos engineering principles
- Prior work in AI-driven technology environments
- Leadership in cross-functional initiatives
- Published technical content or conference talks
Compensation
Competitive salary and benefits package
Work Arrangement
Remote, based in Brazil
Team
Collaborative engineering team focused on scalable systems
Why This Role Matters
- This position plays a critical role in ensuring the stability and performance of core services.
- You will directly influence system design and operational resilience.
What We Expect
- Proactive problem solving and ownership of system health.
- A mindset focused on automation, measurement, and continuous improvement.
Not applicable


