About the Role

The role involves designing, implementing, and maintaining reliable systems by combining software engineering and operational practices to support large-scale distributed services.

Responsibilities

Design and deploy scalable infrastructure solutions
Monitor system performance and respond to incidents
Implement automated recovery and self-healing mechanisms
Collaborate with development teams to improve service reliability
Define and track key reliability metrics
Troubleshoot complex production issues
Optimize system availability and latency
Develop tools for operational efficiency
Maintain documentation for systems and processes
Support deployment pipelines and CI/CD workflows
Enforce security and compliance standards
Participate in on-call rotations
Conduct post-incident reviews
Improve observability through logging and alerting
Reduce technical debt in production systems
Evaluate new technologies for operational impact
Drive incident response coordination
Ensure capacity planning meets demand
Promote best practices in reliability engineering
Integrate feedback loops for continuous improvement

Nice to Have

Master's degree in a technical field
Experience with large-scale microservices architectures
Contributions to open-source projects
Certifications in cloud or DevOps platforms
Background in machine learning infrastructure
Experience with service-level objectives and error budgets
Knowledge of chaos engineering principles
Prior work in AI-driven technology environments
Leadership in cross-functional initiatives
Published technical content or conference talks

Compensation

Competitive salary and benefits package

Work Arrangement

Remote, based in Brazil

Team

Collaborative engineering team focused on scalable systems

Why This Role Matters

This position plays a critical role in ensuring the stability and performance of core services.
You will directly influence system design and operational resilience.

What We Expect

Proactive problem solving and ownership of system health.
A mindset focused on automation, measurement, and continuous improvement.

Not applicable

Articul8 AI is hiring a Senior Site Reliability Engineer (SRE) - (Brazil)

About the Role

Responsibilities

Nice to Have

Compensation

Work Arrangement

Team

Why This Role Matters

What We Expect

Similar Jobs

Senior Infrastructure Engineer /DevOps (relocation)

Containerization Cloud Consulting

Senior Infrastructure Engineer /DevOps

DevOPS Engineer

DevOps Engineer (Remote in Canada)

Senior Infrastructure Engineer

Related Articles

Platform Engineering: Kubernetes for All

Developer Experience Platform: Lessons from Europe

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026