Remote (Country)

Articul8 AI is hiring a Senior Site Reliability Engineer (SRE) - (Brazil)

About the Role

The role involves designing, implementing, and maintaining reliable systems by combining software engineering and operational practices to support large-scale distributed services.

Responsibilities

  • Design and deploy scalable infrastructure solutions
  • Monitor system performance and respond to incidents
  • Implement automated recovery and self-healing mechanisms
  • Collaborate with development teams to improve service reliability
  • Define and track key reliability metrics
  • Troubleshoot complex production issues
  • Optimize system availability and latency
  • Develop tools for operational efficiency
  • Maintain documentation for systems and processes
  • Support deployment pipelines and CI/CD workflows
  • Enforce security and compliance standards
  • Participate in on-call rotations
  • Conduct post-incident reviews
  • Improve observability through logging and alerting
  • Reduce technical debt in production systems
  • Evaluate new technologies for operational impact
  • Drive incident response coordination
  • Ensure capacity planning meets demand
  • Promote best practices in reliability engineering
  • Integrate feedback loops for continuous improvement

Nice to Have

  • Master's degree in a technical field
  • Experience with large-scale microservices architectures
  • Contributions to open-source projects
  • Certifications in cloud or DevOps platforms
  • Background in machine learning infrastructure
  • Experience with service-level objectives and error budgets
  • Knowledge of chaos engineering principles
  • Prior work in AI-driven technology environments
  • Leadership in cross-functional initiatives
  • Published technical content or conference talks

Compensation

Competitive salary and benefits package

Work Arrangement

Remote, based in Brazil

Team

Collaborative engineering team focused on scalable systems

Why This Role Matters

  • This position plays a critical role in ensuring the stability and performance of core services.
  • You will directly influence system design and operational resilience.

What We Expect

  • Proactive problem solving and ownership of system health.
  • A mindset focused on automation, measurement, and continuous improvement.

Not applicable

Required Skills
AWSGCPMicrosoft AzurePythonGoBashTerraformCloudFormationDockerKubernetesInfrastructure as CodeCloud InfrastructureMonitoring
About company
Articul8 AI
Articul8 AI creates exceptional AI products that exceed customer expectations.
All jobs at Articul8 AI Visit website
Job Details
Category infrastructure
Posted 10 months ago