About the Role
This role focuses on maintaining and improving the reliability of complex distributed systems by applying engineering principles to operations challenges, reducing operational toil, and enhancing system resilience.
Responsibilities
- Design and implement scalable monitoring and alerting systems
- Develop automation tools to reduce manual operational tasks
- Respond to and resolve critical production incidents
- Conduct root cause analysis for system outages
- Collaborate with development teams to improve service reliability
- Define and track key reliability metrics such as SLOs and SLIs
- Participate in on-call rotations with rapid response expectations
- Optimize system performance and availability
- Implement and maintain disaster recovery procedures
- Contribute to capacity planning and scalability assessments
- Enforce best practices in configuration management
- Integrate reliability into the software development lifecycle
- Lead post-incident reviews and drive follow-up actions
- Evaluate and adopt new technologies to improve system stability
- Support deployment pipelines with reliability checks
- Maintain documentation for system architecture and incident response
- Promote a blameless incident culture
- Work across time zones with global teams
- Ensure compliance with security and operational standards
- Drive adoption of observability practices across teams
Nice to Have
- Master’s degree in a technical field
- Experience supporting streaming media platforms
- Knowledge of large-scale data processing systems
- Background in security operations
- Public speaking or conference presentation experience
- Open source contributions in relevant domains
- Experience with machine learning infrastructure
- Familiarity with database reliability engineering
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model
Team
Part of a global technology organization supporting digital platforms
What We Do
- We power digital experiences for millions of users worldwide through resilient, scalable infrastructure.
- Our teams work on real-time systems that support content delivery, user engagement, and platform stability.
Why You’ll Love It
- You’ll solve complex technical challenges at scale.
- You’ll see the direct impact of your work on user experience and platform reliability.
Available for qualified candidates


