This role is responsible for maintaining the stability and efficiency of large-scale, web-based services. You will play a central part in building and managing the infrastructure that supports continuous delivery, monitoring, and incident response across multiple platforms.
Key Responsibilities
- Manage the operational lifecycle of expanding digital services, focusing on uptime, performance, and seamless user experiences
- Design, implement, and maintain deployment pipelines and architectures to support rapid, reliable releases
- Build comprehensive monitoring solutions, including dashboards, alerting systems, and escalation protocols
- Lead end-to-end incident management—from detection and response to post-mortem analysis and prevention strategies
- Develop and enforce change and configuration management practices to ensure system integrity
- Participate in a rotating on-call schedule, serving as the primary technical contact during operational events
- Conduct root cause analyses and lead initiatives to reduce operational risk and recurring issues
- Collaborate across teams to refine tools, processes, and standards that elevate service reliability company-wide
Impact and Innovation
You'll help shape how services are operated and improved, applying technical insight and operational discipline to deliver consistent, high-quality customer experiences. Your work will directly influence system resilience and the efficiency of engineering practices across the organization.


