Leads proactive support for products to ensure high availability, performance, and operational excellence. Focuses on root cause analysis, automation, incident response, and collaboration with development and operations teams to improve system reliability and efficiency.
Responsibilities
- Design, implement, and manage support systems to maintain high system availability and performance.
- Handle escalated and complex technical issues for the Production Support Operations team.
- Develop and maintain automation solutions for system provisioning, self-healing, recovery, deployment, and monitoring.
- Lead incident response efforts and conduct root cause analysis for critical system outages.
- Monitor system performance and define Service-Level Indicators and Service-Level Objectives.
- Work with Development and Operations teams to integrate reliability practices, including zero-downtime architectures.
- Proactively detect and resolve performance bottlenecks and system inefficiencies.
- Serve as technical expert during new product development, collaborating with Product T&E, ICE, and Service Architects.
- Coordinate with internal and external stakeholders to enhance service performance and ensure system resilience.
- Ensure operational readiness for new product rollouts and support transitions.
- Be accountable for product availability and performance within the assigned scope.
- Conduct in-depth problem investigations and root cause analyses to prevent recurring incidents.
- Collaborate with Incident Management, PSOs, and Engineering teams to implement long-term fixes.
- Track and report on the effectiveness of problem resolution to drive continuous improvement.
- Maintain and optimize an event catalog with defined events, thresholds, and automated remediation actions.
- Develop and deliver event response procedures and training to improve incident handling efficiency.
- Partner with Customer Success Managers to improve customer satisfaction and retention through service enhancements.
- Produce reports, documentation, and communications on customer metrics, updates, and product changes.
- Identify and execute improvements in internal processes and workflows.
- Support knowledge management by contributing to FAQs, training materials, and internal resources.
- Enforce data governance policies and ensure compliance with established standards.
- Monitor data quality, consistency, and regulatory compliance continuously.
- Act as a Subject Matter Expert for data in the assigned domain, providing guidance and resolving queries.
Requirements
- Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related technical field.
- Over 10 years of experience in IT operations, service management, or infrastructure roles such as Site Reliability Engineer, Problem Manager, or DevOps Manager.
- Demonstrated experience managing highly available systems and ensuring operational reliability.
- Extensive background in root cause analysis, incident management, and resolving recurring service issues.
- Hands-on experience with CI/CD pipelines, automation tools, performance monitoring, and infrastructure as code.
- Proven ability to work with cross-functional teams including Development, Operations, and Engineering to enhance service delivery.
- Experience managing deployments, performing risk assessments, and improving event and problem management workflows.
- Familiarity with cloud platforms, containerization, scalable architectures, and zero-downtime deployment strategies.
- Strong expertise in AKS and on-premises Kubernetes environments.
- Proficiency in scripting with Ansible, Bash, and Python, or a combination thereof.
- Experience with automation frameworks and tools.
- Working knowledge of CI/CD pipeline design and management.
- Exposure to Terraform for infrastructure provisioning.
- Proficiency in either Azure or AWS cloud platforms.
- Basic database administration and query skills.
- Strong problem-solving abilities and adaptability to new technologies.
- Possession of a Site Reliability Engineering mindset focused on scalability, automation, and resilience.
Tech Stack
AKS, Kubernetes (On-prem), Ansible, Bash, Python, Terraform, Azure, AWS, CI/CD pipelines, Infrastructure as Code (IaC), Containerization, Cloud technologies, Monitoring tools, Automation tools
Benefits
- Work from home up to 2 days per week based on team requirements.
- Flexible workday scheduling to accommodate personal needs.
- Option to work remotely from anywhere in the world for up to 30 days per year.
- Employee Assistance Program available 24/7, 365 days a year for employees and dependents.
- Personalized wellbeing support through the Champion Health platform.
- Access to professional development resources including LinkedIn Learning, Microsoft's Enterprise Skills Initiative, Pluralsight, Harvard Business Publishing, and Stanford programs.
Compensation
Competitive benefits that align with local market conditions and employment status.
Work Arrangement
Hybrid model allowing up to 2 days per week work from home and up to 30 days per year working from any location worldwide.
Team
Collaborates across Development, Operations, Engineering, Product T&E, ICE, Service Architects, PSOs, and Customer Success Managers in a cross-functional structure.
- Operates in 200 countries with employees speaking 60 languages, fostering a diverse and inclusive workplace.
- Employees are empowered, supported, and encouraged to grow professionally.
- Recognized as a Great Place to Work® by 79% of employees.
- Combines comfortable office environments with flexible work-from-home options.
- Innovation is driven by continuous learning and professional development.
Additional Information
- Committed to equal opportunity employment and encourages applications from women, Indigenous peoples, visible minorities, and individuals with disabilities, with self-identification supported in the hiring process.
- Remote work is supported as part of the flexible work policy.
- Employee Assistance Program is available 24/7, 365 days a year for employees and dependents.
- The Champion Health platform offers personalized wellbeing support.
- Learning resources include LinkedIn Learning, Microsoft's Enterprise Skills Initiative, Pluralsight, Harvard Business Publishing, and Stanford programs.


