Responsibilities
- Recruit, hire, and develop a team of senior incident response engineers distributed across AMER and APAC time zones
- Design sustainable on-call models with follow-the-sun coverage
- Provide incident command for high-severity and critical customer-impacting incidents, with your team as the primary rotation and you as the senior escalation point
- Set and enforce standards for how incidents are run: communications cadence, directing engagements with stakeholders, domain expert coordination, handoffs
- Drive a customer-first posture in every incident to ensure timely, accurate updates and clear ownership from detection through resolution
- Own postmortem quality end-to-end: facilitation, root cause analysis, corrective action definition, and ensuring follow-through
- Manage the Customer Root Cause Analysis (CRCA) program, ensuring timely, technically accurate, clearly written documents that restore customer trust
- Coordinate upstream technical inputs from engineering teams; synthesize ambiguity into clear, actionable narratives
- Drive an AI-centric approach to scaling incident operations using intelligent tooling to improve triage speed, documentation quality, and pattern detection without sacrificing rigor
- Partner with observability, supportability, and resiliency sub-functions with CAR to provide critical inputs into our platform evolution
- Own and evolve the incident management tooling stack with a bias towards agentic assistance
- Analyze incident data to identify recurring patterns and feed learnings back into engineering practices
- When incident load allows, direct your team's capacity toward runbook improvements, automation, and operational hygiene
- Partner with Legal, PR, and Customer Success on customer-facing communications during and after major incidents
- Brief engineering leadership and executives during active incidents with clarity and composure
- Be the person engineering teams proactively seek out when operational standards and incident practices need to improve
Requirements
- 10+ years in SRE, incident management, or reliability engineering, with at least 5 years managing teams in this space
- Proven experience as an incident commander in high-severity, customer-impacting outages at scale. You've personally run incidents that mattered
- Cloud infrastructure experience across at least one of AWS, GCP, or Azure
- Deep understanding of distributed systems failure modes (Kafka/event streaming experience preferred, or demonstrated ability to rapidly master complex systems)
- Strong track record with postmortem facilitation and driving corrective actions to completion
- Excellent written communication with customers regarding root-cause analysis. You are comfortable stating things with conviction to executive audiences
- Experience working with cross-functional stakeholders (legal, PR, customer success) during incident response
- Track record of hiring and developing senior technical talent in a globally distributed, remote-first environment
- Comfort operating with significant autonomy and making high-stakes decisions under pressure
Nice to Have
- Experience with incident response in a multi-cloud context
- Experience building an incident management function or team from scratch
- Post-incident review methodologies beyond standard '5 whys' (e.g., Learning from Incidents, resilience engineering)
- Demonstrated use of AI-assisted tooling to improve operational quality at scale
Work Arrangement
Hybrid
Team
Team size: ~5. Structure: small, specialized group
Additional Information
- Compensation Range: CA$271.6K - CA$319.1K
