Hybrid

Confluent is hiring a Senior Manager - Incident Response Engineering

Responsibilities

  • Recruit, hire, and develop a team of senior incident response engineers distributed across AMER and APAC time zones
  • Design sustainable on-call models with follow-the-sun coverage
  • Provide incident command for high-severity and critical customer-impacting incidents, with your team as the primary rotation and you as the senior escalation point
  • Set and enforce standards for how incidents are run: communications cadence, directing engagements with stakeholders, domain expert coordination, handoffs
  • Drive a customer-first posture in every incident to ensure timely, accurate updates and clear ownership from detection through resolution
  • Own postmortem quality end-to-end: facilitation, root cause analysis, corrective action definition, and ensuring follow-through
  • Manage the Customer Root Cause Analysis (CRCA) program, ensuring timely, technically accurate, clearly written documents that restore customer trust
  • Coordinate upstream technical inputs from engineering teams; synthesize ambiguity into clear, actionable narratives
  • Drive an AI-centric approach to scaling incident operations using intelligent tooling to improve triage speed, documentation quality, and pattern detection without sacrificing rigor
  • Partner with observability, supportability, and resiliency sub-functions with CAR to provide critical inputs into our platform evolution
  • Own and evolve the incident management tooling stack with a bias towards agentic assistance
  • Analyze incident data to identify recurring patterns and feed learnings back into engineering practices
  • When incident load allows, direct your team's capacity toward runbook improvements, automation, and operational hygiene
  • Partner with Legal, PR, and Customer Success on customer-facing communications during and after major incidents
  • Brief engineering leadership and executives during active incidents with clarity and composure
  • Be the person engineering teams proactively seek out when operational standards and incident practices need to improve

Requirements

  • 10+ years in SRE, incident management, or reliability engineering, with at least 5 years managing teams in this space
  • Proven experience as an incident commander in high-severity, customer-impacting outages at scale. You've personally run incidents that mattered
  • Cloud infrastructure experience across at least one of AWS, GCP, or Azure
  • Deep understanding of distributed systems failure modes (Kafka/event streaming experience preferred, or demonstrated ability to rapidly master complex systems)
  • Strong track record with postmortem facilitation and driving corrective actions to completion
  • Excellent written communication with customers regarding root-cause analysis. You are comfortable stating things with conviction to executive audiences
  • Experience working with cross-functional stakeholders (legal, PR, customer success) during incident response
  • Track record of hiring and developing senior technical talent in a globally distributed, remote-first environment
  • Comfort operating with significant autonomy and making high-stakes decisions under pressure

Nice to Have

  • Experience with incident response in a multi-cloud context
  • Experience building an incident management function or team from scratch
  • Post-incident review methodologies beyond standard '5 whys' (e.g., Learning from Incidents, resilience engineering)
  • Demonstrated use of AI-assisted tooling to improve operational quality at scale

Work Arrangement

Hybrid

Team

Team size: ~5. Structure: small, specialized group

Additional Information

  • Compensation Range: CA$271.6K - CA$319.1K
About company
Confluent
Confluent provides a data streaming platform that puts information in motion in near real-time across AWS, GCP, and Azure, enabling companies to react faster and build smarter.
All jobs at Confluent Visit website
Job Details
Category management
Posted 3 months ago