This role centers on diagnosing and resolving advanced technical challenges in customer cloud deployments built on OpenStack and Kubernetes. The engineer serves as a key escalation point for high-severity incidents, taking ownership from detection through resolution while ensuring system stability and performance.
Key Responsibilities
- Diagnose and resolve complex issues across OpenStack, Kubernetes, Mirantis Container Runtime (MCR), and associated cloud technologies in private cloud environments
- Lead incident response for critical outages, including facilitating communication, identifying root causes, and coordinating cross-team remediation efforts
- Act as technical lead during assigned shifts, monitoring platform health, responding to alerts, and addressing emerging issues proactively
- Execute cluster upgrades and lifecycle management tasks with minimal service impact, ensuring clear customer communication throughout
- Maintain end-to-end ownership of technical escalations, routing issues appropriately to OpenStack, storage, networking, or hardware teams while tracking resolution
- Reproduce reported issues in lab environments, validate defects, and deliver detailed diagnostic information to development teams
- Collaborate with engineering to analyze recurring problems, recommend improvements, and verify fixes before deployment
- Provide timely, accurate updates to customers during incidents, guiding them through troubleshooting steps and resolution processes
Requirements
- U.S. Citizenship required due to federal security clearance eligibility
- Availability to work third shift: Monday through Friday, 11:00 PM – 7:00 AM
- Demonstrated experience resolving technical issues in large-scale cloud infrastructures
- Deep technical knowledge of OpenStack and Kubernetes platforms
- Proven ability to lead high-severity incident calls and manage problem resolution workflows
- Strong written and verbal communication skills, especially under pressure
- Experience managing technical escalations from initiation to closure
- Familiarity with cluster lifecycle operations and upgrade procedures
Technology Environment
OpenStack, Kubernetes, Mirantis Container Runtime (MCR), Ceph, storage systems, networking components, and cloud infrastructure platforms.


