Responsibilities
- Lead real-time incident response as the primary point of contact, coordinating team communication, prioritizing issues, and guiding resolution efforts to reduce service outages and business impact.
- Follow established procedures to diagnose and fix live system problems involving cloud infrastructure such as EC2, CloudWatch, IAM, and Kubernetes components including pods, deployments, and auto-scaling systems.
- Work with development teams after incidents to investigate root causes, capture insights, and support the rollout of long-term fixes.
- Promote high operational standards by tracking key performance indicators like mean time to resolution and service level compliance to guide process improvements.
- Maintain and enhance runbooks and operational protocols to reflect changes in technology and business requirements.
- Support strategic projects aimed at strengthening incident response frameworks and overall system stability.
Benefits
- Performance-based bonus opportunity
- Eligibility for restricted stock units, where applicable
- Comprehensive medical and financial benefits package
- 401(k) plan availability
- Paid time off, including PTO and parental leave
Compensation
Competitive base salary with potential adjustments based on performance, team outcomes, and market conditions
Work Arrangement
Fully remote
Team
Distributed engineering team supporting scalable systems
Other
- This role is entirely remote with no office requirement.
- Candidates based in Eastern Standard Time or Central Standard Time zones are preferred.
- The position requires participation in an on-call schedule and occasional after-hours support during system incidents.
- Employment is at-will, meaning either party may terminate the relationship at any time.
- The company reserves the right to adjust base pay and discretionary compensation at any time due to performance or market-related factors.
Not specified


