About the Role
This role involves designing and maintaining reliable systems, automating operational processes, and collaborating across teams to improve service resilience and incident response.
Responsibilities
- Design and implement scalable infrastructure solutions
- Monitor system performance and respond to incidents
- Develop automation tools to reduce manual operations
- Collaborate with development teams to enhance system reliability
- Troubleshoot and resolve complex technical issues
- Maintain system documentation and runbooks
- Participate in on-call rotations for incident management
- Optimize system availability and reduce downtime
- Implement proactive alerting and monitoring systems
- Support cloud infrastructure and migration initiatives
- Enforce security and compliance standards
- Drive post-incident reviews and follow-up actions
- Improve deployment reliability and rollback procedures
- Contribute to capacity planning and performance tuning
- Promote best practices in system design and operations
- Integrate reliability into the software development lifecycle
- Use data to identify and resolve system bottlenecks
- Manage configuration and change control processes
- Support disaster recovery planning and testing
- Ensure systems meet service level objectives
- Work with cross-functional teams to resolve production issues
- Evaluate new technologies for operational improvements
- Mentor junior engineers and share technical knowledge
- Maintain focus on customer impact during outages
- Contribute to engineering standards and operational policies
Nice to Have
- Master’s degree in a technical field
- Certifications in cloud or systems engineering
- Experience with large-scale enterprise systems
- Background in financial or regulated industries
- Knowledge of Kubernetes and service mesh technologies
- Experience with infrastructure as code tools
- Familiarity with observability platforms
- Contributions to open-source projects
- Public speaking or technical writing experience
- Leadership in incident command roles
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model with flexible location options
Team
Part of a global engineering team focused on system reliability and performance
Why This Role Matters
- This position plays a critical role in maintaining the stability and performance of systems that support enterprise clients.
- Engineers in this role directly influence uptime, scalability, and the overall customer experience.
What to Expect
- You will work across time zones with global teams.
- Expect a mix of strategic planning and hands-on technical problem solving.
- Opportunities for professional growth and technical leadership are built into the role.
Available for qualified candidates

