About the Role
The role involves designing, implementing, and maintaining systems that support high availability, performance, and automation across complex environments.
Responsibilities
- Design and manage scalable, fault-tolerant systems
- Monitor infrastructure performance and system health
- Respond to and resolve critical production incidents
- Implement automated solutions for operational tasks
- Collaborate with development teams to improve code deployability
- Optimize system reliability and reduce operational toil
- Support incident management and post-mortem analysis
- Maintain comprehensive documentation of systems and processes
- Drive improvements in monitoring and alerting frameworks
- Ensure efficient resource utilization across environments
- Participate in on-call rotations for system support
- Evaluate and integrate new technologies for reliability
- Support capacity planning and performance testing
- Enforce security best practices in infrastructure design
- Promote observability across distributed systems
- Troubleshoot complex technical issues across layers
- Contribute to disaster recovery planning
- Improve deployment pipelines and CI/CD workflows
- Ensure compliance with operational standards
- Mentor engineers on reliability practices
- Analyze system metrics to guide technical decisions
- Work across time zones with global teams
- Balance short-term fixes with long-term improvements
- Support cloud infrastructure management and optimization
- Foster a culture of ownership and accountability
Nice to Have
- Master’s degree in a technical field
- Experience in high-scale production environments
- Contributions to open-source projects
- Certifications in cloud or systems engineering
- Leadership experience in technical teams
- Public speaking or conference participation
- Knowledge of machine learning infrastructure
- Experience with global, distributed systems
- Background in gaming or real-time services
- Track record of mentoring junior engineers
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model with flexibility for remote and office-based work
Team
Collaborative engineering environment focused on system reliability and scalable infrastructure
Why This Role Matters
This position plays a critical role in maintaining the stability and performance of large-scale systems that serve millions of users.
Growth Opportunities
Engineers are encouraged to lead initiatives, explore new technologies, and grow into technical leadership roles.
Visa sponsorship may be available for qualified candidates


