About the Role
The role involves building and maintaining reliable systems by combining software engineering and operational expertise to support scalable services.
Responsibilities
- Design and manage scalable infrastructure for high availability
- Implement automated deployment and rollback systems
- Monitor system performance and proactively address issues
- Respond to incidents and lead resolution efforts
- Optimize system reliability and reduce downtime
- Develop tools to streamline operations and reduce manual work
- Collaborate with engineering teams to improve service resilience
- Enforce observability standards across services
- Manage on-call rotations and post-incident reviews
- Improve CI/CD pipelines for faster, safer releases
- Conduct capacity planning and resource forecasting
- Support security and compliance requirements in infrastructure
- Troubleshoot production issues across multiple layers
- Drive adoption of best practices in reliability engineering
- Contribute to disaster recovery planning and testing
- Evaluate and integrate new technologies for operational efficiency
- Maintain documentation for systems and procedures
- Work closely with developers to refine service design
- Ensure systems meet SLOs and error budget policies
- Automate routine operational tasks
- Analyze system metrics to identify trends and risks
- Promote a blameless culture during incident investigations
- Scale infrastructure in response to product growth
- Integrate feedback loops for continuous improvement
- Support cloud cost optimization initiatives
Nice to Have
- Experience in fast-growing startups or high-traffic environments
- Background in machine learning infrastructure
- Familiarity with edge computing or CDN technologies
- Contributions to open-source projects
- Experience with real-time data processing systems
Compensation
Competitive salary and equity
Work Arrangement
Remote-first with team hubs
Team
Collaborative engineering team focused on infrastructure and product reliability
Our Stack
We use Kubernetes for orchestration, Terraform for infrastructure, Prometheus and Grafana for monitoring, and a mix of Python and Go for service development.
Impact
Your work will directly influence the stability and performance of a widely used visual media platform, enabling seamless user experiences at scale.
Available for qualified candidates


