About the Role

The role involves building and maintaining reliable systems by combining software engineering and operational expertise to support scalable services.

Responsibilities

Design and manage scalable infrastructure for high availability
Implement automated deployment and rollback systems
Monitor system performance and proactively address issues
Respond to incidents and lead resolution efforts
Optimize system reliability and reduce downtime
Develop tools to streamline operations and reduce manual work
Collaborate with engineering teams to improve service resilience
Enforce observability standards across services
Manage on-call rotations and post-incident reviews
Improve CI/CD pipelines for faster, safer releases
Conduct capacity planning and resource forecasting
Support security and compliance requirements in infrastructure
Troubleshoot production issues across multiple layers
Drive adoption of best practices in reliability engineering
Contribute to disaster recovery planning and testing
Evaluate and integrate new technologies for operational efficiency
Maintain documentation for systems and procedures
Work closely with developers to refine service design
Ensure systems meet SLOs and error budget policies
Automate routine operational tasks
Analyze system metrics to identify trends and risks
Promote a blameless culture during incident investigations
Scale infrastructure in response to product growth
Integrate feedback loops for continuous improvement
Support cloud cost optimization initiatives

Nice to Have

Experience in fast-growing startups or high-traffic environments
Background in machine learning infrastructure
Familiarity with edge computing or CDN technologies
Contributions to open-source projects
Experience with real-time data processing systems

Compensation

Competitive salary and equity

Work Arrangement

Remote-first with team hubs

Team

Collaborative engineering team focused on infrastructure and product reliability

Our Stack

We use Kubernetes for orchestration, Terraform for infrastructure, Prometheus and Grafana for monitoring, and a mix of Python and Go for service development.

Impact

Your work will directly influence the stability and performance of a widely used visual media platform, enabling seamless user experiences at scale.

Available for qualified candidates

Photoroom is hiring a Site Reliability Engineer

About the Role

Responsibilities

Nice to Have

Compensation

Work Arrangement

Team

Our Stack

Impact

Similar Jobs

Cloud Systems Engineer

Senior DevOps Engineer

Senior Site Reliability Engineer

Lead Engineer – Platform & Infrastructure

KTO - Platform Engineer - SRE - Lever

Staff / Senior Infrastructure Engineer (relocation)

Related Articles

Platform Engineering: Kubernetes for All

Network Configuration as Code: CI/CD for Automation | NVIDIA

Become an AI Developer: Your Career Guide