Help maintain and improve the backbone of a top-tier global platform used by millions every day. As a Senior Site Reliability Engineer, you will play a key role in ensuring the stability, scalability, and efficiency of systems that power one of the most accessed websites on the internet. This is a remote-first position with a globally distributed team, operating across time zones with a strong emphasis on asynchronous collaboration.
What You'll Do
- Manage and optimize production infrastructure through deployment, configuration, and ongoing maintenance using tools like Puppet and Kubernetes.
- Drive automation initiatives to streamline service setup, configuration, and long-term upkeep across large-scale environments.
- Partner with development teams to guide architectural decisions that support performance, resilience, and scalability.
- Respond to incidents as part of a shared on-call rotation, leading diagnosis, resolution, and post-mortem analysis to prevent future issues.
- Diagnose complex system problems at the OS and network protocol levels, with deep knowledge of TCP/IP, DNS, HTTP, and TLS.
- Mentor team members and contribute to a culture of operational excellence and continuous learning.
- Work in a fully transparent, open-source environment where all code, configs, and documentation are publicly accessible.
What We're Looking For
- At least six years of experience in site reliability, operations, or DevOps roles within team-oriented environments.
- Proficiency in scripting with Python, Bash, or similar languages, and hands-on experience with configuration management systems like Puppet.
- Solid understanding of Linux system internals, Debian-based package management, and distributed caching technologies.
- Demonstrated ability to automate workflows, identify inefficiencies, and implement lasting improvements.
- Strong written and verbal English skills, with the ability to collaborate independently across global teams.
- Experience with incident response, root cause analysis, and implementing corrective actions to enhance system resilience.
Preferred Experience
- Familiarity with high-performance HTTP caching solutions such as Varnish, Nginx, or Envoy.
- Background in Linux kernel optimization under heavy load conditions.
- Experience with monitoring and observability stacks including Prometheus and Grafana.
- Contributions to open-source software projects or active participation in developer communities.
- Knowledge of LAMP stack components, particularly PHP/HHVM and Redis/memcached; MediaWiki experience is a plus.
- Experience establishing and managing service-level objectives (SLOs) across technical teams.
Compensation & Work Environment
Salaries range from US$113,082 to US$175,725 annually for U.S. hires, adjusted for location for international team members. Pay is determined by experience, skills, and local cost of living—never based on prior salary history. The role operates in a remote-first model with team members across more than 40 countries. Occasional travel (1–2 times per year) may be expected for team gatherings. We are an equal opportunity employer committed to equity, inclusion, and global representation in all aspects of our work.


