Requirements
- Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
- Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
- CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
- Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
- SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
- Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
Nice to Have
- Familiarity with Wikimedia or other open source projects is a plus.
- Experience managing and troubleshooting event streaming platforms at scale (e.g., Kafka, Kinesis, or similar)
- Hands-on experience with cloud platforms such as AWS and/or GCP, including designing and operating production systems
- Familiarity with data lake architectures and large-scale data processing frameworks (e.g., Iceberg, Flink, Spark)
- Experience with continuous profiling and performance optimization tools to identify bottlenecks and improve system efficiency
- Experience working with or contributing to open source projects, particularly in infrastructure or data ecosystems
- Prior participation in the Wikimedia movement
Work Arrangement
Remote (Worldwide)
Additional Information
- On-call participation required
- Mentoring peers in technical and operational areas
- Work with globally distributed and asynchronously communicating team