Role Overview
We're seeking a Senior Site Reliability Engineer to lead the development and operation of resilient, high-performance systems. Working remotely from Ireland, you'll play a central role in ensuring the stability, scalability, and security of our production infrastructure across multiple cloud platforms.
Key Responsibilities
- Architect, implement, and maintain production systems with a focus on reliability, observability, and performance under real-world workloads
- Develop automation frameworks to reduce manual operations and improve consistency across environments
- Monitor system health proactively, define intelligent alerting rules, and implement self-healing mechanisms to reduce incident impact
- Lead incident response efforts, document runbooks, and conduct postmortems to strengthen system resilience
- Partner with engineering teams to identify performance bottlenecks and improve system design
- Enhance deployment pipelines to support safe, incremental rollouts with robust rollback capabilities
- Optimize monitoring and observability platforms to ensure full visibility into system behavior
- Coordinate maintenance activities with minimal service disruption, ensuring clear communication with stakeholders
- Evaluate and integrate industry best practices in infrastructure management and platform security
- Analyze open-source technologies to improve troubleshooting depth and resolution speed
- Collaborate with external vendors and support teams when technical escalation is required
Technology Environment
You'll work across a modern stack including Ansible and Terraform for infrastructure automation, Prometheus and Grafana for monitoring, and containerized workloads on Kubernetes and Docker. Our systems span AWS, GCP, and Azure, with CI/CD powered by Jenkins and GitLab.


