Role Overview

We're seeking a Senior Site Reliability Engineer to lead the development and operation of resilient, high-performance systems. Working remotely from Ireland, you'll play a central role in ensuring the stability, scalability, and security of our production infrastructure across multiple cloud platforms.

Key Responsibilities

Architect, implement, and maintain production systems with a focus on reliability, observability, and performance under real-world workloads
Develop automation frameworks to reduce manual operations and improve consistency across environments
Monitor system health proactively, define intelligent alerting rules, and implement self-healing mechanisms to reduce incident impact
Lead incident response efforts, document runbooks, and conduct postmortems to strengthen system resilience
Partner with engineering teams to identify performance bottlenecks and improve system design
Enhance deployment pipelines to support safe, incremental rollouts with robust rollback capabilities
Optimize monitoring and observability platforms to ensure full visibility into system behavior
Coordinate maintenance activities with minimal service disruption, ensuring clear communication with stakeholders
Evaluate and integrate industry best practices in infrastructure management and platform security
Analyze open-source technologies to improve troubleshooting depth and resolution speed
Collaborate with external vendors and support teams when technical escalation is required

Technology Environment

You'll work across a modern stack including Ansible and Terraform for infrastructure automation, Prometheus and Grafana for monitoring, and containerized workloads on Kubernetes and Docker. Our systems span AWS, GCP, and Azure, with CI/CD powered by Jenkins and GitLab.