Role Overview

As a Senior Site Reliability Engineer, you will be central to ensuring the stability, scalability, and performance of our cloud-native platform. You'll work closely with engineering teams to build robust systems that support millions of users, focusing on automation, observability, and operational excellence.

Key Responsibilities

Investigate and resolve complex issues across applications and infrastructure, minimizing service disruption
Participate in on-call rotations, sprint planning, and deployment coordination
Lead root cause analyses and guide teams toward preventive improvements
Enhance system observability using tools like Prometheus, Grafana, Splunk, and DataDog
Advocate for secure, scalable, and maintainable architectural patterns
Develop automation scripts, internal tools, and infrastructure-as-code to streamline operations
Document processes, runbooks, and technical standards to support team knowledge sharing
Collaborate across teams to refine CI/CD pipelines and software delivery practices

Required Qualifications

Minimum of 10 years in site reliability or systems engineering roles
Deep expertise in Linux systems, scripting, and troubleshooting
Proven experience with AWS services including EC2, ECS, Fargate, VPC, Route53, and load balancing
Strong background in infrastructure-as-code using CloudFormation, Terraform, Helm, or Ansible
Familiarity with containerization, Kubernetes, and microservices architectures
Hands-on experience with CI/CD and the full software development lifecycle
Proficiency with observability platforms such as New Relic, Splunk, or Datadog
Excellent written and verbal communication skills

Preferred Skills

Experience with AWS CDK

Technology Environment

The platform leverages Java, Kotlin, and C++ alongside Postgres for data storage. Infrastructure runs on AWS with services including ECS, Fargate, and ALB/NLB, orchestrated via Kubernetes. Automation is driven by CloudFormation, Terraform, Helm, and Ansible, while observability is powered by Prometheus, Grafana, and other leading monitoring tools.

Work Environment

This is a fully remote role with the option to work hybrid from our Berlin office. We support teams across the UK and Germany, fostering a flexible and inclusive working model.

Benefits & Culture

Competitive compensation package
Access to a corporate benefits platform with discounts on travel, fitness, fashion, and more
Opportunities for professional growth through dedicated training resources
Collaborative, international team focused on innovation and customer success
Commitment to diversity, inclusion, and a supportive workplace

Equal Opportunity

We value diversity and ensure equal consideration for employment regardless of race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, genetic information, marital status, or veteran status.

Cephalgo is hiring a Senior Site Reliability Engineer

Role Overview

Key Responsibilities

Required Qualifications

Preferred Skills

Technology Environment

Work Environment

Benefits & Culture

Equal Opportunity

Similar Jobs

Senior Site Reliability Engineer

Senior Infrastructure Engineer

DevSecOps Engineer

DevOps Engineer

DevOps Engineer

Principal Engineer (P4 Software Developer)