Design, build, and maintain streaming systems that process large-scale telemetry data from GPU clouds and global data centers. Enable real-time visibility and reliable operations for AI infrastructure through scalable, observable pipelines.
Responsibilities
- Develop and manage streaming pipelines for logs, metrics, traces, and operational events
- Build real-time data processing systems using platforms like Kafka, Kinesis, or Pub/Sub
- Scale infrastructure to handle high-volume telemetry and variable traffic loads
- Ensure system reliability with monitoring, dashboards, and alerting mechanisms
- Collaborate with SRE and platform teams to integrate data into observability workflows
- Participate in on-call duties, incident response, and post-mortem reviews
- Enhance system stability and developer efficiency through automation and CI/CD practices
- Contribute to technical design and review of new streaming features
Requirements
- Proven experience with distributed systems, particularly real-time data platforms
- Hands-on work with Kafka or comparable streaming technologies
- Proficiency in backend programming languages such as Java, Scala, Go, or Python
- Experience managing services in cloud or large-scale infrastructure environments
- Understanding of observability principles including metrics, logging, tracing, and alerting
- Ability to debug complex production issues across distributed components
- End-to-end ownership of features from design to deployment and operations
- Strong collaboration skills and practical engineering approach
Nice to Have
- Experience developing observability platforms at cloud or data center scale
- Knowledge of stream processing frameworks and data delivery guarantees
- Familiarity with Kubernetes and containerized environments
- Experience with schema management, data contracts, or serialization formats
- Work with bare-metal or large-scale data center infrastructure
- Interest in guiding and mentoring junior engineers
Tech Stack
Kafka, Kinesis, Pub/Sub, Flink, Java, Scala, Go, Python, Kubernetes
Benefits
- Competitive salary
- Restricted Stock Units in a growing technology company
- Health insurance options including HDHP and PPO, with vision and dental coverage for dependents
- Employer contributions to Health Savings Accounts
- Paid Parental Leave
- Life insurance and short- and long-term disability coverage
- Teladoc access
- 401(k) plan with 100% match up to 4% of salary
- Generous paid time off and holidays
- Cell phone reimbursement
- Tuition reimbursement
- Subscription to the Calm app
- MetLife Legal services
- Company-paid commuter benefit of $300 per month
Compensation
Compensation ranges from $172,000 to $209,000 annually, plus bonus. Restricted Stock Units are included in all offers.
Team
Observability team within the Cloud Infrastructure organization
- Focus on problem-solving and identifying opportunities
- Operates with urgency and ambition
- Thrives in unstructured, evolving environments
- Collaborative, high-performing team environment
- Driven by mission to accelerate energy and intelligence abundance
- Vertically integrated AI infrastructure from power source to AI output
- Energy-first philosophy in building AI systems
Additional Information
- On-call participation is required
- Incident response and post-incident reviews are part of the role
- Opportunity to work on large-scale data pipelines and distributed systems
- Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation


