Responsibilities
- Design, develop, and maintain advanced test automation frameworks that incorporate chaos engineering principles
- Create and execute chaos experiments that simulate various failure modes and edge cases in our distributed systems
- Implement monitoring solutions that effectively track system performance, resilience, and failure recovery
- Establish observability practices that provide deep insights into system behavior during chaos experiments
- Collaborate with development teams to build resilience into our applications from the ground up
- Develop metrics and dashboards to visualize system reliability and the impact of chaos experiments
- Lead post-mortem analyses to identify system weaknesses discovered through chaos testing
- Integrate chaos testing into CI/CD pipelines to validate system resilience continuously
- Mentor engineers through code reviews, technical sessions, and hands-on guidance in test automation, chaos engineering, and monitoring best practices
- Contribute to the company's overall testing strategy and quality assurance practices
Benefits
- Make an Impact: Shape the resilience and reliability of AI-driven systems at scale
- Build with Modern Tech: Leverage cutting-edge tools and platforms (Multi-cloud, AI-first tooling)
- Ownership & Growth: Take ownership of chaos engineering initiatives and influence engineering culture across teams
- Continuous Learning: Collaborate with top engineers, participate in mentoring, and stay ahead in chaos engineering and SRE practices


