Responsibilities
- Lead the development and enhancement of a multi-tiered evaluation pipeline covering tool call validation, risk heuristics, and LLM-based transcript assessment.
- Improve evaluation methods for full system performance across coordinated agents, RAG-enhanced responses, and multi-party voice interactions using provider-agnostic verification.
- Enhance observability tools to expose evaluation metrics, identify performance regressions, and support data-informed quality decisions.
- Develop real-time monitoring systems that detect live interaction issues, apply contextual interventions, and feed insights back into system improvements.
- Collaborate with machine learning, product, and operations teams to convert real-world failures into automated test cases and strengthen evaluation coverage.
- Create and manage test suites focused on adversarial scenarios and edge cases, including resistance to prompt injection and behavior under ambiguous user input.
- Promote early integration of quality standards by embedding evaluation into prompt design, defining behavioral acceptance criteria, and prioritizing quality in development.
- Help design the orchestration of QA workflows, including background tasks, alerting via Slack, and risk data storage, to boost efficiency and developer usability.
Compensation
Not specified
Work Arrangement
Not specified
Team
Cross-functional team collaborating with ML engineers, product managers, and operations leads
Responsibilities
- Own and extend our multi-layered eval pipeline and verification portfolio: deterministic quality checks on tool calls, risk-factor heuristics, and LLM-graded transcript evaluation.
- Advance our capabilities to evaluate end-to-end system performance (across orchestrated agents, RAG-supported responses, multi-party voice conversations) with modular and auditable verification that is independent of any single model provider.
- Drive improvements to our observability stack to surface eval metrics, detect regressions, and enable data-driven quality decisions across the team.
- Build real-time monitoring and verification loops that catch issues in production interactions as they happen, intervening with context and feeding back for system refinement.
- Partner with ML engineers, product managers, and operations leads to translate real-world failure modes into automated checks, closing the loop between production incidents and eval coverage.
- Build and maintain adversarial and edge-case test suites — including prompt injection resistance, guardrail robustness, and graceful degradation under ambiguous patient inputs.
- Champion “shift-left” quality practices: embed eval criteria into prompt engineering workflows, define acceptance criteria for new agent behaviors, and make quality a first-class concern in the development cycle.
- Contribute to the design of our QA pipeline orchestration (background processing, Slack notifications, risk assessment persistence) to improve throughput, reliability, and developer experience.
Not specified


