Own quality for a mission-critical AI system that processes complex freight transactions. In this role, you’ll ensure the reliability and precision of an AI-powered billing platform by building robust quality frameworks and leading deep technical investigations when issues arise.
What You’ll Do
- Establish and refine a quality rubric that defines success and failure across key scenarios and exception types.
- Build and manage golden datasets with real-world inputs, expected outputs, and customer-specific variations to benchmark system performance.
- Conduct regular reviews of system outputs in development and production, identifying patterns, diagnosing root causes, and driving improvements into the product roadmap.
- Design and run regression tests for model updates, logic changes, and new customer integrations.
- Investigate quality incidents by tracing through email ingestion, parsing, prompts, model outputs, normalization, and final audit outcomes.
- Analyze logs, traces, event histories, and data streams to pinpoint failures across distributed workflows and state transitions.
- Produce clear, actionable reports with minimal reproductions, evidence, impact assessments, and recommended fixes.
- Develop a standardized triage process and classification system for recurring quality issues.
- Define and implement monitoring dashboards to track anomalies, error trends, and per-customer performance.
- Collaborate with engineering and AI teams to enhance system observability, including traceability from input to final state.
- Translate customer requirements into testable logic and identify gaps where real-world complexity exceeds current system modeling.
What We’re Looking For
- Proven experience in roles combining quality assurance, deep technical investigation, and systems thinking—such as QA in distributed systems, product analysis with debugging focus, or LLM quality evaluation.
- Hands-on experience assessing AI-generated outputs, including structured extraction, classification, tool use, and prompt pipelines.
- Strong skills in debugging production systems using tools like Datadog, ELK, Honeycomb, OpenTelemetry, or Jaeger.
- Proficiency in SQL or Python for data analysis and issue reproduction.
- Familiarity with event-driven architectures, state machines, and distributed workflows—including handling retries, idempotency, and partial failures.
- Ability to define clear requirements and convert ambiguous edge cases into structured test scenarios.
- Comfort working in high-volume, complex environments with frequent edge cases.
Nice to Have
- Background in freight, logistics, billing, or audit processes—especially with documents like BOLs, rate confirmations, or carrier invoices.
- Experience designing evaluation metrics such as precision/recall, drift detection, or customer-specific scorecards.
- Knowledge of workflow engines and distributed system failure modes.
- Experience with annotation pipelines, taxonomy design, or human-in-the-loop QA systems.
Technology Environment
You’ll work with Datadog, ELK, Honeycomb, OpenTelemetry, Jaeger, SQL, Python, event-driven systems, state machines, distributed architectures, LLM inference, RAG, and prompt-based pipelines.
Culture & Expectations
This role thrives on ownership: when something breaks, you follow it to the root cause and drive resolution. You’re systematic in turning ambiguity into clarity and communicate effectively across product, engineering, machine learning, and operations teams.
