Responsibilities
- Design, test, and optimize context strategies and system prompts that shape answer engine behavior across products, features, and use cases.
- Build automated and semi-automated evaluation pipelines that measure model quality, catch regressions, and scale across product surfaces.
- Partner with research and engineering to validate model behavior before and during rollouts, ensuring smooth transitions with no degradation.
- Identify inconsistencies and failure modes in model outputs through well-designed research projects — for both internal and production-facing systems.
- Work closely with design, product, and research teams to translate product goals into concrete model behavior requirements.
- Help engineers across teams build intuition for prompt design, context engineering, and evaluation best practices.
- Track the latest alignment, evaluation, and prompting techniques from industry and academia, and bring the best ideas back to the team.
Requirements
- Experience designing evaluations, benchmarks, or metrics for AI systems.
- Strong written and verbal communication skills, particularly in explaining complex concepts to diverse stakeholders.
- Ability to manage multiple concurrent projects in a fast-moving environment.
- Strong experience with Perplexity or other frontier AI models in production settings.
- Demonstrated experience with Python — you'll prototype, debug, automate, and build systems at scale.
- 3+ years of experience working with LLMs in a product or research setting.
Nice to Have
- Experience with A/B testing or experimentation frameworks.
- Track record of improving AI system performance through systematic evaluation and iteration.