Responsibilities
- Translate business needs into testable GenAI hypotheses, clear outputs, and measurable success criteria; define scope boundaries (what the system should not attempt), including risks.
- Run feasibility assessments to choose the right approach: prompting vs RAG vs fine-tuning vs classical ML.
- Select and develop models based on task requirements (reasoning vs extraction vs classification) working with AI Engineering to understand latency/cost, and risk profile.
- Design prompting strategies: instruction design, few-shot sets, structured outputs, tool/agent prompts, and robustness patterns. This will be implemented as an MVP and iterate based on eval results.
- Establish prompt iteration methodology driven by evals (not anecdotal testing): prompt versioning, ablations, and change control.
- Define the evaluation plan for GenAI systems and agentic workflows- designing and implementing evaluation from LLM as a judge, thresholds and metric creation i.e. recall@k, precision@k. Ensure evaluation includes fairness and bias considerations where applicable.
- Define acceptance thresholds and release (“go/no-go”) gates tied to these metrics.
- Own experimentation and model improvements: Run structured experiments (across prompts, retrievers, chunking, models).
- Develop out methods for identifying model failures such as hallucination types, retrieval misses, instruction-following errors, formatting failures etc.
- Provide recommendations for improvements grounded in evidence: what to change, expected lift, and tradeoffs.
- Deliver an engineering-ready handoff: prompt packages and versioning approach, RAG configuration, tool schemas (if agentic), evaluation harness, datasets/ground truth, metric definitions, and go/no-go gates.
Requirements
- 7 + years of overall AI/ML experience including 2+ years of Generative AI solutions
- Strong background in applied ML / data science with demonstrated GenAI delivery experience
- Deep expertise in evaluation design, metrics, and dataset curation for LLM systems
- Proven experience in model selection and prompt engineering, including structured output and tool-use prompting
- Strong proficiency in Python and major ML frameworks (PyTorch, TensorFlow, Scikit-learn).
- Experience in LLM fine-tuning, prompt engineering, or AI solution integration with enterprise applications.
- Familiarity with RAG design choices (chunking, embeddings, retrieval strategies, reranking) and how to evaluate them.
- Comfortable working with Azure GenAI ecosystem (Azure OpenAI / Azure AI Foundry) from a consumer/solution perspective.
- Proven ability to build end-to-end GenAI MVPs in Python (RAG/agents + evaluation harness) and prepare them for production handoff.
- Excellent communication and stakeholder management skills with a strategic mindset.


