About the Role
The role involves assessing the performance of large language models in educational contexts by identifying issues, categorizing errors, and providing structured feedback to improve model behavior and reliability.
Responsibilities
- Evaluate outputs from language models for factual correctness and coherence
- Identify harmful, biased, or inappropriate content in AI-generated text
- Classify types of model errors including hallucinations and logical flaws
- Follow detailed guidelines to score model responses consistently
- Provide clear, actionable feedback to improve model training
- Test model behavior across diverse educational prompts and scenarios
- Document patterns in model failures for engineering review
- Collaborate with researchers to refine evaluation criteria
- Maintain high accuracy and attention to detail in assessments
- Adapt quickly to updated instructions and testing protocols
- Contribute to the development of new evaluation frameworks
- Ensure alignment of model outputs with pedagogical goals
- Report edge cases that reveal model limitations
- Participate in calibration sessions with team members
- Track and log evaluation results in shared systems
- Support quality assurance across multiple AI features
- Help prioritize issues based on severity and frequency
- Review model updates for improvements or regressions
- Maintain confidentiality of internal testing data
- Engage in ongoing training to stay current with AI developments
- Communicate findings clearly and concisely
- Work independently while meeting deadlines
- Contribute to a culture of continuous improvement
- Follow ethical guidelines in all evaluations
- Assist in creating realistic educational prompts for testing
Nice to Have
- Master’s degree in education or related field
- Experience working with large language models
- Background in special education or diverse learning needs
- Familiarity with K–12 curriculum frameworks
- Prior work in AI ethics or content safety
- Experience with annotation or labeling tasks
- Knowledge of prompt engineering techniques
- Exposure to educational technology products
- Research experience in cognitive science or learning theory
- Multilingual abilities
Compensation
$60,000 - $80,000 annually, commensurate with experience
Work Arrangement
Remote with flexible hours; some real-time collaboration required
Team
Small, agile team focused on AI-driven educational tools
What You’ll Be Doing
- Review and score AI-generated responses to classroom-related prompts
- Flag content that violates safety or accuracy standards
- Participate in weekly team discussions to align on evaluation standards
Why This Role Matters
- Your work directly improves the reliability of AI tools used by educators and students
- You help ensure AI outputs are safe, factual, and appropriate for learning environments
Not available for this position

