Responsibilities
- Create and manage data processing workflows that analyze real-world documents to guide the creation of high-quality synthetic data
- Develop systems that generate documents in various formats, layouts, and subject areas
- Build assessment methods to verify synthetic data reflects accurate statistical patterns and variety
- Investigate and deploy generative modeling approaches tailored for document AI training
- Detect and resolve data quality problems to ensure synthetic outputs effectively train downstream models
- Work with modeling teams to assess how synthetic data influences model accuracy and performance
- Lead the synthetic data pipeline from design through validation, ensuring technical and quality standards
- Make key architectural choices that balance data quality, diversity, scale, and cost efficiency
- Establish and track data quality indicators and maintain real-time generation monitoring dashboards
- Coordinate with annotation teams to align synthetic data with downstream processing needs
- Support strategic planning in collaboration with senior technical leadership
- Develop high-throughput systems capable of producing millions of synthetic training samples
- Implement filtering, post-processing, and validation steps to eliminate poor-quality synthetic outputs
- Design cost-effective generation workflows that optimize computational resources, output quality, and speed
- Build monitoring tools to identify changes in data distribution or declining quality over time
- Partner with platform engineering teams on compute resource management, storage, and job scheduling