Responsibilities

Create and manage data processing workflows that analyze real-world documents to guide the creation of high-quality synthetic data
Develop systems that generate documents in various formats, layouts, and subject areas
Build assessment methods to verify synthetic data reflects accurate statistical patterns and variety
Investigate and deploy generative modeling approaches tailored for document AI training
Detect and resolve data quality problems to ensure synthetic outputs effectively train downstream models
Work with modeling teams to assess how synthetic data influences model accuracy and performance
Lead the synthetic data pipeline from design through validation, ensuring technical and quality standards
Make key architectural choices that balance data quality, diversity, scale, and cost efficiency
Establish and track data quality indicators and maintain real-time generation monitoring dashboards
Coordinate with annotation teams to align synthetic data with downstream processing needs
Support strategic planning in collaboration with senior technical leadership
Develop high-throughput systems capable of producing millions of synthetic training samples
Implement filtering, post-processing, and validation steps to eliminate poor-quality synthetic outputs
Design cost-effective generation workflows that optimize computational resources, output quality, and speed
Build monitoring tools to identify changes in data distribution or declining quality over time
Partner with platform engineering teams on compute resource management, storage, and job scheduling

ABBYY is hiring a Senior Machine Learning Engineer, Synthetic Data & Document Understanding

Responsibilities