Responsibilities
- Create and manage scalable platforms for machine learning and large language model operations that support data versioning, training, evaluation, deployment, and monitoring.
- Develop continuous integration and continuous delivery pipelines for machine learning models and LLM applications with automated testing, validation, and rollback features.
- Produce infrastructure-as-code solutions to enable reproducible and version-controlled machine learning environments.
- Design model serving systems with auto-scaling, A/B testing, and canary release capabilities.
- Construct platforms that support large language model fine-tuning, prompt management, and large-scale experimentation.
- Implement evaluation systems to measure LLM performance, output quality, safety, and cost efficiency.
- Build and deploy enterprise-level AI agents and copilots with monitoring safeguards and operational controls.
- Establish observability practices for LLMs including token tracking, latency metrics, prompt/response logging, and cost analysis.
- Ensure high availability, reliability, and performance of machine learning and LLM services using defined service level indicators and objectives.
- Set up comprehensive monitoring, alerting, and incident response protocols for ML systems.
- Participate in on-call duties and lead post-incident reviews to strengthen system resilience.
- Develop automation tools to reduce manual effort and increase the speed of ML development cycles.
- Work closely with ML engineers and data scientists to transition research prototypes into production systems.
- Coordinate with platform and infrastructure teams on cloud architecture design and resource efficiency.
- Guide team members in adopting MLOps best practices, production-ready ML patterns, and operational discipline.
- Lead technical decision-making around tools, frameworks, and architectural approaches for ML systems.
We are not able to offer nonimmigrant visa sponsorship for this position.