We are seeking a skilled AI Operations Specialist to join our growing ML/AI team. In this contract position, you'll play a central role in transitioning machine learning models from research to reliable, large-scale production systems. Your work will directly support the expansion of AI-driven learning experiences, ensuring performance, scalability, and observability across services.
Key Responsibilities
- Develop and manage CI/CD pipelines tailored for machine learning workflows, enabling seamless model training, packaging, and deployment across microservices.
- Operate and optimize containerized applications on AWS ECS, balancing efficiency, responsiveness, and uptime.
- Automate infrastructure setup and configuration using Terraform to ensure consistent, reproducible environments.
- Support and scale backend services that integrate with external large language model providers.
- Design and maintain data pipelines that extract, transform, and load data from BigQuery, S3, and DynamoDB into training and inference systems.
- Implement monitoring and tracing solutions using Datadog, OpenTelemetry, and Langfuse to track model behavior and define service-level objectives.
- Partner with machine learning engineers to deploy models using BentoML and FastAPI in containerized production environments.
Required Expertise
- 2–3 years of experience in ML operations, with a foundation of 3–4 years in DevOps, CloudOps, or site reliability engineering.
- Strong programming skills in Python, with hands-on experience in Docker and container orchestration.
- Proven track record building CI/CD systems for machine learning in enterprise settings.
- Familiarity with Infrastructure as Code, particularly Terraform.
- Experience working with AWS services including ECS, ECR, S3, DynamoDB, and CloudWatch, as well as GCP tools like BigQuery and Vertex AI.
- Direct experience integrating and monitoring LLMs using OpenAI API, Google GenAI, and tracing tools like Langfuse.
- Background in constructing and maintaining data pipelines for model training and feature engineering.
- Understanding of the full ML lifecycle, including training, evaluation, experiment tracking (e.g., MLFlow, Weights & Biases), and model version control.
- Ability to detect and respond to model drift over time.
- Exposure to NLP frameworks such as Hugging Face Transformers, spaCy, or sentence-transformers.
- Knowledge of vector databases like LanceDB or FAISS and embedding-based retrieval systems.
- Experience deploying deep learning models built with TensorFlow or PyTorch in production environments.
- Familiarity with classical ML libraries including scikit-learn, XGBoost, and LightGBM, along with explainability tools such as SHAP.
- Working knowledge of model serving platforms like BentoML and async Python web frameworks such as FastAPI.


