Responsibilities
- Design and manage scalable cloud infrastructure on GCP, using Kubernetes and containerization to support demanding machine learning workloads.
- Develop automated workflows for training, evaluating, and releasing ML models using platforms such as Jenkins, GitHub Actions, or Airflow.
- Set up observability systems to detect model drift, performance drops, accuracy changes, and latency issues in live environments.
- Act as a technical liaison between data, machine learning, backend, and frontend teams to enable seamless deployment and operations.
- Establish monitoring solutions that track both system-level metrics like uptime and latency, and ML-specific indicators including feature drift and data distribution changes.
- Enable team-level autonomy by deploying monitoring tools that allow individual groups to oversee their own services.
- Take part in on-call duties and help maintain compliance with security standards such as SOC.
Benefits
- Tackle meaningful customer challenges with direct and visible outcomes.
- Operate within a lean and agile environment where individual initiative is recognized and valued.
- Witness the tangible results of your work on a daily basis.
- Grow your expertise by engaging with emerging technologies and markets in a dynamic, high-growth setting.
- Collaborate with skilled professionals in a culture that prioritizes people.
- Be part of a welcoming and respectful workplace that prohibits discrimination and harassment.