Responsibilities
- Design and manage scalable cloud infrastructure on GCP with Kubernetes to support demanding machine learning workloads.
- Create automated workflows for training, evaluating, and releasing machine learning models using tools such as Jenkins, GitHub Actions, or Airflow.
- Set up monitoring systems to detect model degradation, accuracy loss, latency issues, and data drift in live environments.
- Collaborate across data, machine learning, backend, and frontend teams to ensure seamless integration and operations.
- Establish monitoring solutions that track both system performance and ML-specific indicators like feature drift and prediction consistency.
- Deploy observability platforms that allow individual engineering teams to oversee their own services and pipelines.
- Take part in on-call duties and contribute to maintaining compliance with security standards such as SOC.
Nice to Have
- Prior experience implementing systems for continuous monitoring of model accuracy and detecting data or concept drift.
- Background in using Ansible for cluster provisioning and disaster recovery procedures.
- Holding recognized certifications such as CKA, CKS, or GCP Professional Cloud Architect/Security Engineer.
- Exposure to modern observability tools including Loki, Grafana, or large-scale ClickHouse operations.
What you bring to the table
- 8 - 10+ years in DevOps/Platform Engineering, with at least 2 years operating production ML workloads.
- Deep hands-on experience with GCP (VPC-SC, IAM, Organization Policies) and GKE (Cluster topology, Helm, Kustomize, ArgoCD).
- High proficiency with Istio (VirtualServices, mTLS, sidecar injection) and Kong API Gateway.
- Expert-level Terraform skills using Atlantis/GitOps in large, multi-hundred-file environments.
- Experience managing enterprise identity and secrets with tools like Auth0, Dex, ESO, or SOPS.
- Production experience with Airflow and ML-serving stacks (e.g., Triton, vLLM, MLflow).
- Comfortable managing Cloud SQL (PostgreSQL), BigQuery, and in-cluster stores like Elasticsearch or ClickHouse.
- Upper-intermediate or higher English proficiency in speaking and writing.
It would be great if you also had
- Experience monitoring model accuracy and detecting data/concept drift.
- Familiarity with Ansible for cluster bootstrapping and recovery.
- Kubernetes (CKA/CKS) or GCP Professional Cloud Architect/Security Engineer certifications.
- Exposure to Loki, Grafana, or managing ClickHouse at scale.
As part of Point Wild, you will
- Solve real customer problems through targeted cybersecurity solutions.
- See your impact daily in a fast-moving, contributor-focused organization.
- Accelerate your career by learning new technologies and working with talented peers.
- Have the chance to shape the direction and growth of the organization.