Redwood City, California, United States USD 158,000 - 185,000 Yearly

Sumo Logic, Inc. is hiring a Senior Machine Learning Engineer

Sumo Logic is hiring a Senior Machine Learning Engineer to design, build, and scale production-grade infrastructure and platforms that enable the full lifecycle of ML and LLM systems. You will architect robust pipelines for model training, evaluation, deployment, and monitoring while ensuring reliability, observability, and efficiency.

What You'll Do

  • Design and implement scalable MLOps/LLMOps platforms supporting the full ML lifecycle: data versioning, model training, evaluation, deployment, and monitoring
  • Build and maintain CI/CD pipelines for ML models and LLM applications with automated testing, validation, and rollback capabilities
  • Develop infrastructure-as-code (IaC) for reproducible, version-controlled ML environments
  • Architect model serving infrastructure with auto-scaling, A/B testing, and canary deployment capabilities
  • Build platforms for LLM fine-tuning, prompt management, and experimentation at scale
  • Implement evaluation frameworks for LLM performance, quality, safety, and cost optimization
  • Design and deploy enterprise-grade AI agents and copilots with robust monitoring and guardrails
  • Establish LLM observability: token usage tracking, latency monitoring, prompt/response logging, and cost attribution
  • Own uptime, reliability, and performance of ML/LLM services (SLIs/SLOs)
  • Implement comprehensive monitoring, alerting, and incident response for ML systems
  • Participate in on-call rotations and drive post-incident reviews to improve system resilience
  • Build automation and tooling to reduce toil and accelerate ML development velocity
  • Partner with ML Engineers and Data Scientists to translate research into production-ready systems
  • Collaborate with platform and infrastructure teams on cloud architecture and resource optimization
  • Mentor team members on MLOps best practices, production ML patterns, and operational excellence
  • Drive technical decisions on tooling, frameworks, and architectural patterns

What We're Looking For

  • B.S./M.S./Ph.D. in Computer Science, Engineering, or a related technical field
  • 4+ years of software engineering experience with 2+ years focused on MLOps/LLMOps
  • Production experience with ML model serving frameworks (e.g., TensorFlow Serving, TorchServe, Triton)
  • Hands-on with ML experiment tracking and model registry tools (MLflow, Weights & Biases, Kubeflow)
  • Proficiency in workflow orchestration (Airflow, Prefect, Kubeflow Pipelines, Metaflow)
  • Experience with LLM deployment, fine-tuning, and evaluation frameworks (e.g., vLLM, LangChain, LlamaIndex)
  • Knowledge of prompt engineering, RAG architectures, and LLM application patterns
  • Familiarity with LLM observability tools (e.g., LangSmith, Arize, WhyLabs)
  • Strong experience with major cloud providers (AWS, GCP, or Azure) and ML-specific services (SageMaker, Vertex AI, Azure ML, Bedrock)
  • Proficiency in containerization (Docker, Kubernetes) and infrastructure-as-code (Terraform, CloudFormation, Pulumi)
  • Experience with microservices architecture and API development (REST, gRPC)
  • Strong programming skills in Python, Terraform and Helm
  • Deep understanding of CI/CD practices and tools (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
  • Experience with monitoring and observability stacks (Prometheus, Grafana, DataDog, ELK)
  • Track record of managing production systems with defined SLIs/SLOs
  • Experience with on-call rotations, incident management, and reliability engineering practices

Nice to Have

  • Familiarity with Go, Java, or Rust
  • Experience building internal ML platforms or developer tooling used by multiple teams
  • Hands-on with distributed training frameworks (Ray, Horovod, DeepSpeed)
  • Knowledge of model optimization techniques (quantization, distillation, pruning)
  • Familiarity with feature stores (Feast, Tecton) and data versioning tools (DVC, LakeFS)
  • Understanding of ML security best practices, model governance, and compliance requirements
  • Experience with cost optimization and resource management for large-scale ML workloads
  • Contributions to open-source MLOps/LLMOps projects
  • Background in applied ML or data science with practical model development experience

Technical Stack

  • Model Serving: TensorFlow Serving, TorchServe, Triton
  • MLOps Tools: MLflow, Weights & Biases, Kubeflow, Airflow, Prefect, Metaflow
  • LLM Frameworks: vLLM, LangChain, LlamaIndex, LangSmith, Arize, WhyLabs
  • Cloud & ML Services: AWS, GCP, Azure, SageMaker, Vertex AI, Azure ML, Bedrock
  • Infrastructure: Docker, Kubernetes, Terraform, CloudFormation, Pulumi
  • APIs: REST, gRPC
  • Languages: Python, Go, Java, Rust
  • CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
  • Observability: Prometheus, Grafana, DataDog, ELK
  • Distributed Training: Ray, Horovod, DeepSpeed
  • Data Management: Feast, Tecton, DVC, LakeFS

Benefits & Compensation

  • Salary range: $158,000 - $185,000

Sumo Logic Privacy Policy. Employees will be responsible for complying with applicable federal privacy laws and regulations, as well as organizational policies related to data protection.

Required Skills
TensorFlow ServingTorchServeTritonMLflowWeights & BiasesKubeflowAirflowPrefectMetaflowvLLMMLOpsLLMOpsworkflow orchestrationmodel servingexperiment tracking TensorFlow ServingTorchServeTritonMLflowWeights & BiasesKubeflowAirflowPrefectMetaflowvLLMMLOpsLLMOpsworkflow orchestrationmodel servingexperiment tracking
Relocating to Thailand?

Visa and work permit handled by experts

SVBL manages your entire visa process — from application to approval. Work permits, extensions, and compliance all covered. One partner for legal, immigration, and settling in.

Work permit processing
Visa extensions & renewals
Immigration compliance
Banking & housing guidance
Get free consultation
Free initial consultation
About company
Sumo Logic, Inc.
Sumo Logic helps make the digital world secure, fast, and reliable by unifying critical security and operational data through its Intelligent Operations Platform. Built to address the increasing complexity of modern cybersecurity and cloud operations challenges, the company empowers digital teams to move from reaction to readiness—combining agentic AI-powered SIEM and log analytics into a single platform to detect, investigate, and resolve modern challenges. The platform enables organizations to protect against security threats, ensure reliability, and gain powerful insights into their digital environments.
All jobs at Sumo Logic, Inc. Visit website
Job Details
Department Software Development
Category data
Posted 3 months ago