United States of America Remote (Global)

Lavendo is hiring a Senior AI/ML Specialist Solutions Architect (AI Infra & Cloud)

About the Role

As a Senior AI/ML Specialist Solutions Architect, you will lead the design and implementation of high-performance AI systems tailored to the needs of AI-driven organizations. Your work will focus on optimizing large-scale distributed training and inference workflows across multi-GPU and multi-node environments, ensuring solutions are production-ready, efficient, and aligned with long-term business objectives.

Key Responsibilities

  • Design and refine scalable AI architectures that leverage modern cloud and HPC infrastructure
  • Guide the evolution of machine learning pipelines from proof-of-concept to robust, production-grade systems
  • Collaborate with engineering and product teams to integrate customer feedback and shape future technology roadmaps
  • Deliver technical insights through whitepapers, presentations, and webinars to support customer success
  • Advise internal teams and clients on best practices in AI deployment, performance tuning, and infrastructure strategy
  • Foster trusted relationships with technical and business stakeholders to ensure alignment with strategic goals

Qualifications

  • Minimum of 5 years in cloud infrastructure, MLOps, or solutions architecture roles with a focus on AI/ML systems
  • Proven experience scaling AI models in production using frameworks such as PyTorch and JAX
  • Strong understanding of NVIDIA’s HPC ecosystem, including CUDA, NCCL, and Infiniband networking
  • Excellent communication skills with the ability to translate complex technical concepts for diverse audiences
  • Eligibility to work full-time in the U.S. without sponsorship

Preferred Technical Experience

  • Programming: Python, Go, Java, or C++
  • Infrastructure as Code: Terraform, Ansible
  • Orchestration platforms: Kubernetes, Slurm
  • DevOps tools: Git, Docker, Helm
  • Big data technologies: Spark, Kafka, Hadoop
  • Database systems: SQL, NoSQL, and vector databases
  • ML frameworks: TensorFlow, HuggingFace, Scikit-learn

Compensation & Benefits

Annual salary ranges from $225,000 to $315,000, negotiable based on experience and location. The role includes stock options and a 4% 401(k) match. Employees receive 100% company-paid medical, dental, and vision coverage for themselves and their families, along with company-paid disability and life insurance.

Additional benefits include 20 weeks of paid parental leave for primary caregivers, 12 weeks for secondary caregivers, and a monthly stipend for internet and mobile expenses. The position supports full remote flexibility for U.S.-based team members.

Technology & Impact

You’ll work with state-of-the-art AI infrastructure, including the latest NVIDIA GPUs and one of the most powerful commercially available supercomputers. The organization is committed to sustainable AI, operating energy-efficient data centers that repurpose waste heat to support local communities.

Company Values

The organization champions equitable access to AI infrastructure, empowers teams to build and scale AI solutions, and simplifies the challenges of AI development. It is dedicated to fostering an inclusive, diverse, and accessible workplace for all employees.

Required Skills
PyTorchJAXCUDANCCLInfinibandPythonGoJavaC++TerraformMLOpsAI/MLCloud InfrastructureNVIDIA HPCMulti-GPU Optimization PyTorchJAXTensorFlowHuggingFaceScikit-learnCUDANCCLInfinibandNVIDIA HPC ecosystemPythonMLOpsCloud InfrastructureMulti-GPU OptimizationAI Workload ScalingDistributed Systems
Want to work from Thailand?

Join a remote network built for tech talent

Iglu gives you real employment in Southeast Asia — visa, work permit, and projects included. Pick what you work on, earn performance-based pay, and live where you want.

Legal employment in Thailand & Vietnam
Choose your own projects
Performance-based revenue sharing
Relocation support available
Join Iglu
200+ professionals worldwide
About company
Lavendo
Building AI-centric cloud infrastructure that combines large GPU clusters, high-speed networks, and cloud-native tooling into a platform used by enterprises, startups, and research teams. The goal is to enable serious AI and simulation workloads without requiring customers to build their own supercomputers.
All jobs at Lavendo Visit website
Job Details
Category infrastructure
Posted 2 months ago