NVIDIA is hiring a Senior HPC DevOps Engineer

NVIDIA is hiring a Senior HPC DevOps Engineer to help build the supercomputers and HPC clusters of the future. In this role, you will be a key player in groundbreaking advancements in artificial intelligence and GPU computing, driving the latest breakthroughs in at-scale system design and tuning.

What You'll Do

  • Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems.
  • Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments.
  • Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes.
  • Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring.
  • Develop complex Networking automations.
  • Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
  • Serve as a technical resource, developing and sharing best practices with internal teams.
  • Support R&D activities and engage in proof of concepts (POCs) and proof of values (POVs) for future improvements.

What We're Looking For

  • B.Sc. in Computer Science, Engineering, or a related field with 5+ years of experience.
  • Deep knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
  • Advanced proficiency in programming and scripting languages, with a solid understanding of object-oriented programming principles.
  • Familiarity with Jenkins, Ansible, Puppet/Chef.
  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security.
  • Deep understanding of networking protocols such as InfiniBand and Ethernet.
  • Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes.
  • Background with multiple storage solutions like Lustre, GPFS, ZFS, and XFS.
  • Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix).
  • Familiarity with cloud platforms (AWS, Azure, Google Cloud).

Nice to Have

  • Proven networking experience or strong knowledge through professional networking training.
  • Knowledge of CPU and/or GPU architecture.
  • Understanding of Kubernetes and container-related microservice technologies.
  • Experience with GPU-focused hardware/software (DGX, CUDA).
  • Background with RDMA (InfiniBand or RoCE) fabrics.

Technical Stack

  • Automation & CI/CD: Jenkins, Ansible, Puppet/Chef
  • Operating Systems: Windows, Linux (Redhat/CentOS, Ubuntu)
  • Networking: InfiniBand, Ethernet
  • Orchestration: Slurm, Kubernetes
  • Storage: Lustre, GPFS, ZFS, XFS
  • Virtualization: VMware, Hyper-V, KVM, Citrix
  • Cloud Platforms: AWS, Azure, Google Cloud
  • GPU Technologies: DGX, CUDA

NVIDIA values diversity and is committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We provide reasonable accommodations to ensure all individuals can participate in the job application or interview process, perform essential job functions, and receive other benefits and privileges of employment.

Required Skills
JenkinsAnsiblePuppet/ChefLinuxRedhat/CentOSUbuntuInfiniBandEthernetSlurmKubernetesLustreHPCDevOpsWindows
Need to work legally in Thailand?

Work permits without the paperwork nightmare

Thai immigration rules are strict and easy to get wrong. SVBL handles the bureaucracy — correct visa type, proper documentation, timely submissions. You focus on your work.

Right visa type for your situation
Document preparation & submission
Deadline tracking & renewals
Direct liaison with immigration
Talk to an expert
10+ years experience
About company
NVIDIA
NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.
All jobs at NVIDIA Visit website
Job Details
Category infrastructure
Posted 4 months ago