NVIDIA is hiring an AI Infrastructure Engineer, DGXC Lepton

About the Role

NVIDIA is looking for an AI Infrastructure Engineer to join the DGX Cloud (DGXC) Lepton team. You will design, build, and maintain the AI infrastructure that enables large-scale AI training and inferencing, implementing software and systems engineering practices to ensure high efficiency and availability.

What You'll Do

  • Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure.
  • Develop and optimize tools to improve infrastructure efficiency and resiliency.
  • Root cause and analyze and triage failures from the application level to the hardware level.
  • Enhance infrastructure and products underpinning NVIDIA's AI platforms.
  • Co-design and implement APIs for integration with NVIDIA's resiliency stacks.
  • Define meaningful and actionable reliability metrics to track and improve system and service reliability.
  • Apply strong problem-solving, root cause analysis, and optimization skills.

What We're Looking For

  • A minimum of 12+ years of experience in developing software infrastructure for large scale AI systems.
  • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
  • Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
  • Proven track record in building and scaling large-scale distributed systems.
  • Experience with AI training and inferencing and data infrastructure services.
  • Familiarity in operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
  • Proficiency in programming languages such as Python, C/C++, and scripting languages.
  • Excellent communication and collaboration skills.

Nice to Have

  • Experience in working with large scale AI clusters.
  • Strong understanding of NVIDIA GPUs and network technologies (RDMA, IB, NCCL).
  • Good understanding of DL frameworks internal to PyTorch, TensorFlow, JAX, and Ray.
  • Experience and root cause analysis of failures at the datacenter scale.
  • Strong background in software design and development.

Technical Stack

  • Languages: Python, C/C++
  • Observability: ELK, Prometheus, Loki
  • Frameworks: PyTorch, TensorFlow, JAX, Ray

Team & Environment

You will be part of the DGX Cloud Team. We cultivate a dynamic and supportive environment that values learning and growth, with a culture of blameless postmortems, iterative improvement, and risk-taking. We value diversity, intellectual curiosity, problem solving, and openness.

Benefits & Compensation

  • Compensation: $224,000 USD - $356,500 USD for Level 5, and $272,000 USD - $425,500 USD for Level 6 + equity eligibility.
  • Equity.
  • Comprehensive benefits package.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Required Skills
PythonC/C++ELKPrometheusLokiPyTorchTensorFlowJAXRayAI InfrastructureDistributed SystemsObservabilityPerformance OptimizationGPU Computing PythonC/C++ELKPrometheusLokiPyTorchTensorFlowJAXRayAI InfrastructureDistributed SystemsObservabilityPerformance OptimizationGPU Computing
Planning long-term in Thailand?

Full relocation support, start to finish

From visa strategy to housing, banking, and schools for your family — SVBL plans and manages every detail of your move to Thailand so nothing falls through the cracks.

Complete relocation planning
Family visa & school enrollment
Banking & insurance setup
Cultural integration support
Plan your move
One partner for everything
About company
NVIDIA
NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.
All jobs at NVIDIA Visit website
Job Details
Category infrastructure
Posted 6 months ago