As a Senior HPC and AI Networking Performance Research and Analysis Engineer, you will investigate and enhance the performance of AI workloads running on extensive GPU and CPU systems. Your primary focus will be on distributed deep learning applications, particularly large language model training and inference, where communication patterns and network efficiency play a critical role.

Key Responsibilities

Conduct in-depth profiling and analysis of AI workloads to uncover performance bottlenecks, especially in communication and data transfer layers
Design and execute benchmarking strategies to evaluate system behavior under real-world conditions
Collaborate with hardware and software teams to assess performance across CPUs, GPUs, host channel adapters, and network switches
Develop and apply simulation models, performance tools, and analytical methods to diagnose system limitations
Investigate low-level system interactions to determine root causes of performance issues
Establish performance baselines and define testing strategies for emerging technologies
Guide optimization efforts to achieve maximum system throughput and efficiency

Qualifications

Applicants should hold a Bachelor's degree in Computer Science or Software Engineering and bring at least six years of hands-on experience in high-performance networking. Essential skills include deep familiarity with RDMA, MPI, NCCL, and networking protocols such as RoCE. Proficiency in Python, Bash, and C is required, along with strong Linux system knowledge.

Experience with NVIDIA GPUs, CUDA libraries, and deep learning frameworks like TensorFlow or PyTorch is necessary. Demonstrated ability in performance analysis, problem solving, and cross-team collaboration is essential.

Preferred Background

Proven track record in benchmarking AI workloads, especially for distributed LLM training
Strong understanding of CUDA and NCCL internals
Comprehensive knowledge of system architecture, including CPUs (Intel, AMD, ARM), GPUs, memory, and PCI subsystems
Familiarity with congestion control mechanisms in high-speed networks

NVIDIA is hiring a Senior HPC and AI Networking Performance Research and Analysis Engineer

Key Responsibilities

Qualifications

Preferred Background

Similar Jobs

AI Engineer, Email CRM

Solution Architect, Energy

Solution Architect – Digital Biology

Senior Data Scientist - Downstream Demand Forecast - Value Chain (f/m/d)

AI Research Engineer

Machine Learning Engineer III

Related Articles

Become an AI Developer: Your Career Guide