Austin, United States of America On-site

Dell Technologies is hiring a Senior System Development Engineer – AI Technologies

Role Overview

As a Senior Systems Development Engineer, you will drive the design, validation, and performance optimization of advanced computing platforms engineered for artificial intelligence workloads. Based in Austin, Texas, you will ensure system-level readiness for demanding AI applications, from infrastructure bring-up to deployment at scale.

Key Responsibilities

  • Lead the deployment, configuration, and functional validation of high-performance computing systems, including GPU servers, accelerator racks, and high-speed networking fabrics
  • Perform deep co-validation across hardware and software layers, ensuring compatibility and stability of CPUs, GPUs, DPUs, NICs, memory subsystems, and I/O interfaces under AI-intensive workloads
  • Diagnose and resolve complex issues spanning BIOS/UEFI, BMC firmware, kernel subsystems, device drivers, container environments, orchestration frameworks, and AI model runtimes
  • Validate PCIe topology, NUMA alignment, and data-path efficiency critical to model training and inference performance
  • Analyze system telemetry, kernel logs, hardware events, GPU health metrics, and fabric diagnostics to identify root causes of failures or bottlenecks
  • Conduct root-cause analysis on training instability, model divergence, and hardware degradation under sustained AI loads
  • Collaborate with silicon, firmware, operating system, and AI software teams to implement rapid resolutions and drive platform improvements
  • Deploy and manage AI clusters integrating GPU servers, accelerators, InfiniBand or RoCE networking, and scalable storage solutions
  • Verify cluster readiness for distributed training by evaluating bandwidth, latency, network topology, and gradient synchronization efficiency
  • Integrate with orchestration platforms such as Kubernetes, Slurm, Ray, Docker, and Singularity to optimize AI pipeline execution
  • Partner with data center engineering on rack integration, power and thermal planning, and capacity forecasting
  • Run and interpret industry-standard AI benchmarks including MLPerf Training, MLPerf Inference, and SPEC AI suites
  • Develop custom benchmarking tools for transformer models, large language models, computer vision, multimodal systems, and recommendation engines
  • Deliver actionable optimization recommendations across hardware, OS, drivers, and AI frameworks based on benchmark results
  • Document technical findings and lead cross-functional initiatives to enhance platform performance and reliability

Required Qualifications

  • Bachelor’s or Master’s degree in Computer Engineering, Computer Science, Electrical Engineering, or a related technical discipline
  • Minimum of five years of experience in system development, platform engineering, or hardware-software validation
  • Strong grasp of computer architecture, including CPU/GPU/accelerator design, memory hierarchies, and I/O subsystems

Technical Environment

BIOS/UEFI, BMC, firmware, kernel drivers, PCIe, NUMA, InfiniBand, RoCE, Kubernetes, Slurm, Ray, Docker, Singularity, MLPerf Training, MLPerf Inference, SPEC AI Benchmarks

Required Skills
Kubernetes
About company
Dell Technologies
Dell Technologies helps customers modernize infrastructure and unlock value from AI. The company is a family of businesses that helps individuals and organizations transform how they work, live and play.
All jobs at Dell Technologies Visit website
Job Details
Category embedded
Posted 3 months ago