Role Overview
As a Senior Systems Development Engineer, you will drive the design, validation, and performance optimization of advanced computing platforms engineered for artificial intelligence workloads. Based in Austin, Texas, you will ensure system-level readiness for demanding AI applications, from infrastructure bring-up to deployment at scale.
Key Responsibilities
- Lead the deployment, configuration, and functional validation of high-performance computing systems, including GPU servers, accelerator racks, and high-speed networking fabrics
- Perform deep co-validation across hardware and software layers, ensuring compatibility and stability of CPUs, GPUs, DPUs, NICs, memory subsystems, and I/O interfaces under AI-intensive workloads
- Diagnose and resolve complex issues spanning BIOS/UEFI, BMC firmware, kernel subsystems, device drivers, container environments, orchestration frameworks, and AI model runtimes
- Validate PCIe topology, NUMA alignment, and data-path efficiency critical to model training and inference performance
- Analyze system telemetry, kernel logs, hardware events, GPU health metrics, and fabric diagnostics to identify root causes of failures or bottlenecks
- Conduct root-cause analysis on training instability, model divergence, and hardware degradation under sustained AI loads
- Collaborate with silicon, firmware, operating system, and AI software teams to implement rapid resolutions and drive platform improvements
- Deploy and manage AI clusters integrating GPU servers, accelerators, InfiniBand or RoCE networking, and scalable storage solutions
- Verify cluster readiness for distributed training by evaluating bandwidth, latency, network topology, and gradient synchronization efficiency
- Integrate with orchestration platforms such as Kubernetes, Slurm, Ray, Docker, and Singularity to optimize AI pipeline execution
- Partner with data center engineering on rack integration, power and thermal planning, and capacity forecasting
- Run and interpret industry-standard AI benchmarks including MLPerf Training, MLPerf Inference, and SPEC AI suites
- Develop custom benchmarking tools for transformer models, large language models, computer vision, multimodal systems, and recommendation engines
- Deliver actionable optimization recommendations across hardware, OS, drivers, and AI frameworks based on benchmark results
- Document technical findings and lead cross-functional initiatives to enhance platform performance and reliability
Required Qualifications
- Bachelor’s or Master’s degree in Computer Engineering, Computer Science, Electrical Engineering, or a related technical discipline
- Minimum of five years of experience in system development, platform engineering, or hardware-software validation
- Strong grasp of computer architecture, including CPU/GPU/accelerator design, memory hierarchies, and I/O subsystems
Technical Environment
BIOS/UEFI, BMC, firmware, kernel drivers, PCIe, NUMA, InfiniBand, RoCE, Kubernetes, Slurm, Ray, Docker, Singularity, MLPerf Training, MLPerf Inference, SPEC AI Benchmarks


