As a Senior Hardware Support Engineer, you will play a central role in maintaining the reliability of large-scale production hardware infrastructure. You'll lead deep-dive investigations into complex hardware and firmware issues, identifying root causes and implementing corrective actions to minimize downtime and improve system resilience.
Key Responsibilities
- Lead end-to-end analysis of critical hardware failures, tracing issues from initial symptoms to resolution
- Identify recurring failure patterns and drive systemic fixes to improve fleet-wide reliability
- Serve as the primary technical escalation point during high-severity hardware incidents
- Collaborate with hardware vendors to coordinate diagnostics, replacements, firmware updates, and long-term remediation
- Work alongside internal engineering teams to validate hardware fixes and prevent future issues
- Perform pre-deployment validation of server hardware and firmware across diverse platforms
- Apply structured problem-solving frameworks to diagnose and document hardware-related outages
- Support on-site operations during critical events with clear technical guidance and coordination
- Enhance monitoring, failure tracking, and reporting systems to improve hardware observability
- Contribute to strategic initiatives aimed at increasing long-term platform stability
Qualifications
Candidates must have extensive experience with server hardware in production environments, including deep knowledge of core components such as CPUs, memory, storage, power systems, and BMCs. You should have a proven ability to analyze telemetry and log data to diagnose failure modes, and experience applying formal incident management methodologies to resolve issues efficiently.
Strong communication skills are essential, as the role involves coordinating across engineering, operations, and vendor teams. You must be comfortable managing multiple concurrent investigations under pressure and delivering clear technical documentation.
Preferred Experience
- Work with GPU-intensive systems, AI workloads, or high-performance computing infrastructure
- Experience managing firmware lifecycles and validating large-scale rollouts
- Familiarity with Linux-based environments and infrastructure automation tools
- Track record of improving hardware reliability metrics across large fleets
Work Environment
This role is remote within the United States, with occasional travel required for on-site support during critical hardware events. The position operates in a fast-paced, innovation-driven culture focused on advancing AI and machine learning technologies.
Compensation & Benefits
Base salary ranges from $125,000 to $180,000 annually, with an annual performance-based bonus. Benefits include comprehensive medical, dental, and vision coverage; a 401(k) plan with company contribution; flexible paid time off; paid parental leave; and support for professional development.