Palo Alto, California, United States USD 180,000 - 440,000 Yearly

xAI is hiring a Software Engineer

xAI is seeking a Software Engineer for the ML and Data Infrastructure team. This team builds the foundational infrastructure for frontier AI models and truth-seeking agents. You will collaborate with pre-training, multimodal, reasoning, and product teams to tackle ambiguous, high-stakes problems in a fast-paced, meritocratic environment.

What You'll Do

  • Design, build, and operate petabyte-to-exabyte scale distributed systems for data acquisition, web crawling, preprocessing, filtering, classification, and multimodal pipelines.
  • Architect high-performance search and retrieval engines at trillion-document scale, integrated with LLMs and agents for truth-seeking, low-hallucination reasoning, and real-time knowledge access.
  • Develop reliable inference serving infrastructure: load balancing, autoscaling, KV cache, batching, fault-tolerance, monitoring, CI/CD, and benchmarking for 100% uptime and optimal tail latency.
  • Optimize low-level performance: CUDA kernels, Triton and CUTLASS extensions, quantization, distillation, speculative decoding, GPU memory hierarchy, and model-hardware co-design for next-generation architectures.
  • Innovate on compilers, runtimes, distributed profiling and debugging tools, and interconnect fabrics.
  • Manage complex workloads across clouds and clusters: orchestration, data bookkeeping and verifiability, high-speed interconnect validation, failure analysis, and telemetry and automation for production reliability.

What We're Looking For

  • Strong systems engineering skills with a proven impact on large-scale distributed infrastructure.
  • Proficiency in Python and at least one compiled language (Rust, C++, Go, or Java); experience building bespoke libraries, optimizing performance, and debugging complex systems.
  • Hands-on experience with at least one key area: petabyte-scale data pipelines and crawling, web-scale search and retrieval, inference optimization, compiler features, or high-speed interconnects.
  • Deep understanding of distributed systems challenges: high-throughput operations per second, latency and throughput tradeoffs, fault-tolerance, monitoring, and scaling to production billions-of-users or 100,000+ GPU clusters.
  • Passion for AI infrastructure: keeping up with state-of-the-art techniques, first-principles problem-solving, meticulous organization and bookkeeping, and delivering rigorous, high-quality results.

Nice to Have

  • Experience with multimodal data, epistemics and truth-seeking in retrieval, or agentic systems.
  • Low-level optimizations: CUDA kernel development, GPU profiling, low-precision numerics, or interconnect pathfinding.
  • Production expertise in inference reliability, CI/CD for ML, or cluster networking.
  • A track record of owning end-to-end projects in hyperscale environments, with strong debugging, vendor management, or open-source contributions.

Technical Stack

  • Languages: Python, Rust, C++, Go, Java
  • Infrastructure: Spark, Ray, Kubernetes
  • ML/Performance: CUDA, Triton, CUTLASS, JAX, XLA, MLIR
  • Ops & Observability: Prometheus, Grafana, Buildkite, ArgoCD

Team & Environment

You will join a small team within a flat organizational structure. The company culture is highly motivated and focused on engineering excellence. All employees are expected to be hands-on and contribute directly to the company’s mission to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.

Benefits & Compensation

  • Total compensation range: $180,000 - $440,000 USD
  • Equity
  • Comprehensive medical, vision, and dental coverage
  • 401(k) retirement plan
  • Short-term and long-term disability insurance
  • Life insurance
  • Various other discounts and perks

xAI is an equal opportunity employer.

Required Skills
PythonRustC++GoJavaSparkRayKubernetesCUDATritonDistributed SystemsPerformance OptimizationData PipelinesInference OptimizationHigh-Speed Interconnects PythonRustC++GoJavaSparkRayKubernetesCUDATritonDistributed SystemsPerformance OptimizationData PipelinesInference OptimizationHigh-Speed Interconnects
Looking for a remote dev community?

200+ professionals, 37 countries, one network

Working remotely doesn't mean working alone. Iglu connects you with developers, designers, and digital experts worldwide. Collaborate, learn, and grow together.

Global professional network
Knowledge sharing & collaboration
Regular community events
Cross-project opportunities
Join the community
37 countries represented
About company
xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.
All jobs at xAI Visit website
Job Details
Department Software Development
Category infrastructure
Posted 2 months ago