San Francisco; New York City; Palo Alto Hybrid Employment

Perplexity is hiring a Member of Technical Staff (AI Inference Engineer)

Responsibilities

  • New models support. Support transformer-based retrieval, text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache management to support in API Gateway.
  • GPU kernels migration to CuTe DSL. Port our in-house CUDA kernels to NVIDIA's CuTe DSL so they run on GB200 today and are portable to Vera Rubin racks tomorrow.
  • Rust-native serving runtime. Develop our internal Rust-based inference server to solve all Python pains and keep up with rapidly growing traffic.
  • Performance optimisation. Profile and fix bottlenecks from network ingress through continuous batching and GPU kernel interleaving.
  • Reliability and observability. Build dashboards, alerts, and automated remediation so we catch regressions before users do. Respond to and learn from production incidents.

Requirements

  • 3+ years of professional software engineering experience with meaningful work on ML inference or high-performance systems.
  • Familiarity with at least one deep learning framework (PyTorch, JAX, TensorFlow).
  • Understanding of GPU architectures (memory hierarchy, warp scheduling, tensor cores).
  • Understanding of common LLM architectures and inference optimization techniques (e.g. quantization, speculative decoding, prefill-decode disaggregation).

Nice to Have

  • Deep experience with GPU programming and performance work (CUDA, Triton, CUTLASS, or similar). Any other deep systems programming experience is a plus.
  • You understand modern LLM architectures and are able to bring them up reliably in a production environment.
  • You've built and operated production distributed systems under real load - ideally performance-critical ones.
  • Comfortable working across languages and layers: Rust for the serving runtime, Python for model code, CUDA/CuteDSL for kernels.
  • You own problems end-to-end. You can read a research paper on Monday, write a kernel on Wednesday, and debug a production incident on Friday.
  • Self-directed. You do well in fast-moving environments where the path forward isn't laid out for you.
  • Good if you touched any of ML compilers and framework internals: PyTorch internals, torch.compile, custom operators.
  • Good if you touched any of Distributed GPU communication: NCCL, NVLink, InfiniBand, RDMA libraries, model/tensor parallelism.
  • Good if you touched any of Low-precision inference: INT8/FP8/FP4 quantization, mixed-precision serving.
  • Good if you touched any of Profiling and debugging tools: Nsight Compute/Systems, CUDA-GDB, PTX/SASS analysis.
  • Good if you touched any of Container orchestration: Kubernetes, GPU scheduling, autoscaling inference workloads.
About company
Perplexity
Perplexity is a free AI-powered answer engine that provides accurate, trusted, and real-time answers to any question.
All jobs at Perplexity Visit website
Job Details
Department AI
Category other
Posted 2 months ago