San Francisco, California, United States

Crusoe is hiring a Senior Staff Software Engineer

This role involves leading the creation of a managed platform for the complete application lifecycle centered on Machine Learning models, especially Large Language Models. The engineer will build foundational systems, design training and reinforcement learning pipelines, and manage scalable model lifecycle infrastructure, while shaping key architectural decisions in a sustainability-driven AI environment.

Responsibilities

  • Oversee fine-tuning systems for large foundation models using techniques like SFT, PEFT, LoRA, and adapters, ensuring resilience through checkpointing, failure recovery, and efficient multi-node orchestration.
  • Design and maintain comprehensive training pipelines for Large Language Models from ingestion to deployment.
  • Develop pipelines for model distillation and reinforcement learning, including reward modeling and policy optimization.
  • Create infrastructure to support agent-based execution workflows.
  • Manage datasets, models, and experiments with robust versioning, lineage tracking, evaluation frameworks, and reproducibility at scale.
  • Collaborate with product, business, and platform teams to define core system abstractions and API designs.
  • Shape long-term architecture for training runtimes, scheduling, storage solutions, and model lifecycle management.
  • Engage with and contribute to the open-source LLM community.
  • Design and implement core platform components from the ground up, with full ownership of 0 → 1 development.

Requirements

  • Advanced degree in Computer Science, Engineering, or a related technical field.
  • 8 to 12 or more years of professional experience leading impactful projects in artificial intelligence.
  • Demonstrated ability to deliver early-stage, complex projects under aggressive timelines.
  • Strong proficiency with cloud services including elastic compute, object storage, virtual private networks, and managed databases.
  • Hands-on experience in Generative AI, particularly with Large Language Models and multimodal systems.
  • Deep technical knowledge of AI infrastructure covering both training and inference systems.

Nice to Have

  • Proficient in Golang or Python for developing large-scale, production-grade services.
  • Active contributions to open-source AI projects such as vLLM or similar frameworks.
  • Experience optimizing performance on GPU systems and inference engines.
  • Familiarity with PyTorch for model development and training.
  • Practical experience in training and fine-tuning Large Language Models.

Tech Stack

Golang, Python, PyTorch, Large Language Models (LLMs), Generative AI, AI Infrastructure, Cloud Services including elastic compute, object storage, VPC, and managed databases, Distributed Training Systems, Model Fine-tuning using SFT, PEFT, LoRA, and adapters, Reinforcement Learning, Distillation techniques, Agent Execution Infrastructure, Dataset and Model Versioning, Experiment Management

Benefits

  • Competitive base salary aligned with industry standards
  • Restricted Stock Units in a rapidly growing, well-funded technology company
  • Comprehensive health benefits including HDHP and PPO options, vision, and dental coverage for employees and dependents
  • Employer contributions to Health Savings Accounts (HSA)
  • Paid Parental Leave
  • Company-paid life insurance and short-term and long-term disability coverage
  • Access to Teladoc for virtual healthcare
  • 401(k) plan with 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Monthly cell phone expense reimbursement
  • Tuition reimbursement program
  • Subscription to the Calm app for wellness
  • MetLife Legal Plan access
  • Company-paid commuter benefit of $300 per month

Compensation

Up to $237,600 - $288,000 base salary plus bonus. Restricted Stock Units included in all offers. Bonus is part of total compensation.

Team

Collaborative team environment with cross-functional engagement across product, business, and platform teams; this role has significant influence on system design and technical direction.

  • Innovation centered on sustainability
  • Development of responsible and transformative cloud infrastructure
  • Commitment to making a meaningful impact in the AI revolution
  • Proactive and collaborative work environment
  • Autonomous work style with emphasis on clear communication
  • Passion for cutting-edge AI product development
  • Focus on scalable innovation

Additional Information

  • Must have a proactive and collaborative mindset with the ability to work independently.
  • Strong interpersonal and communication abilities are essential.
  • Enthusiasm for solving difficult technical challenges and building state-of-the-art AI systems.
Required Skills
GoPythonPytorchLarge language models (llms)Generative AIReinforcement Learning
About company
Crusoe
Crusoe is a vertically integrated AI infrastructure company that owns and operates each layer of the stack to power the world's most ambitious AI workloads, solving the power bottleneck with an energy-first approach.
All jobs at Crusoe Visit website
Job Details
Department Software Development
Category data
Posted 4 months ago