Responsibilities
- Create and manage high-volume batch and streaming data pipelines that enable AI training, product functionality, analytics, and experimentation.
- Develop real-time data systems using technologies like Kafka, Kinesis, or PubSub for ingestion and transformation, alongside batch processing frameworks for historical data and offline workloads.
- Lead the design and implementation of data orchestration platforms using tools such as Airflow or Dagster, ensuring reliable scheduling, dependency resolution, retry logic, SLA adherence, and full observability.
- Ensure data accuracy, timeliness, traceability, and resilience by building systems that scale, tolerate partial outages, and adapt to changing data schemas without impacting downstream applications.
- Develop self-service data infrastructure that enables engineers, data scientists, and analysts to discover datasets, define data contracts, and deploy pipelines with minimal overhead.
- Enhance developer productivity by providing clear abstractions, standardized patterns, and best practices for data modeling, testing, validation, and deployment across teams.
- Influence technical direction for data storage, compute engines, orchestration layers, and data APIs through collaboration with engineering and data science teams.
- Guide and mentor engineers through code and design reviews, documentation, and active collaboration to elevate the quality of data infrastructure.
Compensation
Competitive salary and equity package
Work Arrangement
Hybrid or remote options available
Team
Collaborative environment working across engineering and data science teams
Responsibilities
- Design and operate large-scale batch and streaming data pipelines that directly power Perplexity product features, AI training and evaluation workflows, analytics, and experimentation.
- Build event-driven and streaming systems (Kafka, Kinesis, PubSub, or similar) for real-time ingestion, transformation, and delivery, alongside batch frameworks for backfills, aggregations, and offline computation.
- Lead the architecture of data orchestration using tools like Airflow or Dagster, owning scheduling, dependency management, retries, SLAs, and end-to-end observability for critical data flows.
- Set and enforce guarantees for data correctness, freshness, lineage, and recoverability, designing systems that handle rapid scale growth, partial failures, and evolving schemas without disrupting AI workloads or product experiences.
- Build self-serve data platforms that let engineers, data scientists, and analysts safely discover data, define contracts, and create and operate their own pipelines with minimal friction.
- Improve developer experience through better abstractions, opinionated paved paths, and standards for data modeling, testing, validation, and deployment, treating the data platform as a product used by many teams.
- Drive architectural decisions across storage, compute, orchestration, and data APIs, partnering closely with product engineering and data science to align the data ecosystem with Perplexity’s roadmap.
- Mentor engineers, review designs, and raise the technical bar for data infrastructure through thoughtful feedback, documentation, and hands-on collaboration.
Available for qualified candidates