JobAgent
← Back to jobs

Member of Technical Staff, Vision / Language

xdof

📍San Francisco, California, US

hybridSoftware Engineering

Posted 2w ago · via ashby

Apply on ashby

Job Description

About XDOF

Frontier labs are racing to build general-purpose robots, and the bottleneck isn't compute. It's data. At XDOF, we're building the foundation behind the foundation models: the data collection systems, annotation pipelines, exabyte-scale data infrastructure, and software toolchain that enable our partners to push the field forward.

We're hiring a Research Engineer / Scientist to help lead technical efforts at the intersection of vision-language models and robot learning. You will build systems that turn raw egocentric and teleoperation video into high-signal training data for VLA models, and increasingly, contribute to the models themselves.

Beyond pipelines, you will drive research into what makes robot data useful: discovering new metadata (contact events, affordance labels, implicit reward signals, dynamics priors from video) that unlock capabilities current approaches miss. You'll explore how structured annotations can improve cross-embodiment transfer, automatic curriculum generation, and world models that predict what actually matters for manipulation. The data layer isn't downstream of the research. It is the research.

What You'll Do

  • Design and implement vision-language pipelines for egocentric and teleoperation video: structured captioning, temporal grounding, action-conditioned scene understanding, and semantic annotation at scale

  • Develop and evaluate representations that bridge visual perception, language, and low-level robot action — spanning VLAs, video prediction, and world models

  • Build and improve data curation systems that assess quality, diversity, and coverage of large-scale robot demonstration datasets

  • Work hands-on with bimanual and high-DoF manipulation data, including real teleoperation footage and sim-generated rollouts

  • Collaborate directly with partner labs to define data requirements and close the loop between data quality and downstream policy performance

  • Stay current on the research frontier (VLAs, video foundation models, flow matching, DiT architectures, egocentric pretraining) and translate insights into production systems

About You

Required:

  • MS or PhD in Computer Science, Robotics, Machine Learning, or a related field from a top-tier program

  • 3–7 years of research or applied research experience (industry or academic) in one or more of: vision-language models, video understanding, robot learning, or generative modeling

  • Deep fluency in PyTorch; working knowledge of large-scale training infrastructure (distributed training, mixed precision, large batch workflows)

  • Published work or demonstrable impact in VLMs/VLAs, video representation learning, imitation learning, or a closely related area

  • Strong engineering fundamentals — you can design clean systems, not just run experiments

Benefits

  • Competitive compensation and equity

  • Comprehensive health and wellness benefits

  • Flexible work arrangements

  • Collaborative and fast-paced work environment

  • Opportunity to shape the future of robotics and AI alongside an ambitious, values-driven team

Level: Mid Level to Senior Research Scientist (L4–L5 equivalent) Location: San Mateo

Note: Junior candidates will still be considered

If you’re excited to help build the infrastructure powering tomorrow’s intelligent machines, we’d love to hear from you!

Details

Department
Software Engineering
Work Type
hybrid
Locations
San Francisco, California, US
Posted
April 3, 2026
Source
ashby