Whole-Body Conditioned
Egocentric Video Prediction

1 UC Berkeley (BAIR), 2 FAIR at Meta 3 New York University
*Indicates Equal Contribution

Indicates Equal Advising

A World Model for Embodied Agents

What are the requirements?

To create a World Model for Embodied Agents , we need...

• …a real embodied agent...: complex actions.

  • ❌ abstract control signals — ⬆️,⬇️,⬅️,➡️
  • ✅ real, physical grounded complex action space

• … that act in the real world: complex scenarios.

  • ❌ Aesthetic scene, stationary camera
  • ✅ diverse real-life scenario, ego-centric view

Human, as the most complex agent in the world:

• Ego-centric view -> Intention

Humans routinely look first and act second—our eyes lock onto a goal, the brain runs a brief visual “simulation” of the outcome, and only then does the body move.

At every moment, our egocentric view both serves as input from the environment and reflects the intention/goal behind the next movement.

Generated Sequence

• Whole-body Control -> Physicality

When we consider our body movements, we should consider both actions of the feet (locomotion and navigation) and the actions of the hand (manipulation), or more generally, whole-body control.

Feet action

Actions of the Feet

Hand action

Actions of the Hand

Whole-body action

Whole-body Control

What did we do?

MY ALT TEXT

We trained a model, to Predict Egocentric Video from human Actions (PEVA).
By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

Why it's hard?

• Action & Vision is Heavily Context Dependent

Same view can lead to different movements and vice versa — because humans act in complex, embodied, goal-directed environments.

• Human Control is High-Dimensional and Structured

Full-body motion spans 48+ DoF with hierarchical, time-dependent dynamics—not synthetic control codes.

• Egocentric View Reveals Intention—But Hides the Body

First-person vision reflects goals, but not motion execution—models must infer consequences from invisible physical actions.

• Perception Lags Behind Action

Visual feedback often comes seconds later, requiring long-horizon prediction and temporal reasoning.

Results

Following Atomic Actions

We demonstrate samples of PEVA following atomic actions.

Move Forward

MY ALT TEXT

Rotate Left

MY ALT TEXT

Rotate Right

MY ALT TEXT

Move Left Hand Up

MY ALT TEXT

Move Left Hand Down

MY ALT TEXT

Move Left Hand Left

MY ALT TEXT

Move Left Hand Right

MY ALT TEXT

Move Right Hand Up

MY ALT TEXT

Move Right Hand Down

MY ALT TEXT

Move Right Hand Left

MY ALT TEXT

Move Right Hand Right

MY ALT TEXT

Long Video Generation

Generation Over Long-Horizons, including 16-second video generation examples. PEVA generates coherent 16-second rollouts conditioned on whole-body motion.

More Long Video Generation

Main Display
Thumb 107
Thumb 104
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94
Thumb 94

Planning with Multiple Action Candidates

We explore PEVA's ability to serve as a world model by demonstrating a planning example by simulating multiple action candidates using PEVA and scoring them based on their perceptual similarity to the goal, as measured by LPIPS. Hover over the video to play.

PEVA enables us to rule out action sequences that leads us to the sink in the top row, and outdoors in the second row. PEVA allows us to find a reasonable sequence of actions to open the refridgerator in the third row.

PEVA enables us to rule out action sequences that leads us to the plants in the top row, and to the kitchen in the second row. PEVA allows us to find a reasonable sequence of actions to grab the box in the second row.

More Attempts with Planning

Main Display
Thumb 107
Thumb 104
Thumb 94
Thumb 94
Thumb 94
Thumb 94

Method

MY ALT TEXT

Random Timeskips: It allows the model to learn both short-term motion dynamics and longer-term activity patterns.

Sequence-Level Training: Model the entire sequence of motion by applying the loss over each prefix of frames.

Action Embeddings: Whole-body motion is high-dimensional, concatenate all actions at time t into a 1D tensor and use it to condition each AdaLN layer.

Quantitative Results

Atomic Action Performance

MY ALT TEXT

Comparison of models in generating videos of atomic actions.

Baselines

MY ALT TEXT

Baseline Perceptual Metrics.

Video Quality

MY ALT TEXT

Video Quality Across Time (FID).

Scaling

MY ALT TEXT

PEVA has good scaling ability. Larger models lead to better performance.

BibTeX

@misc{bai2025wholebodyconditionedegocentricvideo,
        title={Whole-Body Conditioned Egocentric Video Prediction}, 
        author={Yutong Bai and Danny Tran and Amir Bar and Yann LeCun and Trevor Darrell and Jitendra Malik},
        year={2025},
        eprint={2506.21552},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2506.21552}, 
  }

Acknowledgements

The authors thank Rithwik Nukala for his help in annotating atomic actions. We thank Katerina Fragkiadaki, Philipp Krähenbühl, Bharath Hariharan, Guanya Shi, Shubham Tsunami and Deva Ramanan for the useful suggestions and feedbacks for improving the paper; Jianbo Shi for the discussion regarding control theory; Yilun Du for the support on Diffusion Forcing; Brent Yi for his help in human motion related works and Alexei Efros for the discussion and debates regarding world models. This work is partially supported by the ONR MURI N00014-21-1-2801.