World-Gymnast: Training Robots with Reinforcement Learning in a World Model

1NYU 2NYU Shanghai 3UC Berkeley
World-Gymnast overview

The policy is trained on tasks specified by an initial frame and language instruction. During training, the policy outputs actions which are then passed to the world model which generates imagined rollouts. These rollouts are then passed to a VLM which returns a binary task completion reward. This reward is used to update the policy. Once trained, we evaluate the policy on real robots. The resulting real world rollouts (frame-action sequences) can be further used to improve the world model on the particular environment.

Abstract

Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software-based simulator, are limited by the amount of expert data available and the sim-to-real gap for manipulation. With the recent emergence of world models learned from real-world video-action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real-robot performance. We propose World-Gymnast, which performs RL finetuning of a vision-language-action (VLA) policy by rolling out the policy in an action-conditioned video world model and rewarding the rollouts with a vision-language model (VLM). On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World-Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test-time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.

Key Contributions

  • Introduce World-Gymnast, an RL framework that fine-tunes VLA policies inside a learned video world model with VLM-based rewards.
  • Demonstrate improved real-robot performance over SFT and simulator-based RL.
  • Show that training with distractor augmentation, novel language instructions, and additional tasks improves robustness and success.
  • Demonstrate test-time training from a novel frame and iterative world model + policy improvement via a Dyna-style loop.

Experiments

World-Gymnast is evaluated on real-robot tasks in the Bridge setup. Below are the reported success rates.

Real-robot evaluation: Simulator RL vs World-Gymnast

Real-robot evaluation: SIMPLER vs World-Gymnast results

Supervised Learning vs World-Gymnast

Real-robot evaluation: SFT vs Iter-SFT vs World-Gymnast results

Real-robot: Put eggplant into blue sink

World-Gymnast (RL)

SIMPLER (simulator RL)

Iter-SFT

Diverse Training Settings

World-Gymnast supports additional training data through distractor augmentation, novel language instructions, and scaling the number of tasks. Reported success rates on the OpenVLA held-out split:

Variant Success Rate
SFT 58 ± 4%
World-Gymnast 74 ± 3%
World-Gymnast-Distract 78 ± 2%
World-Gymnast-Language 81 ± 1%
World-Gymnast-Scaled 81 ± 4%

Distractor training: put yellow corn on pink plate

Novel language: put plate on drying rack

Scaled tasks: close fridge

Distractor Robustness: Lift AAA Battery

Task: Lift AAA Battery. Both SFT and World-Gymnast are distracted and pick up the rubber duck, while World-Gymnast-Distract completes the task.

SFT (distracted)

World-Gymnast (distracted)

World-Gymnast-Distract (success)

Test-Time Training & Iterative Updates

World-Gymnast can perform test-time RL from a novel frame without real-world rollouts, improving the close-the-drawer task from 62 ± 6% to 100 ± 0%. The framework also supports Dyna-style iterative updates: real-robot rollouts are used to refine the world model, which then yields higher-quality imagined rollouts and further policy improvements.

Test-time training policy (real world)

Iteratively improved policy (real world)

World Model Rollout Fidelity

Comparison of rolling out the same action sequence in different environments.

Real robot

SIMPLER (software simulator)

WorldGym (pretrained world model)

WorldGym with online update