Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

Cheng, Xiaoyuan; Yuan, Wenxuan; Mu, Zhancun; Zhang, Yuanzhao; Yang, Yiming; Wang, Hai; Sun, Zhuo; Liu, Che

MBDPO: Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

Xiaoyuan Cheng^1,* Wenxuan Yuan^2,* Zhancun Mu³ Yuanzhao Zhang⁴ Yiming Yang¹ Hai Wang¹

Zhuo Sun^5,† Che Liu^6,†

¹University College London ²Nanyang Technological University

³Peking University ⁴Santa Fe Institute

⁵Shanghai University of Finance and Economics ⁶Imperial College London

^*Core contributors ^†Corresponding authors

Paper Code 🤗 Hugging Face BibTeX

Overview of MBDPO offline and online performance

Overview. MBDPO scales world-model reinforcement learning. Left: in multi-task offline pretraining, MBDPO consistently outperforms TD-MPC2 and shows monotonic performance gains as model size increases. Right: in online learning from scratch, MBDPO achieves competitive performance across 121 continuous-control tasks.

Abstract

MBDPO is a model-based reinforcement learning framework that unifies search and policy optimization through a diffusion policy inside a learned latent world model. Instead of placing an explicit planner such as MPPI on top of the world model, MBDPO reformulates policy optimization as a diffusion process over imagined trajectories, where the score field is corrected by model-based returns and anchored to the behavior distribution via an implicit energy function. This eliminates the structural misalignment between search and value learning that limits prior world-model approaches, and yields monotonic scaling of performance with model capacity.

Motivation

Why Scaling World-Model RL Remain Hard?

Scaling world-model RL is not limited only by model prediction error. A deeper bottleneck is the mismatch between how actions are selected by search and how values are learned from replay data: the planner may move toward actions where the learned value function is unreliable, causing value overestimation and unstable policy improvement.

World-Model Score Correction

MBDPO uses imagined rollouts in the latent world model to evaluate candidate action sequences and correct the diffusion score toward higher-return behaviors.

Implicit Energy Anchor

MBDPO learns an implicit energy function from replay data to anchor the policy to the behavior distribution, preventing drift toward unreliable high-value regions.

Diffusion Policy

MBDPO represents policy improvement as iterative denoising over future action sequences, turning search into a learnable policy optimization process.

Cross TD error and policy drift comparison — Search-based planners can drift away from the policy distribution used for value learning, leading to larger cross-TD error and relative action drift. **MBDPO** reduces both effects, showing that diffusion policy optimization keeps policy improvement better aligned with value learning.

Method

Model-Based Diffusion Policy Optimization (MBDPO)

Core framework of Model-Based Diffusion Policy Optimization — **MBDPO** starts from Gaussian action-sequence samples and progressively denoises them into high-return behaviors. At each denoising step, candidate action sequences are rolled out through the latent world model, evaluated by their energy-regularized returns, and reweighted to estimate a score direction toward better actions. The learned world model therefore corrects the diffusion score field, while the implicit energy anchor keeps the policy close to the behavior distribution. Together, this transforms generative denoising into model-based policy optimization and unifies search with policy learning.

Results

Strong Performance Across Online, Offline, and Fine-tuning

121

control tasks across DMControl, MetaWorld, ManiSkill2, MyoSuite, Locomotion, and Visual RL

340M

largest world-model scale tested in multi-task offline pretraining

40K

limited online interactions per unseen task for offline-to-online (O2O) fine-tuning

Aggregate online performance across benchmarks

MBDPO achieves strong online-from-scratch performance across diverse control suites. Compared with SAC, DreamerV3, TD-MPC, and TD-MPC2, MBDPO learns faster and reaches higher or competitive final performance across state-based, visual, locomotion, manipulation, and musculoskeletal control tasks.

Monotonic Scaling in Offline Pretraining

MBDPO shows monotonic performance gains as model capacity increases from 1.7M to 340M parameters, outperforming TD-MPC2 across model scales in multi-task offline pretraining.

Offline-to-online fine-tuning transfers a pretrained generalist agent to unseen tasks with limited interaction. With only 40K online steps per task, MBDPO fine-tuning substantially outperforms training from scratch, demonstrating efficient adaptation from pretrained world-model representations.

MBDPO produces structured latent trajectories that align with task dynamics. Cyclic control tasks form closed-loop patterns, while manipulation tasks follow directed paths toward the goal, suggesting that the learned world model supports physically meaningful policy optimization.

Visualization

Structured Latent Trajectories

MBDPO learns latent trajectories that reflect the structure of the underlying control task. Cyclic skills form closed-loop patterns, while goal-directed manipulation tasks produce smooth trajectories from initial states to successful completion.

DMControl

Cheetah Run Front

Reward: 740.7

Cup Spin

Reward: 840.4

Reacher Hard

Reward: 985.0

Walker Run

Reward: 769.2

MetaWorld

Bin Picking

Reward: 1585.0
Success Rate: 1.00

Disassemble

Reward: 1556.2
Success Rate: 1.00

Door Close

Reward: 1549.1
Success Rate: 1.00

Lever Pull

Reward: 1664.9
Success Rate: 0.90

BibTeX

@misc{cheng2026scalingworldmodelreinforcementlearning,
      title={Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization},
      author={Xiaoyuan Cheng and Wenxuan Yuan and Zhancun Mu and Yuanzhao Zhang and Yiming Yang and Hai Wang and Zhuo Sun and Che Liu},
      year={2026},
      eprint={2605.26282},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2605.26282}
}