MBDPO: Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

Xiaoyuan Cheng1,* Wenxuan Yuan2,* Zhancun Mu3 Yuanzhao Zhang4 Yiming Yang1 Hai Wang1
Zhuo Sun5,† Che Liu6,†
1University College London 2Nanyang Technological University
3Peking University 4Santa Fe Institute
5Shanghai University of Finance and Economics 6Imperial College London
*Core contributors    Corresponding authors
Overview of MBDPO offline and online performance

Overview. MBDPO scales world-model reinforcement learning. Left: in multi-task offline pretraining, MBDPO consistently outperforms TD-MPC2 and shows monotonic performance gains as model size increases. Right: in online learning from scratch, MBDPO achieves competitive performance across 121 continuous-control tasks.

Abstract

MBDPO is a model-based reinforcement learning framework that unifies search and policy optimization through a diffusion policy inside a learned latent world model. Instead of placing an explicit planner such as MPPI on top of the world model, MBDPO reformulates policy optimization as a diffusion process over imagined trajectories, where the score field is corrected by model-based returns and anchored to the behavior distribution via an implicit energy function. This eliminates the structural misalignment between search and value learning that limits prior world-model approaches, and yields monotonic scaling of performance with model capacity.

Why Scaling World-Model RL Remain Hard?

Scaling world-model RL is not limited only by model prediction error. A deeper bottleneck is the mismatch between how actions are selected by search and how values are learned from replay data: the planner may move toward actions where the learned value function is unreliable, causing value overestimation and unstable policy improvement.

World-Model Score Correction

MBDPO uses imagined rollouts in the latent world model to evaluate candidate action sequences and correct the diffusion score toward higher-return behaviors.

Implicit Energy Anchor

MBDPO learns an implicit energy function from replay data to anchor the policy to the behavior distribution, preventing drift toward unreliable high-value regions.

Diffusion Policy

MBDPO represents policy improvement as iterative denoising over future action sequences, turning search into a learnable policy optimization process.

Cross TD error and policy drift comparison
Search-based planners can drift away from the policy distribution used for value learning, leading to larger cross-TD error and relative action drift. MBDPO reduces both effects, showing that diffusion policy optimization keeps policy improvement better aligned with value learning.

Model-Based Diffusion Policy Optimization (MBDPO)

Core framework of Model-Based Diffusion Policy Optimization
MBDPO starts from Gaussian action-sequence samples and progressively denoises them into high-return behaviors. At each denoising step, candidate action sequences are rolled out through the latent world model, evaluated by their energy-regularized returns, and reweighted to estimate a score direction toward better actions. The learned world model therefore corrects the diffusion score field, while the implicit energy anchor keeps the policy close to the behavior distribution. Together, this transforms generative denoising into model-based policy optimization and unifies search with policy learning.

Strong Performance Across Online, Offline, and Fine-tuning

121
control tasks across DMControl, MetaWorld, ManiSkill2, MyoSuite, Locomotion, and Visual RL
340M
largest world-model scale tested in multi-task offline pretraining
40K
limited online interactions per unseen task for offline-to-online (O2O) fine-tuning
Aggregate online performance across benchmarks

MBDPO achieves strong online-from-scratch performance across diverse control suites. Compared with SAC, DreamerV3, TD-MPC, and TD-MPC2, MBDPO learns faster and reaches higher or competitive final performance across state-based, visual, locomotion, manipulation, and musculoskeletal control tasks.

Multi-task offline scaling results

Monotonic Scaling in Offline Pretraining

MBDPO shows monotonic performance gains as model capacity increases from 1.7M to 340M parameters, outperforming TD-MPC2 across model scales in multi-task offline pretraining.

Offline-to-online performance

Offline-to-online fine-tuning transfers a pretrained generalist agent to unseen tasks with limited interaction. With only 40K online steps per task, MBDPO fine-tuning substantially outperforms training from scratch, demonstrating efficient adaptation from pretrained world-model representations.

Latent trajectory visualization

MBDPO produces structured latent trajectories that align with task dynamics. Cyclic control tasks form closed-loop patterns, while manipulation tasks follow directed paths toward the goal, suggesting that the learned world model supports physically meaningful policy optimization.

Structured Latent Trajectories

MBDPO learns latent trajectories that reflect the structure of the underlying control task. Cyclic skills form closed-loop patterns, while goal-directed manipulation tasks produce smooth trajectories from initial states to successful completion.

DMControl
Cheetah Run Front
Cheetah Run Front latent trajectory visualization Reward: 740.7
Cup Spin
Cup Spin latent trajectory visualization Reward: 840.4
Reacher Hard
Reacher Hard latent trajectory visualization Reward: 985.0
Walker Run
Walker Run latent trajectory visualization Reward: 769.2
MetaWorld
Bin Picking
Bin Picking latent trajectory visualization Reward: 1585.0
Success Rate: 1.00
Disassemble
Disassemble latent trajectory visualization Reward: 1556.2
Success Rate: 1.00
Door Close
Door Close latent trajectory visualization Reward: 1549.1
Success Rate: 1.00
Lever Pull
Lever Pull latent trajectory visualization Reward: 1664.9
Success Rate: 0.90

BibTeX

@misc{cheng2026scalingworldmodelreinforcementlearning,
      title={Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization},
      author={Xiaoyuan Cheng and Wenxuan Yuan and Zhancun Mu and Yuanzhao Zhang and Yiming Yang and Hai Wang and Zhuo Sun and Che Liu},
      year={2026},
      eprint={2605.26282},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2605.26282}
}