D3PO
Preference Conditioned Multi-Objective Reinforcement Learning
Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, Abhinav Verma
A novel algorithm that trains a single, preference-conditioned policy to efficiently discover a diverse and high-quality set of trade-off solutions in multi-objective environments.

Overview of the D³PO decomposed architecture.
Balancing Conflicting Goals
Many real-world problems, from autonomous driving to logistics, require balancing multiple, often conflicting, objectives—like speed versus safety, or cost versus environmental impact.
Training a single, flexible policy that can adapt to different user preferences is the Holy Grail. However, existing methods often fail due to two critical issues: Gradient Interference and Mode Collapse.
The Multi-Objective Dilemma
Destructive Interference
Gradients from conflicting objectives cancel each other out, stalling learning.
Mode Collapse
The agent ignores diverse preferences and collapses to a single "safe bet" behavior.
"How can we learn the full Pareto front with a single neural network?"
How D3PO Works
D3PO (Decomposed, Diversity-Driven Policy Optimization) tackles these challenges head-on with a three-pronged approach.
Decomposed Optimization
We compute advantages for each objective independently, preventing signal cancellation before updates.
Late-Stage Weighting
User preferences are applied only at the final stage of loss calculation, ensuring stable integration.
Diversity Regularizer
A novel loss term forces the policy to produce different behaviors for different preferences.
Theoretical Foundations
D3PO is grounded in formal analysis that guarantees stable and diverse policy learning.
Key Formulas
Final Actor Objective
Scaled Diversity Regularizer
Formal Guarantees
No Advantage Cancellation
We prove that decomposing advantages prevents the loss of gradient information when objectives conflict.
No Mode Collapse
We guarantee that minimizing our objective forces the policy to behave differently for different preferences.
State-of-the-Art Pareto Front Discovery

Hopper-v2

Ant-v2

Humanoid-v2
Quantitative Comparison (Hypervolume)
| Environment | Metrics | PG-MORL | GPI-LS | C-MORL | D³PO (Ours) |
|---|---|---|---|---|---|
| Hopper-2d | HV () | 1.20 | 1.19 | 1.37 | 1.30 |
| EU () | 2.34 | 2.33 | 2.53 | 2.47 | |
| SP () | 5.13 | 0.49 | 1.13 | 0.26 | |
| Hopper-3d | HV () | 1.59 | 1.70 | 2.19 | 2.12 |
| EU () | 1.47 | 1.62 | 1.81 | 1.74 | |
| SP () | 0.76 | 0.74 | 0.53 | 0.04 | |
| Ant-2d | HV () | 0.35 | 1.17 | 1.31 | 1.91 |
| EU () | 0.81 | 4.28 | 2.50 | 3.14 | |
| SP () | 2.20 | 3.61 | 2.65 | 0.66 | |
| Ant-3d | HV () | 0.94 | 0.55 | 2.61 | 2.68 |
| EU () | 1.07 | 2.41 | 2.06 | 1.99 | |
| SP () | 0.02 | 1.96 | 0.06 | 0.004 | |
| Humanoid-2d | HV () | 2.62 | 1.98 | 3.43 | 3.76 |
| EU () | 4.06 | 3.67 | 4.78 | 5.11 | |
| SP () | 0.13 | 0* | 2.21 | 0.003 |
Extreme Model Efficiency
By learning a single unified policy instead of a population, D3PO reduces memory requirements by orders of magnitude while representing an unbounded number of preference solutions.
| Env | D3PO (MB) | C-MORL | Reduction |
|---|---|---|---|
| Ant-2D | 0.089 | 14.385 | ~160x |
| Humanoid | 0.212 | 5.060 | ~24x |
| Building-9D | 0.064 | 11.608 | ~180x |
Rigorous Statistical Validation
Performance improvements are validated using one-sided Welch’s t-tests across 5 seeds. D3PO achieves statistically significant gains () in 12/18 comparisons.
- Ant-2d DominanceSignificant across all metrics: HV (), EU (), SP ().
- High-Dim RobustnessOn Humanoid-2d, D3PO is the only method to avoid collapse, with significant gains in HV () and EU ().
Demonstrated Performance
Achieves state-of-the-art Hypervolume on complex continuous control tasks.
Provably prevents mode collapse, ensuring diverse solutions.
Replaces 200+ discrete policies with one unified model, representing the unbounded Pareto front.