D3PO

Preference Conditioned Multi-Objective Reinforcement Learning

Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, Abhinav Verma


A novel algorithm that trains a single, preference-conditioned policy to efficiently discover a diverse and high-quality set of trade-off solutions in multi-objective environments.

D³PO Algorithm Overview

Overview of the D³PO decomposed architecture.

The Challenge

Balancing Conflicting Goals

Many real-world problems, from autonomous driving to logistics, require balancing multiple, often conflicting, objectives—like speed versus safety, or cost versus environmental impact.

Training a single, flexible policy that can adapt to different user preferences is the Holy Grail. However, existing methods often fail due to two critical issues: Gradient Interference and Mode Collapse.

The Multi-Objective Dilemma

1
Destructive Interference

Gradients from conflicting objectives cancel each other out, stalling learning.

2
Mode Collapse

The agent ignores diverse preferences and collapses to a single "safe bet" behavior.

"How can we learn the full Pareto front with a single neural network?"

Our Framework

How D3PO Works

D3PO (Decomposed, Diversity-Driven Policy Optimization) tackles these challenges head-on with a three-pronged approach.

1

Decomposed Optimization

We compute advantages for each objective independently, preventing signal cancellation before updates.

2

Late-Stage Weighting

User preferences are applied only at the final stage of loss calculation, ensuring stable integration.

3

Diversity Regularizer

A novel loss term forces the policy to produce different behaviors for different preferences.

Theoretical Foundations

D3PO is grounded in formal analysis that guarantees stable and diverse policy learning.

1

Key Formulas

Final Actor Objective
Lactor(θ)=(i=1dωiLclip(i)(θ))+λdivLdiversity(θ)\mathcal{L}_{actor}(\theta)=-(\sum_{i=1}^{d}\omega_{i}\mathcal{L}_{clip}^{(i)}(\theta))+\lambda_{div}\mathcal{L}_{diversity}(\theta)
Scaled Diversity Regularizer
Ldiversity(θ)=Et[(DKL(πθ(st,ω)πθ(st,ω))αωω1)2]\mathcal{L}_{diversity}(\theta)=\mathbb{E}_{t}[(D_{KL}(\pi_{\theta}(\cdot|s_{t},\omega)||\pi_{\theta}(\cdot|s_{t},\omega^{\prime}))-\alpha||\omega-\omega^{\prime}||_{1})^{2}]
2

Formal Guarantees

No Advantage Cancellation

We prove that decomposing advantages prevents the loss of gradient information when objectives conflict.

No Mode Collapse

We guarantee that minimizing our objective forces the policy to behave differently for different preferences.

State-of-the-Art Pareto Front Discovery

Hopper Pareto Front

Hopper-v2

Ant Pareto Front

Ant-v2

Humanoid Pareto Front

Humanoid-v2

Quantitative Comparison (Hypervolume)

EnvironmentMetricsPG-MORLGPI-LSC-MORLD³PO (Ours)
Hopper-2dHV (10510^5)1.201.191.371.30
EU (10210^2)2.342.332.532.47
SP (10210^2) \downarrow5.130.491.130.26
Hopper-3dHV (10710^7)1.591.702.192.12
EU (10210^2)1.471.621.811.74
SP (10210^2) \downarrow0.760.740.530.04
Ant-2dHV (10510^5)0.351.171.311.91
EU (10210^2)0.814.282.503.14
SP (10310^3) \downarrow2.203.612.650.66
Ant-3dHV (10710^7)0.940.552.612.68
EU (10210^2)1.072.412.061.99
SP (10310^3) \downarrow0.021.960.060.004
Humanoid-2dHV (10510^5)2.621.983.433.76
EU (10210^2)4.063.674.785.11
SP (10410^4) \downarrow0.130*2.210.003

Extreme Model Efficiency

By learning a single unified policy instead of a population, D3PO reduces memory requirements by orders of magnitude while representing an unbounded number of preference solutions.

EnvD3PO (MB)C-MORLReduction
Ant-2D0.08914.385~160x
Humanoid0.2125.060~24x
Building-9D0.06411.608~180x

Rigorous Statistical Validation

Performance improvements are validated using one-sided Welch’s t-tests across 5 seeds. D3PO achieves statistically significant gains (p<0.05p < 0.05) in 12/18 comparisons.

  • Ant-2d DominanceSignificant across all metrics: HV (p<0.001p < 0.001), EU (p=0.002p=0.002), SP (p<0.001p < 0.001).
  • High-Dim RobustnessOn Humanoid-2d, D3PO is the only method to avoid collapse, with significant gains in HV (p=0.002p=0.002) and EU (p<0.001p < 0.001).

Demonstrated Performance

SOTA
Hypervolume

Achieves state-of-the-art Hypervolume on complex continuous control tasks.

0
Mode Collapse

Provably prevents mode collapse, ensuring diverse solutions.

1
Single Policy

Replaces 200+ discrete policies with one unified model, representing the unbounded Pareto front.