DPO vs RLHF

Medium

llms Anthropic OpenAI DeepMind

How does Direct Preference Optimization (DPO) differ from traditional RLHF?

Multiple Choice

DPO uses more human feedback data than RLHF DPO trains a reward model first, then uses it more efficiently than RLHF DPO skips the reward model and RL loop, directly optimizing the policy on preference data with a closed-form loss DPO can only be applied to encoder-only models

Correct!

Incorrect — try again next time!

Upgrade to Pro to unlock solutions and explanations

DPO vs RLHF

Multiple Choice

Related problems

Deep-dive guides