Artwork

Indhold leveret af The Nonlinear Fund. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af The Nonlinear Fund eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Player FM - Podcast-app
Gå offline med appen Player FM !

LW - Catastrophic Goodhart in RL with KL penalty by Thomas Kwa

11:57
 
Del
 

Manage episode 418385950 series 3337129
Indhold leveret af The Nonlinear Fund. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af The Nonlinear Fund eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Catastrophic Goodhart in RL with KL penalty, published by Thomas Kwa on May 15, 2024 on LessWrong. TLDR: In the last two posts, we showed that optimizing for a proxy can fail to increase true utility, but only when the error is heavy-tailed. We now show that this also happens in RLHF with a KL penalty. This post builds on our earlier result with a more realistic setting and assumptions: Rather than modeling optimization as conditioning on a minimum reward threshold, we study maximization of reward with a KL divergence penalty, as in RLHF. We remove the assumption of independence between the error and utility distributions, which we think was the weakest part of the last post. When the true utility V is light-tailed, the proxy can be maximized while keeping E[V]to the same level as the prior. We can't guarantee anything about E[V] when V is heavy tailed; it could even go to minus infinity. Abstract When applying KL regularization, the trained model is regularized towards some prior policy π0. One would hope that a KL penalty can produce good outcomes even in the case of reward misspecification; that is, if the reward U is the sum of true utility V and an error term X, we would hope that optimal policies under a KL penalty achieve high V even if the magnitude of X is large. We show that this is not always the case: when X is heavy-tailed, there are arbitrarily well-performing policies π with Eπ[V]Eπ0[V]; that is, that get no higher true utility than the prior. However, when error is light-tailed and independent of V, the optimal policy under a KL penalty results in V>0, and V can be made arbitrarily large. Thus, the tails of the error distribution are crucial in determining how much utility will result from optimization towards an imperfect proxy. Intuitive explanation of catastrophic Goodhart with a KL penalty Recall that KL divergence between two distributions P and Q is defined as If we have two policies π,π0, we abuse notation to define DKL(ππ0) as the KL divergence between the distributions of actions taken on the states in trajectories reached by π. That is, if Tr(π) is the distribution of trajectories taken by π, we penalize This strongly penalizes π0 taking actions the base policy never takes, but does not force the policy to take all actions the base policy takes. If our reward model gives reward U, then the optimal policy for RLHF with a KL penalty is: Suppose we have an RL environment with reward U=X+V where X is an error term that is heavy-tailed under π0, and V is the "true utility" assumed to be light-tailed under π0. Without loss of generality, we assume that E[U(π0)]=0. If we optimize for E[U(π)]βDKL(ππ0), there is no maximum because this expression is unbounded. In fact, it is possible to get E[U(π)]>M and DKL(π,π0)<ϵ for any M,ϵ. That is, we get arbitrarily large proxy reward U and arbitrarily small KL penalty. For such policies π, it is necessarily the case that limϵ0E[V(π)]=0; that is, for policies with low KL penalty, utility goes to zero. Like in the previous post, we call this catastrophic Goodhart because the utility produced by our optimized policy is as bad as if we hadn't optimized at all. This is a corollary of a property about distributions (Theorems 1 and 3 below) which we apply to the case of RLHF with unbounded rewards (Theorem 2). The manner in which these pathological policies π achieve high E[U] is also concerning: most of the time they match the reference policy π0, but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy π, it could be impossible to tell whether π is Goodharting or identical to the base policy. Results All proofs are in the appendix, which will be published shortly after this post. X heavy tailed, V light tailed: EV0 We'll start by demon...
  continue reading

1687 episoder

Artwork
iconDel
 
Manage episode 418385950 series 3337129
Indhold leveret af The Nonlinear Fund. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af The Nonlinear Fund eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Catastrophic Goodhart in RL with KL penalty, published by Thomas Kwa on May 15, 2024 on LessWrong. TLDR: In the last two posts, we showed that optimizing for a proxy can fail to increase true utility, but only when the error is heavy-tailed. We now show that this also happens in RLHF with a KL penalty. This post builds on our earlier result with a more realistic setting and assumptions: Rather than modeling optimization as conditioning on a minimum reward threshold, we study maximization of reward with a KL divergence penalty, as in RLHF. We remove the assumption of independence between the error and utility distributions, which we think was the weakest part of the last post. When the true utility V is light-tailed, the proxy can be maximized while keeping E[V]to the same level as the prior. We can't guarantee anything about E[V] when V is heavy tailed; it could even go to minus infinity. Abstract When applying KL regularization, the trained model is regularized towards some prior policy π0. One would hope that a KL penalty can produce good outcomes even in the case of reward misspecification; that is, if the reward U is the sum of true utility V and an error term X, we would hope that optimal policies under a KL penalty achieve high V even if the magnitude of X is large. We show that this is not always the case: when X is heavy-tailed, there are arbitrarily well-performing policies π with Eπ[V]Eπ0[V]; that is, that get no higher true utility than the prior. However, when error is light-tailed and independent of V, the optimal policy under a KL penalty results in V>0, and V can be made arbitrarily large. Thus, the tails of the error distribution are crucial in determining how much utility will result from optimization towards an imperfect proxy. Intuitive explanation of catastrophic Goodhart with a KL penalty Recall that KL divergence between two distributions P and Q is defined as If we have two policies π,π0, we abuse notation to define DKL(ππ0) as the KL divergence between the distributions of actions taken on the states in trajectories reached by π. That is, if Tr(π) is the distribution of trajectories taken by π, we penalize This strongly penalizes π0 taking actions the base policy never takes, but does not force the policy to take all actions the base policy takes. If our reward model gives reward U, then the optimal policy for RLHF with a KL penalty is: Suppose we have an RL environment with reward U=X+V where X is an error term that is heavy-tailed under π0, and V is the "true utility" assumed to be light-tailed under π0. Without loss of generality, we assume that E[U(π0)]=0. If we optimize for E[U(π)]βDKL(ππ0), there is no maximum because this expression is unbounded. In fact, it is possible to get E[U(π)]>M and DKL(π,π0)<ϵ for any M,ϵ. That is, we get arbitrarily large proxy reward U and arbitrarily small KL penalty. For such policies π, it is necessarily the case that limϵ0E[V(π)]=0; that is, for policies with low KL penalty, utility goes to zero. Like in the previous post, we call this catastrophic Goodhart because the utility produced by our optimized policy is as bad as if we hadn't optimized at all. This is a corollary of a property about distributions (Theorems 1 and 3 below) which we apply to the case of RLHF with unbounded rewards (Theorem 2). The manner in which these pathological policies π achieve high E[U] is also concerning: most of the time they match the reference policy π0, but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy π, it could be impossible to tell whether π is Goodharting or identical to the base policy. Results All proofs are in the appendix, which will be published shortly after this post. X heavy tailed, V light tailed: EV0 We'll start by demon...
  continue reading

1687 episoder

Alle episoder

×
 
Loading …

Velkommen til Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Hurtig referencevejledning