

When? This feed was archived on February 21, 2025 21:08 (
Why? Inaktivt feed status. Vores servere kunne ikke hente et gyldigt podcast-feed i en længere periode.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Gradient hacking is a hypothesized phenomenon where:
Below I give some potential examples of gradient hacking, divided into those which exploit RL credit assignment and those which exploit gradient descent itself. My concern is that models might use techniques like these either to influence which goals they develop, or to fool our interpretability techniques. Even if those effects don’t last in the long term, they might last until the model is smart enough to misbehave in other ways (e.g. specification gaming, or reward tampering), or until it’s deployed in the real world—especially in the RL examples, since convergence to a global optimum seems unrealistic (and ill-defined) for RL policies trained on real-world data. However, since gradient hacking isn’t very well-understood right now, both the definition above and the examples below should only be considered preliminary.
Source:
https://www.alignmentforum.org/posts/EeAgytDZbDjRznPMA/gradient-hacking-definitions-and-examples
Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.
---
A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.
85 episoder
When?
This feed was archived on February 21, 2025 21:08 (
Why? Inaktivt feed status. Vores servere kunne ikke hente et gyldigt podcast-feed i en længere periode.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Gradient hacking is a hypothesized phenomenon where:
Below I give some potential examples of gradient hacking, divided into those which exploit RL credit assignment and those which exploit gradient descent itself. My concern is that models might use techniques like these either to influence which goals they develop, or to fool our interpretability techniques. Even if those effects don’t last in the long term, they might last until the model is smart enough to misbehave in other ways (e.g. specification gaming, or reward tampering), or until it’s deployed in the real world—especially in the RL examples, since convergence to a global optimum seems unrealistic (and ill-defined) for RL policies trained on real-world data. However, since gradient hacking isn’t very well-understood right now, both the definition above and the examples below should only be considered preliminary.
Source:
https://www.alignmentforum.org/posts/EeAgytDZbDjRznPMA/gradient-hacking-definitions-and-examples
Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.
---
A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.
85 episoder
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.