Artwork

Indhold leveret af LessWrong. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af LessWrong eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Player FM - Podcast-app
Gå offline med appen Player FM !

“Self-Other Overlap: A Neglected Approach to AI Alignment” by Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena

23:21
 
Del
 

Manage episode 432958306 series 3364760
Indhold leveret af LessWrong. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af LessWrong eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Figure 1. Image generated by DALL-3 to represent the concept of self-other overlapMany thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work.
Summary
In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment. We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment. On top of that [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
July 30th, 2024
Source:
https://www.lesswrong.com/posts/hzt9gHpNwA2oHtwKX/self-other-overlap-a-neglected-approach-to-ai-alignment
---
Narrated by TYPE III AUDIO.
---
Images from the article:
undefined
undefined
undefined
undefined
undefined
  continue reading

355 episoder

Artwork
iconDel
 
Manage episode 432958306 series 3364760
Indhold leveret af LessWrong. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af LessWrong eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Figure 1. Image generated by DALL-3 to represent the concept of self-other overlapMany thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work.
Summary
In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment. We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment. On top of that [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
July 30th, 2024
Source:
https://www.lesswrong.com/posts/hzt9gHpNwA2oHtwKX/self-other-overlap-a-neglected-approach-to-ai-alignment
---
Narrated by TYPE III AUDIO.
---
Images from the article:
undefined
undefined
undefined
undefined
undefined
  continue reading

355 episoder

所有剧集

×
 
Loading …

Velkommen til Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Hurtig referencevejledning