Artwork

Indhold leveret af Michaël Trazzi. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af Michaël Trazzi eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Player FM - Podcast-app
Gå offline med appen Player FM !

Erik Jones on Automatically Auditing Large Language Models

22:36
 
Del
 

Manage episode 374008737 series 2966339
Indhold leveret af Michaël Trazzi. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af Michaël Trazzi eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.

Youtube: https://youtu.be/bhE5Zs3Y1n8

Paper: https://arxiv.org/abs/2303.04381

Erik: https://twitter.com/ErikJones313

Host: https://twitter.com/MichaelTrazzi

Patreon: https://www.patreon.com/theinsideview

Outline

00:00 Highlights

00:31 Eric's background and research in Berkeley

01:19 Motivation for doing safety research on language models

02:56 Is it too easy to fool today's language models?

03:31 The goal of adversarial attacks on language models

04:57 Automatically Auditing Large Language Models via Discrete Optimization

06:01 Optimizing over a finite set of tokens rather than continuous embeddings

06:44 Goal is revealing behaviors, not necessarily breaking the AI

07:51 On the feasibility of solving adversarial attacks

09:18 Suppressing dangerous knowledge vs just bypassing safety filters

10:35 Can you really ask a language model to cook meth?

11:48 Optimizing French to English translation example

13:07 Forcing toxic celebrity outputs just to test rare behaviors

13:19 Testing the method on GPT-2 and GPT-J

14:03 Adversarial prompts transferred to GPT-3 as well

14:39 How this auditing research fits into the broader AI safety field

15:49 Need for automated tools to audit failures beyond what humans can find

17:47 Auditing to avoid unsafe deployments, not for existential risk reduction

18:41 Adaptive auditing that updates based on the model's outputs

19:54 Prospects for using these methods to detect model deception

22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts

Patreon supporters:

  • Tassilo Neubauer
  • MonikerEpsilon
  • Alexey Malafeev
  • Jack Seroy
  • JJ Hepburn
  • Max Chiswick
  • William Freire
  • Edward Huff
  • Gunnar Höglund
  • Ryan Coppolo
  • Cameron Holmes
  • Emil Wallner
  • Jesse Hoogland
  • Jacques Thibodeau
  • Vincent Weisser
  continue reading

55 episoder

Artwork
iconDel
 
Manage episode 374008737 series 2966339
Indhold leveret af Michaël Trazzi. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af Michaël Trazzi eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.

Youtube: https://youtu.be/bhE5Zs3Y1n8

Paper: https://arxiv.org/abs/2303.04381

Erik: https://twitter.com/ErikJones313

Host: https://twitter.com/MichaelTrazzi

Patreon: https://www.patreon.com/theinsideview

Outline

00:00 Highlights

00:31 Eric's background and research in Berkeley

01:19 Motivation for doing safety research on language models

02:56 Is it too easy to fool today's language models?

03:31 The goal of adversarial attacks on language models

04:57 Automatically Auditing Large Language Models via Discrete Optimization

06:01 Optimizing over a finite set of tokens rather than continuous embeddings

06:44 Goal is revealing behaviors, not necessarily breaking the AI

07:51 On the feasibility of solving adversarial attacks

09:18 Suppressing dangerous knowledge vs just bypassing safety filters

10:35 Can you really ask a language model to cook meth?

11:48 Optimizing French to English translation example

13:07 Forcing toxic celebrity outputs just to test rare behaviors

13:19 Testing the method on GPT-2 and GPT-J

14:03 Adversarial prompts transferred to GPT-3 as well

14:39 How this auditing research fits into the broader AI safety field

15:49 Need for automated tools to audit failures beyond what humans can find

17:47 Auditing to avoid unsafe deployments, not for existential risk reduction

18:41 Adaptive auditing that updates based on the model's outputs

19:54 Prospects for using these methods to detect model deception

22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts

Patreon supporters:

  • Tassilo Neubauer
  • MonikerEpsilon
  • Alexey Malafeev
  • Jack Seroy
  • JJ Hepburn
  • Max Chiswick
  • William Freire
  • Edward Huff
  • Gunnar Höglund
  • Ryan Coppolo
  • Cameron Holmes
  • Emil Wallner
  • Jesse Hoogland
  • Jacques Thibodeau
  • Vincent Weisser
  continue reading

55 episoder

Wszystkie odcinki

×
 
Loading …

Velkommen til Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Hurtig referencevejledning