Gå offline med appen Player FM !
Erik Jones on Automatically Auditing Large Language Models
Manage episode 374008737 series 2966339
Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.
Youtube: https://youtu.be/bhE5Zs3Y1n8
Paper: https://arxiv.org/abs/2303.04381
Erik: https://twitter.com/ErikJones313
Host: https://twitter.com/MichaelTrazzi
Patreon: https://www.patreon.com/theinsideview
Outline
00:00 Highlights
00:31 Eric's background and research in Berkeley
01:19 Motivation for doing safety research on language models
02:56 Is it too easy to fool today's language models?
03:31 The goal of adversarial attacks on language models
04:57 Automatically Auditing Large Language Models via Discrete Optimization
06:01 Optimizing over a finite set of tokens rather than continuous embeddings
06:44 Goal is revealing behaviors, not necessarily breaking the AI
07:51 On the feasibility of solving adversarial attacks
09:18 Suppressing dangerous knowledge vs just bypassing safety filters
10:35 Can you really ask a language model to cook meth?
11:48 Optimizing French to English translation example
13:07 Forcing toxic celebrity outputs just to test rare behaviors
13:19 Testing the method on GPT-2 and GPT-J
14:03 Adversarial prompts transferred to GPT-3 as well
14:39 How this auditing research fits into the broader AI safety field
15:49 Need for automated tools to audit failures beyond what humans can find
17:47 Auditing to avoid unsafe deployments, not for existential risk reduction
18:41 Adaptive auditing that updates based on the model's outputs
19:54 Prospects for using these methods to detect model deception
22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts
Patreon supporters:
- Tassilo Neubauer
- MonikerEpsilon
- Alexey Malafeev
- Jack Seroy
- JJ Hepburn
- Max Chiswick
- William Freire
- Edward Huff
- Gunnar Höglund
- Ryan Coppolo
- Cameron Holmes
- Emil Wallner
- Jesse Hoogland
- Jacques Thibodeau
- Vincent Weisser
55 episoder
Manage episode 374008737 series 2966339
Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.
Youtube: https://youtu.be/bhE5Zs3Y1n8
Paper: https://arxiv.org/abs/2303.04381
Erik: https://twitter.com/ErikJones313
Host: https://twitter.com/MichaelTrazzi
Patreon: https://www.patreon.com/theinsideview
Outline
00:00 Highlights
00:31 Eric's background and research in Berkeley
01:19 Motivation for doing safety research on language models
02:56 Is it too easy to fool today's language models?
03:31 The goal of adversarial attacks on language models
04:57 Automatically Auditing Large Language Models via Discrete Optimization
06:01 Optimizing over a finite set of tokens rather than continuous embeddings
06:44 Goal is revealing behaviors, not necessarily breaking the AI
07:51 On the feasibility of solving adversarial attacks
09:18 Suppressing dangerous knowledge vs just bypassing safety filters
10:35 Can you really ask a language model to cook meth?
11:48 Optimizing French to English translation example
13:07 Forcing toxic celebrity outputs just to test rare behaviors
13:19 Testing the method on GPT-2 and GPT-J
14:03 Adversarial prompts transferred to GPT-3 as well
14:39 How this auditing research fits into the broader AI safety field
15:49 Need for automated tools to audit failures beyond what humans can find
17:47 Auditing to avoid unsafe deployments, not for existential risk reduction
18:41 Adaptive auditing that updates based on the model's outputs
19:54 Prospects for using these methods to detect model deception
22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts
Patreon supporters:
- Tassilo Neubauer
- MonikerEpsilon
- Alexey Malafeev
- Jack Seroy
- JJ Hepburn
- Max Chiswick
- William Freire
- Edward Huff
- Gunnar Höglund
- Ryan Coppolo
- Cameron Holmes
- Emil Wallner
- Jesse Hoogland
- Jacques Thibodeau
- Vincent Weisser
55 episoder
Wszystkie odcinki
×Velkommen til Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.