Artwork

Indhold leveret af The Nonlinear Fund. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af The Nonlinear Fund eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Player FM - Podcast-app
Gå offline med appen Player FM !

AF - Visualizing neural network planning by Nevan Wichers

8:33
 
Del
 

Manage episode 417316620 series 2997284
Indhold leveret af The Nonlinear Fund. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af The Nonlinear Fund eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Visualizing neural network planning, published by Nevan Wichers on May 9, 2024 on The AI Alignment Forum. TLDR We develop a technique to try and detect if a NN is doing planning internally. We apply the decoder to the intermediate representations of the network to see if it's representing the states it's planning through internally. We successfully reveal intermediate states in a simple Game of Life model, but find no evidence of planning in an AlphaZero chess model. We think the idea won't work in its current state for real world NNs because they use higher-level, abstract representations for planning that our current technique cannot decode. Please comment if you have ideas that may work for detecting more abstract ways the NN could be planning. Idea and motivation To make safe ML, it's important to know if the network is performing mesa optimization, and if so, what optimization process it's using. In this post, I'll focus on a particular form of mesa optimization: internal planning. This involves the model searching through possible future states and selecting the ones that best satisfy an internal goal. If the network is doing internal planning, then it's important the goal it's planning for is aligned with human values. An interpretability technique which could identify what states it's searching through would be very useful for safety. If the NN is doing planning it might represent the states it's considering in that plan. For example, if predicting the next move in chess, it may represent possible moves it's considering in its hidden representations. We assume that NN is given the representation of the environment as input and that the first layer of the NN encodes the information into a hidden representation. Then the network has hidden layers and finally a decoder to compute the final output. The encoder and decoder are trained as an autoencoder, so the decoder can reconstruct the environment state from the encoder output. Language models are an example of this where the encoder is the embedding lookup. Our hypothesis is that the NN may use the same representation format for states it's considering in its plan as it does for the encoder's output. Our idea is to apply the decoder to the hidden representations at different layers to decode them. If our hypothesis is correct, this will recover the states it considers in its plan. This is similar to the Logit Lens for LLMs, but we're applying it here to investigate mesa-optimization. A potential pitfall is that the NN uses a slightly different representation for the states it considers during planning than for the encoder output. In this case, the decoder won't be able to reconstruct the environment state it's considering very well. To overcome this, we train the decoder to output realistic looking environment states given the hidden representations by training it like the generator in a GAN. Note that the decoder isn't trained on ground truth environment states, because we don't know which states the NN is considering in its plan. Game of Life proof of concept (code) We consider an NN trained to predict the number of living cells after the Nth time step of the Game of Life (GoL). We chose the GoL because it has simple rules, and the NN will probably have to predict the intermediate states to get the final cell count. This NN won't do planning, but it may represent the intermediate states of the GoL in its hidden states. We use an LSTM architecture with an encoder to encode the initial GoL state, and a "count cells NN" to output the number of living cells after the final LSTM output. Note that training the NN to predict the number of alive cells at the final state makes this more difficult for our method than training the network to predict the final state since it's less obvious that the network will predict t...
  continue reading

2421 episoder

Artwork
iconDel
 
Manage episode 417316620 series 2997284
Indhold leveret af The Nonlinear Fund. Alt podcastindhold inklusive episoder, grafik og podcastbeskrivelser uploades og leveres direkte af The Nonlinear Fund eller deres podcastplatformspartner. Hvis du mener, at nogen bruger dit ophavsretligt beskyttede værk uden din tilladelse, kan du følge processen beskrevet her https://da.player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Visualizing neural network planning, published by Nevan Wichers on May 9, 2024 on The AI Alignment Forum. TLDR We develop a technique to try and detect if a NN is doing planning internally. We apply the decoder to the intermediate representations of the network to see if it's representing the states it's planning through internally. We successfully reveal intermediate states in a simple Game of Life model, but find no evidence of planning in an AlphaZero chess model. We think the idea won't work in its current state for real world NNs because they use higher-level, abstract representations for planning that our current technique cannot decode. Please comment if you have ideas that may work for detecting more abstract ways the NN could be planning. Idea and motivation To make safe ML, it's important to know if the network is performing mesa optimization, and if so, what optimization process it's using. In this post, I'll focus on a particular form of mesa optimization: internal planning. This involves the model searching through possible future states and selecting the ones that best satisfy an internal goal. If the network is doing internal planning, then it's important the goal it's planning for is aligned with human values. An interpretability technique which could identify what states it's searching through would be very useful for safety. If the NN is doing planning it might represent the states it's considering in that plan. For example, if predicting the next move in chess, it may represent possible moves it's considering in its hidden representations. We assume that NN is given the representation of the environment as input and that the first layer of the NN encodes the information into a hidden representation. Then the network has hidden layers and finally a decoder to compute the final output. The encoder and decoder are trained as an autoencoder, so the decoder can reconstruct the environment state from the encoder output. Language models are an example of this where the encoder is the embedding lookup. Our hypothesis is that the NN may use the same representation format for states it's considering in its plan as it does for the encoder's output. Our idea is to apply the decoder to the hidden representations at different layers to decode them. If our hypothesis is correct, this will recover the states it considers in its plan. This is similar to the Logit Lens for LLMs, but we're applying it here to investigate mesa-optimization. A potential pitfall is that the NN uses a slightly different representation for the states it considers during planning than for the encoder output. In this case, the decoder won't be able to reconstruct the environment state it's considering very well. To overcome this, we train the decoder to output realistic looking environment states given the hidden representations by training it like the generator in a GAN. Note that the decoder isn't trained on ground truth environment states, because we don't know which states the NN is considering in its plan. Game of Life proof of concept (code) We consider an NN trained to predict the number of living cells after the Nth time step of the Game of Life (GoL). We chose the GoL because it has simple rules, and the NN will probably have to predict the intermediate states to get the final cell count. This NN won't do planning, but it may represent the intermediate states of the GoL in its hidden states. We use an LSTM architecture with an encoder to encode the initial GoL state, and a "count cells NN" to output the number of living cells after the final LSTM output. Note that training the NN to predict the number of alive cells at the final state makes this more difficult for our method than training the network to predict the final state since it's less obvious that the network will predict t...
  continue reading

2421 episoder

所有剧集

×
 
Loading …

Velkommen til Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Hurtig referencevejledning