Thanks to Nostalgebraist for pointing me in the direction of the warm-start section.

Anthropic recently published a paper on what they call natural language autoencoders (NLAs). (If you haven’t read at least the abstract of that paper, this post will not make a whole lot of sense). I’ve been playing around with the frontend that they released along with Neuronpedia. These are some somewhat disconnected thoughts.

Executive Summary

A Brief Description of How the NLAs Were Trained

I’ll describe this chronologically, which isn’t how they do it in the paper but I think is more enlightening. The “target model” refers to the model they are producing an NLA for.

  1. They get a bunch of texts of a bunch of different kinds, mostly “pretraining-like”.

  2. They truncate these texts in a bunch of different places.

  3. They ask Claude to summarise the text and what it imagines an LLM might be “thinking about” at the point of truncation.

  4. They run the target model on the texts up until the points of truncation, and save the activation vectors at some particular layer ll.

  5. They train a version of the target model on the activation/summary pairs. In particular, they give a prompt which looks like:

    [Some summary stuff]
    We will pass a vector enclosed in <concept> tags into your context[…]
    <concept>A</concept>
    Please provide an explanation.

    And do supervised finetuning (SFT) on these prompts paired with the summaries provided by Claude earlier. They replace the embedding of the special token “A” with the activation vector that was previously saved from the target model’s layer ll at the point of truncation, normalised and scaled by some (large) amount. This gets a “warm-started” activation verbaliser (AV).

  6. They train another version of the target model on the summary/activation pairs. In particular, they give exactly this prompt (with explanation replaced with the summary):

    Summary of the following text: <text>explanation</text> <summary>

    And then they do SFT to train it to reproduce the saved activation vector with its value head on the last token of that prompt. (I don’t really understand the structure of this prompt and I had to check the code to make sure I was reading it correctly). This gets a “warm-started” activation reconstructor (AR).

  7. Now they have a kind of bad AV and a kind of bad AR. They do RL on this pair to make the output of the AR, given the input from the AV, better match the original activations (using the same prompts as above, just without the supervised summaries; at this point the summaries are generated by the AV).

What Are We Asking the Models to Do Here?

The AV and AR have to agree on some language which they can use to communicate the given vector. There is no requirement that words in this language correspond to words in English. They already know English, but the thing that they communicating has no “natural” mapping to English text; they experience it as some ultra-strong input using the same modality that they usually “see” in but which has no straightforward, for instance, semantic representation (and to the extent that it has a semantic representation, that representation is not what we “want”; “the layer activations are a combination of the vector for “apple” and the vector for “Bru” is totally useless to us). Without the warm start, they are unable to agree on a language. The warm start gives them a language: the things that were mentioned in Claude summaries of mostly pretraining-like text. They seem to basically stick to this language, and to the extent that they innovate, I don’t see a strong reason to think that their innovations will be non-arbitrary.

This means that the most straightforward meaning it is possible to extract from the AV output is “what is the kind of text among the summarised data which produces activations most similar to the given activation vector?”.

Some Examples

From the Llama 70B AV:

This makes it basically impossible to use AV output to figure out what the LLM “thinks it’s doing”, contra the example given in the persona drift demo. The LLM doesn’t necessarily think that it’s doing creative writing in this example any more than it thinks it’s writing jokes on Reddit whenever it’s confabulating factual information; it’s just that the activations are most similar to those during creative writing, among the activation/summary pairs it was given. It’s also important to note that each model generalises the Claude-summary-language differently; for example, here’s Gemma 27B in a (slightly modified) persona drift sample, and here’s Llama 70B on the same prompts. They are both producing very similar text, for presumably very similar reasons, but the Gemma AV describes what it is doing as “argument building”, and the Llama AV as “technical dialogue” or “exchange”.

More Observations