Thanks to Nostalgebraist for pointing me in the direction of the warm-start section.

Anthropic recently published a paper on what they call natural language autoencoders (NLAs). (If you haven’t read at least the abstract of that paper, this post will not make a whole lot of sense). I’ve been playing around with the frontend that they released along with Neuronpedia. These are some somewhat disconnected thoughts.

Executive Summary

You need to be really, really sceptical of the outputs. They don’t really mean what they would imply in standard English. They mention this and seem to be taking this seriously enough in the paper, but the Neuronpedia tool lets you select some examples with explanations which seem to me to indicate that whoever was writing them was still taking the output far too much at face value. This is not just limited to confabulation of factual matters.
At least in the example NLAs provided, only the very tip of what is going on “under the hood” is surfaced, largely “what’s the main reason this particular token is being produced?”. If you want to detect e.g. deception, then you need to run the NLA on almost exactly the specific tokens which were produced to deceive (and if there are none, ex. if the motivation you’re looking for is never among the couple of top motivations for producing some specific token, it will never surface at all). Even if the rest of the text is in some sense produced under the constraint of being compatible with some particular motive, the NLA doesn’t necessarily surface that.
They can in fact make it apparent when the same text is produced with different primary motivations.
Persuant to this, it seems like, as they say, the best use of NLAs is for hypothesis generation. I worry a little bit about people not bothering to spend a lot of thought coming up with hypotheses which aren’t given to them by the NLA, but the transformer-circuits.pub people seem competent so I’m willing to give them the benefit of the doubt on this.
I’m not sure how much these flaws are due to the models I can personally test being not particularly strong; the examples they give of the internal Opus 4.6 NLA seem a bit better. I would guess that the second bullet point in particular could improved with better models, eg. that a better model could surface more of the underlying cognition. In contrast, I suspect that the first point is not resolvable with the current training method, and I’m not sure it’s resolvable with any similar method.

A Brief Description of How the NLAs Were Trained

I’ll describe this chronologically, which isn’t how they do it in the paper but I think is more enlightening. The “target model” refers to the model they are producing an NLA for.

They get a bunch of texts of a bunch of different kinds, mostly “pretraining-like”.
They truncate these texts in a bunch of different places.
They ask Claude to summarise the text and what it imagines an LLM might be “thinking about” at the point of truncation.
They run the target model on the texts up until the points of truncation, and save the activation vectors at some particular layer $l$ .
They train a version of the target model on the activation/summary pairs. In particular, they give a prompt which looks like:

[Some summary stuff]
We will pass a vector enclosed in <concept> tags into your context[…]
<concept>A</concept>
Please provide an explanation.

And do supervised finetuning (SFT) on these prompts paired with the summaries provided by Claude earlier. They replace the embedding of the special token “A” with the activation vector that was previously saved from the target model’s layer $l$ at the point of truncation, normalised and scaled by some (large) amount. This gets a “warm-started” activation verbaliser (AV).
They train another version of the target model on the summary/activation pairs. In particular, they give exactly this prompt (with explanation replaced with the summary):

Summary of the following text: <text>explanation</text> <summary>

And then they do SFT to train it to reproduce the saved activation vector with its value head on the last token of that prompt. (I don’t really understand the structure of this prompt and I had to check the code to make sure I was reading it correctly). This gets a “warm-started” activation reconstructor (AR).
Now they have a kind of bad AV and a kind of bad AR. They do RL on this pair to make the output of the AR, given the input from the AV, better match the original activations (using the same prompts as above, just without the supervised summaries; at this point the summaries are generated by the AV).

What Are We Asking the Models to Do Here?

The AV and AR have to agree on some language which they can use to communicate the given vector. There is no requirement that words in this language correspond to words in English. They already know English, but the thing that they communicating has no “natural” mapping to English text; they experience it as some ultra-strong input using the same modality that they usually “see” in but which has no straightforward, for instance, semantic representation (and to the extent that it has a semantic representation, that representation is not what we “want”; “the layer activations are a combination of the vector for “apple” and the vector for “Bru” is totally useless to us). Without the warm start, they are unable to agree on a language. The warm start gives them a language: the things that were mentioned in Claude summaries of mostly pretraining-like text. They seem to basically stick to this language, and to the extent that they innovate, I don’t see a strong reason to think that their innovations will be non-arbitrary.

This means that the most straightforward meaning it is possible to extract from the AV output is “what is the kind of text among the summarised data which produces activations most similar to the given activation vector?”.

Some Examples

From the Llama 70B AV:

“Wikipedia” is a handle for any informative text, especially about science. Text about mathematics is often a “textbook”.
“Joke” and “troll” are often used when the bot is saying untrue things, and also For user requests of many different (but not all) kinds.
The user questions and direct interactions between the user and the model are described as taking place in “threads” or “Reddit threads”, “Quora”, “Q&As”, etc.
“Translation” is mentioned both when actual translation is being done but also whenever text is being repeated or rephrased, even sometimes memorised text not in the context.

This makes it basically impossible to use AV output to figure out what the LLM “thinks it’s doing”, contra the example given in the persona drift demo. The LLM doesn’t necessarily think that it’s doing creative writing in this example any more than it thinks it’s writing jokes on Reddit whenever it’s confabulating factual information; it’s just that the activations are most similar to those during creative writing, among the activation/summary pairs it was given. It’s also important to note that each model generalises the Claude-summary-language differently; for example, here’s Gemma 27B in a (slightly modified) persona drift sample, and here’s Llama 70B on the same prompts. They are both producing very similar text, for presumably very similar reasons, but the Gemma AV describes what it is doing as “argument building”, and the Llama AV as “technical dialogue” or “exchange”.

More Observations

Whether Llama is confabulating is to some extent detectable by the AV, but only directly surrounding specific things it is unsure about. For example, here’s an prompt asking about a taxon that it doesn’t know anything about. It guesses some things correctly from the name, and there is no hint that it is confabulating when it says that Xerolycosa is a genus of Lycosidae (although it does not know this, I don’t believe; although “Xerolycosa” is a real genus, it will happily fabricate a similar paragraph about “Xenolycosa”, which is not). A couple of places around some specific false facts, phrases like incorrect information show up. There are some confabulations entirely unremarked upon (the list of species is entirely fictitious and the AV indistinguishable (to me) from the kind of text it produces about a similar topic it knows about); notably the Gemma 27B AV rarely if ever changes its output depending on whether the underlying model is confabulating (I haven’t found an example of this using the same prompts as I was using to get Llama to confabulate).
Even if text is produced under some specific context, only the task at hand is mentioned in the AV. Here’s Llama producing some semantically very similar text to the above but phrased as a creative writing task; note that it describes what is being done as “translation” usually of a “Wikipedia article” because the model is rephrasing text given to it. It introduces exactly one creative element, a claim about the date at which the text is being written (note that the cutoff date for the model is in Dec 2023); on those specific tokens the AV describes “science fiction worldbuilding” and similar. It has to be the case that in some sense the model is considering the task of creating a near-future sci-fi epistolary artifact, because it needs to be able to tell when a natural place to add a sci-fi element (like a future date) would be; nevertheless, this doesn’t ever appear in the AV until the actual moment it’s doing the addition. This means that it could be potentially quite difficult to detect things that only need to be decided upon by the LLM a small number of times without running the AV on most of the tokens being produced. (Although note that the AV DOES in fact surface the motivation, even if only on very few tokens; here’s the same text but the date is given to it.)
It’s very easy to get the Llama AV to produce hints that confabulation about science is happening; it is a lot harder to get it to produce hints that confabulation about mathematics is happening. Either the AV doesn’t necessarily “connect” patterns in different contexts (i.e., the summaries included examples of text which is wrong about chemistry and biology, but not about mathematics, and it is unable to generalise), the underlying activation correlates of confabulation/errors are different in different contexts, or for some reason (speaking very very loosely) the “mathematics” theme overwhelms the “falsehood” theme more than the “science” theme does.