As all scribblers of doggerel know, rhymes must be paired up before you start a new line. Otherwise you may write yourself into a dead end with an ill-placed “purple” or “orange”. It is an insight that is shared by artificial intelligence (AI), new research shows. When Claude, a large language model (LLM), is asked to write a rhyming couplet, it begins thinking of the second part of the rhyme as soon as the first word is written. Give it the first line “he saw a carrot and had to grab it”, and the AI begins contemplating rabbits at once, writing the next sentence to end at the appropriate rhyme.

Such forethought is unexpected, says researcher Josh Batson. The way such systems work sees them writing text one “token” at a time, and he expected the approach to be bluntly linear: start writing the next sentence, and consider possible rhymes only at the end of the line. But when Dr Batson and his team at Anthropic, the AI lab that developed Claude, built a tool that allowed them to peer inside the digital brains of their LLMs, they discovered some unexpected complexity.

Their tool, which the researchers call a digital “microscope”, lets them look at which parts of a neural network are activated as it “thinks”. By tracking when different features of the model are activated, it is possible to build an understanding of what the models do: if a particular area of the LLM lights up whenever it produces words like bunny or rabbit, for instance, then that gets marked as being related to rabbits.

This has let the team solve some open questions in AI research. For example: when a chatbot is multilingual, is there in effect an entire second copy of everything it knows, or does it have some awareness of concepts that transcend language? Ask it in English for the opposite to “big”, in French for the opposite to “grand” or in Chinese for the opposite to the Hanzi character for the same concept, and the same feature lights up in every case, before more language-specific circuits kick in to “translate” the concept of smallness into a particular word.

This suggests that LLMs may be more capable than they are given credit for. The rise of “reasoning” models, which print the chain of thought they took to arrive at a conclusion, means that conventional LLMs are often described as acting on instinct. The microscope, though, shows behaviours that look like planning and reasoning even in those simpler models—and little that looks like simple pattern matching.

Other insights, though, are less encouraging. When Claude itself is asked to reason, printing out the chain of thought that it takes to answer maths questions, the microscope suggests that the way the model says it reached a conclusion, and what it actually thought, might not always be the same thing. Ask the LLM a complex maths question that it does not know how to solve and it will “bullshit” its way to an answer: rather than actually trying, it decides to spit out random numbers and move on.

Worse still, ask a leading question—suggesting, for instance, that the answer “might be 4”—and the model still secretly bullshits as part of its answer, but rather than randomly picking numbers, it will specifically insert numbers that ultimately lead it to agree with the question, even if the suggestion is wrong.

But, notes Dr Batson, being able to peer into the mind of an LLM and see when it decides to bullshit provides clues as to how to stop it doing the same in the future. The goal, after all, is to not have to do brain surgery—digital or otherwise—at all. If you can trust the model is telling the truth about its thought process, he points out, then knowing what it’s thinking should be as simple as reading the transcript. ■

Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.


Independence | Integrity | Excellence | Openness