
The fact that nobody, not even the developers of the most advanced artificial intelligence (AI) systems, truly understands how they operate is among the stranger and more unsettling aspects of these systems.
This is so because, unlike traditional computer programs, huge language models—the kind of AI systems that underpin ChatGPT and other well-known chatbots—are not meticulously crafted by human programmers.
Instead, these systems effectively learn by themselves. They do this by taking in vast amounts of data, recognizing patterns and relationships in language, and utilizing that information to forecast the words that will come after one another in a sequence.
This method of creating AI systems has the drawback that it makes it challenging to reverse-engineer them or find particular defects in the code to address issues. As of right now, there’s no clear way to analyze why the model made that mistake or why the next user asking the same question would get a different response if they type “Which American city has the best food?” and the chatbot answers with “Tokyo.”
Furthermore, there is no clear explanation for why huge language models behave badly or deviate from expectations. One of the main reasons some academics worry that powerful AI systems might one day endanger humans is the inscrutability of big language models.
Ultimately, how will we be able to determine whether these models can be used to develop new bioweapons, disseminate political propaganda, or write dangerous computer code for cyberattacks if we have no idea what’s going on inside of them? How can we stop strong AI systems from misbehaving or disobeying us if we are unable to identify the root cause of their behavior?
However, a group of researchers at the artificial intelligence startup Anthropic reported this week what they considered a significant discovery, one that they believe may help us learn more about the true workings of AI language models and perhaps even stop them from becoming dangerous. In a blog post titled “Mapping the Mind of a Large Language Model,” the team summarized its findings.