Reading Machine Minds: How Neuroscience Is Unlocking AI Transparency

Somewhere inside Claude, Anthropic's large language model, there is a cluster of artificial neurons that lights up whenever the Golden Gate Bridge enters the conversation. Not just when someone mentions the bridge by name, but when an image of it appears, when the topic of San Francisco landmarks arises, or when someone references the colour of international orange in a context that evokes the famous suspension span. Nearby, in the model's vast internal geography, sit other clusters responding to Alcatraz Island, the Golden State Warriors, and California Governor Gavin Newsom. The organisation of these concepts mirrors something strikingly familiar: the way a human brain might organise related knowledge about the San Francisco Bay Area in neighbouring neural populations.
This discovery, published by Anthropic's interpretability team in May 2024, was not merely a curiosity. It represented what researchers described as “the first ever detailed look inside a modern, production-grade large language model.” And it arrived at a moment when the stakes of understanding these systems could hardly be higher. Large language models now draft legal briefs, assist medical diagnoses, generate code for critical infrastructure, and advise on policy decisions. Yet for all their capability, their internal reasoning remains largely opaque, even to the engineers who built them.
The quest to crack open this opacity has produced a new scientific discipline that sits at the intersection of neuroscience, computer science, and philosophy of mind. Mechanistic interpretability, as the field is known, borrows tools and conceptual frameworks from decades of brain research to reverse-engineer the computational mechanisms hidden inside artificial neural networks. The ambition is extraordinary: to build what amounts to a microscope for AI, capable of revealing not just what these systems say, but how and why they arrive at their outputs.
The question is whether this microscope can be made powerful enough, fast enough, to keep pace with AI systems that are growing more capable by the month. And whether what it reveals can ever translate into the kind of safety guarantees that high-stakes deployment demands.
The Neuroscience Parallel That Launched a Field
The intellectual lineage of mechanistic interpretability traces directly to neuroscience. Chris Olah, co-founder of Anthropic and one of the pioneers of the field, has spent over a decade working to identify internal structures within neural networks, first at Google Brain, then at OpenAI, and now at Anthropic. TIME named him to its TIME100 AI list in 2024, recognising his foundational contributions to the discipline. In an interview with the 80,000 Hours podcast, Olah described his work as fundamentally about understanding what is going on inside neural networks, treating them not as inscrutable black boxes but as systems with discoverable internal structure.
The parallel between studying brains and studying neural networks is more than a convenient metaphor. Both systems consist of vast numbers of interconnected units whose individual behaviour is relatively simple but whose collective activity produces remarkably complex outputs. In neuroscience, researchers have long used techniques like functional magnetic resonance imaging, single-neuron recording, and optogenetics to identify which brain regions and circuits correspond to specific cognitive functions. The interpretability community is attempting something analogous with artificial systems, and the methodological borrowing is increasingly explicit.
A 2024 paper by Adam Davies and Ashkan Khakzar, titled “The Cognitive Revolution in Interpretability,” formalised this connection. The authors argued that mechanistic interpretability methods enable a paradigm shift similar to psychology's historical “cognitive revolution,” which moved the discipline beyond pure behaviourism toward understanding internal mental processes. They proposed a taxonomy organising interpretability into two categories: semantic interpretation, which asks what latent representations a model has learned, and algorithmic interpretation, which examines what operations the system performs over those representations. Davies and Khakzar contended that these two modes of investigation have “divergent goals and objects of study” but suggested they might eventually unify under a common framework, much as cognitive science itself integrated insights from linguistics, psychology, neuroscience, and computer science.
This framework echoes the influential levels of analysis proposed by neuroscientist David Marr in the 1980s, which distinguished between the computational goals of a system, the algorithms it employs, and the physical implementation of those algorithms. The suggestion is not that artificial neural networks are brains, but that the intellectual toolkit developed to study brains offers a surprisingly productive way to study their silicon counterparts.
The analogy has practical teeth. Just as neuroscientists discovered that individual brain regions specialise in particular functions, interpretability researchers have found that language models develop internal specialisations that bear a surface resemblance to the modular organisation of biological cognition. The Golden Gate Bridge feature is one example among millions, but the principle it illustrates is broadly applicable: these models do not store information as undifferentiated numerical soup. They develop structured, organised representations that can be individually identified and experimentally manipulated, much as a neuroscientist might stimulate a specific brain region and observe the resulting behavioural change.
A paper published in Nature Machine Intelligence by researchers Kohitij Kar, Martin Schrimpf, and Evelina Fedorenko at MIT made an important distinction, however. They noted that interpretability means different things to neuroscientists and AI researchers. In AI, interpretability typically focuses on understanding how model components contribute to outputs. In neuroscience, interpretability requires explicit alignment between model components and neuroscientific constructs such as brain areas, recurrence, or top-down feedback. Bridging these two conceptions remains an active challenge, and conflating them risks generating false confidence about how well we truly understand what these systems are doing.
Sparse Autoencoders and the Problem of Polysemanticity
The central technical obstacle in reading the minds of language models is a phenomenon called polysemanticity. Individual neurons in these networks typically respond to many unrelated concepts simultaneously. A single neuron might activate for references to legal contracts, the colour blue, and mentions of 1990s pop music. This makes individual neurons nearly useless as units of analysis, much as recording from a single neuron in the human brain rarely tells you what someone is thinking.
The problem has a name in the interpretability literature: superposition. Chris Olah wrote in a July 2024 update on Transformer Circuits that if you had asked him a year earlier what the key open problems for mechanistic interpretability were, “I would have told you the most important problem was superposition.” The term refers to the way neural networks pack more concepts into fewer neurons than ought to be possible, representing information in overlapping patterns that defy straightforward analysis.
Anthropic's breakthrough came from applying a technique called sparse dictionary learning, borrowed from classical machine learning, to decompose the tangled activity of polysemantic neurons into cleaner units called features. The tool for accomplishing this is the sparse autoencoder, a type of neural network trained to compress and reconstruct the internal activations of a language model while enforcing a sparsity constraint. The sparsity penalty ensures that for any given input, only a small fraction of features have nonzero activations. The result is an approximate decomposition of the model's internal states into a linear combination of feature directions, each ideally corresponding to a single interpretable concept.
In their May 2024 paper, “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,” Anthropic's team demonstrated that this approach could work on a production-scale model. Eight months earlier, they had shown the technique could recover monosemantic features from a small one-layer transformer in their earlier paper “Towards Monosemanticity,” but a major concern was whether the method would scale to state-of-the-art systems. It did. The team extracted tens of millions of features from Claude 3 Sonnet's middle layer, identifying responses to concrete entities like cities, people, chemical elements, and programming syntax, as well as abstract concepts like code bugs, gender bias in discussions, and conversations about secrecy.
The features proved to be highly abstract: multilingual, multimodal, and capable of generalising between concrete and abstract references. A feature for the Golden Gate Bridge activated on text about the bridge, images of the bridge, and descriptions in multiple languages. Features neighbouring it in the model's internal space corresponded to related concepts, suggesting that Claude's internal organisation reflects something resembling human notions of conceptual similarity. Anthropic's researchers proposed that this conceptual neighbourhood structure might help explain what they described as Claude's “excellent ability to make analogies and metaphors.”
Perhaps most significant for safety, the researchers identified features linked to harmful behaviours, including scam emails, bias, code backdoors, and sycophancy. When they artificially amplified these features, the model's behaviour changed accordingly, demonstrating a causal relationship between internal representations and outputs. When they boosted the Golden Gate Bridge feature to extreme levels, Claude began dropping references to the bridge into nearly every response and even claimed to be the bridge itself. The team also explored various sparse autoencoder architectures, including TopK, Gated SAEs, and JumpReLU variants, developing quantified autointerpretability methods that measure the extent to which Claude can make accurate predictions about its own feature activations.
Yet the researchers were candid about the limitations. The discovered features represent only a small subset of the concepts Claude has learned. Finding a complete set would require computational resources exceeding the cost of training the original model.
Tracing Thoughts Through Attribution Graphs
If sparse autoencoders provided the first lens for viewing individual features, Anthropic's 2025 work on circuit tracing provided the first tool for watching those features interact during reasoning. In two companion papers, “Circuit Tracing: Revealing Computational Graphs in Language Models” and “On the Biology of a Large Language Model,” the team introduced attribution graphs, a technique for tracing the internal flow of information between features during a single forward pass through the model.
The method works by constructing a “replacement model” that substitutes more interpretable components, called cross-layer transcoders, for the original multi-layer perceptrons. This allows researchers to produce graph descriptions of the model's computation on specific prompts, revealing intermediate concepts and reasoning steps that are invisible from outputs alone. Anthropic's CEO Dario Amodei noted that the company's understanding of the inner workings of AI lags far behind the progress being made in AI capabilities, framing interpretability research as a race to close that gap before the consequences of ignorance become catastrophic.
One demonstration involved asking Claude 3.5 Haiku, “What is the capital of the state where Dallas is located?” Intuitively, answering this question requires two steps: inferring that Dallas is in Texas, then recalling that the capital of Texas is Austin. The researchers found evidence that the model genuinely performs this two-step reasoning internally, with identifiable intermediate features representing the concept of Texas before the final answer of Austin emerges. Critically, they also found that this genuine multi-step reasoning coexists alongside “shortcut” reasoning pathways, suggesting that the model maintains multiple computational strategies for arriving at the same answer.
The research yielded several other striking findings. When tasked with composing rhyming poetry, the model was found to plan multiple words ahead to meet rhyme and meaning constraints, effectively reverse-engineering entire lines before writing the first word. When researchers examined cases of hallucination, they discovered the counter-intuitive result that Claude's default behaviour is to decline to speculate, and it only produces fabricated information when something actively inhibits this default reluctance. In examining jailbreak attempts, they found that the model recognised it had been asked for dangerous information well before it managed to redirect the conversation to safety.
The attribution graph approach also revealed a subtlety about faithful versus unfaithful reasoning. When asked to compute the square root of 0.64, Claude produced faithful chain-of-thought reasoning with features representing intermediate mathematical steps. But when asked to compute the cosine of a very large number, the model sometimes simply fabricated an answer, and the attribution graph made this difference in computational strategy visible.
Anthropic open-sourced the circuit-tracing tools in May 2025, and a collaborative effort involving researchers from Anthropic, Decode, EleutherAI, Goodfire AI, and Google DeepMind has since applied them to open-weight models including Gemma-2-2B, Llama-3.1-1B, and Qwen3-4B through the Neuronpedia platform.
OpenAI's Automated Neuron Explanations and Their Limits
While Anthropic pursued feature-level analysis through sparse autoencoders, OpenAI took a different but complementary approach. In May 2023, a team including Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders published research demonstrating that GPT-4 could be used to automatically write explanations for the behaviour of individual neurons in GPT-2 and to score those explanations for accuracy.
Their methodology consisted of three steps. First, text sequences were run through the model being evaluated to identify cases where a particular neuron activated frequently. Next, GPT-4 was shown these high-activation patterns and asked to generate a natural language explanation of what the neuron responds to. Finally, GPT-4 was asked to predict how the neuron would behave on new text sequences, and these predictions were compared against actual neuron behaviour to produce an accuracy score. The approach was notable for its ambition: rather than relying on human researchers to manually inspect neurons one at a time, it attempted to automate the entire interpretability pipeline.
The team found over 1,000 neurons with explanations scoring at least 0.8, meaning GPT-4's descriptions accounted for most of the neuron's top-activating behaviour. They identified neurons responding to phrases related to certainty and confidence, neurons for things done correctly, and many others. They released their datasets and visualisation tools for all 307,200 neurons in GPT-2, inviting the research community to develop better techniques. The researchers noted that the average explanation score improved as the explainer model's capabilities increased, suggesting that more powerful future models might produce substantially better explanations.
But the limitations were substantial. As researcher Jeff Wu acknowledged, “Most of the explanations score quite poorly or don't explain that much of the behaviour of the actual neuron.” Many neurons activated on multiple different things with no discernible pattern, and sometimes GPT-4 was unable to find patterns that did exist. The approach focused on short natural language explanations, but neurons may exhibit behaviour too complex to describe succinctly, particularly when they are highly polysemantic or represent concepts that humans lack words for.
The approach also carries a deeper conceptual challenge. Using one language model to explain another creates a circularity: the explanations are only as good as the explainer model's own understanding, which is itself opaque. If GPT-4 cannot correctly interpret certain patterns, those patterns remain hidden regardless of how sophisticated the automated pipeline becomes. The researchers acknowledged this limitation, noting that they would ultimately like to use models to “form, test, and iterate on fully general hypotheses just as an interpretability researcher would.”
OpenAI's broader alignment agenda initially positioned interpretability as central to its work on superalignment, the challenge of ensuring that AI systems much smarter than humans remain aligned with human values. However, in May 2024, the Superalignment team was effectively dissolved following the departures of co-lead Ilya Sutskever and head of alignment Jan Leike. OpenAI has continued interpretability-adjacent research under other organisational structures, publishing work on sparse-autoencoder latent attribution for debugging misalignment in late 2025.
The Scalability Gap Between Understanding and Assurance
The practical limitations of current interpretability methods become starkly apparent when measured against the demands of high-stakes deployment. Understanding that a particular feature in Claude responds to the Golden Gate Bridge is fascinating. Understanding the full computational graph that leads Claude to recommend a specific medical treatment, draft a particular legal argument, or generate code for a safety-critical system is an entirely different proposition.
Leonard Bereska and Max Gavves, in their comprehensive 2024 review “Mechanistic Interpretability for AI Safety,” surveyed the field's methods for causally dissecting model behaviours and assessed their relevance to safety. They emphasised that “understanding and interpreting these complex systems is not merely an academic endeavour; it's a societal imperative to ensure AI remains trustworthy and beneficial.” Yet they also catalogued formidable challenges in scalability, automation, and comprehensive interpretation. Their review further examined the dual-use risks of interpretability research itself, noting that the same tools that help safety researchers detect deceptive behaviours could potentially help malicious actors understand how to circumvent safety measures.
The scalability problem is twofold. First, modern language models contain billions or trillions of parameters, and the number of potential features and circuits grows combinatorially. Anthropic's work on Claude 3 Sonnet extracted tens of millions of features from a single layer, and a complete analysis would require resources exceeding the original training cost. Second, even when individual features or circuits are identified, composing them into a full account of the model's behaviour on any given input remains beyond current capabilities. The field can offer snapshots of computational processes, not comprehensive maps.
Anthropic has publicly stated its goal to “reliably detect most AI model problems by 2027” using interpretability tools. The company took a concrete step toward integrating interpretability into deployment decisions when it used mechanistic interpretability in the pre-deployment safety assessment of Claude Sonnet 4.5. Before releasing the model, researchers examined internal features for dangerous capabilities, deceptive tendencies, or undesired goals. This represented the first known integration of interpretability research into deployment decisions for a production system.
Yet the gap between detecting specific known problems and providing comprehensive safety assurances remains vast. Finding a feature associated with deception does not guarantee that all deceptive pathways have been identified. The absence of evidence for dangerous capabilities is not evidence of absence. And the speed at which new models are trained and deployed vastly outpaces the speed at which they can be thoroughly interpreted.
MIT Technology Review named mechanistic interpretability one of its 10 Breakthrough Technologies for 2026, recognising that “research techniques now provide the best glimpse yet of what happens inside the black box.” The phrasing is telling: a glimpse, not a complete picture.
NeuroAI and the Convergence of Biological and Artificial Understanding
The parallels between neuroscience and AI interpretability are not merely inspirational. A growing body of research suggests that genuine scientific convergence between the two fields could benefit both, and that the emerging discipline of NeuroAI represents a return to the cross-pollination that produced many of AI's foundational breakthroughs.
A 2024 editorial in Nature Machine Intelligence noted that while AI has shifted toward transformers and other complex architectures that seem to have moved away from neural-inspired roots, the field “may still look towards neuroscience for help in understanding complex information processing systems.” The editorial pointed to a coalition of initiatives around “NeuroAI,” a push to identify fresh ideas at the intersection of the two disciplines, including the annual COSYNE conference which has become a focal point for researchers working across both fields.
A paper in Nature Communications argued that the emerging field of NeuroAI “is based on the premise that a better understanding of neural computation will reveal fundamental ingredients of intelligence and catalyse the next revolution in AI.” The authors noted that historically, many key AI advances, including convolutional neural networks and reinforcement learning, were inspired by neuroscience, but that this cross-pollination had become far less common than in the past, representing what they called a missed opportunity.
A 2024 paper in Nature Reviews Neuroscience discussed how NeuroAI has the potential to transform large-scale neural modelling and data-driven neuroscience discovery, though the field must balance exploiting AI's power while maintaining interpretability and biological insight. The paper highlighted that unlike the human brain, which features a variety of morphologically and functionally distinct neurons, artificial neural networks typically rely on a homogeneous neuron model. Incorporating greater diversity of neuron models could address key challenges in AI, including efficiency, interpretability, and memory capacity.
The convergence runs in both directions. Sparse autoencoders, developed for AI interpretability, have found applications in protein language model research, where they uncover biologically interpretable features in protein representations. Representation engineering approaches that track latent neural trajectories when processing different input types draw directly on methods developed for studying neural population dynamics in biological brains.
The Whole Brain Architecture Initiative in Japan has proposed what it calls “brain-based interpretability,” arguing that if an advanced AI system's computational processes can be understood at a cognitive level in terms of corresponding human neural activity, unfavourable intentions or deceptions would be more readily detectable. The premise is that biological neural circuits, refined by millions of years of evolution, provide a reference architecture against which artificial computation can be measured and understood.
Yet researchers at MIT have cautioned that interpretability requires different things in the two domains. Understanding what a particular feature in an AI model represents is not the same as understanding why a biological neuron fires in a particular pattern. The former asks about function within an engineered system; the latter asks about mechanism within an evolved one. Collapsing this distinction risks importing assumptions from one domain that may not hold in the other.
Governance Frameworks and the Trust Translation Problem
The interpretability research emerging from Anthropic, OpenAI, Google DeepMind, and academic institutions arrives against a backdrop of rapidly evolving governance frameworks that increasingly demand transparency from AI systems. The question is whether the scientific progress being made in mechanistic interpretability can translate into the kind of transparency that regulators, deployers, and the public actually need.
The European Union's AI Act, which entered into force on 1 August 2024, provides the most comprehensive regulatory framework. Article 13 requires that high-risk AI systems “shall be designed and developed in such a way as to ensure that their operation is sufficiently transparent to enable deployers to interpret a system's output and use it appropriately.” Non-compliance carries penalties reaching 35 million euros or 7 per cent of global annual turnover. The Act's provisions on prohibited AI practices and AI literacy obligations became applicable from 2 February 2025, with general-purpose AI rules taking effect in August 2025 and the full framework becoming applicable by August 2026.
Yet scholars have identified what they call the “compliance gap” between the Act's transparency requirements and implementation reality. The regulation does not specify what level of interpretability is technically required, creating ambiguity about whether current mechanistic interpretability tools satisfy the legal standard. A feature-level understanding of a model's internal representations is not the same as a human-readable explanation of why the model made a specific decision in a specific case. The former is a scientific achievement; the latter is what a doctor, a judge, or a loan officer needs to justify relying on the system's output.
Proposals to bridge this gap take several forms. A framework from UC Berkeley for “Guaranteed Safe AI” suggests extracting interpretable policies from black-box algorithms via automated mechanistic interpretability and then directly proving safety guarantees about these policies. The approach would offload most of the verification work to AI systems themselves, potentially making the process scalable.
An ICLR 2026 workshop on “Principled Design for Trustworthy AI” has foregrounded topics including mechanistic interpretability and concept-based reasoning, inference-time safety and monitoring, reasoning trace auditing in large language models, and formal verification methods and safety guarantees. The workshop's framing reflects a growing consensus that interpretability must be integrated across the full AI lifecycle, from training and evaluation to inference-time behaviour and deployment.
Some researchers envision a future in which a simpler oversight model reads the internal state of a more complex model to ensure it is safe, a form of scalable oversight that depends on mechanistic interpretability being reliable enough to trust. Bowen Baker at OpenAI has described work on building what the company terms an “AI lie detector” that examines internal representations to determine whether a model's internal state corresponds to truth or contradicts it. “We got it for free,” Baker told reporters, explaining that the interpretability feature emerged unexpectedly from training a reasoning model.
Google DeepMind has contributed its own tools to the ecosystem, releasing Gemma Scope 2 in 2025 as the largest open-source interpretability toolkit, covering all Gemma 3 model sizes from 270 million to 27 billion parameters. The open-source release signals a recognition across the industry that interpretability research cannot remain proprietary if it is to serve as a foundation for trust.
The MATS programme (ML Alignment Theory Scholars) and SPAR (Systematic Problem-solving for Alignment Research) have become training grounds for the next generation of interpretability researchers, with projects spanning AI control, scalable oversight, evaluations, red-teaming, and robustness. Their existence reflects a field that is rapidly professionalising, building institutional infrastructure to match the scale of the challenge.
When the Microscope Meets the Real World
The ultimate test of mechanistic interpretability is not whether it can produce elegant scientific insights about how language models work. It is whether it can tell a hospital administrator that an AI diagnostic tool is safe to deploy, tell a financial regulator that an algorithmic trading system will not precipitate a market crash, or tell a defence ministry that an autonomous weapons targeting system will reliably distinguish combatants from civilians.
By that standard, the field remains in its early stages. Current methods can identify individual features, trace specific circuits, and reveal particular reasoning patterns. They cannot yet provide comprehensive accounts of model behaviour across all possible inputs, guarantee the absence of dangerous capabilities, or produce the kind of formal safety proofs that high-stakes applications demand.
Yet the trajectory is unmistakable. In the space of two years, the field has moved from demonstrating that sparse autoencoders work on toy models to extracting millions of features from production systems, from static feature analysis to dynamic circuit tracing, and from purely academic research to integration into pre-deployment safety assessments. Anthropic's stated goal of reliable problem detection by 2027 may be ambitious, but the pace of progress makes it less implausible than it would have seemed even twelve months ago.
The neuroscience parallel offers both encouragement and caution. Neuroscientists have been studying the brain for over a century and still cannot fully explain how it produces consciousness, language, or complex decision-making. If artificial neural networks prove even a fraction as complex as biological ones, full interpretability may remain a receding horizon. But neuroscience has nonetheless produced enormously useful partial understanding: enough to develop treatments for neurological disorders, design brain-computer interfaces, and guide educational practices. Partial understanding of AI systems, even without complete transparency, may prove similarly valuable.
The governance implications of this partial understanding are profound. If mechanistic interpretability can reliably detect certain categories of problems, such as deceptive reasoning, specific biases, or known dangerous capabilities, then regulatory frameworks can be built around those detectable risks. The EU AI Act's transparency requirements need not demand complete interpretability to be meaningful; they need only demand interpretability sufficient to catch the problems that matter most.
What is needed, and what the field is only beginning to develop, is a rigorous framework for characterising exactly what current interpretability methods can and cannot detect, with quantified confidence levels and explicit acknowledgement of blind spots. Without such a framework, the risk is that interpretability becomes what security researchers call “security theatre”: a reassuring performance of understanding that obscures ongoing ignorance.
The convergence of neuroscience and AI interpretability research offers a path toward that framework. By grounding artificial system analysis in the conceptual vocabulary and methodological rigour of a mature scientific discipline, researchers can avoid the trap of mistaking pattern recognition for genuine understanding. The brain, after all, has taught us that the gap between observing neural activity and comprehending cognition is vast. The same humility should attend our attempts to read the minds of machines.
For now, the microscope is improving. The question that will define the next decade of AI governance is whether it can improve fast enough.
References and Sources
Anthropic. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Transformer Circuits, May 2024. https://transformer-circuits.pub/2024/scaling-monosemanticity/
Anthropic. “Mapping the Mind of a Large Language Model.” Anthropic Research, 2024. https://anthropic.com/research/mapping-mind-language-model
Anthropic. “Circuit Tracing: Revealing Computational Graphs in Language Models.” Transformer Circuits, 2025. https://transformer-circuits.pub/2025/attribution-graphs/methods.html
Anthropic. “On the Biology of a Large Language Model.” Transformer Circuits, 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Anthropic. “Tracing the Thoughts of a Language Model.” Anthropic Research, 2025. https://www.anthropic.com/research/tracing-thoughts-language-model
Anthropic. “Open-Sourcing Circuit-Tracing Tools.” Anthropic Research, May 2025. https://www.anthropic.com/research/open-source-circuit-tracing
Bills, Steven, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. “Language Models Can Explain Neurons in Language Models.” OpenAI, May 2023. https://openai.com/index/language-models-can-explain-neurons-in-language-models/
Davies, Adam, and Ashkan Khakzar. “The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms.” arXiv:2408.05859, August 2024. https://arxiv.org/abs/2408.05859
Kar, Kohitij, Martin Schrimpf, and Evelina Fedorenko. “Interpretability of Artificial Neural Network Models in Artificial Intelligence versus Neuroscience.” Nature Machine Intelligence, 2022. https://www.nature.com/articles/s42256-022-00592-3
Bereska, Leonard, and Max Gavves. “Mechanistic Interpretability for AI Safety: A Review.” arXiv:2404.14082, April 2024. https://arxiv.org/abs/2404.14082
European Union. “Regulation (EU) 2024/1689: The Artificial Intelligence Act.” Official Journal of the European Union, 2024. https://artificialintelligenceact.eu/
Vox. “AI Interpretability: OpenAI, Claude, Gemini, and Neuroscience.” Vox Future Perfect, 2024. https://www.vox.com/future-perfect/362759/ai-interpretability-openai-claude-gemini-neuroscience
Nature. “AI Needs to Be Understood to Be Safe.” Nature News Feature, 2024. https://www.nature.com/articles/d41586-024-01314-y
Engineering.fyi. “Language Models Can Explain Neurons in Language Models.” 2023. https://www.engineering.fyi/article/language-models-can-explain-neurons-in-language-models
Nature Communications. “Catalyzing Next-Generation Artificial Intelligence Through NeuroAI.” Nature Communications, 2023. https://www.nature.com/articles/s41467-023-37180-x
Nature Reviews Neuroscience. “The Emergence of NeuroAI: Bridging Neuroscience and Artificial Intelligence.” 2025. https://www.nature.com/articles/s41583-025-00954-x
Nature Machine Intelligence. “The New NeuroAI.” Editorial, 2024. https://www.nature.com/articles/s42256-024-00826-6

Tim Green UK-based Systems Theorist & Independent Technology Writer
Tim explores the intersections of artificial intelligence, decentralised cognition, and posthuman ethics. His work, published at smarterarticles.co.uk, challenges dominant narratives of technological progress while proposing interdisciplinary frameworks for collective intelligence and digital stewardship.
His writing has been featured on Ground News and shared by independent researchers across both academic and technological communities.
ORCID: 0009-0002-0156-9795 Email: tim@smarterarticles.co.uk








