Stop Calling It Hallucination
Why a human metaphor is distorting how we understand AI failure
Introduction
I’ve lost count of how many times I’ve come across the phrase “AI hallucination” over the past months. It appears in headlines, charts, LinkedIn posts, consultancy decks, and increasingly as a supposedly measurable statistic — often presented with reassuring decimals and ranked comparisons.
What struck me wasn’t just the repetition, but the growing sense of unease it produced. Not because the phenomenon itself is uninteresting or unimportant, but because the word hallucination does a great deal of conceptual work before any analysis has begun.
Hallucination is a human term. A clinical one. It presupposes perception, internal experience, and a mind that misinterprets reality. Applying it to AI systems quietly anthropomorphises them from the outset — and once that framing is in place, responsibility begins to drift. The system starts to look unreliable or pathological, rather than doing exactly what it was designed to do under the conditions we’ve created.
What concerns me more is how this anthropomorphic framing is increasingly being reinforced by statistics. Across multiple platforms, “hallucination” is treated as a ranked property of models — as if it were a stable, intrinsic trait — rather than a conditional outcome shaped by task design, context, constraints, and how questions are asked. The numbers look precise. The conclusions often are not.
False Intuition
There’s a reason the word hallucination spread so quickly. It feels intuitively right. It captures surprise, error, and unreliability in a single, vivid image. More importantly, it places the problem inside the system. Something strange is happening in there. Something untrustworthy.
That intuitive appeal is precisely the problem.
Hallucination is not a neutral term. It carries clinical weight. It implies perception without stimulus, an inner experience detached from reality, a mind that cannot be relied upon. When applied to AI systems, the word quietly imports all of that baggage — even though none of the underlying conditions apply. There is no perception, no experience, no internal world that misfires. There is only inference under uncertainty.
Once that framing is in place, the conversation starts drifting in predictable directions. If the system hallucinates, then the remedy must be to fix the system. To rank it, compare it, measure how often it “loses touch with reality.” Responsibility slides away from how questions are posed, what information is provided, and how outputs are interpreted. The user becomes a bystander to a malfunctioning mind.
Language matters because it determines where we look for causes. By choosing a metaphor that presupposes agency and pathology, we make it harder to see what’s actually happening — and easier to mistake fluent error for cognitive failure.
Category Error
What’s being described as hallucination is not a mental failure; it’s a categorical mismatch. The term belongs to neurology and psychology, where it names a breakdown between perception and reality. Applying it to AI systems assumes something like perception exists in the first place. It doesn’t.
Large language models do not see, hear, or experience anything. They do not check claims against an internal model of the world. They operate by estimating the most likely continuation of a sequence, given the information and constraints available at the time. When that information is incomplete, ambiguous, or unconstrained, the system fills the gap probabilistically. That process can produce statements that look like fabrication, but the mechanism is neither delusion nor deception.
Calling this hallucination turns a technical behaviour into a psychological one. It suggests an inner state has gone wrong, rather than recognising that the system is doing exactly what it was trained to do: continue confidently in the absence of stopping rules. The error is not that the system invents; it’s that we allow invention where certainty was implicitly expected.
This matters because category errors distort diagnosis. If we think the problem is a mind misbehaving, we look for cures. If we recognise it as inference without bounds, we look for constraints, reference material, refusal mechanisms, and clearer task definitions. One framing leads to rankings and panic. The other leads to design.
The distinction isn’t semantic. It determines whether responsibility is assigned to the system alone — or shared properly between tool, task, and user.
False Precision
Once the category error is in place, statistics arrive to harden it. Charts, percentages, and ranked tables give the impression that hallucination is something that can be cleanly measured, compared, and optimised away. A model with a lower percentage looks safer. A higher one looks reckless. The numbers appear to settle the matter.
They don’t.
What’s usually being measured is not hallucination in any general sense, but performance under a very specific set of conditions: a particular task, a particular prompt style, a particular corpus, a particular refusal policy. Change any of those variables and the figures move — sometimes dramatically. Yet the presentation rarely makes this conditionality explicit. The numbers detach from their context and begin to circulate as if they described an intrinsic property of the system.
This is where false precision does its real damage. Decimal points suggest objectivity where there is none. Rankings imply comparability where the underlying tasks are not equivalent. A difference between 1.3% and 1.7% feels meaningful, even when there is no stated confidence interval, no variance across task types, and no explanation of what was excluded or refused. The aesthetic of measurement replaces the substance of it.
You end up with statistics that look authoritative while quietly obscuring the most important question: under what circumstances? Without that, the figures don’t clarify behaviour — they mask it. They turn a conditional outcome into a headline trait.
This isn’t a failure of data collection so much as a failure of epistemology. Numbers are being asked to carry more certainty than the systems — or the experiments — can support.
Inferred Authority
Once statistics give the illusion of control, attention shifts to the prompt. When outputs go wrong, the explanation often collapses into a familiar refrain: bad prompt. As if the right incantation would have produced truth, and the wrong one summoned fiction.
This misses what’s actually happening.
A prompt doesn’t just ask a question; it defines a situation. When context, background, reference material, or constraints are absent, the system has no option but to infer them. It infers the level of expertise it should assume, the depth of answer expected, the tolerance for speculation, even whether it is being asked to explore or to decide. None of that authority is explicit — but all of it is implied.
When people ask a bare question and expect an intelligent answer, they are unconsciously treating the system as if it were a social being. A human listener would read tone, posture, voice, shared history. An AI system has none of that. It can only work with what is made explicit. In the absence of context, it compensates. And that compensation is often mistaken for confidence, or worse, deception.
The irony is that the behaviour being criticised is not overreach, but compliance. The system continues because continuation is its job. It does not know when not to answer unless that rule is supplied. Where humans pause, machines require instruction.
The so-called authority of the prompt, then, is largely a fiction. Authority comes from context, not wording. When we withhold that context and still expect precision, we create the conditions for exactly the failures we then label as hallucinations.
Anthropomorphism, Deeper Still
At a deeper level, the discomfort around AI output isn’t really about errors at all. It’s about misplaced expectations. People expect these systems to know what they mean, to recognise intent without being told, to read between the lines. Those expectations only make sense if the system is treated as a social actor rather than a tool.
This is where anthropomorphism does its quietest work. It’s no longer just about calling an output a hallucination. It’s about assuming shared context, shared norms, even shared judgement. When the system fails to meet those expectations, the failure feels personal — as if a listener has misunderstood — rather than procedural.
The privacy anxiety that accompanies this is revealing. Many users are reluctant to provide AI systems with background information: their role, their level of expertise, the purpose of the task. They fear disclosure, permanence, misuse. Yet the same individuals often reveal far more intimate details on social platforms — emotional states, political anger, professional insecurities — in public, monetised environments.
The difference isn’t rational. It’s psychological.
AI systems feel analytic and impersonal, and therefore threatening. Social platforms feel human, even when they are not. So context is withheld from the one place it would improve accuracy, and overshared where it mainly fuels engagement.
The result is a feedback loop. Thin context produces inferred intent. Inferred intent produces confident answers. Confident answers are then treated as evidence of overreach or deception. What looks like AI misbehaviour is often a mirror of our own discomfort with tools that reflect cognition without participating in social ritual.
Reframing Responsibility
If there’s a thread running through all of this, it’s not that AI systems are becoming dangerously unreliable. It’s that our language for describing their failures is pulling responsibility in the wrong direction. “Hallucination” makes the system look mentally unsound. Rankings and percentages make that unsoundness appear measurable. Bad prompts become user error. Somewhere along the way, the structure of the task itself disappears from view.
A more accurate vocabulary would start from limits, not minds. What we’re seeing is unbounded inference, context-free completion, epistemic overreach — behaviour that emerges when systems designed to continue are not told when to stop. None of this requires pathology to explain. It requires clearer constraints.
Changing the words won’t solve the problem on its own, but it changes what we notice. When we stop treating fluent error as a psychological failure, we begin to ask better questions: What information was available? What assumptions were implied? Where should refusal have occurred? Who was expected to decide?
That shift matters because it restores proportion. AI is neither an oracle nor a liar. It is a tool that amplifies whatever clarity — or ambiguity — it is given. Used carefully, it can extend judgement. Used carelessly, it will simulate it.









"Large language models do not see, hear, or experience anything. They do not check claims against an internal model of the world. They operate by estimating the most likely continuation of a sequence, given the information and constraints available at the time. When that information is incomplete, ambiguous, or unconstrained, the system fills the gap probabilistically. That process can produce statements that look like fabrication, but the mechanism is neither delusion nor deception."
While I've understood from the moment that chatgpt itself told me it had no internal subjective experience that the term "hallucination" when used discussing AI was bogus, I never understood the mechanism behind the phenomena the term described....so thank you 😀
You brought up a lot of good points in this piece.....I would say the tendency to anthropomorphize external objects on the one hand and the need to view something external as a source of authority (regardless of its a teacher, scientist, journalist, a sacred text or an AI) on the other, are both deep seated human instincts we don't even consciously realising we are doing.
So it's possible that someone who does/should know better can lapse into taking things a LLM says as gospel and experience it as being deceitful and take it personally when drifts occur.😅
Brilliant piece, much to think about 😁👍