When folks on the web searched Google for “cheese not sticking to pizza” in May 2024, the newly launched “AI Overviews” function of the favored search engine replied “you can … add about ⅛ cup of non-toxic glue to the sauce to give it more tackiness.”
In a sequence of unusual solutions, the synthetic intelligence (AI) software additionally really helpful that individuals eat one small rock a day and drink urine in an effort to move kidney stones.
The in style identify for these weird solutions is hallucinations: when AI fashions face questions whose solutions they weren’t skilled to provide you with, they make up generally convincing however typically inaccurate responses.

Like Google’s “AI Overviews”, ChatGPT has additionally been liable to hallucinations. In a 2023 Scientific Reports examine, researchers from the Manhattan College and the City University of New York in contrast how typically two ChatGPT fashions, 3.5 and 4, hallucinated when compiling data on sure matters. They discovered that 55% of ChatGPT v3.5’s references had been fabricated; ChatGPT-4 fared higher with 18%.
“Although GPT-4 is a major improvement over GPT-3.5, problems remain,” the researchers concluded.
Hallucinations make AI fashions unreliable and restrict their functions. Experts informed this reporter they had been sceptical of how dependable AI instruments are and the way dependable they will be. And hallucinations weren’t the one motive fuelling their doubts.
Defining reliability
To consider how dependable an AI mannequin is, researchers often refer to 2 standards: consistency and factuality. Consistency refers back to the skill of an AI mannequin to supply related outputs for related inputs. For instance, say an e mail service makes use of an AI algorithm to filter out spam emails and say an inbox receives two spam emails which have related options: generic greetings, poorly written content material, and so forth. If the algorithm is capable of establish each these emails as spam, it may be stated to be making constant predictions.
Factuality refers to how accurately an AI mannequin is ready to answer a query. This contains “stating ‘I don’t know’ when it does not know the answer,” Sunita Sarawagi, professor of pc science and engineering at IIT-Bombay, stated. Sarawagi acquired the Infosys Prize in 2019 for her work on, amongst different issues, machine studying and pure language processing, the backbones of modern-day AI.
When an AI mannequin hallucinates, it compromises on factuality. Instead of stating that it doesn’t have a solution to a selected query, it generates an incorrect response and claims that to be appropriate, and “with high confidence,” in line with Niladri Chatterjee, the Soumitra Dutta Chair professor of AI at IIT-Delhi.
Why hallucinate?
Last month, a number of ChatGPT customers had been amused when it couldn’t generate photographs of a room with no elephants in it. To check whether or not this drawback still continued, this reporter requested OpenAI’s DALL-E, an AI mannequin that may generate photographs based mostly on textual content prompts, to generate “a picture of a room with no elephants in it.” See the picture above for what it made.
When prompted additional with the question, “The room should have no pictures or statues of elephants. No elephants of any kind at all”, the mannequin created two more photographs. One contained a big image of an elephant whereas the opposite contained each an image and a small elephant statue. “Here are two images of rooms completely free of elephants — no statues, no pictures, nothing elephant-related at all,” the accompanying textual content from DALL-E learn.
Such inaccurate however assured responses point out that the mannequin fails to “understand negation,” Chatterjee stated.
Why negation? Nora Kassner, a pure language processing researcher with Google’s DeepMind, informed Quanta journal in May 2023 that this stems from a dearth of sentences utilizing negation within the knowledge used to coach generative AI fashions.
Researchers develop modern AI fashions in two phases: the coaching and the testing phases. In the coaching part, the mannequin is supplied with a set of annotated inputs. For instance, the mannequin may be fed a set of elephant footage labelled “elephant”. The mannequin learns to affiliate a set of options (say, the dimensions, form, and components of an elephant) with the phrase “elephant”.

In the testing part, the mannequin is supplied with inputs that weren’t a part of its coaching dataset. For instance, the researchers can enter a picture of an elephant that the mannequin didn’t encounter in its coaching part. If the algorithm can precisely recognise this image as an elephant and distinguish it from one other image, say of a cat, it is stated to achieve success.
Simply talking, AI fashions don’t perceive language the best way people do. Instead, their outputs are pushed by statistical associations they be taught in the course of the coaching part, between a given mixture of inputs and an output. As a end result, after they encounter queries which are unusual or absent of their coaching dataset, they plug within the hole with different associations which are current within the coaching dataset. In the instance above, it was “elephant in the room”. This results in factually incorrect outputs.
Hallucinations usually happen when AI fashions are prompted with queries that require “ingrained thinking, connecting concepts and then responding,” stated Arpan Kar, professor of knowledge programs and AI at IIT-D.
More or much less dependable?
Even as the event and use of AI are each within the throes of explosive progress, the query of their reliability looms massive. And hallucinations are only one motive.
Another motive is that AI builders usually report the efficiency of their fashions utilizing benchmarks, or standardized exams, that “are not foolproof and can be gamed,” IIT-Delhi’s Chatterjee stated.
One strategy to ‘game’ benchmarks is by together with testing knowledge from the benchmark within the AI mannequin’s coaching dataset.
In 2023, Horace He, a machine studying researcher at Meta, alleged that the coaching knowledge of ChatGPT v4 might need been “contaminated” by the testing knowledge from a benchmark. That is, the mannequin was skilled, a minimum of partly, on the identical knowledge that was used to check its capabilities.
After pc scientists from Peking University, China, investigated this allegation utilizing a unique benchmark, known as the HumanEval dataset, they concluded that there was a very good likelihood it was true. The HumanEval benchmark was created by researchers from OpenAI, the corporate that owns and builds ChatGPT.
According to Chatterjee, this implies whereas the mannequin may carry out “well on benchmarks” as a result of it has been skilled on the testing knowledge, its efficiency may drop “in real-world applications”.
A mannequin with out hallucinations
But all this stated, the “frequency of hallucination [in popular AI models] is reducing for common queries,” Sarawagi stated. She added this is as a result of newer variations of those AI fashions are being “trained with more data on the queries where the earlier version was reported to have been hallucinating”.
This method is like “spotting weaknesses and applying band-aids,” as Sarawagi put it.

However, Kar of IIT-Delhi stated that regardless of there being more coaching knowledge, in style AI fashions like ChatGPT gained’t be capable to attain a stage the place they gained’t hallucinate. That would require an AI mannequin to be “updated with all the possible knowledge all across the globe on a real-time basis,” he stated. “If that happens, that algorithm will become all-powerful.”
Chatterjee and Sarawagi as an alternative prompt shifting how AI fashions are constructed and skilled. One such method is to develop fashions for specialised duties. For instance, not like massive language fashions like ChatGPT, small language fashions are skilled solely on parameters required to unravel just a few particular issues. Microsoft’s Orca 2 is an SLM constructed for “tasks such as reasoning, reading comprehension, math problem solving, and text summarisation,” for example.
Another method is to implement a way known as retrieval-augmented era (RAG). Here, an AI mannequin produces its output by retrieving data from a particular database related to a selected question. For instance, when requested to answer the query “What is artificial intelligence?”, the AI mannequin may be supplied with the hyperlink to the Wikipedia article on synthetic intelligence. By asking the mannequin to seek advice from solely this supply when crafting its response, the possibilities of it hallucinating may be considerably diminished.
Finally, Sarawagi prompt that AI fashions could possibly be skilled in a course of known as curriculum studying. In conventional coaching processes, knowledge is introduced to AI fashions at random. In curriculum studying, nevertheless, the mannequin is skilled successively on datasets with issues of accelerating issue. For instance, an AI mannequin may be skilled first on shorter sentences, then on longer, more advanced sentences. Curriculum studying imitates human studying, and researchers have discovered that ‘teaching’ fashions this fashion can enhance their eventual efficiency in the true world.
But within the remaining evaluation, none of those strategies assure that they’ll do away with hallucinations altogether in AI fashions. According to Chatterjee, “there will remain a need for systems that can verify AI-generated outputs, including human oversight.”
Sayantan Datta is a science journalist and a college member at Krea University.
Published – April 17, 2025 05:30 am IST





