Artificial intelligence: Research assistant, idea generator… liability?

Generative AI has advanced significantly and can now act convincingly as a writing and research assistant – when carefully managed. However, as Apple has discovered, left unchecked it can also cause embarrassment
You might have seen a BBC-branded news alert in late 2024 that tennis star Rafael Nadal had come out as gay. Or another saying a man in prison accused of killing the chief executive of a US health insurer had shot himself. If so, you are likely an Apple iPhone user because both stories were false – fabrications by Apple’s new artificial intelligence (AI) tool. As we’ve written before, generative AI tools are prone to “hallucinating” or, more simply, making up facts.
In this case, Apple’s AI appears to have misinterpreted one news alert about Brazilian tennis player João Lucas Reis Da Silva who, unlike Nadal, is gay, and, in the second case, conflated headlines from unrelated stories. These were not isolated mistakes; iPhone users circulated other incorrect summaries they had received, prompting complaints from the BBC and other news organisations. Apple initially said it would “update” the feature to “further clarify” AI-generated summaries, then suspended it completely.
The incident was a stark reminder that while AI is advancing, it remains unreliable. Releasing a feature capable of fabricating news and attributing it to a respected source is reckless at a time when misinformation is rife. Yet, despite the risks, generative AI has improved immensely over the past 12 months. It still cannot replace careful human insight, but it has become a powerful tool for those who understand its strengths and limitations.
America’s next top model
Despite Apple’s troubles, generative AI platforms such as ChatGPT, Claude and Orca now hallucinate less often. Vectara, an AI platform, tracks hallucination rates for multiple models and its latest data puts those of the best-performing models below 1 per cent. When Vectara began tracking, in late 2023, the best model had a hallucination rate of 3 per cent. The latest figure still means an AI-generated report containing 100 facts will, roughly speaking, have one error. And, of course, even a small mistake in the wrong context, such as misreporting breaking news, can have serious consequences.
As well as becoming more reliable, AI tools are growing in sophistication. Some now offer specialised models for complex tasks, such as maths and logic problems. Using extended reasoning steps, AI models simulate an internal logic process before generating a response. This technique, while not foolproof, helps improve problem-solving abilities without significantly increasing computational costs.
“It’s not that generative AI sometimes makes things up; it always makes things up. It’s just that, most of the time, it makes up things that we find useful.”
This drive for efficiency partly explains the rise of DeepSeek, a Chinese AI model that has disrupted the market by reportedly costing just $6 million to train. Some experts have suggested the real figure is much higher, but it is still likely to be less than the billions invested by Western rivals. However, DeepSeek’s emergence also raises questions about bias. DeepSeek censors responses on politically sensitive topics, such as the 1989 Tiananmen Square protests. Its rise illustrates the broader geopolitical complexities surrounding AI development and hints at further problems to come.
DeepSeek is one to watch, not least because it’s free to use. Its success has spurred rivals into adding more tools and features. Most recently, OpenAI launched Operator for ChatGPT customers. This is an AI ‘agent’ that can autonomously pursue goals set by users, such as visiting a supermarket website and ordering groceries based on a user-provided meal plan. It’s US-only at the time of writing and only available on ChatGPT’s $200-per-month tier, but access will spread and prices will come down.
Once upon a time in the projects
For writers, perhaps the most significant new feature of AI platforms is the arrival of persistent workspaces, such as Google’s NotebookLM and OpenAI’s ChatGPT Projects. Previously, AI interactions were limited to single-use chat sessions. Each time a user started a new conversation, they had to re-upload files, reset context and redefine parameters.
In ChatGPT Projects, each project is a persistent document that can be edited, commented on and updated. This makes version control much easier than in the standard interface and can give the AI the role of a fact-checker or editor, helping refine the finished piece.
For instance, the paragraph above about AI getting better at maths and logic problems was improved after I highlighted it in Projects and wrote a comment asking the AI to check it was correct. I originally included more detail about extended reasoning, but ChatGPT provided sources showing my explanation didn’t cover all AI tools. Explaining the other methods would have over-complicated the paragraph, so I wrote a simpler version.
Google’s NotebookLM takes a different approach, acting as a knowledge management tool that allows users to upload research materials, such as company reports or legal documents, and query them for insights. Since it is designed to work only with user-provided data, its hallucination rate is lower than that of general-purpose AI chatbots.
However, no AI tool is completely free of hallucination because the underlying generative models are inherently probabilistic. It’s not that generative AI sometimes makes things up; it always makes things up. It’s just that, most of the time, it makes up things that we find useful.
Trust issues
Ultimately, that’s why these tools work best when used by people who know what good looks like or have access to expert guidance. Generative AI can be immensely useful for research, such as checking technical concepts or getting feedback on outline structures and summarisation, but only when its output is checked carefully by a knowledgeable human.
The AI debates of today echo the conversations surrounding Wikipedia in the 2000s, when newspapers ran alarmist articles about inaccuracies on the crowd-sourced encyclopaedia. In 2007, for example, several British newspapers published obituaries of TV theme composer Ronnie Hazlehurst that credited him as the writer of S Club 7’s hit, Reach. This turned out to have been added to Hazlehurst’s Wikipedia entry as a prank and copied by unsuspecting hacks. Over time, Wikipedia has become widely accepted as a generally reliable resource, though most people now understand that it should be fact-checked against primary sources.
AI may follow a similar trajectory – moving from scepticism to cautious adoption. In a year’s time, Apple’s AI misinformation misstep may seem like a relic from an era of growing pains, much like the early concerns over Wikipedia errors. But for now, these tools are best used by someone who knows how to separate fact from fiction. If that’s not you, it’s time to call an actual expert.