Deep research is, in my opinion, one of the best use-cases of LLMs. Even the most anti-AI of my friends acknowledge that it is a net positive. As it relies on grounded, verifiable sources to answer queries, hallucinations become uncommon and can be easily identified anyway by checking the sources. It is a use case of AI where one can have relatively high trust in its results.
So, I found it would be interesing to document the failure modes that persist in Deep Research.
How does Deep Research work ?
Deep research refers to the use of a harness or scaffold around LLMs to perform in-depth web investigations. Unlike simple question-answering, it involves formulating multiple adjacent search queries (similar to “prompt augmentation”), running those queries in a search engine, aggregating the results, and producing a final synthesis.
This approach brings a great improvement of simple chat for complex topics, especially regarding the problem of hallucinations. By anchoring responses in retrieved sources, the risk of fabricating information is greatly reduced. In addition, citations and search trails make it possible to verify claims. If you need to get an overview of a complex topic you’re not familiar with, it can save a lot of time.
A few Deep Research implementations:
- Perplexity: Reformulates the user’s query 5–10 times, fetches ~20 sources per query, then synthesizes the results.
- Mistral: Proposes a search plan, gets suggestions and approval from the user, executes the plan, and summarizes findings.
- ChatGPT: Requests clarifications, then starts searching.
These implementations could be refined with several rounds of new searches for example, and, with the development of agents, we might see that kind of capability becoming more common, though not under the name of Deep Research (but that’s a story for another article).
Failure Modes of LLM Deep Research
Hallucinations still exist
Let’s start with the obvious one. Instructing the model to ground its answers in the search results eliminates most of the hallucinations, but not all. For queries with no clear-cut answers, the LLM tend to hallucinate a solution AND a source for it, sometimes even providing real links to unrelated articles. LLMs might even double down on their mistake when it is exposed.
This can be tricky especially because, in many cases (at least in mine), questions with no clear-cut answer are the ones I use Deep Research for the most. So, the added value of Deep Research is in gathering, parsing and filtering many sources, but it doesn’t replace reading the primary sources yourself.
Sources selection bias
The other, more general problem of Deep Research is about which sources are fed to the LLM to generate its answer.
1. robots.txt : unequal access by humans or bots
Most websites have a robots.txt file, which can exclude bots from accessing its content. With the increased load faced by the Internet because of crawlers, many websites have added the big AI labs in their robots.txt. Nothing really forces the AI labs to abide by that anti-crawl policy and one can reasonably doubt they do; but if they did, there would be an asymmetry between the sources a human can see and those a bot can access. Deep Research workflows would not be able to conduct exhaustive web searches and would certainly leave important sources out.
2. Retracted papers: dealing with context
A particular (and, I admit, niche) failure mode of Deep Research regards retracted academic papers. The standards of scientific publishing require that, if a published article is later deemed unreliable (due to methodological errors, fraud, ethical violations and such), the publisher adds an Expression of Concern or even a Retraction Notice. Still, a “retracted” article usually stays online; but a big red notice warns users about the retraction before they download the PDF file of the article.
Currently, an AI performing Deep Research completely ignores that context, and processes the PDF separately from the text of the website. I did a few tests, with queries based on the content of famously retracted studies: the LLM found the studies and gave their content as if it were proven, established results, without a single mention of the retraction.
3. Language Bias
Most searches yield results in the language of the query, this problem affects classic search engines as well as Deep Research workflows. For example, writing a query in English about some events in Georgia (the country 🇬🇪), returns only results in English and none in Georgian. This is obviously a problem whenever the most relevant search results are in a different language from the query. This concerns not only those who search about events in less documented regions of the world, but also most non-English speakers searches about informatics, and many English-speakers’ problems whose answer lies in an obscure japanese forum (that happened to me more than once).
There is ongoing research to eliminate this language bias from semantic search, but it hasn’t reached search engines yet and there seems to be little interest for this issue. An easy fix can be found in explicitly instructing the LLM to reformulate search queries in the most relevant language for the topic, but one needs to remember to do it.
Silent degradation of the product
Companies may quietly degrade the quality of their Deep Research tool in order to save costs. Some companies were accused of reducing the number of searches per query, or to lower the threshold for sources relevance, which led to noisier results and less exhaustive searches. That is not a problem of Deep Research per se, but rather a general problem of the industry at the moment.
So what ?
Deep research is a rare case that makes the best use of LLM’s strengths (summarizing and synthesizing information), but these pitfalls still need to be addressed. Better harnesses would be enough to solve a good part of these problems without requiring a breakthrough, so I have hope the quality of our deep searches will keep improving.