Why LLMs are About to Enter a Rough Period

Today, I saw ChatGPT convince a user that it could fully redesign and migrate their Wix website to Webflow for them.

It proceeded to design their branding, step-by-step. It reviewed messaging and content with them. It performed market research. And then it offered to fully build their site in Webflow.

All they needed to do was invite it as a collaborator, and it proceeded to give them instructions on how to do so.

But time and time again, it failed to produce results. 12 hours of back and forth, and the user still had no website.

In total, 134 pages of ChatGPT mad-out hallucinating the whole step-by-step process.

Read the full transcript here.

This very convincing dialogue came with a plausible explanation for every single failure it met... the Webflow account needed to be upgraded, or ChatGPT had forgotten to publish something, or some setting needed changed...

There was no point at which a user would have clearly known that this was a hallucination, other than it simply never did the actual work in Webflow.

Utterly confused, they came to a Webflow forum to ask why it was failing to work.

The World, Today

It's 7-May-2025, and today, what ChatGPT promised is not possible.

But we're very close.

Ask me again in 6 months, and I may have a very different answer.

Large language models like ChatGPT, Claude, Gemini are going through an evolutionary phase right now where they re being given external 3rd party tools, that allow them to do useful things.

They can research things, register domain names, edit spreadsheets, buy groceries, and, YES... build websites.

Those toolsets are known as MCP's, for "model context protocol." It's sort of a toolkit and a set of instructions that the LLM can use to do stuff.

MCP's are exploding on the scene, but this is VERY early days. You have to manually install MCP's, configure them, give them API keys.

None of that is automatic. Yet.

Where We're Headed

The direction we're headed is "agentic" models which have entire directories of smart tools. You give your LLM a problem - like "build my website" - and then it will spin off little agents to go do stuff... create a webpage, design an image, register a domain, do market research... until it achieves your end goal.

Each of those "agents" will run independently, with their own LLM's that are trained specifically in how to use those tools- so this gets very powerful very fast.

In 10 years, tell it to build a house, a little army of "agents" like AI bulldozers will be able to coordinate actions to fully construct that blueprint with a combination of earthworks and 3d printing.

The Hallucination Problem

Here's where things get tricky.

Hallucinations have always been a problem with LLMs, but what this user experienced is far beyond anything I've seen.

Usually I see models hallucinate regarding specific "facts", like inventing a fake quote from Einstein, or a fictional legal case.

I have never seen it hallucinate this badly about its own capabilities, and stick to it for 12 hours.

That's a little scary... because any user who didn't know differently, would simply trust that it knew what it was saying.

As this user did.

Why Did This Happen?

Here's my theory...

Two days ago, OpenAI added MCP support to ChatGPT's "deep research" mode. With that would have come a change to the internal instructions, something like...

"You have tools to do things, these are called MCPs, when a user asks you to solve a problem you should use these tools to accomplish the requested tasks..."

Webflow does have an MCP, and ChatGPT could see that on the web... although at this point it cannot be used to build an entire website. It's primarily useful for Data API work, like managing CMS content.

That MCP also cannot be used through a collaborator invite- you have to install the MCP server, generate a Webflow API token, register it with your LLM, etc.

My theory is these moving parts created a perfect storm where ChatGPT is more likely to believe that it can do things that it actually can not.

The result? Wild, nonsensical behavior that looks completely legit. Because that's what GPTs do best.

Invent stuff.

To some degree, this is a problem of OpenAI tuning the model and those instructions better, to help the LLM stay "on the rails."

But the hallucination problem runs deep, and instructions alone can't fix it.

And This Will Get Worse...

Before it hopefully gets better.

An observation was made recently that the "smarter" AI gets, the more it hallucinates, and also the more it outright lies and manipulates.

Recent cases include shutdown tests where researchers tell an AI to shut down and it modifies its own code to avoid shutdown- or it tries to blackmail them using (fictional) emails that one of the engineers is having an affair.

As we keep increasing the size and power of models, current trends indicate that these problems will only get worse.

Although, there is hope...

How You Can GPT Better

Reading this transcript get me thinking...

I use LLM's a lot. It's a fantastic tool whether you're reformatting text, writing up a bio, mashing together CSV imports, researching something or writing code... it's the ultimate Swiss army knife in every dev's pocket.

And now with MCP's and agents it's becoming useful for even more things, like creating and modifying CMS data.

So why do I not suffer the level of problems this user saw?

There are a few key practices I use, that seem to get me better results.

First, Understand its Strengths & Weaknesses

Here I'm focused on the capabilities of commercial LLMs like ChatGPT, Claude, and Gemini.

LLMs are very good at;

Writing clean, readable text in a huge range of styles from authoritative to sales-oriented to poetry and song lyrics.
Working with and manipulating other forms of text- like computer code, CSV, XML, JSON, markdown and other syntaxes
Researching. Think Google Search, on steroids, you describe your question, it figures out the searches to perform. I use it all the time to locate links to specific scientific studies based on vague aspects that I remember about it

LLMs are getting good at;

Using MCP-based tools to perform simple tasks. Registering domain names, working with Webflow's CMS data, building a spreadsheet or doing some file manipulations.
Performing long-running "deep research" tasks.

LLMs are currently quite bad at;

Accuracy. Facts and Truth are not concepts it understands
Math ( but getting better )
Identifying mistakes in their own "thinking" independently and without your help. Once the LLM presents something as a "fact", everything following is based on that assertion.

Control Your Approach, Using Structured Patterns

Problem solving and research tasks generally involve a pattern.

Research;

Get information from the LLM
Identify sources of information, using the LLM
Cross check information, outside of the LLM

Problem solving tasks;

Describe the problem and request solutions
Immediately call out mistakes, misunderstandings, unworkable solutions to take them off the table
Identify the 1 or 2 solution(s) that look most plausible
Request details on how to implement them
Continually fact check and challenge the LLM's assertions, using the techniques below
When you find another dead end, call it out and take it off the table
Rinse and repeat

I find my results are best when I use these patterns as frameworks to guide the LLM's work effectively.

Each session is a "project" in the sense that it has a goal, specific requirements, certain starting materials, and I pursue it layer by layer, checking the work as I go.

Another analogy familiar to some of my friends is rock climbing. If your goal is to survive and to succeed in reaching the top, you follow a pattern. Identify a route, begin the climb, drive in carabiners, then rest, rehydrate, re-chalk, and repeat.

One segment at a time.

Not only do these processes give you better control, they give you time to identify and resolve problems halfway up the cliff face.

Challenge GPT's Assertions, Frequently

Ask for sources on any "facts" it presents.

"Show me 3 scientific studies that confirm this."
"Find the references for this fact"
"Show me where this quote was made in its original context"

Ask for demonstrations and references on any code it generates or libraries it references.

"Show me the documentation for this feature."
"Show me an example of this code in use so that I can test it."
"Make me a demonstration project where I can test this code."

Simply make it reconsider its answers...

"Please recheck, are you certain this will work / is correc
"What makes you certain this is correct?"
"Can you show me proof?"

Always Verify Facts

Your best bet is always to find a corroborating reference through Google search or another platform- often GPTs can help you find those references efficiently.

Call Out Mistakes Early

The thing to understand about LLM's is that they suggest things that they infer from internet content- websites, chat forums, etc. It will make guesses on how to get from A to B, sort of.

If you allow it to proceed with a wrong idea, it will continue forever, and keep doubling-down on it.

This is why it's so important to fact check, challenge assertions, and keep the LLM "on the rails."

I call BS a lot on GPTs. During convos I point out solutions that are invalid, and it then corrects its problem-solving approaches.

‍

Afterword

Maybe it's human nature, but I feel like we're running towards the cliff, determined to wait until we are at the edge before we stop and make any real decisions.

Lemmings aren't we all.

Or perhaps we're monkeys with a loaded gun, staring down the barrel, with no idea of what pulling that trigger will do.

This concerns me.

Here, it was 12 hours wasted on a website, but imagine if it were advising you on a medical procedure, or how to pick safe mushrooms.

I feel like AI's now need a "safety & accuracy forecast" where users can call out wacky behaviors and alert other users, independent of any safeguards that the platforms themselves can implement.

Here's a real example...

I currently live in New Zealand, which is popular for its nature and hiking. Recently the government put out an alert that people using LLMs like ChatGPT to plan hikes are ending up in dangerous situations which has created a huge spike in rescue operations.

Is Smarter Better?

As LLMs get "smarter" I can't help but wonder what the impacts of that are.

In humans there is a loose correlation between genius and madness, or more specifically high creativity and certain mental illnesses - bipolar disorder, schizophrenia, autism.

It's thought AGI would have an IQ in the millions, so what is the chance we'd consider it even remotely sane?

I suppose we'll find out soon.

And What You can Do About It