Enhancing AI Agents: A Proposal for a "Thought" Role to Manage Hallucinations
In the march towards more useful AI Agents the biggest problem is the need to deal with hallucinations.
Both the number and severity of hallucinations has steadily been getting better.
GPT-4o and similar models are really on a high level.
At Stubber we're doing long multiturn interactions, and as the context window increases so does the propensity for hallucinations to occur.
The Token-Based Thinking Problem
LLMs cannot "think" without outputting tokens. This is why CoT (Chain-of-Thought) and other strategies have been so successful. Getting the model to output a plan and reasoning before answering is really helpful.
But I think current strategies are a bit "hacky".
Currently we're having to tell the model to put parts of the "assistant" role into tags like <thought> and <reflection>, and then parsing out those parts before returning a succinct answer to the user.
Current Role Limitations
If we go back to the API and the concept of roles, we have essentially 4 roles (in the OpenAI API standard):
Role | Purpose | Type |
---|---|---|
system | High level instructions | Input |
user | User input | Input |
assistant | LLM output (including function calls) | Output |
function | Returning data to the LLM | Input |
Proposal: A Native "Thought" Role
I've been thinking about how valuable an additional output role might be, given that a model provider could train a model on its use. The output role I'd love to see is called: thought.
How It Would Work
Example Workflow
=====================
API Call 1:
=====================
system: You assist in collecting details from people interested in entering a competition...
=====================
=====================
API Response 1:
=====================
thought: I should greet the person in a friendly excited manner...
=====================
=====================
API Call 2 (with thought):
=====================
system: [Original prompt]
thought: [Previous thought]
=====================
=====================
API Response 2:
=====================
assistant: Welcome to the competition! Would you like to enter? I can help.
=====================
Continuing the conversation when user says "Yes please":
=====================
API Call 3:
=====================
system: [Original prompt]
thought: [Previous thought]
assistant: [Previous response]
user: Yes please
=====================
=====================
API Response 3:
=====================
thought: I should probably collect a full name and some sort of contact details...
=====================
Shortcomings
Latency
It takes 2 API calls to get a response to the user, thus increasing latency.Token Usage
Having the LLM output thought tokens before responding each time will increase token usage. Given that the cost of tokens has been dropping significantly, I think this is a small price to pay.
Summary
We all know a person who speaks their mind constantly and a person who is rather deliberate and thoughtful, engaging the conversation only after considering a few factors. What I'm proposing is that we work towards allowing LLMs to become more like the latter.
Author: Werner Stucky