Enhancing AI Employees: A Proposal for a "Thought" Role to Manage Hallucinations

In the march towards more useful AI Employees the biggest problem is the need to deal with hallucinations.

Both the number and severity of hallucinations has steadily been getting better.

GPT-4o and similar models are really on a high level.

At Stubber we're doing long multiturn interactions, and as the context window increases so does the propensity for hallucinations to occur.

The Token-Based Thinking Problem

LLMs cannot "think" without outputting tokens. This is why CoT (Chain-of-Thought) and other strategies have been so successful. Getting the model to output a plan and reasoning before answering is really helpful.

But I think current strategies are a bit "hacky".

Currently we're having to tell the model to put parts of the "assistant" role into tags like <thought> and <reflection>, and then parsing out those parts before returning a succinct answer to the user.

Current Role Limitations

If we go back to the API and the concept of roles, we have essentially 4 roles (in the OpenAI API standard):

Role	Purpose	Type
`system`	High level instructions	Input
`user`	User input	Input
`assistant`	LLM output (including function calls)	Output
`function`	Returning data to the LLM	Input

Proposal: A Native "Thought" Role

I've been thinking about how valuable an additional output role might be, given that a model provider could train a model on its use. The output role I'd love to see is called: thought.

Example Workflow

=====================
API Call 1:
=====================
system: You assist in collecting details from people interested in entering a competition...
=====================

=====================
API Response 1:
=====================
thought: I should greet the person in a friendly excited manner...
=====================

=====================
API Call 2 (with thought):
=====================
system: [Original prompt]
thought: [Previous thought]
=====================

=====================
API Response 2:
=====================
assistant: Welcome to the competition! Would you like to enter? I can help.
=====================

Continuing the conversation when user says "Yes please":

=====================
API Call 3:
=====================
system: [Original prompt]
thought: [Previous thought] 
assistant: [Previous response]
user: Yes please
=====================

=====================
API Response 3:
=====================
thought: I should probably collect a full name and some sort of contact details...
=====================

Shortcomings

Latency
It takes 2 API calls to get a response to the user, thus increasing latency.

<li><strong>Token Usage</strong><br>
Having the LLM output thought tokens before responding each time will increase token usage. Given that the cost of tokens has been dropping significantly, I think this is a small price to pay.</li>

Summary

We all know a person who speaks their mind constantly and a person who is rather deliberate and thoughtful, engaging the conversation only after considering a few factors. What I'm proposing is that we work towards allowing LLMs to become more like the latter.

Note:

This post was conceived and written before model providers had reasoning models and the <thinking> tags in model output came to be. It is the author's opinion that the <thinking> tags should have been <thought> tags, which would have been semantically more correct.

Author: Werner Stucky

September 7, 2024