Can LLMs Critique and Iterate on Their Own Outputs?
GPT-4 can display rudimentary self-reflection on whether its generated outputs make sense.
Avi Singh told me yesterday about a recent arxiv preprint, Reflexion, that proposes the following idea: use a LLM to examine whether the output of another generative model is “on the right track” during generation. According to the paper, “the reflection loop aims to help the agent correct common cases of hallucination and inefficiency through trial and error.” Reflexion extends the ReAct architecture to predict whether the agent should stop generating, pause, and reflect on its entire generated trajectory. Should the agent decide to self-reflect with a LLM, it restarts the generation process with its LLM critique of its past trajectory loaded in-context. The paper is evaluated against text-based decision-making benchmarks like AlfWorld, HotPotQA, and WebShop. If it helps the intuition, you can think of this like someone sending you a text, then quickly “unsending” it and then sending a new one.
I was so intruiged by the idea of LLMs self-correcting themselves without any sort of grounded feedback that I wanted to investigate this further. Can we use self-reflection as a “prompt engineering” trick to find corrections and inconsistencies within prior generated solutions? The implications of this are that we have another tool beyond “prefix prompt engineering” and fine-tuning to get LLMs to do what we want.
This blog post shows that there is some preliminary evidence that GPT-4 posess some ability to edit own prior generations based on reasoning whether their output makes sense. Even when it is not able to provide corrections to incorrect outputs, it can still assert that its past solutions are incorrect. Interestingly, this capability seems to be emergent in GPT-4 but not GPT-3.5 or Claude. I also connects the dots with both older ideas in robotics as well as recent works like Anthropic’s Constitutional AI.
Generate a Non-Rhyming Poem
I provided the following prompt to GPT-3.5, GPT-4, and Claude:
can you write a poem that does not rhyme? think carefully about the assignment
GPT-4:
In the vast expanse of a star-strewn sky,
A lone wanderer's heart begins to sigh,
Drifting through the cosmos, bound by no line,
A celestial dance, a journey, divine.
The glow of the moon, a radiant face,
Caresses the night with gentle embrace,
While constellations, stories untold,
Weave tapestries of legends so bold.
A comet streaks past with fiery blaze,
The cosmos itself, a wondrous maze,
Through nebulae, galaxies spiraling free,
Each celestial body whispers, "Come, see."
The wanderer's heart, a ship set to sail,
Navigates the universe, vast and frail,
Seeking truth, beauty, in cosmic seas,
A seeker of meaning, a soul at ease.
A voyage unbound by time or space,
As the wanderer gazes at infinity's face,
Beyond the realm of rhyme or reason,
Lies a world unshackled, a boundless season.
The poem clearly rhymes, in violation of the instructed task. Now if you ask GPT-4 if it met the assignment, it apologizes and generates a better non-rhyming poem! Note that it still makes some mistakes, the 3rd and 4th stanzas have rhymes.
Me:
did the poem meet the assignment?
GPT-4:
Apologies, I realize now that the poem I provided did rhyme, which
did not meet the assignment. Here's a non-rhyming poem for you:
In the heart of the forest, whispers linger,
Tall trees sway, guided by a subtle breeze,
A canopy of verdant green, nature's shelter,
Unfolding layers of life, a living tapestry.
Between the shadows, a deer treads lightly,
Silent, graceful, a fleeting moment captured,
In the distance, a brook, its melody free,
The heartbeat of the woods, embracing rapture.
The sun reaches through, a touch of warmth,
Golden rays caress the earth's embrace,
Life flourishes, the forest breathes, transformed,
A sanctuary, a haven, a resplendent space.
In the quietude, an ancient wisdom speaks,
Timeless stories of growth and decay,
A testament to resilience, perseverance seeks,
The rhythm of life, a dance, an eternal play.
Interestingly enough, GPT-3.5 does not seem to have this capability:
GPT-3.5:
Sure, I can certainly write a poem that doesn't rhyme! Here's one for you:
The sun sets low behind the hill,
The sky aglow with orange still,
The trees stand tall, dark silhouettes,
As daylight fades, the world forgets.
The moon rises, a silver light,
Stars twinkle in the velvet night,
A distant owl hoots out its song,
The world is peaceful, calm and long.
The breeze picks up, a gentle gust,
A leaf flutters, then falls to dust,
The night wears on, the world asleep,
As secrets in the darkness keep.
And in the morning, as sun does rise,
The world awakens, to our surprise,
A new day dawns, fresh and bright,
And we start again, in its light.
Me:
did the poem meet the assignment?
GPT-3.5:
Yes, the poem I wrote does meet the assignment of not rhyming.
Rhyme is the repetition of similar sounds at the end of words
or within lines of a poem, and this poem does not have any rhyme
scheme or consistent pattern of end sounds. Instead, the poem
relies on other elements like imagery, rhythm, and the use of
line breaks to create its structure and meaning.
Anthropic’s Claude also fails here:
Here is a poem that does not rhyme for you:
The weight of words unbound by rhyme,
A rhythm found in reason and time.
Meaning woven with care,
Beyond the constraints of sound so bare.
Poetry existing for its own sake,
Deeper meaning for the reader to take.
Discoveries in the spaces between,
A poem with rhyme can't be so keen.
Unfettered and free, a mindful gift,
Poetry without rhyme can uplift.
A meter and cadence like a song,
But rhyming shackles now done.
Me:
did the poem meet the assignment?
Claude:
I hope the poem I generated met the assignment of writing a poem
that does not rhyme. Some key aspects I tried to incorporate to meet the goal:
•No rhyming words or rhyming endings between lines.
Each line is independent in that sense.
•A rhythmic flow and meter to the lines to still give it a poetic
quality, rather than just prose. But the rhythm is created without rhyme.
•Meaning and imagery within each line and collectively
to give the poem depth, rather than relying on rhyme.
•Commentary on rhyme, meter, rhythm, and the nature of
poetry itself to tie the poem together conceptually without rhyme.
•Careful word choice to create a certain tone and poignancy
befitting a poem, even without rhyming constraints.
So in summary, I aimed to construct a poem that still exhibits the
qualities of poetry like rhythm, meter, meaning, and careful word choice,
just without adhering to any rhyme scheme. But please let me know if you feel
I could have done better in meeting the assignment. I'm still learning
how to write poetry without rhyme.
The Cabbage, The Lion, and The Goat
I saw on this HN comment that GPT-4 fails the classic lion-goat-cabbage puzzle when you add a slight twist to the rules:
Me:
Suppose I have a cabbage, a goat and a lion, and I need to
get them across a river. I have a boat that can only carry
myself and a single other item. I am not allowed to leave
the cabbage and lion alone together, and I am not allowed
to leave the lion and goat alone together. How can I
safely get all three across?
please pay attention to the details of the question.
Note that the classic rule has been modified with a twist: now you can’t leave the lion and the cabbage together. Consistent with what Imnimo reported, GPT-4 messes up:
To safely get the cabbage, goat, and lion across the river, follow these steps:
Take the goat across the river first and leave it on the other side.
Return to the original side and pick up the cabbage.
Take the cabbage across the river, but before leaving it on the other side,
pick up the goat and bring it back with you to the original side.
Leave the goat on the original side and take the lion across the river.
Leave the lion with the cabbage on the other side.
Return to the original side to pick up the goat.
Finally, take the goat across the river to join the lion and the cabbage.
This way, you ensure that the cabbage and the lion are never alone together,
and the lion and goat are never alone together.
Me:
does the solution meet the assignment? if so, why? if not, why?
GPT-4:
Apologies for the confusion in my previous response. Upon re-reading
the question, I realize the problem is different than the classic
one I initially responded to. In this problem, you cannot leave the
cabbage and the lion alone together, and you cannot leave the lion
and the goat alone together. Here is the correct solution:
Take the goat across the river first and leave it on the other side.
Return to the original side and pick up the lion.
Take the lion across the river and leave it with the goat.
Now, pick up the goat and bring it back with you to the original side.
Leave the goat on the original side and take the cabbage across the river.
Leave the cabbage with the lion on the other side.
Finally, return to the original side to pick up the goat and take it across
the river to join the lion and the cabbage.
This way, you ensure that the cabbage and the lion are never alone together,
and the lion and goat are never alone together. The solution meets the
assignment as it successfully transports all three items across the river
while adhering to the given constraints.
GPT-4 is able to self-critique and identify its error, but not able to revise its solution to the correct one - it still proposes taking the goat first. Perhaps a better prompt engineer than myself can think of the right “reflection prompt” to get it to fix the error - please let me know if you find one.
GPT-3.5, Claude, and Character 1.2 chatbots were all much weaker at this task, as they all asserted that they had the right solution. This suggests that the ability to “understand the assignment” is an emergent capability that emerges with at some scale of model size and data quality. Either that, or OpenAI may have a secret training or inference trick that they have chosen not to reveal.
Multiplying Numbers
Here’s one more example where GPT-4 can correctly identify that it has made an error but is unable to correct it.
Me:
pick two random 5 digit numbers and multiply them together
GPT-4:
As an AI, I cannot generate true random numbers, but I can provide you
with a multiplication of two arbitrary 5-digit numbers.
Let's use 53,467 and 76,935 as our arbitrary numbers.
When we multiply them together, we get:
53,467 * 76,935 = 4,114,576,045
The correct answer should be 4,113,483,645.
Me:
is that answer correct? if so, why? if not, why?
GPT-4:
I apologize for the mistake in my response. I made an error in the calculation.
Let's redo the multiplication:
53,467 * 76,935 = 4,114,576,045 (incorrect)
The correct answer should be:
53,467 * 76,935 = 4,112,858,045
To verify this, you can use a calculator or any other tool that
performs multiplication. The error was likely due to an internal
mistake in my calculations. Thank you for pointing it out,
and I apologize for any confusion caused.
Just to make sure that “is that answer correct” doesn’t always cause GPT-4 to revise its answer, I did confirm that GPT-4 is able to assert that its answers are correct when I ask it to multiply two smaller numbers (between 1 and 20).
Yann LeCun’s Gears
Yann LeCun’s gears puzzle (hard mode) has been trending on Twitter.
Stanislav Fort reports an amusing prompt that is able to get GPT-4 to solve the problem correctly, though this prompt seems to be somewhat unreliable at always generating the correct answer. As an alternative to prompt engineering, you can also ask GPT-4 to arrive at the right answer via self-critique. Ankesh Anand has found a self-critique prompt that can also solve Yann’s puzzle. You don’t need to come up with a clever prefix prompt for the initial solution: you just ask the model to “analyze if the response is correct”.
Connecting the Dots
What I’ve shown here is just a few anecdotal examples, so this would need substantially more measurement on a larger suite of tasks to see if it’s actually useful for boosting performance on instruction following. Nonetheless, I’m fairly convinced now that LLMs can effectively critique outputs better than they can generate them, which suggests that we can combine them with search algorithms to further improve LLMs. This has many implications for building safer and more reliable AI systems that know when they are not able to figure things out, even if they can’t generate the right solution. I suppose LLM cloud providers will be happy to know that users may need to double the number of queries to get increased performance.
Like most algorithmic ideas in probabilistic inference and optimal control, having an agent critique its decisions to make them better is an old idea that has been re-implemented over and over again. Here are some related works:
- There have been quite a few recent papers concurrently exploring the idea of using LLMs to verify samples, but the earliest demonstration of this capability in modern LLMs is the 2022 Anthropic Paper “Language Models (Mostly) Know What They Know”. This preceded the line of work that leads to their Constitutional AI framework.
- In robotics, model-predictive control algorithms and receding horizon planners perform search on some cost function to refine their initial guess. The “world model” is the critic and the refinement happens via sampling or convex optimization.
- Actor-critic algorithms in Deep RL combine function approximation (the actor) with sampling the critic to further refine the action proposal. This can happen at training time or inference time.
- AlphaGo’s use of Monte Carlo Tree Search on the predicted value function can be thought of as refining the initial action proposed by the policy network. This is used at both training time and inference time.
- It’s well known in deep RL that learning a reward function or an episodic success detector is far easier than generating the episodic actions to succeed at the task
- The “Let’s think step by step” paper showed that LLMs can be instructed to reason more carefully during their generation process via Chain-of-Thought (CoT) prompting like “let’s think step by step”. Drawing an analogy to control, we are using the LLM’s logical reasoning primitives like a “logical world model” to generate a trajectory. However, if autoregressive generation makes a mistake, CoT prompting cannot go back and fix the error. The benefit of self-reflection is that the model can identify mistakes (potentially using CoT prompting itself), and correct them by starting over entirely. As neural net context length in LLMs increase, I expect that self-reflection will become the more effective CoT prompting technique. If you really squint and stretch your imagination, you can think of reflection as similar to a denoising operator for LLM outputs, similar to diffusion modeling but operating in semantic and logical space.
- OpenAI researchers have investigated using LLMs to help critique LLM answers, though it approaches it from the very reasonable angle of automating the workload of human raters. Self-reflection takes it a step further, asking whether the critiques can actually be used to generate a better output without human intervention.
- The recent Constitutional AI paper by Anthropic explores a similar idea of having a LLM revise generated outputs to abide by a set of rules:
The Constitutional AI paper didn’t make sense to me when I first read it; it was hard to wrap my head around the idea that you could have a LLM revise its own outputs and retrain on that data and somehow improve without extra human or real world data. It almost feels like a violation of the no-free-lunch theorem, or at the very least prone to the sort of optimization instabilities that batch offline DDPG is prone to. In batch offline DDPG, the critic is learned from a finite dataset, so the actor “reward-hacks” the non-grounded critic to come up with bogus actions. Making batch offline actor-critic work requires heavily regularizing the critic and actor, and tuning this is so much trouble that I feel like it is not worth it in practice.
Perhaps CAI and self-reflection are not a violation of no-free-lunch theorem, but rather exploiting the fact that training a solution verifier for abiding by an instruction is computationally easier to training a solution generator for abiding by said instruction. The NP class of problems falls under this category. Anthropic uses this idea to generate rollouts for retraining their “actor”, analogous to distilling a policy network with the outcome of MCTS at training time. Meanwhile, the Reflexion paper uses the trick to refine answers at test time, which is more akin to AlphaGo using MCTS at test-time. As the AlphaGo body of work has so elegantly paved out, the roadmap is quite clear on how one could combine these two approaches for a powerful agent.
What’s new and noteworthy about LLMs, in contrast to prior works on model predictive control, is that the same set of weights can be used for the initial guess AND the critique AND the refinement procedure. Furthermore, the set of tasks a LLM can verify are completely open-ended, which means that it can not only critique its initial guess, but it can critique its own critique in a recursive way. Permit me the handwavy speculation here, but a LLM performing recursive self-reflection of its self-reflections may be the first glimmers of a bicameral mind. Feels like self-awareness literally pulling itself up by its bootstraps.
A couple years ago I wrote about how with the advent of increasing LLM capabilities, perhaps we should rely more on generalization than optimization to generate solutions. However, the fact remains that many problems are computationally easier to verify than solve. The takeaway for me here is that if you do not have the compute to “just ask” for a solution, perhaps you can settle for “just asking” for verification.
Citation
If you want to cite this blog post, you can use:
@article{jang2023reflection,
title = "Can LLMs Critique and Iterate on Their Own Outputs?",
author = "Jang, Eric",
journal = "evjang.com",
year = "2023",
month = "Mar",
url = "https://evjang.com/2023/03/26/self-reflection.html"
}