Reflexion Prompting Verbal AI

What is Reflexion Prompting?

Reflexion is a reinforcement-learning-inspired prompting framework introduced by Shinn et al. in 2023. The central idea is to give LLMs a verbal feedback loop: instead of updating model weights when a task fails (as in traditional RL), the model generates a natural-language self-reflection on its failure, stores it in an episodic memory buffer, and conditions its next attempt on that reflection.

The framework is built around three roles that a single LLM (or multiple LLM calls) can play:

Actor: Generates the primary output — code, a plan, a text answer, or a sequence of tool calls.
Evaluator: Scores the Actor's output. This can be an automated evaluator (unit tests, environment reward), a heuristic, or a separate LLM-as-judge call.
Self-Reflection: The LLM analyzes the Evaluator's feedback and produces a concise verbal diagnosis: what went wrong, why, and what to do differently next time.

The self-reflection text is appended to a persistent episodic memory that is included in the context for every subsequent attempt. This accumulation of structured self-critique is what drives improvement — making Reflexion fundamentally different from simply retrying a prompt.

When to Use Reflexion Prompting

💻

Iterative Code Generation

Run unit tests as the Evaluator. When tests fail, the model reflects on the specific assertion errors and rewrites the function with targeted fixes — not random rewrites.

🤖

Agentic Task Completion

For agents using tools (web search, APIs, file systems), Reflexion lets the agent learn from failed action sequences within a single session without fine-tuning.

🧩

Complex Multi-Step Reasoning

Logic puzzles, math competition problems, and multi-hop question answering where initial attempts often miss a key step that self-reflection can identify.

🔄

Decision-Making in Environments

Text-based games, planning tasks, and simulated environments where the model can observe outcomes, reflect on suboptimal decisions, and re-plan.

📊

Data Transformation & Extraction

When a structured extraction task returns malformed output, the model can reflect on the schema mismatch and correct its parsing logic on the next pass.

🧪

Research & Hypothesis Testing

Scientific reasoning tasks where the model proposes a hypothesis, evaluates it against constraints, reflects on contradictions, and refines its theory.

How to Implement Reflexion Prompting

1
Define the task and success criterion
Reflexion requires a clear, evaluable success criterion. Before building the loop, specify how the Evaluator will score outputs: unit tests, a rubric-based LLM judge, a pass/fail heuristic, or a human review score. Without an evaluator, the self-reflection has no signal to reflect on.
2
Generate the initial output (Actor)
Prompt the model normally for the task. Store this output. This becomes Attempt 1 — the baseline that the Evaluator will assess.
3
Evaluate the output
Run the Evaluator against Attempt 1. Collect specific, structured feedback — not just "wrong" but exactly what failed, which test case, which constraint was violated. Precise feedback produces more useful reflections.
4
Generate a self-reflection
Prompt the model with the task, its previous attempt, and the evaluation result. Ask it to generate a concise self-reflection: what was wrong, why it was wrong, and a concrete plan for the next attempt. Store this reflection in the episodic memory.
5
Retry with accumulated memory
For the next Actor call, prepend all accumulated self-reflections to the prompt. The model now generates conditioned on its own failure analysis. Repeat the evaluate-reflect-retry cycle until the task is solved or the attempt budget is exhausted (typically 3–5 cycles).

Prompt Examples

Reflexion Loop — Code Generation with Test Feedback

You are an expert Python developer using the Reflexion framework.

## Task
Write a Python function find_duplicates(lst) that returns all duplicate values in a list.

## Previous Attempt
def find_duplicates(lst):
    return [x for x in lst if lst.count(x) > 1]

## Evaluation Result
FAILED: Test find_duplicates([1, 2, 2, 3, 3]) expected {2, 3}, got [2, 2, 3, 3].
FAILED: Test find_duplicates([]) expected set(), got [].

## Self-Reflection
My implementation returns every occurrence of a duplicate, not just the unique duplicate values.
I should use a set to collect items seen more than once, not a list comprehension that
includes all appearances. I also need to handle the empty list case, which the set approach
will handle naturally.

## Improved Attempt
Using the reflection above, write a corrected implementation that passes all tests.

Reflexion Loop — Factual Research Agent

## Reflexion Agent — Tool Use Task

### Task
Find the current CEO of Anthropic and their educational background.

### Attempt 1 — Actor Output
The CEO of Anthropic is Dario Amodei. He has a PhD in computational neuroscience.

### Attempt 1 — Evaluator Feedback
PARTIAL: CEO identification is correct. Educational background is incorrect or imprecise —
verify the specific institution and field.

### Self-Reflection Prompt
You just attempted the task above and received the evaluation feedback shown.
Write a concise self-reflection that:
1. Identifies exactly what was wrong with your answer
2. Describes what you would do differently (e.g., which specific details to verify)
3. States your plan for the next attempt

Then produce an improved answer using your reflection.

Pros and Cons

🟢 Pros	🔴 Cons
Improves task success rates without fine-tuning or weight updates	Requires a reliable evaluator — difficult for open-ended or subjective tasks
Works as a black-box wrapper around any LLM API	Episodic memory grows with each cycle, increasing context length and cost
Self-reflections are human-readable and auditable	Diminishing returns after 3–5 reflection cycles
Particularly strong on code generation benchmarks (HumanEval, MBPP)	Can get stuck in a loop if the model generates inaccurate self-reflections

Frequently Asked Questions

What is Reflexion prompting?

Reflexion is an academic prompting framework (Shinn et al., 2023) where an LLM acts as its own critic. The model generates an output (Actor), scores or evaluates it against a task criterion (Evaluator), then produces a verbal self-reflection describing what went wrong and how to improve (Self-Reflection). This reflection is stored in an episodic memory buffer and fed back into the next generation attempt, allowing the model to improve across multiple trials without updating its weights.

How is Reflexion different from simply asking the model to 'try again'?

Simply retrying gives the model no new information — it often produces the same errors. Reflexion requires the model to generate a specific, structured verbal critique explaining the failure mode before retrying. This linguistic feedback signal is what drives improvement: the model conditions on its own diagnosis, not just the task description, making subsequent attempts meaningfully different.

What are the three components of the Reflexion framework?

The Reflexion framework has three components: (1) Actor — the LLM that generates actions or text outputs for the task; (2) Evaluator — a component (which can be another LLM call, a unit test runner, or a heuristic) that scores the Actor's output; (3) Self-Reflection — the LLM generates a natural-language summary of its failure and a plan for improvement. The self-reflection is appended to an episodic memory that persists across attempts.

Does Reflexion require fine-tuning or gradient updates?

No. This is Reflexion's key contribution: it achieves iterative improvement through language rather than gradient-based weight updates. The model's weights remain frozen. Improvement comes entirely from conditioning on the accumulated self-reflection text in the context window — making it usable with any black-box LLM API.

What tasks benefit most from Reflexion?

Reflexion excels at tasks where correctness can be verified: code generation (run unit tests, reflect on failures), sequential decision-making (game environments, tool-use agents), and multi-step reasoning benchmarks. It is less useful for open-ended creative tasks where there is no objective evaluator, since without a clear failure signal the self-reflection component has nothing concrete to reflect on.

How many reflection cycles should I run?

The original paper caps attempts at a small number (typically 3–5) to balance improvement against context length growth. Each cycle appends the reflection to the context, so very long chains eventually hit the model's context limit. In practice, 2–3 reflection cycles capture most of the improvement; diminishing returns set in quickly after that.

Can I implement Reflexion without building a full agent framework?

Yes. A minimal Reflexion loop requires only three prompt calls: (1) generate output, (2) evaluate it (manually, with a test suite, or with a second LLM call), (3) prompt the model to reflect on the evaluation result and plan an improvement. You then prepend the reflection to the next generation prompt. No special framework is required, though agent frameworks like LangGraph or AutoGen make the loop easier to manage.

Reflexion Prompting (Reflexion)