Reflexion: Language Agents with Verbal Reinforcement Learning
1 min read
Originally from arxiv.org
View source
My notes
Summary
Reflexion makes an LLM agent learn from its own failures without any fine-tuning: after a failed attempt it writes a short verbal post-mortem, stores it in memory, and feeds it back as context on the next try. This “verbal reinforcement” loop lifted GPT-4 from 80% to 91% pass@1 on HumanEval and added +22% on AlfWorld and +20% on HotPotQA over strong baselines.
Key Insight
- The core loop is three roles, not one model. Actor (generates actions), Evaluator (scores the trajectory), Self-Reflection (turns a binary pass/fail into a written lesson). Keeping them separate is what makes the feedback actionable.
- Memory is tiny and bounded. The reflection buffer is capped at 1-3 past experiences (Omega) to fit the context window. Code tasks used max 1, reasoning/decision tasks used max 3. You don’t need a vector DB to get the gain.
- Self-reflection beats plain retry. Ablation: just adding the last trajectory to memory (episodic memory) helped, but adding the written first-person reflection on top gave an extra +8% absolute. Blind retry/debugging loops without a reflection step did NOT beat baseline on hard Rust problems (52% vs 60%).
- For code, the evaluator is self-generated unit tests (max 6, AST-validated), which keeps it pass@1-eligible (no ground-truth tests leaked). The weak link is flaky tests: HumanEval Python had a 1.4% false-positive rate (tests pass but solution wrong), reaching 91%; MBPP Python had 16.3%, so Reflexion actually underperformed baseline there.
- It fails on tasks needing exploration/creativity. On WebShop (e-commerce search) it stalled after 4 trials and produced no useful reflections, getting stuck in local minima when the action space is broad and ambiguous.
- Self-correction is an emergent capability of bigger models. A small model (starchat-beta) showed zero gain (0.26 to 0.26); the lift scales with base-model strength.