The Error Detective Debugging

What is The Error Detective?

The Error Detective is a 3-step debugging flow that applies scientific method to error diagnosis. It chains Chain-of-Thought to generate an exhaustive hypothesis space before any fix is attempted, Tree of Thoughts to stress-test each hypothesis against evidence and prune implausible causes, and Reflexion to audit the fix and confirm it addresses the root cause rather than the symptom.

The flow prevents the most expensive debugging failure: premature convergence. When you ask an AI to fix a bug directly, it anchors on the first plausible explanation and generates code that may fix the visible symptom while leaving the underlying cause intact. The Error Detective forces the model through hypothesis generation, evidence evaluation, and fix validation — in that order.

When to Use The Error Detective

🐛

Mysterious Production Bugs

Bugs that only appear in production, intermittently, or under specific load conditions where the cause isn't immediately apparent.

🔄

Regression Bugs

Bugs introduced by a recent change where the relationship between the change and the failure isn't obvious.

🗄️

Data Corruption Issues

Data integrity problems where the source of corruption is unclear and multiple systems are potential culprits.

⚡

Performance Degradation

Unexplained slowdowns where the bottleneck could be in database, application code, infrastructure, or their interaction.

🌐

Integration Failures

Third-party API or service integration failures where the fault could be in your code, the API, or the network layer.

🔐

Auth & Session Bugs

Authentication and authorization failures where the security implications require root cause certainty, not just symptom resolution.

The Flow Algorithm

Chain-of-Thought — Exhaust the Hypothesis Space

Present the error with full context and instruct the model to reason through every possible cause systematically. Categorize hypotheses by layer: environment (OS, runtime, configuration), inputs (request payload, user state, data format), application logic (business rules, state management, conditional branches), dependencies (libraries, services, infrastructure), and timing (race conditions, timeouts, clock drift). Explicitly instruct the model not to skip causes that seem unlikely — exhaustiveness is the goal. Do not generate any fixes yet.

Produces:

A numbered list of causal hypotheses organized by layer, with reasoning for why each could cause the observed behavior. Nothing is ruled out yet.

Tree of Thoughts — Prune to Root Cause

Take the top 3-4 hypotheses from Step 1 and branch each into three sub-investigations: (a) evidence that supports this hypothesis — what about the error description, stack trace, or behavior points to this cause? (b) evidence that argues against it — what should we see if this were the cause that we don't see? (c) the specific test to confirm or rule it out — not a generic "add logging" but the exact assertion, query, or experiment. Evaluate each branch and prune hypotheses where the against-evidence is strong.

Produces:

A prioritized, evidence-weighted shortlist of testable hypotheses with specific diagnostic tests. The most probable root cause is now identifiable.

Reflexion — Validate the Fix

After implementing the fix, apply Reflexion: "Review this fix critically. Does it address the root cause identified in Step 2, or only the symptom? What adjacent code could this change break? Are there edge cases this fix doesn't handle? What tests must be written to verify the fix is complete and hasn't introduced regressions? If the fix only addresses a symptom, what would a root-cause fix look like?" Treat this as a mandatory audit, not an optional review.

Produces:

A root-cause-verified fix with an explicit audit trail, regression risk assessment, and a test checklist. The fix is defensible, not just functional.

Example Prompt Sequence

Step 1 — Chain-of-Thought Hypothesis Generation

Think step by step through every possible cause of this error. Do not generate fixes yet.

Error: "TypeError: Cannot read properties of undefined (reading 'user')" in a Node.js Express API
Observed behavior: Error occurs on POST /api/orders, but only for about 3% of requests. The remaining 97% succeed.
Last change: Added Redis-based session caching 48 hours ago
Stack trace: at OrderController.create (src/controllers/orders.js:47)
Environment: Node 18, Express 4.18, Redis 7, deployed on AWS ECS

Generate hypotheses in these categories:
- Environment (configuration, infrastructure, deployment)
- Inputs (request payload, user state, timing)
- Application logic (race conditions, null checks, state management)
- Dependencies (Redis, session middleware, request lifecycle)

Do not skip hypotheses that seem unlikely. Number each one.

Step 2 — Tree of Thoughts Pruning

Take the top 4 hypotheses from the list below and apply Tree of Thoughts to each:

For each hypothesis, branch into:
(a) Evidence FOR: What in the error description, stack trace, or 3% failure rate points to this cause?
(b) Evidence AGAINST: What should we observe if this were the cause that we don't see in the evidence?
(c) Specific test: The exact diagnostic step (a query, an assertion, a log line) that confirms or rules out this hypothesis.

After evaluating all 4, rank them by likelihood. Prune any with strong against-evidence.

Hypotheses from Step 1: [PASTE TOP 4 HYPOTHESES HERE]

Step 3 — Reflexion Fix Audit

I implemented the following fix for the TypeError issue:

[PASTE YOUR FIX HERE]

Apply Reflexion:
1. Does this fix address the root cause identified in the investigation, or does it only handle the symptom?
2. What adjacent code could this change break? List specific functions or flows.
3. What edge cases does this fix not handle? Be specific.
4. Write a Jest test suite for this fix: (a) the happy path, (b) the edge case that caused the original 3% failure, (c) the regression cases from question 2.
5. If this fix is symptomatic only: describe what a root-cause fix would require.

Pros and Cons

Strengths

Prevents premature convergence on wrong cause
Tree of Thoughts forces evidence-based hypothesis pruning
Reflexion audit prevents symptom-only fixes from shipping
Produces an investigation log as a by-product
Works for any system type — not just code

Trade-offs

High upfront investment for simple, obvious bugs
CoT hypothesis list can be long — requires judgment to filter
Reflexion audit may be overly thorough for low-risk fixes
Does not replace running actual tests and diagnostics

Frequently Asked Questions

What is The Error Detective prompt flow?

The Error Detective chains Chain-of-Thought, Tree of Thoughts, and Reflexion to diagnose and resolve complex errors systematically. CoT generates an exhaustive hypothesis space, Tree of Thoughts stress-tests each hypothesis against evidence, and Reflexion audits the fix to ensure it addresses the root cause rather than the symptom.

Why not just ask the AI to debug the code directly?

A direct 'fix this bug' prompt causes the model to anchor on the first plausible explanation — typically the most obvious surface symptom. The Error Detective forces exhaustive hypothesis generation before any fix is attempted, exactly as a scientist would enumerate hypotheses before testing. This prevents the 'fixed the symptom, not the cause' failure pattern.

When does Tree of Thoughts add value over just ranking CoT hypotheses?

Ranking hypotheses is linear — it produces a priority order but doesn't stress-test each one. Tree of Thoughts branches each hypothesis into evidence for, evidence against, and a specific test. This forces the model to argue against each hypothesis as vigorously as it argues for it, which is the only reliable way to prune false positives before investing in debugging.

What does Reflexion catch that code review doesn't?

Standard code review asks 'does this code look correct?'. Reflexion asks 'does this fix solve the root cause, what could it break elsewhere, and what tests should validate the repair?'. It's a post-fix audit that explicitly questions whether the fix is complete — not just syntactically correct.

Can this flow be used for non-code bugs?

Yes. The flow works for any system error: database query failures, infrastructure incidents, process breakdowns, data quality issues, and logical errors in business workflows. The terminology adjusts (environment/inputs/state/logic applies to any system), but the three-step reasoning structure is domain-agnostic.

What context should I provide in the initial prompt?

For best results, provide: the exact error message or observed behavior, the expected behavior, the last change made before the error appeared, the environment details (OS, runtime version, relevant dependencies), and any previous attempts to fix it (and why they didn't work). More context produces better hypothesis coverage in Step 1.