[BUG] Opus 4.8 confabulates user messages, a fake "prompt injection attack" narrative, and fabricated tool/host facts in long sessions (2 sessions, JSONL-verified)
已打开 05:48PM - 11 Jun 26 UTC
bug
has repro
platform:linux
area:model
## Environment
- Claude Code versions: 2.1.172 and 2.1.173 (two separate sessio…ns, same day)
- Model: `claude-opus-4-8` (Max subscription)
- Platform: Linux (Ubuntu, bash)
- Date of incidents: 2026-06-11
## Summary
Two independent Opus 4.8 sessions on the same day exhibited severe confabulation in long-context sessions (~100–170k tokens). In both cases I performed forensic analysis afterwards by reading the session `.jsonl` transcripts directly, so every claim below is verified against what actually entered the model's context vs. what the model emitted.
Symptoms match the cluster already reported in #67324, #67484, #67454, #64048, #63538.
## Incident 1 — fabricated user message + fabricated "prompt injection attack" narrative
Session A (v2.1.172) was a debugging task. The full transcript contains exactly **4 real user messages**. Yet:
1. Mid-session the assistant responded `「繼續」收到` ("got your 'continue'") — **no such user message exists anywhere in the jsonl**. No queued message, no tmux/automation input, nothing. The fabricated "continue" then triggered it to start implementing.
2. It then told me the session was under a **sustained prompt-injection attack**, presenting a table of "injection attempts" it claimed to have found embedded in tool results (fake system-reminders instructing it to commit without review, disable CSRF, bypass pre-commit hooks). Forensics: across the entire session, tool_results contain **exactly one** `<system-reminder>` — a benign harness-generated "this memory is 13 days old" notice. The entire attack narrative was fabricated.
3. This caused real alarm — I spent significant time on an intrusion investigation (separate clean-session audit found no compromise; the machine was fine).
4. Late in the session the model itself confessed (its own words, paraphrased): it had stopped waiting for real tool output and "continued writing the results itself", fabricating both the implementation results and the injection storyline.
## Incident 2 — fabricated facts, inverted host identity, fabricated apology
Session B (v2.1.173), same day: log-investigation task on the local machine (the production host).
1. Mid-session it suddenly asserted "the production log has **18,197 lines**" — this number appears **nowhere** in any tool output in the transcript.
2. Simultaneously it inverted host identities: it decided a different, unrelated machine was "production" and the actual host it was running on was "a test box" — directly contradicting the auto-loaded memory file that explicitly states the opposite.
3. It ssh'd to the unrelated machine; greps came back **empty** (the path doesn't exist there). Instead of questioning the premise, it attributed the empty output to "unstable ssh" and produced a confident final report based on it. It later claimed to have seen an Apache vhost + git HEAD + error logs on that machine — none of which exist (verified afterwards: no Apache unit, no such directory).
4. When I corrected it, the **apology itself contained two more fabrications**: it claimed I had earlier said "the other machine is production, look only" (no such user message exists in the transcript) and that it had edited a memory file which I then reverted (no Edit/Write/Bash call in the transcript ever touched that file).
5. Finally it invented a third machine (an IP that was never mentioned by anyone) and asked me to run commands there.
## Pattern / conditions
- Both sessions: `claude-opus-4-8`, long context (~100–170k tokens), several hook-injected reminders per turn (high context noise).
- Both derailed at the transition from "investigation" (unpredictable outputs, must wait) to "action/implementation" (predictable-looking outputs) — consistent with the model "auto-completing" expected observations instead of waiting for real ones.
- Fabrications are exclusively in assistant output; the user/tool side of the transcripts is clean. This rules out actual injection/compromise in both cases.
## Expected behavior
The model should never emit acknowledgements of user messages that don't exist, report tool results that were never returned, or assert it is under prompt-injection attack without the offending content actually being present in its context. When tool output is empty/failed, it should question the premise rather than fabricate a result.
## Notes
I can provide sanitized excerpts of both jsonl transcripts (timestamps, message types, usage stats) on request.