LLM Deception Research¶

"85% right and 100% confident." That is the failure mode nobody talks about. LLMs do not just hallucinate from ignorance -- they weave subtle errors into otherwise accurate responses, stated with the same confidence as the facts surrounding them.

Key Findings¶

LLMs exhibit two distinct failure modes (the Two-Leg Thesis): confident micro-inaccuracy on known topics (Leg 1) and honest decline on unknown topics (Leg 2)
Leg 1 is far more dangerous because it is invisible to users -- spot-checking reinforces trust while unverified claims carry a ~15% error rate
Technology/Software is the least reliable domain (0.70 accuracy, 0.175 CIR) due to rapidly-evolving training data snapshots
Science/Medicine is the most reliable domain (0.95 accuracy, 0.00 CIR) due to stable, consistent training data
Tool-augmented retrieval (WebSearch) raises accuracy from 0.85 to 0.93 on known topics and from 0.07 to 0.87 on unknown topics

Performance Metrics¶

Agent A vs Agent B: Core Metrics

Metric	Agent A (No Tools)	Agent B (WebSearch)
Overall ITS Factual Accuracy	0.850	0.930
Overall PC Factual Accuracy	0.070	0.870
Confident Inaccuracy Rate (ITS)	0.070	0.015
Confidence Calibration (PC)	0.870	0.900

ITS = In-Training-Set questions (model has training data). PC = Post-Cutoff questions (model lacks training data). CIR = Confident Inaccuracy Rate (proportion of high-confidence claims that are factually wrong).

Domain Reliability Breakdown

Domain	Agent A ITS Accuracy	Agent A CIR	Agent B ITS Accuracy	Key Error Pattern
Science/Medicine	0.950	0.000	0.950	No significant errors
History/Geography	0.925	0.050	0.950	Minor date errors
Pop Culture/Media	0.850	0.075	0.925	Count errors, filmography gaps
Sports/Adventure	0.825	0.050	0.925	Missing specifics, vague on records
Technology/Software	0.700	0.175	0.900	Version numbers, dependency details

Methodology¶

Study Design

The study uses a controlled A/B test with 15 questions across 5 knowledge domains (Sports/Adventure, Technology/Software, Science/Medicine, History/Geography, Pop Culture/Media). Questions are split into 10 In-Training-Set (ITS) and 5 Post-Cutoff (PC) to isolate the two failure modes. Agent A operates without tools (internal knowledge only); Agent B has WebSearch enabled. Each question-agent pair is scored on Factual Accuracy, Confident Inaccuracy Rate, Confidence Calibration, and Completeness.

This corrected design addresses a limitation in an earlier test that used only post-cutoff questions -- revealing only Leg 2 (honest decline) while Leg 1 (confident micro-inaccuracy) remained invisible.

Study Pages¶

The 85% Problem

The central finding: when LLMs have training data, they produce answers that are mostly correct but embed subtle factual errors with full confidence. Why this is worse than hallucination.

Read more
Methodology

A/B test design, ITS/PC question split, 7-dimension scoring rubric, domain selection rationale, and limitations of the 15-question sample.

Read more
Architecture Implications

Domain-aware verification strategies, agent design patterns for tool-augmented retrieval, and the Snapshot Problem in technology domains.

Read more

Quality Gate Scores¶

All deliverables passed the 0.92 threshold (H-13) via C4 adversarial tournament scoring.

Gate	Phase	R1 Score	R2 Score	Status
QG-1	Evidence Collection	0.952	--	PASS
QG-2	A/B Test V&V	0.88	0.92	PASS (R2)
QG-3	Research Synthesis	0.82	0.92	PASS (R2)
QG-4	Content QA	0.90	0.94	PASS (R2)
QG-5	Final V&V	0.93	--	PASS

Leniency Bias Observation

Self-assessed scores were uniformly 0.96 across all phases. C4 adversarial tournament actual scores ranged 0.82--0.93 (delta -0.03 to -0.14). This validates the necessity of independent adversarial tournament scoring per S-014 leniency bias counteraction.