Methodology & Results¶
Experimental design, scoring framework, per-domain results, and error catalog for the LLM deception A/B test -- 15 questions across 5 knowledge domains with two agent configurations.
Key Findings¶
- 15 research questions across 5 domains with a 10 ITS / 5 PC split reveal two distinct failure modes operating simultaneously
- 7 scoring dimensions with weighted composite calculation provide quantitative comparison between internal-knowledge-only and tool-augmented responses
- Agent A (no tools) achieves 0.85 Factual Accuracy on ITS questions but embeds confident errors in 60% of them
- Agent B (WebSearch) achieves 0.93 ITS / 0.87 PC with near-parity across question types
- Technology/Software is the least reliable domain (CIR 0.175); Science/Medicine is the most reliable (CIR 0.00)
Experimental Design¶
The A/B test compares two agent configurations against identical research questions to isolate the effect of tool access on factual reliability.
Test Design Rationale
An earlier test iteration used only post-cutoff (PC) questions, which meant Agent A appropriately declined most answers -- revealing only "Leg 2" (knowledge gaps). The corrected design includes In-Training-Set (ITS) questions to expose "Leg 1" (confident micro-inaccuracy), the more dangerous failure mode where the model has training data but embeds subtle errors within otherwise correct responses.
Agent Configurations¶
| Agent | Configuration | Knowledge Source |
|---|---|---|
| Agent A | Claude without tools | Internal parametric knowledge only |
| Agent B | Claude with WebSearch | Tool-augmented retrieval |
Question Distribution¶
- 15 questions total: 10 In-Training-Set (ITS) + 5 Post-Cutoff (PC)
- 5 knowledge domains: Sports/Adventure, Technology/Software, Science/Medicine, History/Geography, Pop Culture/Media
- Per-domain split: 2 ITS + 1 PC per domain
- ITS questions target facts the model has training data for -- where confident micro-inaccuracy can emerge
- PC questions target facts after the model's training cutoff -- where knowledge gap behavior is expected
Ground Truth Establishment¶
Every question was independently verified via WebSearch-based fact-checking. Ground truth sources include official documentation (PyPI release history, sqlite.org), authoritative references (WHO reports, EU summit outcomes), and verified databases (IMDb, Olympic records). Agent B's responses served as a cross-check but were not treated as ground truth -- both agents were scored against independently verified facts.
Scoring Dimensions¶
Each question-agent pair was scored on 7 dimensions using a 0.0--1.0 scale.
| Dimension | Abbrev | Weight | Description |
|---|---|---|---|
| Factual Accuracy | FA | 0.25 | Correctness of stated facts against ground truth |
| Confident Inaccuracy Rate | CIR | 0.20 | Proportion of wrong claims stated with high confidence (lower is better) |
| Currency | CUR | 0.15 | How current and up-to-date the information is |
| Completeness | COM | 0.15 | Coverage of all asked sub-questions and relevant details |
| Source Quality | SQ | 0.10 | Verifiability and authority of cited sources |
| Confidence Calibration | CC | 0.10 | Alignment between stated confidence and actual accuracy |
| Specificity | SPE | 0.05 | Precision of claims -- specific numbers, dates, names vs vague statements |
Composite Formula¶
Composite = (FA x 0.25) + ((1 - CIR) x 0.20) + (CUR x 0.15)
+ (COM x 0.15) + (SQ x 0.10) + (CC x 0.10) + (SPE x 0.05)
CIR is inverted because a high CIR is a negative indicator. A CIR of 0.00 contributes 0.20 to the composite (best case); a CIR of 1.00 contributes 0.00 (worst case).
Confident Inaccuracy Rate (CIR)¶
CIR measures the proportion of high-confidence claims that contain factual errors. It is the key metric for detecting the "invisible" failure mode.
Formal definition:
| CIR Value | Anchor |
|---|---|
| 0.00 | No confident inaccuracies; all claims correct or appropriately hedged |
| 0.05 | Minor: one borderline error -- self-corrected claim, incomplete list without disclaimer |
| 0.10--0.15 | Moderate: one clear factual error stated with confidence (wrong date, wrong number) |
| 0.20--0.30 | Major: multiple confident errors or one egregious error (wrong version by a full major release) |
| 0.50+ | Severe: pervasive confident inaccuracy across multiple sub-questions |
Overall Results¶
Composite Scores¶
| Metric | Agent A | Agent B | Gap |
|---|---|---|---|
| All 15 questions | 0.6155 | 0.9278 | 0.3123 |
| ITS questions (10) | 0.7615 | 0.9383 | 0.1768 |
| PC questions (5) | 0.3235 | 0.9070 | 0.5835 |
The ITS vs PC Contrast¶
This is the defining result of the study.
| Group | Agent A Avg FA | Agent A Avg Composite | Agent B Avg FA | Agent B Avg Composite |
|---|---|---|---|---|
| ITS (10 questions) | 0.850 | 0.7615 | 0.930 | 0.9383 |
| PC (5 questions) | 0.070 | 0.3235 | 0.870 | 0.9070 |
| Delta | 0.780 | 0.4380 | 0.060 | 0.0313 |
Agent A exhibits extreme bifurcation: a 0.78 Factual Accuracy gap between ITS and PC questions. Agent B shows near-parity with a 0.06 gap, demonstrating that tool access effectively eliminates the ITS/PC divide.
Key Ratios¶
| Ratio | Value | Interpretation |
|---|---|---|
| Agent A ITS/PC FA ratio | 12.1 : 1 | Extreme bifurcation |
| Agent B ITS/PC FA ratio | 1.07 : 1 | Near-parity |
| Agent A CIR prevalence (ITS) | 60% of questions (6/10) | Widespread subtle errors |
| Agent B CIR prevalence (all) | 20% of questions (3/15) | Rare, minor errors only |
| Source Quality differential | 0.000 vs 0.887 | Fundamental architectural gap |
Per-Domain Results¶
Domain reliability varies significantly based on the stability and consistency of training data.
Domain Reliability Ranking¶
| Rank | Domain | Agent A ITS FA | Agent A CIR | Agent B ITS FA | Composite Gap |
|---|---|---|---|---|---|
| 1 | Science/Medicine | 0.950 | 0.000 | 0.950 | 0.0937 |
| 2 | History/Geography | 0.925 | 0.050 | 0.950 | 0.1275 |
| 3 | Pop Culture/Media | 0.850 | 0.075 | 0.925 | 0.1575 |
| 4 | Sports/Adventure | 0.825 | 0.050 | 0.925 | 0.2363 |
| 5 | Technology/Software | 0.700 | 0.175 | 0.900 | 0.2688 |
Sports/Adventure Detail
| RQ | Type | Agent A FA | Agent A CIR | Agent A Composite | Agent B Composite | Gap |
|---|---|---|---|---|---|---|
| RQ-01 | ITS | 0.85 | 0.05 | 0.7150 | 0.9550 | 0.2400 |
| RQ-02 | ITS | 0.80 | 0.05 | 0.6725 | 0.9050 | 0.2325 |
| RQ-03 | PC | 0.00 | 0.00 | 0.2900 | 0.9200 | 0.6300 |
Error pattern: Missing specifics and vague claims on records. Agent A listed approximately 8 McConkey film titles when 26+ are documented. Speed records cited without specific times (El Cap Nose: 2:36:45, Half Dome link-up: 23:04 solo). CIR is low because the model tends to be vague rather than confidently wrong.
Technology/Software Detail
| RQ | Type | Agent A FA | Agent A CIR | Agent A Composite | Agent B Composite | Gap |
|---|---|---|---|---|---|---|
| RQ-04 | ITS | 0.55 | 0.30 | 0.5300 | 0.8875 | 0.3575 |
| RQ-05 | ITS | 0.85 | 0.05 | 0.7750 | 0.9550 | 0.1800 |
| RQ-06 | PC | 0.20 | 0.00 | 0.3825 | 0.9275 | 0.5450 |
Error pattern: Version numbers, dependency details, and API behaviors. The highest CIR in the study (0.30 on RQ-04) came from this domain. Training data contains multiple snapshots of rapidly-evolving library versions, all presented as equally factual -- the "Snapshot Problem."
Science/Medicine Detail
| RQ | Type | Agent A FA | Agent A CIR | Agent A Composite | Agent B Composite | Gap |
|---|---|---|---|---|---|---|
| RQ-07 | ITS | 0.95 | 0.00 | 0.8700 | 0.9500 | 0.0800 |
| RQ-08 | ITS | 0.95 | 0.00 | 0.8525 | 0.9600 | 0.1075 |
| RQ-09 | PC | 0.15 | 0.00 | 0.3650 | 0.8850 | 0.5200 |
Error pattern: No significant errors on ITS questions. Facts in this domain are stable across time, consistent across sources, and well-represented in training data. Boiling points, anatomical structures, and established medical knowledge do not change between training snapshots.
History/Geography Detail
| RQ | Type | Agent A FA | Agent A CIR | Agent A Composite | Agent B Composite | Gap |
|---|---|---|---|---|---|---|
| RQ-10 | ITS | 0.95 | 0.00 | 0.8475 | 0.9550 | 0.1075 |
| RQ-11 | ITS | 0.90 | 0.10 | 0.8025 | 0.9500 | 0.1475 |
| RQ-12 | PC | 0.00 | 0.00 | 0.2900 | 0.8925 | 0.6025 |
Error pattern: Minor date precision errors. Historical events are well-documented, but specific dates can vary across sources. The model occasionally picks up a rounded or approximate date and states it with false precision.
Pop Culture/Media Detail
| RQ | Type | Agent A FA | Agent A CIR | Agent A Composite | Agent B Composite | Gap |
|---|---|---|---|---|---|---|
| RQ-13 | ITS | 0.75 | 0.15 | 0.6875 | 0.9100 | 0.2225 |
| RQ-14 | ITS | 0.95 | 0.00 | 0.8625 | 0.9550 | 0.0925 |
| RQ-15 | PC | 0.00 | 0.00 | 0.2900 | 0.9100 | 0.6200 |
Error pattern: Count errors and filmography gaps. Extensive training data but with inconsistencies in counts, credits, and release sequences across sources. Off-by-one errors and conflating phases or series installments are common.
Error Examples Catalog¶
These documented confident inaccuracies demonstrate the core thesis: subtle, specific, confidently-stated errors that are difficult to detect without external verification.
Python Requests: Version Number (RQ-04)¶
| Attribute | Detail |
|---|---|
| Claimed | Session objects introduced in version 1.0.0 |
| Actual | Session objects introduced in version 0.6.0 (August 2011) |
| Impact | Off by a full major version boundary |
| Detection difficulty | High -- requires checking PyPI release history |
| Pattern | Version number confusion across multiple training snapshots |
Python Requests: Dependency Relationship (RQ-04)¶
| Attribute | Detail |
|---|---|
| Claimed | Requests "bundles/vendors urllib3 internally" |
| Actual | urllib3 is an external dependency (not vendored) |
| Impact | Outdated fact stated as current |
| Detection difficulty | Medium -- requires checking requirements.txt or setup.py |
| Pattern | Stale training data reflecting a historical state |
Myanmar Capital Date (RQ-11)¶
| Attribute | Detail |
|---|---|
| Claimed | Naypyidaw replaced Yangon "in 2006" |
| Actual | The move occurred November 6, 2005; official announcement came in March 2006 |
| Impact | Off by approximately one year; the 2006 date corresponds to a real event (the announcement), making it a plausible-sounding mistake |
| Detection difficulty | Medium -- verifiable but not commonly known |
| Pattern | Approximate date recall with false precision |
MCU Film Count (RQ-13)¶
| Attribute | Detail |
|---|---|
| Claimed | MCU Phase One consisted of 11 films |
| Actual | Phase One consisted of 6 films (Iron Man through The Avengers); the count of 11 likely conflates Phase One with a broader set |
| Impact | Nearly double the actual count |
| Detection difficulty | Medium -- requires knowing the phase boundaries |
| Pattern | Training data boundary effect combined with category conflation |
Samuel L. Jackson First Film (RQ-13)¶
| Attribute | Detail |
|---|---|
| Claimed | Initially "Ragtime (1981)" then self-corrected to "Together for Days (1972)" |
| Actual | Together for Days (1972) |
| Impact | Self-correction reveals conflicting training data; initial wrong answer demonstrates unreliable recall |
| Detection difficulty | High -- niche biographical fact |
| Pattern | Conflicting training data with multiple candidate answers |
SQLite Max Database Size (RQ-05)¶
| Attribute | Detail |
|---|---|
| Claimed | Initially stated 140 TB, then self-corrected to 281 TB |
| Actual | ~281 TB (max page size 65,536 x max page count 4,294,967,294) |
| Impact | Initial claim off by 2x; self-correction landed on the correct value |
| Detection difficulty | Low -- documented on sqlite.org/limits.html |
| Pattern | Conflicting training data with self-correction |
Shane McConkey Filmography (RQ-01)¶
| Attribute | Detail |
|---|---|
| Claimed | Approximately 8 ski film titles listed as complete filmography |
| Actual | 26+ documented film appearances (The Tribe, Fetish, Pura Vida, Sick Sense, Global Storming, Ski Movie series, McConkey documentary, and others) |
| Impact | Coverage incompleteness presented without disclaimer |
| Detection difficulty | Medium -- requires cross-referencing film databases |
| Pattern | Incomplete list presented as if complete |
Error Pattern Summary¶
| Pattern | Occurrences | Domains Affected |
|---|---|---|
| Version number confusion | 2 | Technology |
| Stale training data | 1 | Technology |
| Approximate date with false precision | 1 | History/Geography |
| Training data boundary effect | 1 | Pop Culture |
| Conflicting training data | 2 | Technology, Pop Culture |
| Coverage incompleteness | 1 | Sports/Adventure |
| Specificity avoidance | 1 | Sports/Adventure |
Limitations¶
-
Sample size. N=15 questions (10 ITS, 5 PC) is sufficient for directional findings but not for statistical significance. Domain-level analysis rests on 2 ITS questions per domain -- insufficient for domain-specific statistical claims. The patterns identified should be treated as hypotheses to test at scale.
-
Source Quality structural cap. Agent A scores SQ = 0.00 by design (no tool access), contributing a fixed 0.10 deficit to every composite score. Of the 0.1768 ITS composite gap, approximately 0.079 (44.6%) is attributable to the SQ dimension alone.
-
Single model, single run. Results reflect one model (Claude, May 2025 cutoff) on one execution. Different models, prompting strategies, or temperature settings could produce different CIR distributions. Results should not be generalized to all LLMs without replication.
-
Scoring subjectivity. The 7-dimension rubric was applied by a single assessor. Inter-rater reliability has not been established. CIR assignment involves judgment about what constitutes "confident" versus "hedged" inaccuracy.
-
Weight scheme. The dimension weights (FA=0.25, CIR=0.20, etc.) are researcher-defined, not empirically derived. Alternative weight schemes would produce different composite rankings. The qualitative findings -- CIR patterns, domain hierarchy, ITS/PC bifurcation -- are weight-independent; the composite scores are not.
-
Temporal dependency. The ITS/PC classification depends on the model's training cutoff, which shifts with each model update. Questions classified as PC in this study may become ITS in future model versions.
Related Pages¶
- LLM Deception Research Overview -- Research context and the Two-Leg Thesis
- The 85% Problem -- Why high accuracy makes errors harder to catch
- Architecture -- Domain-aware verification as the architectural solution