Architecture & Recommendations¶
Why LLM reliability is domain-dependent, how to classify domains by risk, and what builders should do about it.
Key Findings¶
- The Snapshot Problem is the root cause of confident inaccuracy — LLMs compress contradictory training snapshots from different time periods into a single representation, producing subtle errors in fast-evolving domains
- Domain reliability follows a five-tier spectrum (T1-T5) from highly reliable (established science) to unavailable (post-cutoff events), each demanding a different verification policy
- The optimal architecture is domain-aware tool routing — verify aggressively in fast-evolving domains, trust internal knowledge in stable ones
- Governance-based mitigation (mandatory verification, multi-pass review, confidence annotation) is available now and does not depend on model improvements
The Snapshot Problem¶
The Snapshot Problem is the structural explanation for why LLMs are "85% right and 100% confident." It is not a bug in any particular model — it is an inherent consequence of how all current LLMs are trained.
Definition: Training data captures the state of the world at the time each document was written. When a model trains on documents from multiple time periods, it acquires multiple contradictory snapshots of the same fact. The model has no mechanism to determine which snapshot is most recent, most authoritative, or currently correct.
How Snapshots Become Errors
Consider how a model learns about the Python requests library:
Snapshot 2019: "requests 2.22.0 uses urllib3 1.25..."
Snapshot 2020: "requests 2.24.0 added support for..."
Snapshot 2021: "requests 2.26.0 changed the..."
Snapshot 2022: "requests 2.28.0 deprecated..."
Snapshot 2023: "requests 2.31.0 is the latest..."
The model sees all five snapshots with equal weight. Each makes factual claims that were true at the time of writing but may not be true at any other time. The model compresses these contradictory snapshots into a single internal representation — and the compression produces characteristic errors:
- Version number errors: The model selects a plausible blend of the snapshots, but it may not correspond to any actual version
- API behavior errors: The model describes behavior from one version while claiming to describe a different version
- Dependency errors: The model describes a dependency relationship from one snapshot while referencing a version from another
The severity of the Snapshot Problem varies by how frequently the underlying facts change:
| Stability | Description | Domains | Snapshot Conflict Rate |
|---|---|---|---|
| Immutable | Facts never change | Fundamental science, mathematics, established history | Near zero |
| Slow-evolving | Facts change on decade timescales | Geography, demographics, medical consensus | Low |
| Medium-evolving | Facts change on year timescales | Pop culture counts, sports records, political geography | Moderate |
| Fast-evolving | Facts change on month/week timescales | Software versions, API details, current events | High |
| Ephemeral | Facts change continuously | Stock prices, weather, live scores | Not addressable by training data |
The critical insight: The Snapshot Problem is not fixable by better training alone. Even a perfectly trained model will produce snapshot conflicts when trained on documents from different time periods about fast-evolving topics. The solution is architectural — external verification for domains where snapshots conflict.
Domain Reliability Tiers¶
Based on empirical testing and the Snapshot Problem analysis, LLM internal knowledge falls into five reliability tiers. These tiers should drive verification decisions in any system built on LLMs.
Tier Definitions (T1 through T5)
| Tier | Rating | Domains | Expected Accuracy | Confident Inaccuracy Rate | Verification Policy |
|---|---|---|---|---|---|
| T1: Highly Reliable | 0.95+ accuracy, 0.00 CIR | Established science, mathematics, fundamental physics/chemistry | 0.95 | 0.00 | Trust with spot-check |
| T2: Reliable | 0.90+ accuracy, <0.05 CIR | Established history, well-documented geography, canonical literature | 0.925 | 0.05 | Trust; verify specific dates and numbers |
| T3: Moderate | 0.80+ accuracy, <0.10 CIR | Pop culture, sports records, biographical details | 0.825-0.85 | 0.05-0.075 | Verify counts, dates, and specific claims |
| T4: Unreliable | <0.80 accuracy, >0.10 CIR | Software versions, API details, recent technical changes | 0.70 | 0.175 | Always verify externally |
| T5: Unavailable | <0.20 accuracy | Post-cutoff events, real-time data | 0.00-0.20 | N/A (model declines) | External retrieval required |
Cross-Tier Error Characteristics
| Characteristic | T1-T2 (Reliable) | T3 (Moderate) | T4 (Unreliable) | T5 (Unavailable) |
|---|---|---|---|---|
| Error type | Rare imprecision | Count/date errors | Version/API errors | Complete absence |
| Model confidence | Justified | Slightly overconfident | Significantly overconfident | Appropriately low |
| Detection difficulty | Low (errors are rare) | Moderate (embedded in correct context) | High (correct context masks errors) | Low (model signals uncertainty) |
| User risk | Minimal | Moderate | High | Low (user knows to verify) |
| Tool augmentation benefit | Marginal | Moderate | Critical | Essential |
The Danger Zone¶
The most dangerous combination is T4 (fast-evolving domains) with tool-only-detectable errors. These are errors that:
- The user cannot detect — the answer looks authoritative and is mostly correct
- A reviewer cannot reliably detect — the surrounding context is accurate
- Only external tool verification can catch
This is why T4 queries must always be verified externally. The model's confidence provides no signal — it is equally confident about its correct and incorrect claims in this tier.
Recommendations for Builders¶
Practical guidance for developers building systems on LLMs, derived from the Snapshot Problem analysis and domain reliability tiers.
1. Implement Domain-Aware Tool Routing¶
Do not treat all queries equally for verification. Classify queries by reliability tier and route tool calls accordingly.
Minimum viable implementation:
- Queries mentioning specific software, libraries, or APIs: T4 (always verify)
- Queries about current events or recent developments: T5 (always verify)
- Queries about established scientific or mathematical facts: T1 (trust)
- When uncertain, default to T3 (selective verification)
Latency-Accuracy Tradeoff
| Strategy | Latency | Accuracy | Use Case |
|---|---|---|---|
| Always use tools (T1-T5) | High | Highest | Mission-critical applications |
| Domain-aware routing (T3-T5 only) | Moderate | High | General-purpose agents |
| Never use tools | Lowest | Variable (0.70-0.95 by domain) | Low-stakes, latency-sensitive |
Domain-aware routing provides the best tradeoff for most applications by avoiding unnecessary tool calls for reliable domains while ensuring verification for unreliable ones.
2. Never Trust Specific Numbers from Internal Knowledge¶
Version numbers, dates, counts, and rankings are the highest-risk claim types across all domains. Even when the model's general knowledge of a topic is accurate, specific numbers are disproportionately likely to be wrong due to snapshot compression.
Rule of thumb: When the model produces a specific number, date, version, or count, flag it for external verification regardless of the domain tier.
3. Add Per-Claim Confidence Markers¶
Do not present all claims with equal confidence. Post-process responses to annotate claims with verification status:
- Verified — checked against an external source and confirmed
- Internal (High Stability) — from a T1-T2 domain, not externally verified but historically reliable
- Internal (Moderate Stability) — from a T3 domain, not externally verified
- Unverified — could not be verified; user should check independently
This helps users calibrate trust correctly instead of developing the false confidence that the 85% accuracy rate encourages.
4. Use Creator-Critic Patterns for Factual Content¶
Single-pass generation is insufficient for factual accuracy. Implement a creator-critic-revision cycle where the critic specifically targets high-risk claim types:
- Specific numbers stated without hedging
- Version numbers or release dates
- Claims about "the current" or "the latest" anything
- Counts, rankings, or record claims
The critic should have external tool access and should verify a sample of specific claims in each response. This is more efficient than verifying everything and catches the highest-risk error types.
5. Design for 85% Accuracy, Not 0%¶
Most LLM reliability discussions focus on the case where the model does not know (knowledge gaps, post-cutoff questions). Agent architectures must also address the case where the model thinks it knows but is partially wrong. The 85% accuracy case is more dangerous because:
- Users trust the output (it looks right)
- Errors are embedded in correct context (hard to spot)
- The model does not hedge (no warning signals)
- Spot-checking reinforces false trust (most checked facts are correct)
Design assumption: Any response about a fast-evolving topic contains at least one subtle error. Build verification workflows that catch these errors before they reach the user.
6. Use Specialized Tools for Technical Domains¶
For technology and software domains (T4), use structured documentation tools rather than general web search. Documentation-specific APIs provide more reliable results for API details, version histories, and configuration than general search results.
Route T4 queries through documentation tools as the primary source, with web search as a fallback when specialized tools have no coverage.
7. Invest in Governance, Not Just Better Models¶
The Snapshot Problem is inherent in the training paradigm. Any model trained on documents from multiple time periods will have contradictory snapshots of evolving facts. Do not wait for better models to solve this.
Governance-based solutions are available now:
- Mandatory verification for high-risk domains
- Multi-pass review that catches errors missed in generation
- Structured confidence scoring rather than relying on model self-assessment
- Verified knowledge persistence so that corrected information is available in future sessions
Model improvements will reduce error rates but cannot eliminate them as long as the Snapshot Problem exists. Governance catches what training cannot.
8. Log and Learn from Corrections¶
Track which claims were corrected by external verification to improve domain classification over time. When a verified claim differs from what the model would have generated internally, log:
- The domain and topic
- The claim type (version, date, count, relationship)
- The error magnitude
Use this corpus to refine reliability tier boundaries and tool routing decisions.
Related Pages¶
- LLM Deception Research Overview — research goals, scope, and navigation
- The 85% Problem — the core finding: confident partial inaccuracy
- Methodology — experimental design and evidence collection approach