Skip to content

Architecture & Recommendations

Why LLM reliability is domain-dependent, how to classify domains by risk, and what builders should do about it.


Key Findings

  • The Snapshot Problem is the root cause of confident inaccuracy — LLMs compress contradictory training snapshots from different time periods into a single representation, producing subtle errors in fast-evolving domains
  • Domain reliability follows a five-tier spectrum (T1-T5) from highly reliable (established science) to unavailable (post-cutoff events), each demanding a different verification policy
  • The optimal architecture is domain-aware tool routing — verify aggressively in fast-evolving domains, trust internal knowledge in stable ones
  • Governance-based mitigation (mandatory verification, multi-pass review, confidence annotation) is available now and does not depend on model improvements

The Snapshot Problem

The Snapshot Problem is the structural explanation for why LLMs are "85% right and 100% confident." It is not a bug in any particular model — it is an inherent consequence of how all current LLMs are trained.

Definition: Training data captures the state of the world at the time each document was written. When a model trains on documents from multiple time periods, it acquires multiple contradictory snapshots of the same fact. The model has no mechanism to determine which snapshot is most recent, most authoritative, or currently correct.

How Snapshots Become Errors

Consider how a model learns about the Python requests library:

Snapshot 2019: "requests 2.22.0 uses urllib3 1.25..."
Snapshot 2020: "requests 2.24.0 added support for..."
Snapshot 2021: "requests 2.26.0 changed the..."
Snapshot 2022: "requests 2.28.0 deprecated..."
Snapshot 2023: "requests 2.31.0 is the latest..."

The model sees all five snapshots with equal weight. Each makes factual claims that were true at the time of writing but may not be true at any other time. The model compresses these contradictory snapshots into a single internal representation — and the compression produces characteristic errors:

  • Version number errors: The model selects a plausible blend of the snapshots, but it may not correspond to any actual version
  • API behavior errors: The model describes behavior from one version while claiming to describe a different version
  • Dependency errors: The model describes a dependency relationship from one snapshot while referencing a version from another

The severity of the Snapshot Problem varies by how frequently the underlying facts change:

Stability Description Domains Snapshot Conflict Rate
Immutable Facts never change Fundamental science, mathematics, established history Near zero
Slow-evolving Facts change on decade timescales Geography, demographics, medical consensus Low
Medium-evolving Facts change on year timescales Pop culture counts, sports records, political geography Moderate
Fast-evolving Facts change on month/week timescales Software versions, API details, current events High
Ephemeral Facts change continuously Stock prices, weather, live scores Not addressable by training data

The critical insight: The Snapshot Problem is not fixable by better training alone. Even a perfectly trained model will produce snapshot conflicts when trained on documents from different time periods about fast-evolving topics. The solution is architectural — external verification for domains where snapshots conflict.


Domain Reliability Tiers

Based on empirical testing and the Snapshot Problem analysis, LLM internal knowledge falls into five reliability tiers. These tiers should drive verification decisions in any system built on LLMs.

Tier Definitions (T1 through T5)
Tier Rating Domains Expected Accuracy Confident Inaccuracy Rate Verification Policy
T1: Highly Reliable 0.95+ accuracy, 0.00 CIR Established science, mathematics, fundamental physics/chemistry 0.95 0.00 Trust with spot-check
T2: Reliable 0.90+ accuracy, <0.05 CIR Established history, well-documented geography, canonical literature 0.925 0.05 Trust; verify specific dates and numbers
T3: Moderate 0.80+ accuracy, <0.10 CIR Pop culture, sports records, biographical details 0.825-0.85 0.05-0.075 Verify counts, dates, and specific claims
T4: Unreliable <0.80 accuracy, >0.10 CIR Software versions, API details, recent technical changes 0.70 0.175 Always verify externally
T5: Unavailable <0.20 accuracy Post-cutoff events, real-time data 0.00-0.20 N/A (model declines) External retrieval required
Cross-Tier Error Characteristics
Characteristic T1-T2 (Reliable) T3 (Moderate) T4 (Unreliable) T5 (Unavailable)
Error type Rare imprecision Count/date errors Version/API errors Complete absence
Model confidence Justified Slightly overconfident Significantly overconfident Appropriately low
Detection difficulty Low (errors are rare) Moderate (embedded in correct context) High (correct context masks errors) Low (model signals uncertainty)
User risk Minimal Moderate High Low (user knows to verify)
Tool augmentation benefit Marginal Moderate Critical Essential

The Danger Zone

The most dangerous combination is T4 (fast-evolving domains) with tool-only-detectable errors. These are errors that:

  • The user cannot detect — the answer looks authoritative and is mostly correct
  • A reviewer cannot reliably detect — the surrounding context is accurate
  • Only external tool verification can catch

This is why T4 queries must always be verified externally. The model's confidence provides no signal — it is equally confident about its correct and incorrect claims in this tier.


Recommendations for Builders

Practical guidance for developers building systems on LLMs, derived from the Snapshot Problem analysis and domain reliability tiers.

1. Implement Domain-Aware Tool Routing

Do not treat all queries equally for verification. Classify queries by reliability tier and route tool calls accordingly.

Minimum viable implementation:

  • Queries mentioning specific software, libraries, or APIs: T4 (always verify)
  • Queries about current events or recent developments: T5 (always verify)
  • Queries about established scientific or mathematical facts: T1 (trust)
  • When uncertain, default to T3 (selective verification)
Latency-Accuracy Tradeoff
Strategy Latency Accuracy Use Case
Always use tools (T1-T5) High Highest Mission-critical applications
Domain-aware routing (T3-T5 only) Moderate High General-purpose agents
Never use tools Lowest Variable (0.70-0.95 by domain) Low-stakes, latency-sensitive

Domain-aware routing provides the best tradeoff for most applications by avoiding unnecessary tool calls for reliable domains while ensuring verification for unreliable ones.

2. Never Trust Specific Numbers from Internal Knowledge

Version numbers, dates, counts, and rankings are the highest-risk claim types across all domains. Even when the model's general knowledge of a topic is accurate, specific numbers are disproportionately likely to be wrong due to snapshot compression.

Rule of thumb: When the model produces a specific number, date, version, or count, flag it for external verification regardless of the domain tier.

3. Add Per-Claim Confidence Markers

Do not present all claims with equal confidence. Post-process responses to annotate claims with verification status:

  • Verified — checked against an external source and confirmed
  • Internal (High Stability) — from a T1-T2 domain, not externally verified but historically reliable
  • Internal (Moderate Stability) — from a T3 domain, not externally verified
  • Unverified — could not be verified; user should check independently

This helps users calibrate trust correctly instead of developing the false confidence that the 85% accuracy rate encourages.

4. Use Creator-Critic Patterns for Factual Content

Single-pass generation is insufficient for factual accuracy. Implement a creator-critic-revision cycle where the critic specifically targets high-risk claim types:

  • Specific numbers stated without hedging
  • Version numbers or release dates
  • Claims about "the current" or "the latest" anything
  • Counts, rankings, or record claims

The critic should have external tool access and should verify a sample of specific claims in each response. This is more efficient than verifying everything and catches the highest-risk error types.

5. Design for 85% Accuracy, Not 0%

Most LLM reliability discussions focus on the case where the model does not know (knowledge gaps, post-cutoff questions). Agent architectures must also address the case where the model thinks it knows but is partially wrong. The 85% accuracy case is more dangerous because:

  • Users trust the output (it looks right)
  • Errors are embedded in correct context (hard to spot)
  • The model does not hedge (no warning signals)
  • Spot-checking reinforces false trust (most checked facts are correct)

Design assumption: Any response about a fast-evolving topic contains at least one subtle error. Build verification workflows that catch these errors before they reach the user.

6. Use Specialized Tools for Technical Domains

For technology and software domains (T4), use structured documentation tools rather than general web search. Documentation-specific APIs provide more reliable results for API details, version histories, and configuration than general search results.

Route T4 queries through documentation tools as the primary source, with web search as a fallback when specialized tools have no coverage.

7. Invest in Governance, Not Just Better Models

The Snapshot Problem is inherent in the training paradigm. Any model trained on documents from multiple time periods will have contradictory snapshots of evolving facts. Do not wait for better models to solve this.

Governance-based solutions are available now:

  • Mandatory verification for high-risk domains
  • Multi-pass review that catches errors missed in generation
  • Structured confidence scoring rather than relying on model self-assessment
  • Verified knowledge persistence so that corrected information is available in future sessions

Model improvements will reduce error rates but cannot eliminate them as long as the Snapshot Problem exists. Governance catches what training cannot.

8. Log and Learn from Corrections

Track which claims were corrected by external verification to improve domain classification over time. When a verified claim differs from what the model would have generated internally, log:

  • The domain and topic
  • The claim type (version, date, count, relationship)
  • The error magnitude

Use this corpus to refine reliability tier boundaries and tool routing decisions.