Skip to content

Research

Jerry's development is driven by evidence-based research. Every architectural decision, quality framework component, and workflow pattern is backed by structured research artifacts with citations, formal methodologies, and multi-perspective analysis.

This section curates the key research produced during Jerry's development. Each page presents key findings inline with expandable methodology details and links to the full research artifacts on GitHub.


Research Domains

Domain Pages Description
Adversarial Strategies & Quality 3 How we selected and implemented 10 adversarial review strategies from 36 candidates
Agent Architecture 1 Evidence-based analysis of single vs. multi-agent orchestration
Context Management 1 Solving context rot in LLM sessions
Skill Architecture 1 Compliance framework for Claude Code skill patterns
Governance 1 Constitutional AI governance for LLM agents
OSS Release Methodology 1 FMEA, root cause analysis, and best practices for open-sourcing
Software Architecture 1 Hexagonal, DDD, and CQRS patterns in Python
Claude Code Ecosystem 1 Plugin, skill, and CLI patterns
LLM Reliability 4 Empirical A/B study of LLM deception patterns and domain reliability tiers
Negative Prompting 1 Structured negation vs. positive framing in LLM constraint enforcement

Total: 52 research artifacts across 10 domains, ~29,000 lines of documented research. ( Full catalog)


Adversarial Strategies & Quality

The adversarial quality framework was built through a rigorous research pipeline: 36 strategies cataloged from academic, industry, and emerging sources, narrowed to 15 through deduplication, then scored and selected to a final 10 using NASA SE trade study methodology.

  • Adversarial Strategy Catalog


    Unified catalog of 15 adversarial review strategies synthesized from academic literature (CIA, DoD, Hegelian dialectics), industry practices, and emerging AI patterns. 117+ citations.

    Strategy Catalog

  • Strategy Selection & Enforcement (ADRs)


    Two architecture decision records: selection of the final 10 strategies with composite scoring, and design of the 5-layer enforcement architecture with token budget feasibility analysis.

    ADRs

  • Adversarial Quality Deep Dives


    Trade studies, risk registers (FMEA), scoring methodology, and the strategy selection decision tree. Supporting research behind the headline decisions.

    Deep Dives


Agent Architecture

  • Single vs. Multi-Agent Orchestration


    Evidence-based analysis drawing on 20 peer-reviewed sources (ACL, ICLR, ICML, EMNLP, NeurIPS). Quantitative findings on when multi-agent helps vs. hurts, with context rot as the primary mechanism.

    Agent Analysis


Context Management

  • Context Management & LLM Performance


    Research on context rot thresholds, CLAUDE.md optimization (50-70% token savings quantified), decomposition patterns, and cross-platform sync strategies.

    Context Management


Skill Architecture

  • Skill Compliance Framework


    2,646-line compliance framework with 117 checkpoints across 8 orchestration patterns. Reusable for any Claude Code skill. Includes 4-phase remediation roadmap.

    Skill Framework


Governance

  • Governance & Constitutional AI


    The Jerry Constitution implements Constitutional AI for LLM agent governance, drawing on Anthropic, OpenAI, and DeepMind prior art. Progressive enforcement from advisory to hard constraints.

    Governance


OSS Release Methodology

  • OSS Preparation & Methodology


    FMEA risk analysis (21 risks scored), 5 Whys root cause analysis, gap analysis, and best practices research from Google, Microsoft, and Apache Foundation sources.

    OSS Methodology


Software Architecture

  • Software Architecture & Patterns


    Python architecture standards (Hexagonal + DDD + CQRS), teaching-edition walkthroughs, and domain-specific playbooks.

    Architecture


Claude Code Ecosystem

  • Claude Code Ecosystem


    Research on Claude Code CLI patterns, plugin architecture and distribution, skill structure and agent design patterns.

    Ecosystem


LLM Reliability

The first empirical research study in Jerry: a controlled A/B test of LLM deception patterns across 5 knowledge domains. The "Two-Leg Thesis" — that LLMs are approximately 85% accurate but approximately 100% confident — has direct implications for any system built on LLM outputs.

  • LLM Deception Study


    Controlled A/B test across 15 questions and 5 domains reveals systematic confidence-accuracy gaps in LLM outputs. Introduces domain reliability tiers (T1-T5) and the Composite Integrity Ratio metric. Quality-gated at 0.93 average across 5 C4 tournaments.

    Study Overview


Negative Prompting

The second empirical study in Jerry: a controlled A/B test of constraint framing across 270 matched trials on three Claude models. NPT-013 structured negation (NEVER + consequence + alternative) achieved 100% compliance versus 92.2% for positive-only framing (McNemar exact p=0.016). The research produced a 14-pattern taxonomy and the /prompt-engineering skill.

  • Negative Prompting & Constraint Enforcement


    Controlled A/B test across 270 blind trials and 3 Claude models. 14-pattern NPT taxonomy. 4 ADRs. New /prompt-engineering skill. CONDITIONAL GO via PG-003 with 23 C4 quality gates averaging 0.952.

    Research Findings


How Research is Conducted

Jerry uses a structured research pipeline with specialized agents:

  1. ps-researcher gathers information with source citations (L0/L1/L2 structure)
  2. ps-analyst applies formal methodologies (5 Whys, FMEA, gap analysis, trade studies)
  3. ps-architect produces Architecture Decision Records (ADRs) in Nygard format
  4. ps-synthesizer extracts cross-cutting patterns from multiple research documents
  5. ps-critic reviews deliverables against a 6-dimension quality rubric (>= 0.92 threshold)

All research artifacts are persisted to the filesystem and survive context compaction, building a cumulative knowledge base across sessions.

Full research catalog on GitHub