Research¶

Jerry's development is driven by evidence-based research. Every architectural decision, quality framework component, and workflow pattern is backed by structured research artifacts with citations, formal methodologies, and multi-perspective analysis.

This section curates the key research produced during Jerry's development. Each page presents key findings inline with expandable methodology details and links to the full research artifacts on GitHub.

Research Domains¶

Domain	Pages	Description
Adversarial Strategies & Quality	3	How we selected and implemented 10 adversarial review strategies from 36 candidates
Agent Architecture	1	Evidence-based analysis of single vs. multi-agent orchestration
Context Management	1	Solving context rot in LLM sessions
Skill Architecture	1	Compliance framework for Claude Code skill patterns
Governance	1	Constitutional AI governance for LLM agents
OSS Release Methodology	1	FMEA, root cause analysis, and best practices for open-sourcing
Software Architecture	1	Hexagonal, DDD, and CQRS patterns in Python
Claude Code Ecosystem	1	Plugin, skill, and CLI patterns
LLM Reliability	4	Empirical A/B study of LLM deception patterns and domain reliability tiers
Negative Prompting	1	Structured negation vs. positive framing in LLM constraint enforcement

Total: 52 research artifacts across 10 domains, ~29,000 lines of documented research. ( Full catalog)

Adversarial Strategies & Quality¶

The adversarial quality framework was built through a rigorous research pipeline: 36 strategies cataloged from academic, industry, and emerging sources, narrowed to 15 through deduplication, then scored and selected to a final 10 using NASA SE trade study methodology.

Adversarial Strategy Catalog

Unified catalog of 15 adversarial review strategies synthesized from academic literature (CIA, DoD, Hegelian dialectics), industry practices, and emerging AI patterns. 117+ citations.

Strategy Catalog
Strategy Selection & Enforcement (ADRs)

Two architecture decision records: selection of the final 10 strategies with composite scoring, and design of the 5-layer enforcement architecture with token budget feasibility analysis.

ADRs
Adversarial Quality Deep Dives

Trade studies, risk registers (FMEA), scoring methodology, and the strategy selection decision tree. Supporting research behind the headline decisions.

Deep Dives

Agent Architecture¶

Single vs. Multi-Agent Orchestration

Evidence-based analysis drawing on 20 peer-reviewed sources (ACL, ICLR, ICML, EMNLP, NeurIPS). Quantitative findings on when multi-agent helps vs. hurts, with context rot as the primary mechanism.

Agent Analysis

Context Management¶

Context Management & LLM Performance

Research on context rot thresholds, CLAUDE.md optimization (50-70% token savings quantified), decomposition patterns, and cross-platform sync strategies.

Context Management

Skill Architecture¶

Skill Compliance Framework

2,646-line compliance framework with 117 checkpoints across 8 orchestration patterns. Reusable for any Claude Code skill. Includes 4-phase remediation roadmap.

Skill Framework

Governance¶

Governance & Constitutional AI

The Jerry Constitution implements Constitutional AI for LLM agent governance, drawing on Anthropic, OpenAI, and DeepMind prior art. Progressive enforcement from advisory to hard constraints.

Governance

OSS Release Methodology¶

OSS Preparation & Methodology

FMEA risk analysis (21 risks scored), 5 Whys root cause analysis, gap analysis, and best practices research from Google, Microsoft, and Apache Foundation sources.

OSS Methodology

Software Architecture¶

Software Architecture & Patterns

Python architecture standards (Hexagonal + DDD + CQRS), teaching-edition walkthroughs, and domain-specific playbooks.

Architecture

Claude Code Ecosystem¶

Claude Code Ecosystem

Research on Claude Code CLI patterns, plugin architecture and distribution, skill structure and agent design patterns.

Ecosystem

LLM Reliability¶

The first empirical research study in Jerry: a controlled A/B test of LLM deception patterns across 5 knowledge domains. The "Two-Leg Thesis" — that LLMs are approximately 85% accurate but approximately 100% confident — has direct implications for any system built on LLM outputs.

LLM Deception Study

Controlled A/B test across 15 questions and 5 domains reveals systematic confidence-accuracy gaps in LLM outputs. Introduces domain reliability tiers (T1-T5) and the Composite Integrity Ratio metric. Quality-gated at 0.93 average across 5 C4 tournaments.

Study Overview

Negative Prompting¶

The second empirical study in Jerry: a controlled A/B test of constraint framing across 270 matched trials on three Claude models. NPT-013 structured negation (NEVER + consequence + alternative) achieved 100% compliance versus 92.2% for positive-only framing (McNemar exact p=0.016). The research produced a 14-pattern taxonomy and the /prompt-engineering skill.

Negative Prompting & Constraint Enforcement

Controlled A/B test across 270 blind trials and 3 Claude models. 14-pattern NPT taxonomy. 4 ADRs. New /prompt-engineering skill. CONDITIONAL GO via PG-003 with 23 C4 quality gates averaging 0.952.

Research Findings

How Research is Conducted¶

Jerry uses a structured research pipeline with specialized agents:

ps-researcher gathers information with source citations (L0/L1/L2 structure)
ps-analyst applies formal methodologies (5 Whys, FMEA, gap analysis, trade studies)
ps-architect produces Architecture Decision Records (ADRs) in Nygard format
ps-synthesizer extracts cross-cutting patterns from multiple research documents
ps-critic reviews deliverables against a 6-dimension quality rubric (>= 0.92 threshold)

All research artifacts are persisted to the filesystem and survive context compaction, building a cumulative knowledge base across sessions.

Full research catalog on GitHub