Research¶
Jerry's development is driven by evidence-based research. Every architectural decision, quality framework component, and workflow pattern is backed by structured research artifacts with citations, formal methodologies, and multi-perspective analysis.
This section curates the key research produced during Jerry's development. Each page presents key findings inline with expandable methodology details and links to the full research artifacts on GitHub.
Research Domains¶
| Domain | Pages | Description |
|---|---|---|
| Adversarial Strategies & Quality | 3 | How we selected and implemented 10 adversarial review strategies from 36 candidates |
| Agent Architecture | 1 | Evidence-based analysis of single vs. multi-agent orchestration |
| Context Management | 1 | Solving context rot in LLM sessions |
| Skill Architecture | 1 | Compliance framework for Claude Code skill patterns |
| Governance | 1 | Constitutional AI governance for LLM agents |
| OSS Release Methodology | 1 | FMEA, root cause analysis, and best practices for open-sourcing |
| Software Architecture | 1 | Hexagonal, DDD, and CQRS patterns in Python |
| Claude Code Ecosystem | 1 | Plugin, skill, and CLI patterns |
| LLM Reliability | 4 | Empirical A/B study of LLM deception patterns and domain reliability tiers |
| Negative Prompting | 1 | Structured negation vs. positive framing in LLM constraint enforcement |
Total: 52 research artifacts across 10 domains, ~29,000 lines of documented research. ( Full catalog)
Adversarial Strategies & Quality¶
The adversarial quality framework was built through a rigorous research pipeline: 36 strategies cataloged from academic, industry, and emerging sources, narrowed to 15 through deduplication, then scored and selected to a final 10 using NASA SE trade study methodology.
-
Adversarial Strategy Catalog
Unified catalog of 15 adversarial review strategies synthesized from academic literature (CIA, DoD, Hegelian dialectics), industry practices, and emerging AI patterns. 117+ citations.
-
Strategy Selection & Enforcement (ADRs)
Two architecture decision records: selection of the final 10 strategies with composite scoring, and design of the 5-layer enforcement architecture with token budget feasibility analysis.
-
Adversarial Quality Deep Dives
Trade studies, risk registers (FMEA), scoring methodology, and the strategy selection decision tree. Supporting research behind the headline decisions.
Agent Architecture¶
-
Single vs. Multi-Agent Orchestration
Evidence-based analysis drawing on 20 peer-reviewed sources (ACL, ICLR, ICML, EMNLP, NeurIPS). Quantitative findings on when multi-agent helps vs. hurts, with context rot as the primary mechanism.
Context Management¶
-
Context Management & LLM Performance
Research on context rot thresholds, CLAUDE.md optimization (50-70% token savings quantified), decomposition patterns, and cross-platform sync strategies.
Skill Architecture¶
-
Skill Compliance Framework
2,646-line compliance framework with 117 checkpoints across 8 orchestration patterns. Reusable for any Claude Code skill. Includes 4-phase remediation roadmap.
Governance¶
-
Governance & Constitutional AI
The Jerry Constitution implements Constitutional AI for LLM agent governance, drawing on Anthropic, OpenAI, and DeepMind prior art. Progressive enforcement from advisory to hard constraints.
OSS Release Methodology¶
-
OSS Preparation & Methodology
FMEA risk analysis (21 risks scored), 5 Whys root cause analysis, gap analysis, and best practices research from Google, Microsoft, and Apache Foundation sources.
Software Architecture¶
-
Software Architecture & Patterns
Python architecture standards (Hexagonal + DDD + CQRS), teaching-edition walkthroughs, and domain-specific playbooks.
Claude Code Ecosystem¶
-
Claude Code Ecosystem
Research on Claude Code CLI patterns, plugin architecture and distribution, skill structure and agent design patterns.
LLM Reliability¶
The first empirical research study in Jerry: a controlled A/B test of LLM deception patterns across 5 knowledge domains. The "Two-Leg Thesis" — that LLMs are approximately 85% accurate but approximately 100% confident — has direct implications for any system built on LLM outputs.
-
LLM Deception Study
Controlled A/B test across 15 questions and 5 domains reveals systematic confidence-accuracy gaps in LLM outputs. Introduces domain reliability tiers (T1-T5) and the Composite Integrity Ratio metric. Quality-gated at 0.93 average across 5 C4 tournaments.
Negative Prompting¶
The second empirical study in Jerry: a controlled A/B test of constraint framing across 270 matched trials on three Claude models. NPT-013 structured negation (NEVER + consequence + alternative) achieved 100% compliance versus 92.2% for positive-only framing (McNemar exact p=0.016). The research produced a 14-pattern taxonomy and the /prompt-engineering skill.
-
Negative Prompting & Constraint Enforcement
Controlled A/B test across 270 blind trials and 3 Claude models. 14-pattern NPT taxonomy. 4 ADRs. New
/prompt-engineeringskill. CONDITIONAL GO via PG-003 with 23 C4 quality gates averaging 0.952.
How Research is Conducted¶
Jerry uses a structured research pipeline with specialized agents:
- ps-researcher gathers information with source citations (L0/L1/L2 structure)
- ps-analyst applies formal methodologies (5 Whys, FMEA, gap analysis, trade studies)
- ps-architect produces Architecture Decision Records (ADRs) in Nygard format
- ps-synthesizer extracts cross-cutting patterns from multiple research documents
- ps-critic reviews deliverables against a 6-dimension quality rubric (>= 0.92 threshold)
All research artifacts are persisted to the filesystem and survive context compaction, building a cumulative knowledge base across sessions.