Single vs. Multi-Agent Orchestration¶
Evidence-based analysis of when multi-agent orchestration outperforms single-agent approaches, drawing on 20 peer-reviewed sources from ACL, ICLR, ICML, EMNLP, and NeurIPS.
Key Findings¶
- Context length alone degrades LLM performance by 14-85%, even with perfect retrieval — the architecturally decisive finding (Du & Tian, EMNLP 2025)
- Multi-agent advantage narrows with stronger models: gains dropped from ~10% to ~1-3% across one generation of model improvement (Xie et al., 2025 meta-study)
- Self-refinement fails for logical reasoning (+0.2% on math) but succeeds for stylistic tasks (+20-30%) — a separate critic overcomes this blind spot (Madaan et al., NeurIPS 2023)
- Pipeline architectures (MetaGPT, ChatDev) show 2.8-3.9x quality improvement over single-agent in software development tasks (ICLR/ACL 2024)
- The real differentiator is verifiable process: adversarial review, persistent artifacts, and structured quality scoring — not raw answer quality
Single Agent vs. Multi-Agent Orchestration: Evidence-Based Analysis¶
This 320-line research artifact synthesizes findings from three independent research clusters: context window degradation mechanics, head-to-head single vs. multi-agent benchmarks, and quality assurance mechanism evaluation. The analysis provides a nuanced three-layer conclusion rather than a binary recommendation.
Methodology
The analysis draws on 20 sources spanning three independent research clusters:
- Context window degradation (7 sources): ACL, ICLR, EMNLP, Chroma Research, arXiv
- Single vs. multi-agent benchmarks (6 studies): ICML, ICLR, ACL, PMC
- Quality assurance mechanisms (6 studies): NeurIPS, Anthropic, domain-specific venues
Of these, 15 are peer-reviewed (ACL, ICLR, ICML, EMNLP, NeurIPS, PMC), 4 are arXiv preprints or technical reports, and 1 is a practitioner blog post. The clusters provide independent lines of evidence converging on the same conclusion.
Key Data: Context Rot Evidence
| Mechanism | Finding | Source |
|---|---|---|
| Lost in the Middle | U-shaped attention pattern — LLMs miss information in the middle of long contexts | Liu et al., ACL 2024 |
| Context Rot | Performance unreliable as input length grows across all 18 models tested | Chroma Research, 2025 |
| Length Alone Hurts | 13.9-85% degradation even when models forced to attend only to relevant tokens | Du & Tian, EMNLP 2025 |
| Multi-Turn Drift | 39% average accuracy drop from single-turn to multi-turn conversations | Levy et al., 2025 |
| Effective Window | Open-source LLMs use <50% of their claimed context length effectively | Ding et al., ICLR 2025 |
Key Data: Multi-Agent Performance Benchmarks
Software Development (Pipeline Paradigm — directly applicable to Jerry):
| Study | Single Agent | Multi-Agent | Improvement |
|---|---|---|---|
| MetaGPT vs AutoGPT (ICLR 2024) | 1.0 | 3.9 | 3.9x quality |
| ChatDev vs GPT-Engineer (ACL 2024) | 0.14 | 0.40 | 2.8x quality |
Reasoning and Factuality (Debate Paradigm):
| Study | Single Agent | Multi-Agent | Improvement |
|---|---|---|---|
| Du et al., ICML 2024 (Arithmetic) | 67.0% | 81.8% | +14.8pp |
| Du et al., ICML 2024 (Biography facts) | 60.0% | 74.0% | +14.0pp |
| A-HMAD, 2025 (Biography facts) | 60.0% | 80.6% | +20.6pp |
Diminishing Returns with Stronger Models:
| Benchmark | Early Models | Frontier Models | Trend |
|---|---|---|---|
| MetaGPT-HumanEval | +10.7% | +3.0% | Narrowing |
| MathDebate-GSM8K | +9.0% | +0.8% | Narrowing |