Single vs. Multi-Agent Orchestration¶

Evidence-based analysis of when multi-agent orchestration outperforms single-agent approaches, drawing on 20 peer-reviewed sources from ACL, ICLR, ICML, EMNLP, and NeurIPS.

Key Findings¶

Context length alone degrades LLM performance by 14-85%, even with perfect retrieval — the architecturally decisive finding (Du & Tian, EMNLP 2025)
Multi-agent advantage narrows with stronger models: gains dropped from ~10% to ~1-3% across one generation of model improvement (Xie et al., 2025 meta-study)
Self-refinement fails for logical reasoning (+0.2% on math) but succeeds for stylistic tasks (+20-30%) — a separate critic overcomes this blind spot (Madaan et al., NeurIPS 2023)
Pipeline architectures (MetaGPT, ChatDev) show 2.8-3.9x quality improvement over single-agent in software development tasks (ICLR/ACL 2024)
The real differentiator is verifiable process: adversarial review, persistent artifacts, and structured quality scoring — not raw answer quality

Single Agent vs. Multi-Agent Orchestration: Evidence-Based Analysis¶

This 320-line research artifact synthesizes findings from three independent research clusters: context window degradation mechanics, head-to-head single vs. multi-agent benchmarks, and quality assurance mechanism evaluation. The analysis provides a nuanced three-layer conclusion rather than a binary recommendation.

Methodology

The analysis draws on 20 sources spanning three independent research clusters:

Context window degradation (7 sources): ACL, ICLR, EMNLP, Chroma Research, arXiv
Single vs. multi-agent benchmarks (6 studies): ICML, ICLR, ACL, PMC
Quality assurance mechanisms (6 studies): NeurIPS, Anthropic, domain-specific venues

Of these, 15 are peer-reviewed (ACL, ICLR, ICML, EMNLP, NeurIPS, PMC), 4 are arXiv preprints or technical reports, and 1 is a practitioner blog post. The clusters provide independent lines of evidence converging on the same conclusion.

Key Data: Context Rot Evidence

Mechanism	Finding	Source
Lost in the Middle	U-shaped attention pattern — LLMs miss information in the middle of long contexts	Liu et al., ACL 2024
Context Rot	Performance unreliable as input length grows across all 18 models tested	Chroma Research, 2025
Length Alone Hurts	13.9-85% degradation even when models forced to attend only to relevant tokens	Du & Tian, EMNLP 2025
Multi-Turn Drift	39% average accuracy drop from single-turn to multi-turn conversations	Levy et al., 2025
Effective Window	Open-source LLMs use <50% of their claimed context length effectively	Ding et al., ICLR 2025

Key Data: Multi-Agent Performance Benchmarks

Software Development (Pipeline Paradigm — directly applicable to Jerry):

Study	Single Agent	Multi-Agent	Improvement
MetaGPT vs AutoGPT (ICLR 2024)	1.0	3.9	3.9x quality
ChatDev vs GPT-Engineer (ACL 2024)	0.14	0.40	2.8x quality

Reasoning and Factuality (Debate Paradigm):

Study	Single Agent	Multi-Agent	Improvement
Du et al., ICML 2024 (Arithmetic)	67.0%	81.8%	+14.8pp
Du et al., ICML 2024 (Biography facts)	60.0%	74.0%	+14.0pp
A-HMAD, 2025 (Biography facts)	60.0%	80.6%	+20.6pp

Diminishing Returns with Stronger Models:

Benchmark	Early Models	Frontier Models	Trend
MetaGPT-HumanEval	+10.7%	+3.0%	Narrowing
MathDebate-GSM8K	+9.0%	+0.8%	Narrowing

Full artifact on GitHub

Single vs. Multi-Agent Orchestration¶

Key Findings¶

Single Agent vs. Multi-Agent Orchestration: Evidence-Based Analysis¶

Related Research¶