Multiagent Debate is the open-source reference implementation accompanying the paper 'Improving Factuality and Reasoning in Language Models through Multiagent Debate' (Du et al., published at ICML 2024). The repository lives at composable-models/llm_multiagent_debate on GitHub. The core idea is to treat multiple instances of the same language model as a 'society of minds': each instance independently generates a candidate answer, then reads and critiques every other instance's response, and iterates this debate over several rounds until the group converges on a single consensus answer. The approach is motivated by observed failure modes of single-LLM inference: hallucination and reasoning errors that a lone model cannot self-correct. By running cross-agent critique loops, the system surfaces and resolves inconsistencies that would persist in a single-pass generation. Crucially, the paper demonstrates that even when all agents start with the wrong answer, the debate process can drive the group to the correct answer as rounds progress. The repository provides concrete implementations for four benchmark tasks: arithmetic reasoning, GSM8K (grade school math word problems), biography generation (factuality evaluation), and MMLU (multi-domain multiple-choice). Each task has separate generation and evaluation scripts, making it straightforward to reproduce the paper's results or adapt the pipeline to new domains. The work was authored by Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch, and appeared in the Proceedings of the 41st ICML (July 2024). It has become a widely cited reference point for multi-agent reasoning research, with numerous follow-on papers in 2025-2026 examining its scalability, failure modes under adversarial conditions, and adaptive variants that invoke debate only when needed. Key features: - Multi-round structured debate: each LLM agent reads all other agents' responses and updates its own answer iteratively - Model-agnostic design: works with any instruction-following LLM; original experiments used GPT-3.5/GPT-4 - Four benchmark task implementations: arithmetic, GSM8K math, biography factuality, and MMLU - Reduces hallucinations and reasoning errors compared to single-model inference - Consensus emergence: agents that all start wrong can converge to the correct answer through debate - Open-source Python codebase with separate gen/eval scripts per task for reproducibility
Not public (open-source research code; underlying LLM API costs apply)