ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry

🚀 Brief Introduction 🚀 简要介绍

The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems — which we refer to as Deep AI Research Systems (DARS) — on frontier AI scientific questions.

Novel Task Collection: We present 65 high-quality research questions sourced from authentic frontier scenarios across 35 distinct AI research subjects, categorized into three types: Technical Details, Literature Review, and Open Consulting.
Dual Evaluation Framework: Our assessment methodology combines expert-designed qualitative evaluation criteria (Rubric Assessment) with quantitative factual evaluation metrics (Faithfulness and Groundedness scores).
Comprehensive Analysis: We conduct extensive evaluation of leading commercial DARS platforms, revealing both their current capabilities and fundamental limitations in frontier research assistance.
Open-Source Contribution: We release ResearcherBench as an open-source resource to establish a standardized platform for advancing AI research assistance capabilities.

Through ResearcherBench, we aim to foster a new perspective in AI research evaluation, focusing on the depth of understanding and insight generation rather than the breadth of information coverage.

🏆 Leaderboard 🏆 排行榜

We evaluate various Deep AI Research Systems (DARS), including leading commercial platforms and baseline systems with web search capabilities. Our evaluation focuses on three question types: Technical Details, Literature Review, and Open Consulting questions across 35 AI research subjects.

The evaluation employs our dual assessment framework: Rubric Assessment measures insight quality using expert-designed criteria, while Factual Assessment evaluates citation accuracy (Faithfulness) and coverage (Groundedness). All experiments were conducted between March and June 2025 to ensure temporal consistency.

Note: Results show that DARS systems demonstrate particular strength in open consulting questions, suggesting their potential as innovative research ideation partners rather than precision technical implementation guides.

LLM+Search Commercial DARS

Model 模型	Type 类型	Date 日期	Coverage 覆盖率	Faithfulness 忠实度	Groundedness 扎根度
OpenAI Deep Research	DARS	2025-3-24	70.32	84.0	34.0
Gemini Deep Research	DARS	2025-4-15	69.29	86.0	59.0
Perplexity: Sonar Reasoning Pro	LLM+Search		46.63	62.0	68.0
Perplexity Deep Research	DARS	2025-3-24	48.46	85.0	56.0
Grok3 DeepSearch	DARS	2025-3-25	45.23	69.0	32.0
Grok3 DeeperSearch	DARS	2025-4-18	44.01	80.0	31.0
GPT-4o Search Preview	LLM+Search		35.76	86.0	39.0

📊 Question Type Performance 📊 问题类型表现

Key Findings: OpenAI Deep Research and Gemini Deep Research significantly outperform other systems with 20-30% advantages. All systems demonstrate superior performance on open consulting questions, achieving 76%+ coverage rates. This validates their strength in exploratory reasoning and innovative research ideation.

Interestingly, high groundedness (citation coverage) doesn't necessarily correlate with research quality for frontier questions - valuable insights often emerge from creative synthesis rather than explicit source attribution.

If you wish to have your model evaluated on ResearcherBench, please contact the authors for submission guidelines. 关键发现： OpenAI Deep Research和Gemini Deep Research显著优于其他系统，具有20-30%的优势。所有系统在开放咨询问题上表现出卓越性能，达到76%+的覆盖率。这验证了它们在探索性推理和创新研究构思方面的优势。

有趣的是，高扎根度（引用覆盖率）并不一定与前沿问题的研究质量相关——有价值的洞察往往来自创造性综合，而非显式的来源归属。

如果您希望在ResearcherBench上评估您的模型，请联系作者获取提交指南。

📊 Data Explorer 📊 数据浏览器

Explore the detailed responses from different DARS systems. Select a model and question ID to view the complete question-response pairs from our evaluation dataset.

Select Model

Select Question ID

🔬 Evaluation Methodology 🔬 评估方法

ResearcherBench introduces a dual evaluation framework that comprehensively assesses DARS performance through both qualitative insight evaluation and quantitative factual assessment.

Rubric Assessment 专家标准评估

Expert-designed criteria for each question
Insight quality evaluation using domain expertise
Weighted coverage scoring based on importance
Judge model validation (F1=0.8 agreement with experts)

Factual Assessment 事实评估

Automated claim extraction from responses
Citation support verification using web retrieval
Faithfulness score: accuracy of citations
Groundedness score: overall citation coverage

📬 Contact 📬 联系方式

If you have any questions regarding ResearcherBench, feel free to reach out to us via email at lupengrui@sjtu.edu.cn or directly submit a github issue.

BibTeX 引用格式

@misc{xu2025researcherbenchevaluatingdeepai,
  title={ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry},
  author={Tianze Xu and Pengrui Lu and Lyumanshan Ye and Xiangkun Hu and Pengfei Liu},
  year={2025},
  eprint={2507.16280},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2507.16280},
}

🔬 ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry 🔬 ResearcherBench: 前沿科学探索中深度AI研究系统评估基准