🔬 ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry

Tianze Xu*, Pengrui Lu*, Lyumanshan Ye, Xiangkun Hu, Pengfei Liu†
¹Shanghai Jiao Tong University, ²SII, ³GAIR
*Equal contribution, †Corresponding author
Framework Overview

🚀 Brief Introduction

The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems — which we refer to as Deep AI Research Systems (DARS) — on frontier AI scientific questions.

Dataset Distribution
  • Novel Task Collection: We present 65 high-quality research questions sourced from authentic frontier scenarios across 35 distinct AI research subjects, categorized into three types: Technical Details, Literature Review, and Open Consulting.
  • Dual Evaluation Framework: Our assessment methodology combines expert-designed qualitative evaluation criteria (Rubric Assessment) with quantitative factual evaluation metrics (Faithfulness and Groundedness scores).
  • Comprehensive Analysis: We conduct extensive evaluation of leading commercial DARS platforms, revealing both their current capabilities and fundamental limitations in frontier research assistance.
  • Open-Source Contribution: We release ResearcherBench as an open-source resource to establish a standardized platform for advancing AI research assistance capabilities.

Through ResearcherBench, we aim to foster a new perspective in AI research evaluation, focusing on the depth of understanding and insight generation rather than the breadth of information coverage.

🏆 Leaderboard

We evaluate various Deep AI Research Systems (DARS), including leading commercial platforms and baseline systems with web search capabilities. Our evaluation focuses on three question types: Technical Details, Literature Review, and Open Consulting questions across 35 AI research subjects.

The evaluation employs our dual assessment framework: Rubric Assessment measures insight quality using expert-designed criteria, while Factual Assessment evaluates citation accuracy (Faithfulness) and coverage (Groundedness). All experiments were conducted between March and June 2025 to ensure temporal consistency.

Note: Results show that DARS systems demonstrate particular strength in open consulting questions, suggesting their potential as innovative research ideation partners rather than precision technical implementation guides.

LLM+Search Commercial DARS
Model Type Date Coverage Faithfulness Groundedness
OpenAI Deep Research DARS 2025-3-24 70.32 84.0 34.0
Gemini Deep Research DARS 2025-4-15 69.29 86.0 59.0
Perplexity: Sonar Reasoning Pro LLM+Search 46.63 62.0 68.0
Perplexity Deep Research DARS 2025-3-24 48.46 85.0 56.0
Grok3 DeepSearch DARS 2025-3-25 45.23 69.0 32.0
Grok3 DeeperSearch DARS 2025-4-18 44.01 80.0 31.0
GPT-4o Search Preview LLM+Search 35.76 86.0 39.0

📊 Question Type Performance

Question Type Performance Chart

Chart Placeholder: category.png image should be displayed here

Please ensure the image file is located at: static/category.png

Key Findings: OpenAI Deep Research and Gemini Deep Research significantly outperform other systems with 20-30% advantages. All systems demonstrate superior performance on open consulting questions, achieving 76%+ coverage rates. This validates their strength in exploratory reasoning and innovative research ideation.

Interestingly, high groundedness (citation coverage) doesn't necessarily correlate with research quality for frontier questions - valuable insights often emerge from creative synthesis rather than explicit source attribution.

If you wish to have your model evaluated on ResearcherBench, please contact the authors for submission guidelines.

📊 Data Explorer

Explore the detailed responses from different DARS systems. Select a model and question ID to view the complete question-response pairs from our evaluation dataset.

🔬 Evaluation Methodology

ResearcherBench introduces a dual evaluation framework that comprehensively assesses DARS performance through both qualitative insight evaluation and quantitative factual assessment.

Rubric Assessment

  • Expert-designed criteria for each question
  • Insight quality evaluation using domain expertise
  • Weighted coverage scoring based on importance
  • Judge model validation (F1=0.8 agreement with experts)

Factual Assessment

  • Automated claim extraction from responses
  • Citation support verification using web retrieval
  • Faithfulness score: accuracy of citations
  • Groundedness score: overall citation coverage

📬 Contact

If you have any questions regarding ResearcherBench, feel free to reach out to us via email at lupengrui@sjtu.edu.cn or directly submit a github issue.

BibTeX

@misc{xu2025researcherbenchevaluatingdeepai,
  title={ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry},
  author={Tianze Xu and Pengrui Lu and Lyumanshan Ye and Xiangkun Hu and Pengfei Liu},
  year={2025},
  eprint={2507.16280},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2507.16280},
}