Using Elicit AI research assistant for data extraction in systematic reviews: a feasibility study across environmental and life sciences

Malgorzata Lagisz
Ayumi Mizuno
Kyle Morrison
Pietro Pollo
Lorenzo Ricolfi
Yefeng Yang
Shinichi Nakagawa

0 evaluations Published on Aug 11, 2025

This article on Sciety

Abstract

Data extraction in systematic reviews, maps and meta-analyses is time-consuming and prone to human error or subjective judgment. Large Language Models offer potential for automating this process, yet their performance has been evaluated in a limited range of platforms, disciplines, and review types. We assessed the performance of the Elicit platform across diverse data extraction tasks using journal articles from seven systematic-like reviews in life and environmental sciences. Human-extracted data served as the gold standard. For each review, we used eight articles for prompt development and another eight for testing. Initial prompts were iteratively refined to exceed 87% accuracy or up to five rounds. We then tested extraction accuracy, reproducibility across user accounts, and the effect of Elicit’s high-accuracy mode. Of 90 considered prompts, 70 exceeded the 87% accuracy when compared to gold standard values but tended to be lower when tested on a new set of articles. Repeating data extractions with different Elicit user accounts resulted in 90% agreement on extracted values, though supporting quotes and reasoning matched in only 46% and 30% of cases, respectively. In high-accuracy mode, value matches dropped to 77%, with just 10% quote matches and 0% reasoning matches. Extraction accuracy did not differ by data types. Elicit also helped identify eight (<1%) errors in the gold standard data. Our results show that Elicit can complement, but not replace, human data extractors. Elicit may be best used as a secondary reviewer and to evaluate the clarity of data extraction protocols. Prompts must be fine-tuned and independently validated.

Related articles are currently not available for this article.