Corpus linguistics with AI copilots: Large language models as first-pass filters in data annotation

Vítor Míguez

0 evaluations Published on Jul 23, 2025

This article on Sciety

Abstract

Manual annotation of corpus examples usually represents the most time-consumingphase of a corpus-based linguistic study. Recent advancements in AI open upthe possibility of automating corpus annotation through accessible, easy-to-uselarge language models. Research on applications of large language models tocorpus linguistics is still limited, and methodological guidelines have not yet beenestablished. This paper focuses on the use of large language models as first-passfilters in corpus annotation using as a case study the semantic disambiguation ofGalician noun pobo ‘people/village’. 300 examples were annotated by three humancoders and two large language models (Claude 4 Sonnet and Claude 4 Opus), usinga static, single-phase prompting approach. Since the goal of a first-pass filter shouldbe to capture as many actual occurrences of the target linguistic phenomenon aspossible, priority was given to recall over precision. Accordingly, the paper arguesfor the use of F2, a recall-focused metric, for model validation over commonlyused F1 and Matthews correlation coefficient. Human inter-annotator agreementwas substantial (Fleiss’ κ = 0.656). Claude 4 Opus with pretraining examplesachieved the best performance against the human consensus (F2 = 0.857, recall =97.3%), resulting in nearly 80% of workload reduction with minimal informationloss. The study demonstrates the viability of employing large language modelsas first-pass filters in corpus linguistics without sacrificing scientific rigor. It alsohighlights the importance of transparent methodological practices and the need forbroader empirical validation.

Related articles are currently not available for this article.