Comparing Large Language Models for Text Classification: Model Selection Across Tasks, Texts, and Languages

Michael Heseltine

0 evaluations Published on Sep 22, 2025

This article on Sciety

Abstract

Large-scale text analysis has grown rapidly as an analytic method in the socialsciences and beyond in recent years and recent advances in large language models(LLMs) have made automated text annotation increasingly viable. This paper focuseson the comparative viability of closed-source and open-source LLMS for textannotation, testing the performance of 32 different LLMs in text classification acrossa range of tasks, text types, and languages. Using data in seven languages across10 country contexts, the results show considerable variation in model performance,highlighting that researchers should carefully consider model selection as part of theirLLM-centered classification strategy. In general, the closed-source GPT-4 exhibits relativelystrong performance across all classification tasks, while open-source alternativessuch as LLama3 and Qwen2.5 also show similar or even superior performance on selecttasks. Many smaller open-source models, however, provide relatively unsatisfactoryperformance on more complex and non-English language coding tasks. The tradeoffsinherent in the use of each model are therefore highlighted to allow researchers to makeinformed decisions about model selection on a specific task-by-task basis.

Related articles are currently not available for this article.