LLM-Based Persona-Driven Text Data Augmentation
Abstract
The rise of drug‑related crime in South Korea, especially via online messengers, reveals clear limits in keyword‑based or network‑tracking detection methods. To address this in a low‑resource setting, we propose a large‑language‑model (LLM) persona‑driven data‑augmentation framework. Buyer and seller personas replicate authentic linguistic patterns, slang and delivery practices, generating realistic, context‑rich dialogue. Using text‑embedding similarity, type–token ratio (TTR), perplexity, dialogue coherence and ROUGE, we show that 15 000 augmented dialogues closely mirror 87 real conversations while boosting lexical variety and contextual consistency. Results confirm that persona‑driven augmentation mitigates data scarcity and improves illicit‑dialogue detectors, offering a transferable strategy for other sensitive, low‑data domains such as voice phishing or fraudulent trade.
Related articles
Related articles are currently not available for this article.