Step Wise Approximation of CBOW Reduces Hallucinations in Tail Cases

Boris A. Galitsky
Anatoly Tsirlin

0 evaluations Published on Jul 9, 2025

This article on Sciety

Abstract

This paper introduces a cognitively inspired approach to word representation called step-wise approximation of embeddings, which bridges the gap between static word embeddings and fully contextualized language model outputs. Traditional embedding models like Word2Vec assign a single vector to each word, failing to account for polysemy and context-dependent meanings. In contrast, large language models produce distinct embeddings for every token instance but at the cost of interpretability and computational efficiency. We propose modeling embeddings as piecewise-constant approximations that evolve in discrete semantic steps across contexts. This approach enables a word to be represented by a finite set of context-sensitive vectors, capturing different senses or usage patterns. We formalize the approximation process using entropy-minimizing segmentation and demonstrate its application in a continuous Word2Vec setting that handles context shifts smoothly. Our experiments show that this method improves representation quality for tail entities—words with limited training frequency—yielding up to 5% improvement in question answering tasks within a retrieval-augmented generation (RAG) framework. These results suggest that step-wise approximation offers a computationally efficient and interpretable alternative to contextual embeddings, with particular benefits for underrepresented vocabulary.

Related articles are currently not available for this article.