Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification

Zhe Zhang
Xiang-Gui Guo
Junbao Zhuo
Huimin Ma

0 evaluations Published on Jul 30, 2025

This article on Sciety

Abstract

Abstract concepts containing structural information, such as tangram, are often used in cognitive psychology to explore spatial reasoning and visual cognition. Inspired by this way, we propose a newly simple but effective fine-tuning method called two-stage fine-tuning contrastive language-image pretraining (TSF-CLIP). In stage I, the CLIP encoder is fine-tuned by the image-text matching task set on the tangram dataset, and the structure prior knowledge is captured by both two encoders of CLIP via a fine-tuning process. In stage II, to further improve the accuracy, the linear head is aligned to the domain of the specific downstream task. The proposed TSF-CLIP method can not only dynamically integrate structural prior knowledge and semantic information, but also avoid the shortcomings of large models with poor spatial logical reasoning ability and excessive fine-tuning parameters, and enhances the adaptability of the model to different downstream tasks. Experimental results demonstrate that the proposed TSF-CLIP substantially boosts the performance of the target model, which outperforms the existing few-shot image classification approach. Compared to the traditional CLIP, the average accuracy of the proposed TSF-CLIP is improved by 16.1% across 10 image recognition datasets. The code and related datasets can be found at https://github.com/Patrickeroo/TSF-CLIP.

Related articles are currently not available for this article.