Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification
Abstract
Abstract concepts containing structural information, such as tangram, are often used in cognitive psychology to explore spatial reasoning and visual cognition. Inspired by this way, we propose a newly simple but effective fine-tuning method called two-stage fine-tuning contrastive language-image pretraining (TSF-CLIP). In stage I, the CLIP encoder is fine-tuned by the image-text matching task set on the tangram dataset, and the structure prior knowledge is captured by both two encoders of CLIP via a fine-tuning process. In stage II, to further improve the accuracy, the linear head is aligned to the domain of the specific downstream task. The proposed TSF-CLIP method can not only dynamically integrate structural prior knowledge and semantic information, but also avoid the shortcomings of large models with poor spatial logical reasoning ability and excessive fine-tuning parameters, and enhances the adaptability of the model to different downstream tasks. Experimental results demonstrate that the proposed TSF-CLIP substantially boosts the performance of the target model, which outperforms the existing few-shot image classification approach. Compared to the traditional CLIP, the average accuracy of the proposed TSF-CLIP is improved by 16.1% across 10 image recognition datasets. The code and related datasets can be found at https://github.com/Patrickeroo/TSF-CLIP.
Related articles
Related articles are currently not available for this article.