MS-Adapter: Multi-scaled Adapter for Efficient DeepFake Detection
Abstract
Existing deepfake detection methods overly rely on low-level forgery patterns, leading to poor performance when encounteringunseen forgery types or low-quality images. Recently, Vision Transformer (ViT) pretrained on large-scale datasets havedemonstrated strong generalization capabilities across various image downstream tasks. However, parameter-efficient fine tuning methods for ViTs have shown limited effectiveness in DeepFake detection, mainly because ViTs rely on high-levelsemantics while struggling to capture fine grained local details. To address this issue, this paper proposes MS-Adapter, amulti-scale adapter network for efficient deepfake detection. By embedding multi-scale adapter modules within the pretrainedViT, MS-Adapter progressively extracts and fuses features from low-level forgery artifacts to high-level semantic forgery patternsacross multiple scales. At the same time, the Temporal Aggregation Transformer receives the frame-level features extractedby the Multi-Scale Adapter and performs temporal modeling on these features to enhance forgery detection performance.Experimental results demonstrate that MS-Adapter achieves superior performance on multiple datasets, including FF++,Celeb-DFv2, and DFDC, while requiring only a small number of trainable parameters.
Related articles
Related articles are currently not available for this article.