SCAT: The Self-Correcting Aesthetic Transformer for Explainable Facial Beauty Prediction
Abstract
Modeling human aesthetic perception is a fundamental challenge in computer vision. While deep learning has significantly advanced Facial Beauty Prediction (FBP), state-of-the-art models suffer from two critical, interlinked limitations: a performance plateau with Pearson Correlation (PC) coefficients seldom exceeding 0.90, and a ”black box” nature that offers no insight into their reasoning. We posit that these limitations stem from a failure to emulate the hierarchical, part-based reasoning inherent to human aesthetic judgment. In this work, we propose the Self-Correcting Aesthetic Transformer (SCAT), a novel, explainable-by-design framework that overcomes these challenges. SCAT introduces a two-stage architecture featuring a Semantic Parser to disentangle the face into explicit part embeddings (e.g., eyes, mouth) and a Corrector Aggregator to reason about their harmonious interplay. The model is trained with a novel self-correcting loss that enforces internal consistency between its part-based and holistic evaluations. To facilitate this, we present FBP5500-Subscores, a large-scale dataset with granular part-level aesthetic annotations. Extensive experiments demonstrate that SCAT achieves a new state-of-the-art Pearson Correlation of 0.935, thereby breaking the long-standing performance barrier, while simultaneously providing transparent , human-intelligible predictions. Our work bridges the critical gap between 1 predictive power and interpretability in FBP and suggests a structured reasoning paradigm for other subjective visual assessment tasks.
Related articles
Related articles are currently not available for this article.