Sequence Modeling Is Not Evolutionary Reasoning
Abstract
Protein language models (PLMs) are commonly assumed to capture evolutionary information by training on large protein sequence datasets. However, it remains unclear whether PLMs can reason about evolution—that is, infer evolutionary relationships between protein sequences. To test this capability, we introduce a benchmark for evolutionary reasoning and find that existing PLMs consistently fail to recover phylogenetic structure, despite strong performance on standard tasks such as masked token prediction and contact prediction. To address this limitation, we present P<sc>hyla</sc>. P<sc>hyla</sc>introduces a hybrid state-space and transformer architecture that jointly process multiple sequences and is trained using a tree-based objective over 3,000 phylogenies spanning diverse protein families. P<sc>hyla</sc>achieves state-of-the-art performance in evolutionary reasoning, outper-forming the next-best model by 13% on tree reconstruction and 10% on taxonomic clustering. Beyond synthetic benchmarks, P<sc>hyla</sc>applies to real-world settings: it reconstructs biologically accurate branches of the tree of life and infers whole-genome evolutionary relationships fromMycobacterium tuberculosisgenomes. These findings suggest that evolutionary reasoning is not an emergent property of large-scale sequence modeling. Instead, P<sc>hyla</sc>shows that models trained with phylogenetic supervision can reason about evolution more effectively, offering a biologically grounded path toward evolutionary foundation models.
Related articles
Related articles are currently not available for this article.