Movi Color: fast and accurate long-read classification with the move structure

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

The number of reference genomes is rapidly increasing, thanks to advances in long-read sequencing and assembly. While these collections can improve the sensitivity and specificity of classification methods, this requires highly efficient compressed indexes. K-mer-based approaches like Kraken 2 are efficient but limit the analysis to a fixed k-mer length. This is hard for the user to set ahead of time, and suboptimal settings can harm sensitivity and specificity. Methods that use compressed full-text indexes like SPUMONI2 and Cliffy lift this constraint, but are less efficient than k-mer-based tools. Further, these methods either cannot report a full listing of genomes where a match occurs, or cannot scale to large reference databases.

We propose new methods and algorithms that use compressed full-text indexes to enable multi-class and taxonomic classification. Unlike past compressed-indexing methods for classification, ours uses the move structure, which is extremely fast thanks to its locality of reference. Our method, called Movi Color, augments the main table of the Movi index. Specifically, Movi Color assigns a “color” to each run of the Burrows-Wheeler Transform according to the subset of genomes from which the run suffixes originated. When the reference is highly repetitive – as is typical when indexing pangenomes or reference databases – only certain colors occur, creating opportunities to compress the index. For species-level classification, Movi Color achieves over 1.6×higher precision and 2×higher recall than Kraken 2 and Metabuli. At the genus level, it achieves 70% higher precision and 80% higher recall. Movi Color’s read processing time is 7-20× faster than Metabuli and is a comparable to Kraken 2. Although Movi Color uses more memory than both Kraken 2 and Metabuli, its speed-accuracy trade-off makes it well-suited for real-time or high-throughput scenarios.

Related articles

Related articles are currently not available for this article.