Cleanifier: Contamination removal from microbial sequences using spaced seeds of a human pangenome index
Abstract
Motivation: The first step when working with DNA sequence data of human-derived microbiomes is usually to remove human contamination for two reasons. First, many countries have strict privacy and data protection guidelines for human sequence data, so microbiome data containing partly human data cannot be easily further processed or published. Second, human contamination may cause problems in downstream analysis steps, such as genome assembly and binning. For large-scale metagenomics projects, fast and accurate removal of human contamination is hence critical. Results: We introduce Cleanifier, a fast and memory frugal alignment-free tool for detecting and removing human contamination based on gapped k-mers, or spaced seeds. Cleanifier uses a pangenome index of all human gapped k-mers, but the creation and use of alternative references is also possible. Reads are filtered based on the gapped k-mers present in the index. Cleanifier supports two filtering modes: one that queries all gapped k-mers and one that queries only a sample of them. A comparison of Cleanifier with other state-of-the-art tools shows that our sampling mode makes Cleanifier the fastest method with comparable accuracy. Because we store the gapped k-mers in a probabilistic Cuckoo filter, Cleanifier has similar memory requirements to methods that use a minimizer index. At the same time, Cleanifier is more flexible, because it can use different sampling methods on the same index. Availability and Implementation: The Cleanifier tool is available via gitlab (https://gitlab.com/rahmannlab/cleanifier), PyPi (https://pypi.org/project/cleanifier/) and Bioconda (https://anaconda.org/bioconda/cleanifier). The pre-computed human pangenome index is available for download at https://doi.org/10.5281/zenodo.15639519.
Related articles
Related articles are currently not available for this article.