Ultra-deep long-read metagenomics captures diverse taxonomic and biosynthetic potential of soil microbes
Abstract
Background
Soil ecosystems have long been recognised as hotspots of microbial diversity, but most estimates of their microbial and functional complexity remain speculative despite decades of study, in part because conventional sequencing campaigns lack the depth and contiguity required to recover low-abundance and repetitive genomes. Here, we revisit this question using one of the deepest metagenomic sequencing efforts to date, applying 148 billion base pairs of Nanopore long-read and 122 billion base pairs of Illumina short-read data to a single forest soil sample.
Results
Our hybrid assembly reconstructed 837 metagenome-assembled genomes, including 466 that meet high- and medium-quality standards, nearly all lacking close relatives among cultivated taxa. Rarefaction and k-mer analyses reveal that, even at this depth, we capture only a fraction of the extant diversity: non-parametric models project that more than ten trillion base pairs of sequencing data would be required to approach saturation. These findings offer a quantitative, technology-enabled update to long-standing diversity estimates and demonstrate that conventional metagenomic sequencing efforts likely miss the majority of microbial and biosynthetic potential in soil. We further identify more than 11 000 biosynthetic gene clusters, over 99% of which have no match in current databases, underscoring the breadth of unexplored metabolic capacity.
Conclusions
Taken together, our results emphasise both the power and the present limitations of metagenomics in resolving natural microbial complexity, and they provide a new baseline for evaluating future advances in microbial genome recovery, taxonomic classification, and natural product discovery.
Data Description
In order to quantify how much taxonomic and biosynthetic novelty present-day sequencing can recover from soil, we extracted high-molecular-weight DNA from a single Cambisol forest soil sample (Schönbuch, Germany) and generated an ultra-deep 270 Gbp dataset - 148 Gbp of Oxford Nanopore PromethION long reads (read-length N50 = 12.2 kb) plus 122 Gbp of Illumina NovaSeq reads. The hybrid metagenome assembly we conducted produced a 10.5 Gbp assembly, from which multi-tool binning and subsequent refinement recovered 837 MAGs, and antiSMASH/BiG-SCAPE annotation revealed more than 11 000 largely novel biosynthetic gene clusters, creating a resource for benchmarking assembly or binning pipelines, modelling diversity-coverage relationships and mining natural products; all raw reads are archived under ENA BioProject PRJEB89893 [1], and the polished assembly, MAG set and BGC catalogue are available via Zenodo for unrestricted reuse [2, 3].
Key Points
-
Ultra-deep hybrid sequencing (148 Gbp Nanopore + 122 Gbp Illumina) of a single forest soil sample yielded 837 metagenome-assembled genomes, all lacking cultured counterparts.
-
Despite this unprecedented 270 Gbp depth, rarefaction and coverage modelling indicate that more than 10 Tbp of data would still be needed to reach saturation in soil.
-
Only 0.7% and 16.7% of all assembled contigs can be assigned to a species and genus, respectively, with at least one cultured representative, highlighting an unprecedented level of novelty in soil.
-
The assembly uncovers 11 381 biosynthetic gene clusters forming over 10 000 mostly novel families, spotlighting an immense, untapped reservoir of microbial natural-product potential.
Related articles
Related articles are currently not available for this article.