Breaking Through Biology's Data Wall: Expanding the Known Tree of Life by Over 10x using a Global Biodiscovery Pipeline

Oliver Vince
Phoebe Oldach
Valerio Pereno
Marcus H Y Leung
Carla Greco
Gus Minto-Cowcher
Saif Ur-Rehman
Keith Y K Kam
William Chow
Emma Bolton
Bupe R Mwambingu
Nadine L Greenhalgh
Ineke E Knot
Leif Christoffersen
Marlon Clark
Robert Pecoraro
Aaron W Kollasch
Tanggis Bohnuud
Matthew Bakalar
Philipp Lorenz
Glen Gowers

1 evaluations Published on Jun 14, 2025

This article on Sciety

Abstract

Advancements in the life sciences have always been built upon our collective understanding of life on Earth. Now, the rise of generative biology - the use of AI foundation models to design, generate, and annotate proteins, pathways and therapeutics - is creating unprecedented demand for large, diverse biological sequence datasets. While a limited subset of such data can be generated in clinical or laboratory settings, the vast majority of the training data for unsupervised models must be sourced from the natural world - the product of nearly four billion years of evolutionary history. However, the public databases that currently supply this data, while foundational to research, were established to aggregate results from academic experiments, not as training datasets for machine learning. Their human-centric data structure limits model performance due to redundancy, taxonomic and geographic bias, limited biological context, and inconsistent provenance. With 68% of all sequence data in the SRA database coming from just 5 species, this is one of the most severe class imbalance problems ever encountered in AI. Legal and infrastructural constraints further exacerbate this bottleneck. To address these limitations and support scalable model training, we introduce BaseData™: the largest and fastest-growing biological sequence database ever built, and the first purpose-built for training foundation models. As of late 2024, BaseData™ contained 9.8 billion novel genes across more than 1 million newly discovered species, representing more than a 10-fold expansion in known protein diversity after accounting for redundancy. Its partnership-driven data supply chain across 26 countries and autonomous regions enables growth of over 2 billion novel genes per month, far exceeding public repositories. All data is collected under benefit sharing agreements using standardized protocols and structured using graph-based, ontology-rich metadata that preserves evolutionary context. BaseData™ represents a new, ethically grounded infrastructure for training biological foundation models, complementing public efforts and enabling the next era of generative biology.

Related articles are currently not available for this article.