Automated Annotation of Plant Gene Regions Using Supervised Machine Learning

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Rapid sequencing now enables analysis across many species, including plants, but repeat-rich, redundant plant genomes still hinder precise annotation. Precise gene calls are vital for understanding biology, comparative genomics, and discovering traits such as improved nutrition and disease resistance. We present GeAnno, a supervised machine learning method for plant gene detection. GeAnno uses an XGBoost classifier trained on curated plant annotations and a sliding-window scheme while capturing redundancy, base composition, and start/stop-codon spacing, among others. At inference, windows are scored as genic or intergenic and lightly smoothed. The output is a standard GFF3 with strand-specific genic regions. Benchmarking against ab initio predictors under matched training on 11 cassava ( Manihot esculenta ) genomes and cross-species evaluation, GeAnno achieved higher nucleotide-level precision and F1-score. For example, on cassava it reached 77.13% precision and 72.90% F1-score. Moreover, performance transfers to divergent species and parameters (window, step, smoothing, thresholds) are tunable. By improving accuracy and portability on complex plant genomes, GeAnno supports downstream functional studies and breeding, advancing food security and ecological sustainability.

Related articles

Related articles are currently not available for this article.