Benchmarking Large Language Models for Replication of Guideline-Based PGx Recommendations

Mike Zack
Ioan Skobodchikov
Danil Stupichev
Alex Moore
David Sokolov
Igor Trifonov
Allan Gobbs

0 evaluations Published on May 15, 2025

This article on Sciety

Abstract

We evaluated the ability of large language models (LLMs) to generate clinically accurate pharmacogenomic (PGx) recommendations aligned with CPIC guidelines. Using a benchmark of 599 curated gene–drug–phenotype scenarios, we compared five leading models, including GPT-4o and fine-tuned LLaMA variants, through both standard lexical metrics and a novel semantic evaluation framework (LLM Score) validated by expert review. General-purpose models frequently produced incomplete or unsafe outputs, while our domain-adapted model achieved superior performance, with an LLM Score of 0.92 and significantly faster inference speed. Results highlight the importance of fine-tuning and structured prompting over model scale alone. This work establishes a robust framework for evaluating PGx-specific LLMs and demonstrates the feasibility of safer, AI-driven personalized medicine.

Related articles are currently not available for this article.