First, We Need to Generate the Right Data. Then AI Will Shine

By Alice Zhang co-founder and CEO, Verge Genomics, and Victor Hanson-Smith, Ph.D., director and head of computational biology

ChatGPT is a hot topic across many industries. Some say the technology underpinning it – called generative AI – has created an “A.I. arms race.” However, relatively little attention is given to what is needed to fully leverage the promise of generative AI in healthcare, and specifically how it may help accelerate drug discovery and development. 

That’s a mistake. 

Recently, David Shaywitz offered a thoughtful opinion on why he sees generative AI  as a profound technology with implications across the entire value chain. We agree with many of David’s views but want to offer additional perspective. 

In short, our belief is that AI will identify better targets, thus reducing clinical failures in drug development and leading to new medicines. Generative AI will play a role. However, the fundamental challenge in making better medicines a reality comes down to closing the massive data gaps that remain in drug development today. 

And when it comes to the most complex diseases that still lack meaningful medicines, where the data comes from is essential. Today, the source of data that powers generative AI has substantial gaps. Over the long term, generative AI will enable the creation of meaningful medicines, but it will not offer a panacea for what ails all of drug discovery. 

A Primer on M.L. Classification and Generative AI

To start, it’s necessary to have a grounding in machine learning (ML) classification. As the name implies, ML classification predicts whether things are or are not in a class. 

Email spam filters are a great example. They ask, “Is this spam, or is it not?” 

They work because they’ve been “trained” on thousands of previous data points (i.e., emails and the text within). Generative AI, by contrast, uses a class of algorithms called auto encoders and other approaches to generate new data that look like the input training data. It’s why a tool like ChatGPT is great at writing a birthday card. There are thousands, maybe even millions, of examples of birthday cards that it can pull from. 

There are limitations though. 

Ask ChatGPT to summarize a novel that was published this week, and it will give you the wrong answer, or maybe no answer. That’s because the book isn’t yet in the training data. 

What does this have to do with drug discovery? 

The above example illustrates a foundational point in drug discovery: input data - especially its provenance and quality - is essential for training models. Input data is the biggest bottleneck in drug development, especially for complex diseases where few or no therapies exist. Our worldview is that the sophistication of the AI/ML approach is irrelevant if the training data underpinning it is insufficient in the first place. 

So, what kind of biological input data does generative AI need in biology? It depends on the task. For optimizing chemical structures, generative AI mainly relies on vast databases of publicly available protein structures and sequences. This is powerful. We expect generative AI will have a massive impact on small molecule drug design when there is already a target in mind, a known mechanism of action, and the goal is to optimize the structure of a chemical. The wealth of available protein structure and chemistry data means a model can be well trained to craft an optimized small molecule candidate. 

But a different problem – finding new therapeutic drug targets – requires different types of input data. This includes genomic, transcriptomic, and epigenomic sequence data from human tissue. What happens when this type of training data is unavailable? That’s what we’re solving for at Verge. We first fill a fundamental gap in generating the right kind of training data and then using ML classification to ask and answer the question, “Is this a good target or a bad target?”

Building a bridge from genetic drivers to disease symptoms

Take amyotrophic lateral sclerosis (ALS) as an example. At least 56 genes drive the development of ALS. Looking at one of those genes in isolation will tell you something about certain people with ALS, but nothing about the shared mechanisms that impact ALS in all patients. This genetic association data, or GWAS data, alone is insufficient to find treatments that are widely applicable to broad ALS populations. That theme repeats itself for other complex diseases we’ve evaluated, including neurodegeneration, neuropsychiatry, and peripheral inflammation. 

The existing drug therapies for ALS treat symptoms of the disease, rather than the underlying causes. It is likely that if a generative AI approach was applied to ALS, it could predict more symptom-modifying treatments, but it would fail to identify fundamentally new disease-modifying treatments. Although AI can be excellent at pattern-matching to create additional examples of a thing, AI can struggle to create the first example of a thing. 

This is precisely the problem the field of biotech faces for a wide range of diseases with no effective drug treatments. We don’t know what causes the diseases and haven’t collected the right kind of underlying data to even begin to lead us to the right answers.

Our approach in ALS is to use layers of “Omics” data - sourced from human, not animal tissue - to fill gaps in available training data. This enables us to discover molecular mechanisms that cause ALS. When human omics data form the input for a training set, the output is insight into disease-modifying therapies for what we believe will be a wide range of ALS patients. Using this approach, we build a bridge from diverse genetic drivers to shared disease symptoms, from genotype to phenotype. For Verge, this approach has been pivotal towards identifying a new target for ALS and starting clinical trials with a small molecule drug candidate against that target in just 4.5 years.

Back to the Value Chain

AI could affect the entire biopharmaceutical and healthcare value chains, but studies like this one have shown that “a striking contrast” has run through R&D in the last 60 years. The authors write that while “huge scientific and technological gains” should have improved R&D efficiency, “inflation-adjusted industrial R&D costs per novel drug increased nearly 100-fold between 1950 and 2010.” Worse, “drugs are more likely to fail in clinical development today than in the 1970s.”

AI today is being used to test more drugs faster, but it hasn’t fundamentally changed the probability of success. The biggest driver of rising R&D costs is the cost of failure. While using AI to optimize design is appealing, it won’t mean much until it can better predict the effectiveness of targets or drugs in humans. Today’s disease models (cells and animal models) are not great predictors of whether drugs work, so increases in efficiency in these models just provide larger quantities of poor-quality data. When models are poor, the outcomes will be, too. As the old saying goes - garbage in - garbage out. 

Concluding Thoughts

No single type of training data will solve the complexities of discovering and developing new medicines. It will take multiple data types. But a relentless focus on finding the best types of data for the scientific problem and generating lots of that data in a high-quality manner, will be what truly paves the way for AI to fulfill its potential in drug discovery. 


This article first appeared in the Timmerman Report.

Rob Maguire