The Language of Biology: Natural Language Models Being Used to Predict Viral Mutations

January 19, 2021

The new variants of the covid-19 virus that popped up in the UK and South Africa are top of mind right now.  The fear is the vaccines may not be as effective against the new variants.

When a variant or mutation evades antibodies, its artfully called viral escape.  That’s not a good thing.

In a serendipitous development, a recent study from Science found that natural-language processing can help predict virus mutations.

Grammar and semantics
Per MIT Technology Review, the key insight from the study is that many properties of biological systems can be interpreted in terms of words and sentences.

Grammar.  The genetic fitness of a virus can be interpreted in terms of grammatical correctness, ie. if it successfully infects, then it is grammatically correct.

Semantics.  Similarly, mutations of a virus can be interpreted in terms of semantics, or meaning. Mutations such as changes in a virus’ surface proteins that make it invisible to certain antibodies can be analogized to changes in semantics. If semantics change, then a new antibody may be needed to fight it.

Old school NLP.  The NLP tech used was not like GPT-3 but rather an older neural network known as LSTM. These older networks can be trained on less data than models like GPT-3 but still perform well in many cases.

Training the model to “read” viruses­
The researchers trained the NLP model on thousands of genetic sequences taken from

  • Influenza: 45,000 unique sequences;

  • HIV: 60,000 for a strain of HIV; and­

  • Covid-19: between 3,000 and 4,000 for a strain of Sars-Cov-2

LSTM models score accuracy on a scale between 0.5 (no better than chance) and 1 (perfect).

To check out the model’s accuracy, they took the top mutations identified by LSTM and cross checked them in a lab to see how many of them were actual escape mutations.

Their results ranged from 0.69 for HIV to 0.85 for one coronavirus strain. That’­­­s better than the results of other state-of-the-art models, according to the study.  Score. 

Why do this?
To save time.  What would take weeks in a lab can be done by the NLP model instantly, which focuses the lab work and speeds it up.  It’s not ready for use in the wild yet but could be someday.  For now, it’s a breakthrough in concept. 

It goes both ways
Advances in NLP can now lead to advances in biology. But Bryson, Berger and Hie, the study’s researchers believe that this also can go the other way, with new NLP algorithms inspired by concepts in biology.

Easy peasy to share this story with your peeps

Level up your inbox with The Scroll

Get stories like this delivered to your inbox.

Business news focused on startups and tech. Get informed while being very mildly entertained.
No spam. No fluff. No nonsense. Ever.