From Words to Nucleotides: AI Language Models Trained on DNA

Reading time: 4 minutes

Anthony Tao

Since the public introduction of ChatGPT in 2020, large language models (LLMs) have become an emerging cornerstone of modern technology. In healthcare, LLMs have been employed to accomplish tasks such as scribing, medical education, prior authorizations, and filling communication gaps. Indeed, the power of these LLMs lies in their sheer versatility, especially considering how nonspecific the training process is. In fact, most LLMs are trained by feeding these models hundreds of gigabytes of random text, without any specific goal or direction. Yet despite this, these LLMs can respond to questions and interact with prompts in a manner arguably reflective of cognition. Of course, the written language is only one type of information from which LLM networks can learn. Another type is the language of our genome. DNA, like the written word, is composed of individual nucleotides ‒ denoted as the letters A, T, C, and G. With this in mind, researchers have recently sought to develop neural networks that can gain a latent understanding of the genomic code that comprises all of life.

To understand the potential of this endeavor, it is first important to understand how LLMs are trained and developed. Most LLMs these days are based on a unique neural network structure known as the transformer. During the training process, a block of text containing missing words is presented to the transformer, which attempts to correctly predict those gaps based on provided training text. Correct predictions are rewarded via a scoring mechanism, which tunes the network in a way that improves its ability to perform this task. Upon repeating this process millions of times across trillions of textual information, the LLM can achieve a foundational “understanding” of language. Such a foundation model can then be quickly retrained on a much smaller amount of data for specific tasks (a process called finetuning), such as answering medical questions, reasoning through math problems, or generating a song. In this way, the foundation model is said to have developed a latent representation of language, allowing it to reveal properties and predictions beyond what it was specifically trained to do.

The same training process using transformer-like neural networks can be applied to genomic information. With the advent of fast, cost-efficient sequencing technologies, gigabytes of genomic information are publicly available across millions of samples and species. Thus, running a similar training scheme on genomic information becomes feasible and could result in a genomic LLM that has a latent understanding of the code of life (Figure 1). Such a model may potentially yield insights or predictions on cancer mutations, genetic engineering, and regulation of cell function. 

Figure 1. Flow diagram of genomic LLM training and fine-tuning 

In pursuit of this, a group of researchers at Stanford introduced a model called Evo, which was trained on millions of genomes from bacteria and viruses. As with LLMs, Evo was trained to fill in missing nucleotides within random snippets of the genome. Excitingly, the resulting model was able to predict the effects of certain mutations on proteins, to predict what genome locations are critical for gene regulation, and even to generate new variations of the Cas9 protein, a crucial mediator of genome editing used in CRISPR technology. In fact, when the researchers tested these AI-designed Cas9 proteins, many of them were successfully able to perform CRISPR-mediated functions in real-life experimental assays.

As an update to Evo, the same group later introduced Evo2, a similar model trained on more genomic data spanning all domains of life, including animal, plant, and human genomes. Among many other tasks, this model was able to accurately predict disease-causing mutations in the gene BRCA1, which is strongly implicated in the development of breast and ovarian cancer. However, not all BRCA1 mutations are necessarily cancer-promoting. In fact, there are over 2000 different BRCA1 mutations in the human population whose effects are unknown, which complicates the clinical management of these patients. Without any further training, Evo2 outperformed the latest models in predicting the cancer risk of unknown BRCA1 mutations with impressive accuracy. Such a tool could substantially improve how genetic risk is evaluated in medical practice, especially considering the hundreds of genes with thousands of associated mutations each of which has an unknown cancer risk.

Ultimately, genomic LLMs pave the way for exploring biological and medical questions with insights that may be difficult to derive through conventional analysis and reasoning. Of course, such models may continue to face many limitations as well as controversy regarding the reliability of their predictions. For instance, whether such models simply recognize certain statistical patterns or indeed “learn” biology is still debated. And the idea of relying on a computational black box to dictate clinical decisions regarding cancer care can be a difficult pill to swallow. However, with the right guardrails, these models have the potential to achieve many goals in cancer genomics such as predicting responsiveness to treatment or likelihood of metastases. Continued collaboration between computational scientists, biologists, and clinicians will be essential to ensure these tools are sound and used effectively.

Header Image Source: generated by author with Midjourney

Edited by Dr Celia Snyman

References

  1. Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science 386, eado9336, doi:10.1126/science.ado9336 (2024).
  2. Brixi, G. et al. Genome modelling and design across all domains of life with Evo 2. Nature, doi:10.1038/s41586-026-10176-5 (2026).

Leave a comment

Create a website or blog at WordPress.com

Up ↑