Recent advances enabled researchers to map the DNA sequence of the human genome. However, it is still not fully understood how genetic information encoded in DNA is produced or expressed in the human body.
A recent study by DeepMind introduces a novel neural network architecture called Enformer for the prediction of gene expression.
Previous works in this domain used convolutional neural networks but had limited accuracy and application. The suggested approach uses Transformers to make use of self-attention mechanisms that could integrate much greater DNA context. Inspired by the use of Transformers in natural language processing, the researchers adapted them to “read” vastly extended DNA sequences.
Enformer is significantly more accurate at predicting the effects of variants on gene expression, both on natural and synthetic variants. The study can help to further study gene regulation and causal factors in diseases.
How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer–promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.