Setting the bar higher – the complete human genome sequence

The Telomere to Telomere (T2T) consortium has published a ‘complete’ human genome sequence, filling in gaps that had stubbornly persisted for over 20 years since the publication of the original Human Genome Project.

The newly determined sequence covers previously inaccessible regions of chromosomes, including all of the repeat-heavy centres and short ‘arms’. For the first time, scientists will be able to delve into the functions and variations of all 3.055 billion letters of DNA that code for a human.

DNA - artistic impression.

DNA – artistic impression. Image credit: Caroline Davis2010 via Flickr, CC BY 2.0

The team behind the research describes a ‘new era for genomics, where no region of the genome is beyond reach.’ The sequence contains previously undiscovered genes and previously inaccessible repetitive regions and can now be detangled. The work opens the door to complete, end-to-end genome sequencing being possible for all species on Earth.

The Human Genome Project

The Sanger Institute was founded in 1993 to contribute to the Human Genome Project, the 13-year mission to map our species’ DNA. The international consortium announced the ‘complete’ sequence in 2003, and the Sanger was responsible for sequencing a third of the genome – the most significant single contribution. It was a monumental landmark for science, providing the foundations for research into biology, evolution and medicine.

The sequence formed the basis of the reference genome – an open-access resource used by the scientific community worldwide as the basis of nearly all genomics applications in research and clinical settings.

The Genome Reference Consortium (GRC), including scientists at the Sanger Institute, has been maintaining and updating the reference human genome sequence since 2007. They have been chipping away at the sequence, adding to it and correcting errors. The current version, number 38, still has about eight per cent of the sequence missing, beyond the reach of previous sequencing approaches.

The missing millions of DNA letters are primarily in repeat-dense regions of the genome. At the time of the Human Genome Project, there was no way to determine the order of these letters – mainly because only short fragments of the genome could be sequenced at the time. So for regions full of repeats to fit the puzzle together, all the pieces looked the same.

In 2018, the T2T consortium was formed to get to those uncharted regions. Led by researchers in the USA, they utilised advances in sequencing technology from two companies, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Their long-read technologies can determine the order of many thousands of DNA letters in a row – much more than possible. The pieces of the puzzle got bigger and so easier to fit together.

Over 100 scientists working over several years have developed new algorithms, combining sequence data from all of the available technologies to get to the finished sequence.

Dr Kerstin Howe, a Head of Production Genomics at the Sanger Institute, and member of the T2T consortium, describes the implications: “For the first time, we have insight into what previously escaped our sequencing and assembling technologies despite our best efforts: the ‘genomic dark matter’ of highly repetitive regions like centromeres, and expanded gene families.”

Diverse genomes

Another issue with the current reference sequence is a linear, single genome based on a composite sequence, originally from a handful of different people. This creates biases and errors, and importantly, it does not represent human genetic diversity. And while subsequent studies have sequenced more and more individuals, these are primarily people of European ancestry, creating inequity in genomic research.

Kerstin is also involved in the Human Pangenome Reference Consortium. Their aim is to create a complete reference to human genomic diversity. This pangenome will be made from telomere-to-telomere (end-to-end) sequences of 350 individuals of diverse ancestry. Kerstin describes it as a ‘web’: “The pangenome will move away from a linear genome to one which branches out where there is variation, and the branches come back together when there is none. We expect the new pangenome reference to better capture human genetic diversity and improve gene-disease association studies among as yet underrepresented populations.”

The Human Pangenome Reference Consortium also develops methods, software, tools, and data systems to visualise, use and disseminate the sequence.

New standards

It will be a while before T2T sequencing is the norm. The next hurdle to overcome is the diploid genome, present in normal human cells. The T2T consortium worked on a haploid human cell line, with only one copy of each chromosome. But for Kerstin, T2T represents what will be achievable.

“We have started a massive project to sequence the genomes of all the species on Earth, and the bar is now higher. We know what’s possible. We are talking about when we say ‘reference genome’ just got pushed up a notch.”

Kerstin reflects on what’s next. “There are some chromosomes that are only around during development in certain cell types, and then they disappear in adult cells. They are going to be of interest. And then, of course, you have the somatic mutation, how our genomes change as we age, and that will be looked at in the context of T2T. Single-cell T2T sequencing might be the next thing.”

The Human Genome Project transformed biology. As the data is used to inform the next generation of research into biology, evolution, and personalised medicine, the more accurate and representative the reference sequence is.

Source: Sanger Institute