The complete human genome deciphered for the first time
Evan Eichler has always been drawn to the most complex regions of humanity’s genome – those with oddly long repeated stretches of DNA or with extra copies of genes. He suspected that these regions could play a crucial role in evolution and disease. That’s why, more than 20 years ago, he joined the Human Genome Project, the $3 billion effort to read every letter of a person’s DNA.
But after the project’s victory in 2003, Eichler was only one step closer to his scientific goal. The sequencing effort had failed to read many large chunks of DNA – more than eight percent of the genome. Scientists knew these missing chunks contained highly repetitive sequences and largely dismissed them as trash. That’s not the case, says Eichler, a researcher at the Howard Hughes Medical Institute (HHMI) at the University of Washington. “A lot of the regions I was interested in turned out to be in the gaps.” He’s committed to finishing the job – reading the whole genome, the tricky bits and all.
Now he and a team of about 100 scientists, led by Adam Phillippy of the National Human Genome Research Institute (NHGRI) and Karen Miga of the University of California, Santa Cruz, (UCSC) have finally succeeded. In new work first published as a preprint on bioRxiv.org and now published March 31, 2022 in the journal Science, they describe the first-ever sequencing of an entire human genome, adding the value of an entire chromosome of previously hidden DNA – the missing eight percent. In the Genetic Manuscript for Life, “we see chapters that have never been read before,” says Eichler.
Or as University of Washington geneticist Robert Waterston puts it: “There are no more hidden or unknown bits.”
“I think it’s psychologically a great thing,” adds Waterson, a leader of the original Human Genome Project who was not involved in the new effort. “I just admire these scientists for sticking with it.”
A complex puzzle
The human genome is made up of just over six billion individual letters of DNA – about the same number as other primates like chimpanzees – spread over 23 pairs of chromosomes. To read a genome, scientists first cut all that DNA into pieces of hundreds to thousands of letters. Sequencing machines then read the individual letters from each piece, and scientists try to put the pieces together in the correct order, like putting together a complex puzzle.
One challenge is that certain regions of the genome repeat the same letters over and over. Repeating regions include centromeres, the parts that hold the two strands of chromosomes together and play a crucial role in cell division, and ribosomal DNA, which provides instructions for the cell’s protein factories. Other still repetitive parts include new genes that can help species adapt. In the past, all this repetition made it impossible to assemble certain cut parts in the correct order. It’s like having identical puzzle pieces – scientists didn’t know which went where, leaving big gaps in the genomic picture.
Another problem: most cells contain two genomes, that of the father and that of the mother. When researchers try to put all the pieces together, the sequences from each parent can get mixed up, obscuring the real variation within each individual genome.
In the mid-2000s, as scientists tried to figure out how to overcome the obstacles, “we had the idea of obtaining a complete genome by sequencing only one of the genomes instead of solving two of them at the same time”, recalls Eichler. He knew exactly where to find it — from a set of cell lines studied by University of Pittsburgh reproductive geneticist Urvashi Surti. Due to a rare problem with normal development, cells end up with two copies of DNA from the father and none from the mother.
Such a cell line, with a single genome, “is what made this genome assembly possible,” says HHMI researcher Erich Jarvis, a neurogeneticist at Rockefeller University who collaborated on the new work.
Other major advances include rapid improvements in gene sequencing machines made by Oxford Nanopore Technologies and Pacific Biosciences. In 2017, NHGRI’s Phillippy and UCSC’s Miga realized that the ability of a new Nanopore machine to accurately read a million letters of DNA at once had opened the door to finally tackle the hard elements of the genome. They created the Telomere-to-Telomer (T2T) consortium to sequence each chromosome from one end, or telomere, to the other. Around the same time, Eichler’s team had shown the value of using Pacific Biosciences technology to solve more complex forms of genetic variation.
There was no guarantee of success. But “we were blessed with a youthful optimism and were excited about the promise of these new technologies,” Phillippy recalls. The team ran their Nanopore machines non-stop for six months and brought in dozens of scientists to put the parts together and analyze the results. At the same time, sequencing data was being generated by other team members and Pacific Biosciences using their long-read sequencing platform. In particular, the project received a boost when Pacific Biosciences introduced a new sequencing machine that generated long-read sequencing reads that were over 99% accurate. “That was the last piece of the puzzle – like putting on a new pair of glasses,” says Phillippy. Pacific Biosciences’ technology could not cover all parts of the genome equally, but scientists realized that by combining long-read sequencing with data from Oxford Nanopore, they could fill in all the gaps.
By summer 2020, the consortium had assembled two chromosomes and planned what Phillippy calls a hackathon to get the other 21, working remotely on Zoom and Slack during the pandemic. A key aha moment came when the team attempted to assemble the most difficult regions of the genome – the highly repetitive DNA in the centromeres. The researchers realized that the part-assembly algorithms couldn’t handle the repetition, but the human eye could. On the computer screen, the scientists saw where the different repeating sequences had become entangled. Then they untangled it manually, “like untangling a string in your yo-yo,” says Jarvis. By the end of the summer, the team had sequenced each chromosome.
Earthquake of genetic changes
As each new chapter in our genetic book of life emerged, researchers dove in search of biological meaning. Their results appear in six articles in Science and more than a dozen articles elsewhere. For example, the team found surprisingly high levels of genetic variation in centromeres and other regions – “a whole new treasure chest of variants that we can study to see if they have functional significance,” says Phillippy.
The data offers “the basis for a new era” in the study of centromeres, says Miga, who co-led the T2T centromere satellite working group. Scientists will now be able to explore how this newly discovered variation contributes to disease and how centromere DNA changes over time, she says.
The T2T results also point to more complex patterns of gene variation that may have helped create the human species – and could explain our rapid evolution. The full genome sequence reveals that some genes associated with larger brains are highly variable, Eichler says. One person may have 10 copies of a particular gene, while others may only have one or two. This variation can cause problems during fertilization, when mom and dad’s chromosomes line up and swap pieces. Incompatible genes can lead to “an earthquake” of genetic alterations, says Eichler. As a result, “these regions become a melting pot for both rapid evolutionary change and disease susceptibility, both within and between species,” he says.
The successful completion of a single genome is not the last word. Members of the consortium are already working on sequencing a genome with different chromosomes inherited from each parent. They are also beginning a genome-wide effort to read the entire DNA sequences of hundreds of people around the world. “The goal is to create as complete a human genome as possible, representing much more human diversity,” says Jarvis, co-lead of the pan-genome effort.
But the new sequence is the necessary first step, says Eichler. “We now have a Rosetta Stone to examine full variation in hundreds of thousands more genomes in the future.”
Sergei Nurk et al. “The complete sequence of a human genome.” Published on bioRxiv.org on May 27, 2021. doi: 10.1101/2021.05.26.445798. Posted in Science March 31, 2022. doi: 10.1126/science.abj6987.
Mitchell Volger et al. “Segmental Duplications and Their Variation in a Complete Human Genome.” Published on bioRxiv.org on May 26, 2021. doi: 10.1101/2021.05.26.445678. Posted in Science March 31, 2022. doi: 10.1126/science.abj6965