When the sequencing of the human genome was announced two decades ago by the Human Genome Project and biotech firm Celera Genomics, the sequence was not truly complete. About 15% was missing: technological limitations left researchers unable to work out how certain stretches of DNA fitted together, especially those where there were many repeating letters (or base pairs). Scientists solved some of the puzzle over time, but the most recent human genome, which geneticists have used as a reference since 2013, still lacks 8% of the full sequence.
Now, researchers in the Telomere-to-Telomere (T2T) Consortium, an international collaboration that comprises around 30 institutions, have filled in those gaps. In a 27 May preprint1 entitled ‘The complete sequence of a human genome’, genomics researcher Karen Miga at the University of California, Santa Cruz, and her colleagues report that they’ve sequenced the remainder, in the process discovering about 115 new genes that code for proteins, for a total of 19,969.
“It’s exciting to have some resolution to the problem areas,” says Kim Pruitt, a bioinformatician at the US National Center for Biotechnology Information in Bethesda, Maryland, who calls the result a “significant milestone”.
New sequencing technology
The newly sequenced genome — dubbed T2T-CHM13 — adds nearly 200 million base pairs to the 2013 version of the human genome sequence.
This time, instead of taking DNA from a living person, the researchers used a cell line derived from what’s known as a complete hydatidiform mole, a type of tissue that forms in humans when a sperm inseminates an egg with no nucleus. The resulting cell contains chromosomes only from the father, so the researchers don’t have to distinguish between two sets of chromosomes from different people.
Miga says the feat probably wouldn’t have been possible without new sequencing technology from Pacific Biosciences in Menlo Park, California, which uses lasers to scan long stretches of DNA isolated from cells — up to 20,000 base pairs at a time. Conventional sequencing methods read DNA in chunks of only a few hundred base pairs at a time, and researchers reassemble these stretches like puzzle pieces. The larger pieces are much easier to put together, because they are more likely to contain sequences that overlap.
T2T-CHM13 is not the last word on the human genome, however. The T2T team had trouble resolving a few regions on the chromosomes, and estimates that about 0.3% of the genome might contain errors. There are no gaps, but Miga says quality-control checks have proved difficult in those areas. And the sperm cell that formed the hydatidiform mole carried an X chromosome, so the researchers have not yet sequenced a Y chromosome, which typically triggers male biological development.
Hundreds of genomes to follow
T2T-CHM13 represents only one person’s genome. But the T2T Consortium has teamed up with a group called the Human Pangenome Reference Consortium, which aims over the next 3 years to sequence more than 300 genomes from people all over the world. Miga says that the teams will be able to use T2T-CHM13 as a reference to understand which parts of the genome tend to differ between individuals. They also plan to sequence an entire genome that contains chromosomes from both parents, and Miga’s group has been working on sequencing the Y chromosome, using the same new methods to help fill gaps.
Miga expects that genetics researchers will quickly find out whether any of the newly sequenced areas and possible genes are associated with human diseases. “When the human genome came out, we didn’t have the tools poised and ready to go,” she says, but information about the function of the newly sequenced genes should come much faster now, because “we’ve built up a ton of resources”.
She hopes that future human genome sequences will cover everything, including the newly sequenced sections — not just the parts that are easy to read. This should be easier now that the reference genome has been completed and some of the technical snags have been worked out. “We need to reach a new standard in genomics where this isn’t special, but routine,” she says.
Nature 594, 158-159 (2021)
Nurk, S. et al. Preprint at bioRxiv https://doi.org/10.1101/2021.05.26.445798 (2021).