For Jun Wang at the Beijing Genome Institute-Shenzhen and his international collaborators, the idea of a human pan-genome was triggered by the development of a new short sequence read assembler. Wang recalls that during their testing of SOAPdenovo—part of the short oligonucleotide analysis package (SOAP)—they assembled the genomes of an African and an Asian individual but discovered that 5 megabases (Mb) of the approximately 3 gigabases of total assembled sequence for each genome did not align to the current human reference.

The current human reference genome GRCh37, which stands for 'genome reference consortium human build 37', is the representation of a haploid human genome, derived from sequences of multiple individuals. The assembly is divided into a linear primary assembly and a series of alternate loci in regions with great diversity.

Instead of just trying to fit the 5 Mb of diverse sequence into alternate loci to a single linear reference assembly, Wang and colleagues borrowed a concept from microbiologists who are used to dealing with multiple different genomes in bacterial communities. In the microbial world, a pan-genome describes the diversity of genomes in a population, and accordingly, Wang and his collaborators defined a human pan-genome as a nonredundant set of human DNA sequences that includes all genetic information of human populations.

To ascertain that the diverse sequences were indeed authentic, the researchers confirmed that a substantial fraction of the 5 Mb aligned to other mammalian sequences stored at the National Center for Biotechnology information (NCBI). When they compared these sequences between the African and Asian individuals' genomes they saw that these sequences were polymorphic and not simply a rearrangement of repeat regions. “We started to guess,” says Wang, “that they could be individual-specific or even population-specific sequences, and it would be interesting to see whether there are any potential functional elements in them.”

Wang and his colleagues estimated that these individual-specific regions will encompass a total of up to 40 Mb, around 1.3% of the entire human genome. As these sequences vary greatly between individuals, Wang speculates that they can only be identified by de novo assembly of human genomes. To get a firm handle on the function of these regions, genomes of many more individuals will need to be sequenced.

To complete the entire human pan-genome will most likely never be feasible because everyone would need to have their genome sequenced and assembled. Currently whole-genome sequencing is extremely expensive, if done by a commercial provider and, in addition, there are ethical and legal issues that need to be worked out, not to speak of a person's right not to know.

Realizing this, Wang presents an alternative solution, “We could start with over a hundred individuals to get the common alleles with a frequency higher than 1% in a population.” Population-based pan-genomes would already be a great help in finding genomic locations associated with phenotypic traits in a given population.

But ultimately Wang thinks that only the individual-specific, rather than population-specific sequences, provide insight into biological functions and disease mechanisms, and, despite all challenges, he thus summarizes their work in a simple message: “Everyone should have their genome assembled to get complete information.”