Introduction

South Asia is the home of over 1.5 billion humans, representing almost one-quarter of the world population. Early migration to this region from Africa occurred 50 000–70 000 years before present. In recent years, genomic markers have used to study the migration patterns and relationships among different Asian ethnic groups. These efforts provided clues for two major waves of migration to South Asia from the Middle East. One wave followed a southern coastal route, around the rim of Indian subcontinent, and continued across Malaysia, Indonesia and the Philipines, whereas a distinct wave of immigrants traveled east across the Euroasian plains and turned south through the Asian mainland.1 A recent comprehensive study carried out by HUGO Pan-Asian single-nucleotide polymorphism (SNP) Consortium2 concluded that the southern route made a more important contribution to East and Southeast Asian populations than the northern route. Several subsequent migrations and invasions, mainly from the west, resulted in the considerable genetic diversity observed in modern South Asian populations.1 Pakistan constitutes the north-western part of South Asia and is situated at the crossroads of Indian Subcontinent, Central Asia and the Middle East. Thus, Pakistan is located along the southern migration route.3

With an ethnically and linguistically diverse population of >170 million (2011 estimate: http://www.census.gov.pk), Pakistan is the sixth largest country in the world. Most of the Pakistani population has an ancestral north Indian (ANI) origin, genetically close to Middle Easterners, Central Asians and Europeans.4 During the last decade, DNA variation among different Pakistani ethnic groups have analyzed and represented in the Human Genome Diversity Project.1, 5, 6, 7, 8, 9, 10 Y-chromosomal lineage analyses and related studies have linked the Hazara and Pathan ethnic groups of Pakistan to Genghis Khan or his male ancestors11 and Europeans5 respectively.

Next generation DNA sequencing (NGS) technologies represent a practical way to identify and evaluate rare and previously unidentified genetic variants.12, 13, 14 These technologies have made it possible to develop a comprehensive catalog of genetic variation in human population samples, thereby creating a foundation for understanding human ancestry and evolution.15, 16 Large-scale studies aimed at cataloging variation, such as the 1000 Genomes Project, are currently underway. The number of genomes sequenced has grown dramatically over the last few years.17, 18 The discoveries of millions of SNPs and insertion–deletion (indel) polymorphisms indicated the necessity of whole-genome sequencing from diverse global populations to build a truly comprehensive catalog of human variation. Human genetic variation contributes a substantial fraction of disease susceptibility. The characterization of both universal and population-specific genome variation will contribute to the development of personalized medicine in the near future.12

Here, we report the first complete genome sequence of a Pakistani individual (designated as PK1) generated using NGS technology. Pakistan has so far been underrepresented in genome-wide surveys of human variation. The 1000 Genomes Project plans to sequence the genomes of Pakistani individuals at 2–4x coverage17 (http://www.1000genomes.org). We sequenced the PK1 genome at >25x coverage, that is, significantly more deeply than the coverage planned by the 1000 Genomes Project. The resulting genome sequence information represents an important contribution to our knowledge of the genetic diversity of South Asia.

Materials and Methods

Study subject and ethical statement

The study subject (designated as PK1) was a 69-year-old Pakistani male living in Karachi, Pakistan. The subject PK1 gave written informed consent to publicly disclose entire content of his genome. The Institutional Review Board of BGI approved this project after obtaining consent from the donor. Genomic DNA was isolated from peripheral blood sample using Genomic DNA isolation kit (Fermentas). Quality of the DNA was checked using 2100 Bioanalyzer (Agilent Technologies Inc. USA) and agarose gel electrophoresis. Concentration of DNA was measured with NanoDrop spectrophotometer (Thermo Inc. USA) and Qubit Fluorimeter (Life Technologies Inc. USA).

Genomic DNA library construction and genome sequencing

DNA Library preparation was carried out according to the manufacturer’s instructions for sequencing on HiSeq2000 (Illumina Inc., San Diego, USA). Five microgram of genomic DNA was used for library preparation. Consequently, two paired-end libraries with insert sizes of 750 and 700 base pairs were generated for deep sequencing of PK1 genome using HiSeq 2000 (Illumina Inc.).

Data processing and read alignment

The fluorescence images were processed into sequences using the Illumina base-calling pipeline (SolexaPipeline-0.2.2.6). The human reference genome (hg19), together with the annotation of genes and repeats, was downloaded from the UCSC database (http://genome.ucsc.edu/). The SNP set of the Indian genome19 was downloaded from web address http://krishna.gs.washington.edu/indianGenome/, and the SNP set of the 1000 Genomes Project was downloaded from website http://www.1000genomes.org/. We used SOAP (SOAPaligner version 2.2020) to align all short reads onto the human reference genome (hg19). To avoid misalignment, PE clusters with 4 pairs were discarded.

SNPs calling

We used SOAPsnp with a statistical model based on Bayesian theory to call SNPs and the Illumina quality system to calculate the value of each possible genotype at every site. The genotype of each site was assigned as the allele types that had the highest value. The final consensus values were transformed to quality scores in Phred scale by the Illumina quality system. We used six steps to filter out unreliable portions of the consensus sequence: (1) we used a Q20 quality cutoff; (2) we required at least four reads; (3) the overall depth, including randomly placed repetitive hits, had to be <100; (4) the approximate copy number of flanking sequences had to be <2; (5) there had to be at least one paired-end read; and (6) the SNPs had to be at least 5 bp away from each other.

Indel calling

The principle of the indel calling method we used is close to alternative splice calling in transcriptome analysis (such as Tophat; http://tophat.cbcb.umd.edu/). Firstly, we selected the unmapped 75 bp reads to get the head 30 bp and the tail 30 bp to generate the PE30 reads. Then, we aligned those PE30 reads to the human reference genome (hg19) with no mismatch and no gap tolerance. If the coordinate distances of the PE reads were 1–5 bp larger or smaller than the insert size (in our case is 15 bp, 75-30-30), we considered those 75 bp reads as unmapped due to an indel. We filtered the final results by at least four hits (4 unmapped 75 bp reads with an indel).

Structural variant calling

Our two libraries were constructed with insert sizes of 750 bp and 700 bp, respectively. If the paired-end reads mapped the hg19 with the coordinate distances 3 times larger the insert size s.d. (here is 14) than the average insert size, these reads are abnormal. We grouped these reads into diagnostic paired-end (PE) clusters. To avoid misalignment, PE clusters with 5 pairs were discarded. Structural variations including deletions, translocations, duplications and inversions were examined and summarized into alignment models. Reads were assembled to verify the specific coordinates of structural variation elements.

Comparative genomics and annotation of genomic variation

The entire set of genomic variation found in PK1 genome was compared with Single Nucleotide Polymorphism database (dbSNP), 1 K genome data set, OMIM and DGV (database of genomic variations). The gene annotation and genomic loci were derived from RefSeq mappings on hg19 version. Number of identified variation in gene regions was classified into (a) exonic and intronic; (b) homozygous and heterozygous variations. The exonic variants were further analyzed for potential functional effects.

Analysis of potential functional consequences of genomic variations

The non-synonymous SNPs (nsSNPs) found in PK1 genome were screened for predicting damaging effects of missense mutations using SIFT (Sorting Intolerant From Tolerant) program.21 SIFT is a popular sequence-based amino-acid substitution prediction method available at: http://blocks.fhcrc.org/sift/SIFT.html. This program uses sequence-based predictive features to determine whether amino-acid exchanges are likely to be damaging or not. GO term enrichment analysis was carried out using GOrilla program.22 GOrilla identifies enriched GO terms in lists of genes. It employs a flexible threshold statistical approach to discover GO terms that are significantly enriched at the top of a ranked gene list and computes an exact P-value according to the mHG or HG model.22 For GO term analysis, gene lists were submitted as inputs to GOrilla server at: http://cbl-gorilla.cs.technion.ac.il with default running parameters.

Results

Genome sequencing and mapping to reference genome

The individual whose genome is described in this report is Prof. Atta-ur-Rahman, who is a 69-year-old Pakistani male. The donor has no apparent genetic disorders, and his family lives in Karachi, Pakistan. Genomic DNA was subjected to sequencing using a HiSeq 2000 Genome Analyzer (Illumina Inc. San Diego, USA). Two paired-end libraries were constructed with insert sizes of 750 and 700 base pairs. A total of 78.98 Gb sequence data were generated from the two libraries (Supplementary Table 1). Using the SOAPaligner software,20 74.4 Gb (94%) of sequence data were aligned to the human reference genome (NCBI37/hg19). This resulted in complete coverage of the human reference genome with 25.5 × sequence depth, covering about 99.5% of the human reference genome with at least one read. The individual chromosome sequence depth is shown in Supplementary Figure 1.

Identification of SNPs and analysis of variants

During SNP detection, we applied the Bayesian inference to calculate the probability and accuracy of genotypes. At each locus, the genotype with highest probability was selected as the PK1 genotype, and a quality score value was assigned as a measure of SNP call accuracy. Those loci in the PK1 consensus sequences that are polymorphic relative to the NCBI reference genome (hg19) were selected and filtered under specific criteria: quality value 20, and support of the polymorphic site by at least four reads. Using the SOAPsnp software,23 a total of 3 224 311 SNPs were detected in PK1 at an average density of 0.1%. Of these, 1,266,738 (39.2%) SNPs were identified as homozygous, whereas the remaining 1 957 573 (60.7%) were heterozygous. The chromosomal distribution of these SNPs is shown in Supplementary Table 2. Approximately one-third of all SNPs (1,031,979) were located within genes; of those, 12 896 SNPs were located in coding exon sequences. We found 876,370 SNPs in introns, of those, 153 SNPs in splice-sites. Of the 12 896 SNPs in coding exons, 6905 were synonymous and 5991 were non-synonymous substitutions; 8123 were heterozygous and the remaining 4773 were homozygous. Among the 5991 (0.18% of total SNPs) nsSNPs, 463 were novel coding variants (that is, not present in dbSNP or the 1000 Genomes Project data set). The PK1 genome has similar fraction of nsSNPs compared with Chinese, 7062 (0.23%), Watson, 7319 (0.20%) and Ventor, 6889, (0.22%).

The nsSNPs found in PK1 genome were screened for predicting damaging effects of missense mutations using SIFT program.21 nsSNPs that lead to an amino-acid change in the protein product are of interest due to their role in protein structure–function relationship. Among the 5991 nsSNPs, 917 (15.3%) were potentially deleterious coding variants; 655 of these were heterozygous and 174 were homozygous. Examination of genes with deleterious SNPs using the GOrilla program22 identified ‘retinoic acid signaling pathway’ and ‘regulation of transcription’ as the GO terms with enrichment among this gene set (corrected P=3.22 × 10−4 and 4.27 × 10−3, respectively). Scanning of 5991 nsSNPs against the OMIM database24 identified 117 (1.9%) disease-associated coding variants in PK1 genome. GOrilla identified ‘humoral immune response’ as marginally enriched GO term in the disease-associated gene list (corrected P=1). The genes involved were mannan-binding lectin serine peptidase, NOTCH2, C8A and C8B (complement component 8, α and β polypeptides).

We identified 388 532 SNPs (12% of the total PK1 SNPs) that are novel, that is, not present in dbSNP or the 1000 Genomes Project data. These novel SNPs were distributed across all chromosomes (Table 1). Of these novel SNPs, 277 859 (71.5%) were heterozygous, whereas 110 673 (28.5%) were homozygous. Further analyses revealed that 100 298 (26%) of the novel SNPs were located in gene regions, including 1706 (0.44%) in coding exons. Among these 1706 novel coding SNPs, 731 were synonymous and 975 were non-synonymous; 1402 were heterozygous and the remaining 304 were homozygous. However, Gorilla22 analysis of 975 novel nsSNPs containing genes did not show significant enrichment.

Table 1 The novel SNPs in PK1 genome. Chromosomal and gene region distribution of PK1 SNPs not present in dbSNP database and 1K genome data set

SNPs shared between individuals of Pakistani and Indian origin

Kitzman et al.19 recently reported the haplotype-resolved genome sequence of a Gujarati-Indian individual. The state of Gujarat is located at the north-western part of India bordering Pakistan. Like Pakistani population, Gujaratis also have ancestral north Indian (ANI) origin. Several recent studies have examined the effects and causes of positive selection in the human genome. The accessibility of a number of entirely sequenced human genomes provided an opportunity to explore features contributing to positive selection in unprecedented detail. Therefore, a comparative analysis of PK1 and Gujarati genome sequences was carried out. Comparison between the genomes of that individual and PK1 revealed 1 825 213 shared SNPs (56% of total PK1 SNPs), of which 586 700 (32% of Pak-Indian shared SNPs) were annotated by refGene database in 14 007 gene regions. Of the shared SNPs, 101 803 were not present in the 1000 Genomes Project data. Of those novel SNPs, 24 524 SNPs are annotated in the refGene database, and 166 are non-synonymous (Tables 2 and 3).

Table 2 Chromosomal and gene regions distribution of shared SNPs between PK1 and Gujarati-Indian genome sequences19
Table 3 Statistics of SNPs shared between PK1 and Indian19 genome sequences

Examination of 14 007 genes containing PK1-Indian shared SNPs using GOrilla program22 revealed interesting correlations. Results identified seven GO terms with corrected P-values in the range of 10−3–10−8 representing an array of biochemical and cellular processes (Table 4). Among these GO terms, ‘response to jasmonic acid stimulus’, ‘aminoglycoside antibiotic metabolic process’ and ‘glycoside metabolic process’ were identified with the strongest enrichment among this gene set (corrected P=1.02 × 10−8, 2.3 × 10−6 and 4.18 × 10−6, respectively). The next significantly enriched GO terms were ‘steroid metabolic process’, ‘dimethylallyl diphosphate metabolic process’, ‘isoprenoid metabolic process’ and so on. (corrected P=4.88 × 10−4, 6.09 × 10−4 and 4.76 × 10−3 respectively). Interestingly, four genes of aldo-keto reductase family enzymes (that is, AKR1C1, AKR1C2, AKR1C3 and AKR1C4) were involved in all of these GO terms. Two isopentenyl-diphosphate delta isomerase genes (IDI1 and IDI2) were also found to be involved in some of these processes.

Table 4 Gene ontology (GO) term enrichment of genes (n=14,007) with SNVs found in both PK1 and Indian genomes. The genes AKR1C1, AKR1C2, AKR1C3 and AKR1C4 are attributed to all of the GO terms in the table

Moreover, ‘cellular response to hydrogen peroxide’ (14 genes), ‘detection of chemical stimulus involved in sensory perception of smell’ (four olfactory receptor genes OR6A2, OR51B2, OR51B5 and OR51E2), glycolysis (5 genes) and ‘regulation of insulin signal’ (5 genes) were identified as enriched GO terms. Inference of these observations at population level requires more genome level studies from this region. As several of the enriched GO terms mentioned above are involved in drug metabolism, further studies may help to identify potential pharmacogenetic incompatibilities of certain drugs.

Identification of short indels

During the process of indel identification, we considered gapped alignments containing insertions or deletions of 1–5 bp. Short indels were confirmed when they were identified in both strands with a minimum of four reads. From this analysis, we identified a total of 59 558 indels in PK1 sequences, of which 32 890 were deletions and 26 668 were insertions. According to functional classification, approximately one-third of these indels (33.15%, that is, 19,746) was located within gene regions; of those, 16 609 indels were found in introns, 12 were in splice-sites. Thirty-seven indels were in coding exons and homozygous (Supplementary Table 3).

Identification of structural variants

Structural variations include deletions, insertions, inversions and other DNA sequence rearrangements. Paired-end sequencing is important for identification of large structural variants (SVs) in individual genomes relative to a reference.25, 26 We identified a total of 16 063 SVs in PK1, ranging in size from 0.1–100 kbp with an average length of 2 kb. The sum of the length of all SVs was >20 mbp (that is 20 111 213 bp) and length of majority of SVs (90%) were in the range of 500–1500 bp (Supplementary Figure 2). Of these, 8572 SVs (53% of total SVs) were not present in the DGV database (http://projects.tcag.ca/variation/), indicating the presence of novel SVs in the PK1 genome. Of the 16 063 SVs, 98.4% were large insertions or deletions. The remaining of SVs included tandem duplications (0.67%), inversions (0.062%), dispersed duplication (0.21%), and complex structures (0.60%) (Supplementary Table 4). A total of 5312 SVs were located in 2938 genes (refGene database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/); of these, 478 were in coding exons (both in coding exons and overlapped with exons). Out of these SVs in coding exons, 17 were classified as tandem duplication; remaining were large insertions and deletions. We selected 11 SV regions with length of <1 kb for PCR validation. The fragment sizes for 8 SV regions were validated (three of them were inconclusive as the experiments were not successful) (Supplementary Figure 3).

Discussion

We identified 3.22 million SNPs in PK1 genome, out of which over 0.38 million (12%) SNPs were found to be novel. Commonality of one-fourth novel SNPs in PK1 and Indian genomes indicated close relationship between these individuals (388 532 novel SNPs in PK1 versus 101 803 novel and PK1-Indian shared SNPs). Using the Markovian coalescent model applied to Chinese, European, Korean and Yoruban genome sequences, Li and Durbin16 inferred that European and Chinese populations experienced a severe bottleneck 10 000–60 000 years before present, whereas African populations experienced a milder bottleneck from which they recovered earlier. Moreover, analyses of genome-wide SNP data sets from the CEPH Human Genome Diversity Panel samples and International HapMap Project classified the population groups studied into three genetic groups; namely Africans, Eurasians (Europeans, Middle Easterns and Central Asians including present-day Pakistan) and East Asians (also includes Americans and Oceanians).27 The amount of variation (that is, number of SNPs and heterozygosity) we found in PK1 genome is comparable to European genome (CEU). Therefore, the present data is consistent with previous observation27 that PK1 genome has experienced similar bottleneck like Europeans.

The study subject is of old age (70 years) and apparently in good health. Therefore, the novel coding variants identified in this study can be linked to health status and phenotypes over the whole lifetime. As some of the PK1 coding alleles have been reported to be associated with disease, the current results may help to re-evaluate those previous reports. Moreover, our analysis showed that SVs are major type of variation in the genome. The large number of SVs identified during this study putatively having equivalent or superior functional roles than SNPs.

Conclusions

We carried out whole-genome sequencing of the Pakistani individual with 25X coverage. The present genomic data would be an important reference to add into the current deep sequenced genomes from different ethnic groups. Our analysis revealed sizeable number of unreported SNVs, short indels and structural variations. As expected deleterious non-synonymous mutations have a lower frequency than neutral variations probably due to negative selection. Human genomics can identify unknown variations associated with complex diseases widespread in South Asian subcontinent such as diabetes and cardiovascular disorders.