Population-specific long-range linkage disequilibrium in the human genome and its influence on identifying common disease variants

Despite the availability of large-scale sequencing data, long-range linkage disequilibrium (LRLD) has not been extensively studied. The theoretical aspects of LRLD estimates were studied to determine the best estimation method for the sequencing data of three different populations of African (AFR), European (EUR), and East-Asian (EAS) descent from the 1000 Genomes Project. Genome-wide LRLDs excluding centromeric regions revealed clear population specificity, presenting substantially more population-specific LRLDs than coincident LRLDs. Clear relationships between the functionalities of the regions in LRLDs denoted long-range interactions in the genome. The proportions of gene regions were increased in LRLD variants, and the coding sequence (CDS)-CDS LRLDs showed obvious functional similarities between genes in LRLDs. Application to theoretical case-control associations confirmed that the LRLDs in genome-wide association studies (GWASs) could contribute to false signals, although the impacts might not be severe in most cases. LRLDs with variants with functional similarity exist in the human genome indicating possible gene-gene interactions, and they differ depending on populations. Based on the current study, LRLDs should be examined in GWASs to identify true signals. More importantly, population specificity in LRLDs should be examined in relevant studies.

When excluding LRLDs surrounding the centromere, the proportion of functional variants among the total calculated CDS variants increased slightly to 0.662 without NMD variants and to 0.725 with NMD variants, respectively. When excluding LRLDs on and near the centromere, the proportion decreased to 0.593 without NMD variants and to 0.691 with NMD variants. Similarly, 97% and 98% of functional variants with and without NMD variants, respectively, were observed more than once in LRLDs excluding those surrounding the centromere; excluding LRLDs on and near the centromeres, the corresponding percentages were 95% and 96%. When considering only missense and frameshift variants, 100% of the functional variants were observed more than once in LRLDs excluding LRLDs either surrounding or on/near the centromere.
As shown in Figs. 3A and 3B, the proportion of variants located in the 5' UTR was the lowest for all LRLD variants and increased when LRLD variants on and near the centromere were excluded, yielding a higher proportion than that among the total estimated variants.
Similar trends were observed in the proportions of variants in noncoding regions and in introns. The proportion of variants located in the 3' UTR showed similar trends as that in the 5' UTR; however, the proportion of variants in the 3' UTR was the highest among the total estimated variants. Interestingly, there was no LRLD variant detected in the ±5000-bp gene regions (coded as 6, 7, 8, and 9). It is surprising that even smaller proportions of gene regions were observed as LRLD variants. The proportions of variants in nongenic regions showed opposite trends to the proportions in any gene regions.
The total number of calculations to detect LRLD was 300,036,014,321, and LRLD detection was considered if LRLD was observed in at least one population. The proportion of variants in genic regions coded 1 through 5 was 0.4587 (Supplementary Material, Table S1).
The proportion in any gene regions slightly decreased to 0.4431 for LRLD variants, which resulted in the expected number of LRLD with both LD positions in genic regions of 3,124,330. The actual observed number was 3,335,296, which is slightly higher than the expectation. The proportion in any gene regions increased to 0.5215 and 0.5623 for LRLDs excluding those surrounding centromeres and excluding those on and near centromeres, respectively. For LRLDs with both LD positions in genic regions, the observed number of LRLDs excluding variants on and near the centromere was 38,817, which is again slightly higher than the expected value of 38,145; however, the observed number of LRLDs excluding those surrounding the centromere was 151,009, which is less than half the expected number of 330,708. The result indicates that there are many more LRLDs between a genic region and a non-genic region for LRLDs excluding those surrounding centromeres, most of which were from chromosome 9 (81.4% of genic-nongenic LRLDs), as shown in Fig.   2. After excluding all of the regions on and near the centromere, the same trend of slightly higher observed genic-genic LRLDs than expected was found, confirming that most of the genic-nongenic LRLDs in chromosome 9 were removed.
There was a slight increase in the proportions of CDS and 5'UTR in LRLD hotspot variants that were found to be involved in LRLD at least 100 times, as presented in Fig. 3A; however, other gene regions showed slightly decreased proportions with an increasing proportion of non-genic regions (Supplementary Table 2). The gene proportions including repeated variants in LRLD were different from those of unique variants in LRLD. As shown in Supplementary Material, Table S3B, the proportions of CDS variants was almost twice as large as the proportion of CDS unique variants in all of the populations, indicating that CDS variants were repeatedly involved in several LRLDs. The results are consistent with the findings that 92% of the functional CDS variants of frame-shift and/or missense variants were involved in LRLD more than once and were predominantly observed as LRLD hotspots. For the proportions excluding centromeric regions, 100% of the functional CDS variants were involved in LRLD more than once. These results indicate the possibility of functional LRLD due to long-range gene interactions.
As shown in Fig. 3B and Supplementary Material, Table S3, the results differed slightly among populations. The EUR population showed a slightly smaller proportion of CDS variants and much larger proportions of 5'UTR and 3'UTR variants than the other populations.
However, when excluding LRLDs having variants on and near the centromere, the proportion of CDS variants increased in the EUR population (Supplementary Material, Table S3D). The proportions of other regions also showed slight yet clear population differences as shown in Supplementary Material, Table S3. These population differences indicate the possibility of population-specific long-range interactions.