Genome maps across 26 human populations reveal population-specific patterns of structural variation

Large structural variants (SVs) in the human genome are difficult to detect and study by conventional sequencing technologies. With long-range genome analysis platforms, such as optical mapping, one can identify large SVs (>2 kb) across the genome in one experiment. Analyzing optical genome maps of 154 individuals from the 26 populations sequenced in the 1000 Genomes Project, we find that phylogenetic population patterns of large SVs are similar to those of single nucleotide variations in 86% of the human genome, while ~2% of the genome has high structural complexity. We are able to characterize SVs in many intractable regions of the genome, including segmental duplications and subtelomeric, pericentromeric, and acrocentric areas. In addition, we discover ~60 Mb of non-redundant genome content missing in the reference genome sequence assembly. Our results highlight the need for a comprehensive set of alternate haplotypes from different populations to represent SV patterns in the genome.


Filtering of SV lists
To produce the final SV lists, we integrated the SVs from the three modules of OMSV.
Overlapping SVs of the same type were merged, with the size of the resulting SV estimated by the mean size of the merged SVs, and its span (i.e., the possible occurrence location) set as the union of the spans of the merged SVs.
Since some analyses were sensitive to false positives on the SV list, we performed additional filtering to the high-confidence list by removing the following SVs: • SVs that overlapped complex regions at which alignments/assemblies could be unreliable. We compiled a list of 55 complex regions (Supplementary Data 3).
• SVs that overlapped unknown sequences on the reference (N-gaps) since these SVs were less reliable. Considering chromosomes 1-22, X and Y of hg38, there were 879 N-gaps, with a total size of 149,690,620 bp.
• SVs that overlapped fragile sites, DNA regions with close nicking sites on opposite strands, which could cause DNA double-strand breakage and wrongly detected as signals of SV.
• SVs that overlapped the pseudo-autosomal regions (PARs) on the X and Y chromosomes, since optical maps from these regions had a high chance of getting aligned incorrectly. In some recent large-scale population genomic studies, PARs were handled by excluding them 3 or masking them on one of the sex chromosomes 4 .
We followed the former strategy in this study.
• SVs that overlapped regions with super high density of nicking sites. There were three such genomic regions based on the in silico reference map, namely chr10:39687138-39935168, chr12:34820121-37139973, and chr17:22754856-23194891. The average density of nicking sites in each of these regions was larger than one site per 3 kb, which increased the difficulty of alignment, assembly and subsequently SV detection.
For both the high-confidence list and the full list, we kept only SVs larger than 2 kb to focus on large SVs in this study.

Definition of populations and super-populations
We adopted the same definitions of populations and super-populations from the 1000 Genomes Project 2,5 , which included 26 populations grouped into 5 super-populations, namely AFR (Africans), AMR (Admixed Americans), EAS (East Asians), EUR (Europeans) and SAS (South Asians).

Validating SVs by 10x Genomics linked sequencing data
We produced 10x Genomics linked sequencing data for 13 samples evenly picked from 5 super-populations. SVs identified from optical mapping were validated by these linked sequencing data if it overlapped an SV called from the sequencing data, or if the flanking sequences supported the SV size.
Specifically, the linked sequencing data from each sample were assembled using Supernova 6 . The resulting contigs were compared to the reference sequence using nucmer from the Mummer package 7 . Segments of changes larger than 1 kb were collected as SVs.
An SV identified from optical mapping was considered supported by an SV from linked sequencing if they overlapped and had the same type.
In the flanking sequence analysis (Supplementary Figure S7), for each SV identified from genome mapping, we extracted its flanking sequences on both sides in the reference genome and aligned them to the 10x Genomics contigs. If both flanking sequences could be aligned to the same contig, we compared the distance between them on the contig and the reference to determine whether the 10x Genomics data supported the SV. An SV was considered validated if the distance on the contig was at least 500 bp larger (resp. smaller) than the reference for an insertion (resp. deletion). If one or both sequences could not be aligned to 10x Genomics contigs, or they could only be aligned to different contigs, we considered the SV not verifiable and did not consider it in the calculation of the supporting rate.

Analysis of SVs with different sizes in different populations
We performed an analysis of variance (ANOVA), taking each super-population as a group of samples, of the sizes of all N detected SVs. We selected SVs with a resulting p-value < 0.05/N (Bonferroni correction of threshold) and being called in at least 10 samples. To study the SVs detected in Sudmant et al. (2015) that did not overlap the SVs on our highconfidence list, we classified them into the following categories: • SVs found on our full list: These SVs could be identified by using less stringent settings of our pipeline.
• Remaining SVs that overlapped one of our filtered regions: Due to the filtering, these SVs could not be detected by our pipeline.
• Remaining SVs having sufficient (≥20) aligned optical maps that overlapped their loci in at least 75% of the samples: Due to the good depth of coverage, a legitimate SV could likely have been detected by our pipeline, and therefore the 1000 Genome calls could be false positives.

Analysis of specific and common SVs in different super-populations
We studied the ratios of SVs specific to a single super-population and shoes shared by multiple or all super-populations. To handle the issue of having different numbers of samples among different super-populations (e.g. 42 African samples and only 24 American samples), we sub-sampled each super-population to the number of samples of the super-population with the fewest samples. The sub-sampling procedure was repeated 100 times with different random subsets of samples, and then we took the average of their results.

Principal component analysis of SV occurrence matrix
We constructed an SV occurrence matrix in which each row was a sample, each column was a high-confidence SV, and each entry was the allele count of an SV in a sample. Rare SVs that appeared in less than 5% of samples were removed. Samples with an SV count three standard deviations or more from the mean were also removed. To further eliminate the effect of SV count, we projected the samples onto the hyperplane orthogonal to the SV count. We performed this by adding the SV count vector as extra columns to the matrix before applying principal component analysis (PCA) and ignored the first principal component (PC). The second and third PCs were then reported as the real first and second PCs of the PCA.

Phylogenetic analysis of the 26 populations
In the phylogenetic analysis, we again removed rare SVs that appeared in less than 5% of samples from the high-confidence SV list. We then supplied these SVs and their zygosity in each sample to EIGENSOFT 8 to estimate the FST statistic between each pair of populations.
The resulting matrix of FST values was then used to reconstruct the phylogenetic tree by Neighbor Joining.

Saturation analysis of SVs
Based on the SVs identified from our samples, we performed a saturation analysis 9,10 to predict the ultimate number of large SVs in human populations. We used the following saturation curve: where denotes the ultimate number of SVs, b and c are constants to be determined from curve fitting, and x is the number of samples. The derivative ′( ) is the number of novel SVs that can be found by including one more sample when there are already samples.

Comprehensive analyses of inversions
For the inversions, we identified 338 (out of 380) in the low-complexity regions, 31 of which were also identified by the 1000 Genomes Project. In addition, 72 of 99 inversions in the Sanders callset 11 that can be lifted over to hg38 overlap with our study, suggesting the specificity of our study (Supplementary Table 8 -9). Among the remaining 27 inversions not found in our study, 12 of them were found to be close to N-gaps in hg38, making these inversions hard to call confidently, and the other 15 inversions have relatively low allele frequencies in the Sanders callset (average 0.25 for these inversions as compared to average 0.36 for the others) and a lower proportion of them can be found in the DGV database (33% for these inversions as compared to 81% for the others). These findings suggest them some of these inversions are rare or false positives.
We also explored possible population structures based on the complex structural variations.
From the PCA based on the inversions, the African samples were separated from the others based on the first two PCs and showed a clear cluster in the heat map (Supplementary Figure S26). However, the other four super populations were not well separated, suggesting that few novel inversions have emerged in those populations. We also found some inversions identified from at least three samples to be specific to a single super-population, including 7 identified in African samples, 4 in East Asian samples and 1 in European samples.

Y chromosome analysis
2.1. Y chromosome assembly Consensus maps generated from the Bionano IrysSolve de novo assembly pipeline were aligned to the in-silico labeled reference map. Coordinates of the maps aligning to the Y chromosome were extracted and the overall sample coverage was computed along the chromosome. The Illumina callable regions (Supplementary Figure S22) were obtained from Poznik, et al. 12 and the genome coordinates were converted to hg38 using liftOver.
Coordinates corresponding to the segmental duplications were downloaded using the UCSC Table Browser 13 and repeats with greater than 95% match fraction were used for plotting.

Identifying SV candidates in the Y chromosome
A list of SV candidates was manually curated using the top ten samples with the longest molecule N50s and the highest genome-wide coverages. These samples have long assembled contigs that can be used to confidently locate non-reference alleles on the Y chromosome for downstream analysis. SV candidates from these ten samples were initially verified using single molecules. Insertions, deletions, and inversions that were located inside segmental duplications were tolerated as long as the SV candidates were within 150 kb to a unique anchor on at least one end. After determining the genomic locus of each SV, we genotyped all samples using one of the following two strategies depending on the SV type.

Insertion, deletion, and inversion analysis
To detect insertions, deletions, and inversions among all male samples, an in silico labeled representation of each haplotype was created in CMAP format using custom scripts that combined the flanking areas of the reference chromosome with areas of representative assembled contigs observed in our samples. For all haplotypes at a given locus, their CMAP representations were kept as consistent between one another as possible, i.e. containing the same flanking areas. For each locus, single molecules from each sample were used to determine which haplotype(s) the sample contained, as follows: 1) Using outputs from the standard Bionano de novo genome assembly pipeline, all of the single molecules that aligned to the local area of the reference genome were obtained.
3) The resulting alignment files were post-processed. First, the top two alignments for each molecule were compared to one another; if they received the same confidence or score, they were discarded. Otherwise, the best hit for each molecule was evaluated to see whether its alignment spanned the entire "critical region" (CR). These regions were defined as those that were unique to the target haplotype when compared to other haplotypes at the same locus, as well as being anchored in the flanking area(s) by at least four labels or 40 kb, whichever was longer. For inversions inside palindromes, two CRs are defined, one on each end. Molecules would only be required to span one CR. Due to the inherent noise of single molecules, indels were permitted in the alignment overlapping the CR as long as the size change due to the indel was smaller than 50 kb. 4) Molecule alignments were manually verified if a haplotype for a given sample was supported by only one or two molecules, or if less than 70% of the molecules supported the chosen haplotype. For these flagged cases, the alignment between the molecules and haplotypes were manually inspected in OMTools OMView.

Copy Number Variation analysis
1) For each sample, we identified and extracted local molecules that aligned to the CNV candidate locus based on the initial outputs from the standard Bionano de novo genome assembly pipeline.
2) We made a CMAP of only the unique region immediately adjacent to but not including the CNV. We then realigned all Y chromosome molecules to this CMAP with OMBlastMapper from the OMTools package to gather as many informative molecules as possible. The non-redundant molecules were appended to the initial list of aligned molecules from step 1.
3) All molecules from the list were aligned to the corresponding CMAP containing the CNV using OMBlastMapper with the following parameters: -alignmentjoinmode 1filtermode 1 -maxaligneditem 1 -trimmode 1 -overlapmergemode 0. Alignment output was filtered to keep molecules that were anchored to both ends of the CNV, and at least one anchor had to be unique. 4) Filtered alignment output was passed to the SVDetection program of the OMTools package. A minimum support of 1 molecule was used to identify the overall size change between the flanking anchors. If more than one size change was reported, the result with the highest number of molecule support was used for the copy number calculations. Since the repeat unit is 23 kb for both CNVs, we divided the overall size change by this number to obtain the final copy number.

5)
We manually determined the copy number if the division ended within the range of 0.3 -0.7. Otherwise, the numbers were rounded to the closest integers. If no molecule could anchor to both ends of a CNV site, the particular sample would be removed from downstream analysis.
3. Identification of novel genome content not found in the hg38 reference Non-aligned contigs were gathered from all 154 genomes by comparing contig IDs that appeared in the final hg38-aligned XMAP file to the total assembled contigs. Contigs not appearing in the XMAP file were denoted as non-aligned contigs and de novo assembled using the Bionano IrysSolve assembly pipeline with default settings (pipeline version 4618).
Assembly resulted in 42 contigs for a total summed length of 16 Mb. Contigs not participating in this assembly were all-against-all aligned with Bionano RefAligner using default parameters. Alignments were filtered by confidence score of 1e-11. The alignment output XMAP file was loaded into a python pandas dataframe (python version 3.6.0, pandas version 0.22.0) and grouped by query contig ID (column 2). All collapsed groups were intersected by shared contig IDs. Groups with one unique ID -in other words, groups with no additional alignment to other contigs -were removed from consideration. The remaining unique groups totaled ~46 Mb in summed length. the chromosome with the highest alignment score was identified and all alignments for that contig to other chromosomes were discarded. Using Bedtools 15 and custom Python scripts, reference N-gap regions were identified that were either completely spanned by a single alignment, or that were flanked on both sides by non-overlapping, adjacent, same-strand alignments from the same contig. To calculate gap size, the two labels flanking a given gap were identified, and gaps with alignments that did not involve those two labels were filtered out. To avoid ambiguous placement, contigs were filtered out if either of their gap-flanking labels were involved in more than one alignment. Contigs were also filtered if either of the gap-flanking labels were the last label of an alignment, since a single label was not sufficient to confidently anchor the alignment across the gap. Observed gap size was calculated as: where label_distance is the distance between the two labels flanking the gap area in either the assembled contig or the reference.

10x Genomics linked sequencing data
For every sample with 10x Genomics linked read data, samples were assembled using Supernova version 1.1 with default parameters, and both pseudohaplotypes were output.
Each pseudohaplotype was aligned to the hg38 reference genome using nucmer from the MUMmer package 7 with settings -maxmatch -l 100 -c 500, filtered using delta-filter -q, and further filtered and converted to a more parsable format using show-coords -Tcrodl -I 90.
Using a combination of Bedtools commands and custom Python scripts, scaffolds were identified that aligned to at least 50% of both 5 kb regions flanking a given reference gap.
Ambiguous alignments, e.g. cases where multiple areas of the scaffold aligned to the same area flanking a reference gap, were filtered out. Alignments to both flanking regions were required to be on the same strand of the scaffold and to be in orientation-appropriate order, i.e. for plus-strand alignments, the scaffold region that aligned to the upstream reference region must be upstream of the scaffold region that aligned to the downstream reference region, although some overlap was allowed, as in the case of deletions involving repetitive content around the gap area. For cases where two separate alignments corresponded to the two flanking regions, the 10xG gap length was calculated as follows, for plus-strand alignments on the scaffold: where U and D are the upstream and downstream alignments with respect to the gap on the reference, respectively, q indicates a scaffold-based coordinate, r indicates a referencebased coordinate, and s and e indicate the start and end, i.e. the smallest and largest coordinates with respect to the reference, of the associated alignments.
For single alignments that spanned the entire gap on both sides, in order to precisely define the coordinates on the scaffold that corresponded to the gap region, the entire scaffold on which the alignment was found was re-aligned to the two 5 kb regions flanking the gap on the reference genome, using lastz 16 with the parameters --seed=match15 --exact=50 --nogapped --notransition --gfextend --chain --filter=nmatch:100 --filter=identity:95. Gap lengths were calculated as above, where Gre=0 and Grs=5000.
For all alignments, in cases where the gap length was positive, the scaffold sequence that corresponded to the reference gap region was extracted and processed with RepeatMasker 17 with parameters -species human -xm.

SV calls using linked-read sequence data
Sequence-based SV calling was done for 13 samples using two approaches: A) Linked reads were aligned to hg38 for phasing and variant calling using the 10x Genomics Long Ranger pipeline (v2.1, WGS analysis, using FreeBayes). The SV calls produced by this pipeline included mid-scale deletions (50 bp -30 kb) and large-scale SVs (≥ 30 kb). B) De novo assemblies produced using the 10xG Supernova software 6 were aligned to hg38 using nucmer (MUMmer v3.23, -maxmatch -l 100 -c 500) of the Mummer package 7 . Assemblies were initially generated using Supernova v1.1. In the course of this study, Supernova v2.0 was released, and we used it to generate new assemblies for these samples. As we aimed to use this sequence data to analyze as many of the optical mapping SVs as possible, we chose to use both set of assemblies for this particular analysis. For each of the 13 samples, nucmer alignment delta files of both of the de novo pseudohaplotypes (outputs designated by Supernova as 2.1 and 2.2) were input to Assemblytics 18 for SV calling, using minimum alignment lengths of 5 kb, 10 kb, 50 kb, 100 kb, 250 kb, and 1 Mb. Scaffolds were used rather than contigs because gaps in Supernova assemblies consist of a sequence of Ns roughly approximating the size of the gap.

Filtering the list of SVs found by optical mapping
10X Genomics compiled a 'blacklist' for SV calls, representing a set of regions in which SV calls are more likely to represent false positive or otherwise inaccurate SV calls. The regions in the suggested blacklist were derived from: 1) the UCSC browser gap track, including short arm gaps, heterochromatin gaps, telomere gaps, gaps between contigs in scaffolds and gaps between scaffolds in chromosome assemblies; 2) segmental duplications of >= 1Kb and >=90% sequence identity between copies; and 3) regions with new sequences introduced in GRCh38 (hg19 diff track, UCSC browser) 13 .
SVs that were identified by optical mapping but were within 20 kb of a region on the blacklist were filtered out and not included in downstream analysis.

Locating SV breakpoints and their associated sequences
Deletion breakpoints identified by optical mapping were compared to deletion breakpoints identified in the same sample by Long Ranger and Assemblytics (including deletions, repeat contractions, and tandem contractions). Insertion breakpoints identified by optical mapping were compared to large duplications identified in the same sample by Long Ranger and to insertions, repeat expansions, and tandem expansions identified in the same sample by Assemblytics. Comparison and interval intersection was done using the bedtools 15 package.
Matching records were retained when at least a third of the SV size was supported by the complementary method. Up to a 20 kb difference between breakpoint positions was permitted in order to account for the lower resolution and potential local misalignments of the optical maps. Next, the SV call lists of the 13 samples were merged into a unified list such that each of the SVs that were found by optical mapping had only one entry, annotated by the best matching sequence-based SV call. Ranking of matched sequence-based SVs was based on SV size and type as classified by Long Ranger and Assemblytics, where deletions and insertions were ranked higher than tandem contractions and expansions, which were themselves ranked higher than repeat contractions and expansions.
After optical map SVs were annotated with more accurate breakpoints based on the sequence-based SV calls, the corresponding sequence was extracted from either the reference or de novo assemblies, in the case of deletions or insertions, respectively. If the sequence contained mainly Ns, it was recorded as 'N-gap'. To determine the repetitive content of the SVs, we used RepeatMasker 17 v4.0.7 (--species human --xsmall). Based on the results, SVs were assigned to content classes (SINEs, LINEs, LTR, DNA, satellites, simple, unclassified, or low) when at least 50% of the sequence was of the same repeat class. SVs were classified as 'combined' when the repeat content was > 50% but no single class contributed 50% or more to the sequence content. SVs with repeat content < 50% were classified as 'none' unless they were classified as tandem repeats by Assemblytics or as pseudogenes in downstream analysis (see below).
Sequences flanking the breakpoints of SVs were also processed to identify the involvement of repetitive elements. The regions 500 bp before and after each SV were analyzed using RepeatMasker. The flanking regions were assigned to the main repetitive classes listed above, with a minimum requirement of 100 bp per repeat type.

Identification of pseudogenes involved in SVs
To identify SVs that originated from retrotransposition activity but could not be classified by RepeatMasker, we also looked for processed pseudogenes 19 . We used the Pseudogenes annotation of the Retroposed Genes V9 UCSC browser track 13 to identify pseudogene sequences from the reference that were deleted in the 13 samples. To identify insertions that were likely a result of a processed pseudogene retrotransposition event, we created a set of 40 bp chimeric sequences for every gene in the genome, composed of the last 20 bp of exon n and the first 20 bp of exon n+1 of the same gene. We aligned these sequences to the de novo assemblies using Bowtie 2 20 with a minimum alignment threshold of 17 matches out of 20 bp. To filter out hits from pseudogenes present in the reference genome, lastz 16 was used to perform local alignment of the candidate pseudogene and its flanking sequence (spanning 2x the size of the pseudogene) to the reference. The EMBOSS 21 stretcher global alignment tool was used to align all candidate pseudogenes to their corresponding cDNA sequences to determine the involved exons and removed introns, and candidates with low quality alignments (match sites <200 or similarity <90%) or without clear loss of introns (minimum of 90% of intron lost) were discarded. For additional filtering, anchor sequences (typically 30 kb upstream and 30 kb downstream, or 100 kb each when 30 kb was insufficient) flanking each of the candidate pseudogenes were aligned to the reference genome using BLASTn 22 . Any alignments larger than 500 bp were recorded and adjacent alignments of both anchors (distance <10 kb + size of pseudogene) were combined into one alignment. Candidate pseudogenes were filtered out if they were found to be present in the reference genome. The size distribution of deletions (orange) and insertions (blue) with minor allele frequency of at list 0.1 demonstrates a peak at ~6-7 kb, corresponding to LINE1 elements. The peak at around 2 kb is the result of the lowest size threshold used for SV calling (2 kb). Figure 6. The procedure for detecting complex SVs. Candidate complex SVs are identified from split-alignments. Optical maps that support each SV candidate and those that do not support it are collected to determine whether the SVs should be called or not. The red, yellow and gray horizontal bars represent the reference, individual molecules and a contig, respectively. The small black and pink boxes represent nicking sites, the gray dotted lines represent their alignments, and the black dashed-dotted lines indicate the rough location of an SV break point. Figure 7. Flanking sequence analysis based on 10xG data. For each indel, flanking sequences on the reference were extracted and aligned to the contigs assembled from linked sequencing data. If both sequences were aligned to the same contig, the distance between them would be compared with the distance on the reference to evaluate whether the indel is confirmed by the 10xG data. The indel loci would not covered by 10xG contigs if the two flanking sequences were not aligned to the same contig. In the contig alignments, red horizontal bars show the reference, with the nicking sites marked in black vertical lines. Each yellow horizontal bar represents a contig, with the two aligned nicking site labels defining the SVs in blue, other aligned labels in pink, and unaligned labels in black. "I1" and "I2" are different insertion alleles, "D1" and "D2" are different deletion alleles, and "WT" is the wild type (reference).   Figure 16. Shared non-reference-aligned content with an hg38 alternative sequence and two primate sequences. In (A), non-aligned content (yellow) aligned well with in silico nicked chr6_GL000251v2_alt hg38 sequence, which was an alternative sequence of polymorphic MHC class II genes. The chr6_GL000251v2_alt sequence track indicated instances of sequence identity differences (white gaps) compared to the chromosome 6 hg38 reference sequence. Non-aligned content aligned with the alternative sequence, indicating polymorphism with the reference. In (B), in silico nicked Pan paniscus and Pan troglodytes chromosome 13 sequences aligned with non-reference-aligned content. Non-reference content may be contained within non-aligned segments of partiallyaligned maps in other genomes, allowing confirmation of maps which should be present in the reference. Hg38 chromosome 13 sequence contained two 50-kb N-gaps, while primate chromosome 13 maps (blue) contained labels in the gaps (red box). The NA19238 partial map, coupled with primate maps, which are highly identical to the hg38 non-gap sequence, indicated our non-reference content as localizing to hg38 chromosome 13 reference gaps.

Supplementary Figure 17. Comparison between lengths of reference gaps closed in genome map assemblies (y-axis) and their estimated lengths in the reference genome (x-axis).
Each point represents a gap, with size and color representing the number of contigs in which that gap was closed. The diagonal line depicts a linear regression (R 2 =0.3).

Supplementary Figure 18. Comparison between median lengths of reference gaps closed in 10X assemblies (y-axis) and genome map assemblies (x-axis).
Each point represents a gap, with the color representing the estimated length of the gap in the reference genome. The diagonal line depicts a linear regression (R 2 =0.56). Figure 19. Main transposable elements and other sequence repeats found in SVs. SVs were partitioned into sub groups based on their sequence content, when at least 50% of the sequence is of the same repeat class ('none' -SVs with repeat content < 50%, 'combined' -repeat content > 50% but no single class contribute 50% or more to the sequence content)