PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes

Thousands of genomic structural variants (SVs) segregate in the human population and can impact phenotypic traits and diseases. Their identification in whole-genome sequence data of large cohorts is a major computational challenge. Most current approaches identify SVs in single genomes and afterwards merge the identified variants into a joint call set across many genomes. We describe the approach PopDel, which directly identifies deletions of about 500 to at least 10,000 bp in length in data of many genomes jointly, eliminating the need for subsequent variant merging. PopDel scales to tens of thousands of genomes as we demonstrate in evaluations on up to 49,962 genomes. We show that PopDel reliably reports common, rare and de novo deletions. On genomes with available high-confidence reference call sets PopDel shows excellent recall and precision. Genotype inheritance patterns in up to 6794 trios indicate that genotypes predicted by PopDel are more reliable than those of previous SV callers. Furthermore, PopDel’s running time is competitive with the fastest tested previous tools. The demonstrated scalability and accuracy of PopDel enables routine scans for deletions in large-scale sequencing studies.

PopDel's profile format is divided into windows of 256 bp. The start position of each window is given in column one and two. The information on read pairs that fall in the respective window is given in column three, divided by ','. The number before the ':' indicates the offset of the start position from the start of the window and the value after the ':' is the deviation from the median insert size µ. The start position of a pair refers to the rightmost aligned base of the forward read of a pair. Each read pair is only listed in the window that contains its start position.
Supplementary The maximum value of each row is highlighted in bold font.
Supplementary Table 4   The maximum value of each row is highlighted in bold font.
Supplementary Table 7  The minimum values of each row are highlighted in bold font.
Supplementary The minimum value of each row is highlighted in bold font.
Supplementary The median profile size is 0.9 GB, which is 1.78% of the median BAM file size of 53.1 GB.
The mean relative size of the profiles is 1.76%.

Supplementary Notes
Supplementary Note 1: Implementation details of PopDel

Binary profile format
The binary format, defined in Supplementary Table 9, stores insert sizes efficiently on disk.
One important factor is avoiding the suboptimal representation of big integers as strings of chars (assuming 8 bits per char, a number consisting of e.g. 7 digits, such as genomic   The first column describes the name of the field, whose content is described in the second column. Column 3 indicates the datatype and the size of the entry. Except for the magic string, which is used for identifying the profile format and version, no field has default value as indicated by column 4. A line saying "List of …(n=n_x)" indicates that the following block is repeated n times, potentially with different content.

Buffering and length limitations during profile creation
PopDel creates the profile by checking the reads of the BAM file one after the other. If a read is in the forward orientation and its corresponding reverse read maps at most max + base pairs downstream (with max as the maximum deletion length, which defaults to 10,000 bp, and as the median of the sample's insert sizes), the read is added to a buffer. If a read is in reverse orientation, the corresponding forward read is removed from the buffer and the pair's insert size is added to the profile. This buffering ensures that both reads fulfil the requirements listed in Supplementary Table 1 without jumping in the BAM file. The default value of 10,000 bp for max has been selected to keep the buffering to a minimum and because we argue that deletions above a size of 10,000 bp can be easily found using read depth methods. This value can be increased by the user.

Sampling regions for creating the insert size histogram
The sampling uses a set of 22 default regions, one 1,000,000 bp region from each autosome.  For the calculation of the likelihoods, we obtain:

Handling of multiple read groups in a sample
with ℛ as the set of read groups of sample S, Δ R is the set of insert size deviations of read group in the current window and as the prior (default 10 −4 ). We further have: The sample and genotype specific weights become read group and genotype specific weights: For the allele frequency we obtain: We update the deletion length estimate as: The means of all clusters are subsequently used as initial deletion lengths init . For each init an initial allele frequency is calculated as: where | | denotes the total number of samples.

Transformation of insert size histogram
The histogram abs , which has been created during the profiling step, stores absolute counts of observed insert sizes for sample . Let abs ( ) return the number of counts for an insert size . We compute a transformed histogram: trans ( ) ≔ abs ( + ) ⋅ + max( + µ , 0) Since PopDel subsequently only uses the deviation of insert sizes from the mean and not the insert sizes themselves, trans takes the deviation of the insert size as an argument.
The probability density function for the distribution is then calculated as follows: In the cases where the sampled distribution would return zero a pseudo-count of default As we expect only small biases, we limit the maximum value for to the standard deviation of the sample's insert size distribution. Larger values may originate from homozygous deletions and interfere with the size estimations. Therefore, if becomes greater than , it is set to 0.
Setting to instead, can distort the predicted length of homozygous deletions by this value.

Termination of iterative parameter estimation
The iteration of PopDel's window-wise calling stops if at least one of the following conditions is met: The two latter cases result in no call being made for the respective window for that specific deletion. In case two pairs of estimates for frequency and deletion length alternate during the iteration, PopDel assumes the pair yielding the higher likelihood ratio to be the correct one.

Deletion start position estimation
During

Conditions for the combination of consecutive windows
Let be the first of windows for which the null hypothesis can be rejected and let + , 0 < < be another such window. Let and + be the deletion length estimates for and + respectively. Let ∶= (delStart( ), delEnd( )) denote the range spanned by the leftmost and rightmost alignment positions of the read pairs whose insert sizes support and let + be defined analogously. When considering one window after the other, we test if the following conditions for and + are met: where | | denotes the length of the spanned range. If a) and c) are true, the deletion end position in the combined window is updated to match that of + .

Additional filters during joint calling
PopDel applies additional filters on the final variant calls to improve the precision of the results.
The first filter is applied during the iterative parameter estimation. If in a window a sample (over all read groups) has a coverage below two, no deletion will be initialized from the data of this sample and PopDel will not attempt to genotype the sample later on in the respective window.
We assume that this window is unreliable if more than 90% of the samples in the window are lacking coverage and filter the window accordingly. In the case of too high coverage (default at least 100 read pairs of a single read group overlapping one window), the read pairs of this read group are ignored in all calculations. Samples consisting solely of read groups with high coverage in one window will also not be assigned genotypes for the respective window.
Another filter, the relative window coverage, is the value obtained by dividing the product of window length and number of combined windows by the estimated deletion length. By default, PopDel requires Lastly, the variant is filtered if all samples are genotyped as non-carriers.

Supplementary Note 2: Evaluation on simulated and real data
To assess the performance of PopDel, Delly 1 , Lumpy 2 , Manta 3 and GRIDSS 4 the tools were compared on the different simulated and real data sets described in the Methods. A more detailed description of the setup of the callers and the evaluation is described in the following sections.

Simulated data with random deletions
The simulation of random variants was performed as described in the Methods. Scripts for the reproduction of the workflow can be found in the GitHub repository (see URL's of the Main text). The setup of the callers is described in the following sections.

Running PopDel on simulated data with random deletions
The PopDel workflow for all tests consisted of the following steps 1. Apply popdel profile on each sample individually 2. Apply popdel call jointly on all profiles The option -r chr21 for limiting the processing to chromosome 21 was applied. No further filtering of the calls was performed.

Running Delly on simulated data with random deletions
Delly was applied jointly on all simulated samples using default parameters, except for the -n option for disabling the small indel realignment. All non-deletion variants were removed prior to evaluation. No filtering for "QUAL==PASS" was applied, because doing so had a negative impact on the performance of Delly on the simulated data. 4. Joining the calls of all samples using smoove paste.

Running Lumpy on simulated data with random deletions
All non-deletion variants were removed prior to evaluation. No further filtering was performed.

Running Manta on simulated data with random deletions
Manta's workflow consisted of two steps: 1. Configuration via configManta.py for all bam files together (--region chr21) 2. Joint calling via runWorklow.py with -m local -j 1 (local machine and one thread) All non-deletion variants were removed prior to evaluation. No filtering for "QUAL==PASS" was performed, because doing so had a negative impact on the performance of Manta on the simulated data. Because the additional variants in the "candidate" file had a very high false positive rate, they were not added to the final output.

Running GRIDSS on simulated data with random deletions
GRIDSS was applied on all samples jointly with an 8 GB heap size for the JVM and one thread.

Alternative evaluation with genotype consideration
In addition to the above approach, we applied an alternative evaluation metric that also takes

Assessment of running time and memory consumption
Resource consumption of all tools for all tests on simulated data was measured using /usr/bin/time. CPU time was calculated as the sum of CPU seconds the process spent in user mode (%U) and the number of CPU seconds the process spent in kernel mode (%S).
Memory consumption was measured as maximum resident set size of the process during its lifetime (%M). Wall clock time was measure as the elapsed time (%e).

Tradeoff between running time and memory consumption for PopDel
PopDel offers the option for individually setting the maximum number of buffered windows. This

Simulated 1000 Genomes Project data
The simulation of variants taken from the 1000 Genomes Project 6 was performed as described

Mapping of real data
Reads of NA12878 were mapped to GRCh38 and reads of HG002 and his parents to GRCh37 using bwa mem with its '-M' and '-R' options. SAM to BAM conversion, sorting and indexing was performed using Samtools 7 . Duplicates were marked using Picard tools 8 .  Figure 11

Evaluation on the Polaris Diversity and Kids cohort
We applied PopDel, Delly and Lumpy on all 150 samples of the Polaris Diversity cohort and 49 trios of the Polaris Kids cohort as described in the following sections. As we could not produce a workflow for Manta that finished on that many whole genomes, we do not provide a workflow here.

Running PopDel on the Polaris Diversity and Kids cohort
The same default approach as described for NA12878 in section 2.3.3 was applied.
Chromosomes 1 to 22 and the X chromosome were considered for the Diversity cohort and chromosomes 1 to 22 were considered for the Kids cohort. This made the resulting BCF files impossible to sort and index with BCFtools, which was necessary for merging and applying the germline filter.

Running Lumpy on the Polaris Diversity and Kids cohort
The same default approach as for the simulated data sets (section 2.1.3) was applied, except that chromosomes 1 to 22 and the X chromosome were considered for the Diversity cohort and chromosomes 1 to 22 for the Kids cohort. Further, all non-deletion variants were removed from the call sets after single-sample calling with smoove call.

Principal Component Analysis (PCA)
The variants were converted into a sample by variant count matrix M.

Genotype quality filtering for Mendelian error rate and transmission rate
For assessing the effect of filtering on the Mendelian inheritance error rate, we applied an increasing genotype quality (GQ) threshold. Further, all deletions that were not in Hardy-Weinberg equilibrium (P value threshold 0.01) were removed. All genotypes below the required GQ were excluded. Because most tools calculate the GQ in a different way, we determined at which point the Mendelian inheritance error rate of the tools dropped below 0.3% to calibrate the filter for the tools in a fair fashion. The GQ values for the tools were 26 (PopDel), 28 (Delly) and 78 (Lumpy). Those filter settings were also used during the calculation of the transmission rates (Table 1).

Transmission rate pattern observed for Lumpy
The deletion genotypes predicted by Lumpy in the Polaris Kids cohort do not reach the expected transmission rate of 50% in our analysis ( Figure 5) but stay below 48.3% independent of the genotype quality threshold we apply. A transmission rate below 50% can be explained by two scenarios: 1. False negative calls in the child when a heterozygous genotype was correctly called in one parent.
2. False positive calls in one of the parents when the child was correctly determined as a non-carrier.
We think that false positive calls in one of the parents are more likely than genotyping errors in the child when one parent is correctly called heterozygous.

Evaluation on Icelandic data
We evaluated PopDel on different datasets of deCODE Genetics at several stages of PopDel's development.

Transmission and Mendelian inheritance error on Icelandic data
PopDel v1.0.6 was applied on WGS data of 6,794 Icelandic trios. For all variants with a genotype quality threshold above 26, the Mendelian inheritance error rate and transmission rate were calculated considering only variants occurring in a single family with one parent being heterozygous for the variant and the other parent carrying only the reference allele.