Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios

Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e−8 and 1.5e−9 per nucleotide per generation for SNVs and indels, respectively.


Supplementary Figure 3: Allele balance in high confidence heterozygous variants
Supplementary Figure 3. The allele balance in children (n=10) with one parent that is homozygous for the reference allele and one parent that is homozygous for the alternative allele. Furthermore all individuals pass the quality filter used when calling de novo variants, except the allele balance filter. 98.0% of the variants are in the interval between 30% and 70%.  (n=30) showing N10-N90 of the raw contigs, scaffolds and scaftigs. Raw contig refers to the unambiguous path in the de Bruijn graph before scaffolding; Scaftig: contigs that are cut out from the scaffolds at Ns. N50: after sorting the raw contig/scaffold/scaftig based on the length of raw contig/scaffold/scaftig respectively, N50 refers to the minimum length that the assembled sequences with the length above that comprise 50% of the contig/scaffold/scaftig. The error bars indicataes standard deviation (s.d.).   Mapping of novel sequences identified in the parents (n=20) against other human and primate genome sequence. x-axis: Total length of the novel sequences (>=100bp) in the individual de novo assemblies that are not present in the NCBI b37 (upper panel) and NCBI b37 plus decoy sequence (lower panel); y-axis: Fraction of the novel sequences that are unambiguously aligned to different human and primate DNA resources with > 95% identity and 95% aligned ratio. If one novel sequence can be aligned to multiple resources the priority is: HuRef, NA12878 and YH and african2 over primate sequences: chimpanzee (panTro4), gorilla (gorGor3), orangutan (ponAbe2) and macaque (rheMac3)

Supplementary Figure 9: Genomic distribution of non-ref sequence
Supplementary Figure 9. The distribution, mechanism, ancestral state, functional annotation of the 1.2M novel sequence that can be localized to NCBI b37.

Supplementary Figure 10. SoapAsmVar pipeline
Supplementary Figure 10. SoapAsmVar pipeline to discover and genotype structural variants from de novo assemblies. SVD module: A pair of bars connected by dashed line represent an alignment block; Solid black line indicates different alignment blocks; Red bar indicates reference sequence; Green bar indicates assembly sequence which is unambiguously aligned to the reference (misalignment probability <0.01); Gray bar indicates assembly sequence that cannot be unambiguously aligned to the reference. AGE realignment module: Fresh green bar indicates 500bp conservative sequence around the variant breakpoints. SV verified module: Top shows the short-read vs reference alignment; Bottom shows the short-read vs assembly alignment. Genotype module: A Gaussian mixture model with linear constraints is fitted using the PEP ratio, which is the proportion of proper aligned reads around the variant loci compared with the expected number of proper aligned reads; Solid bar: real data; Dashed line: the Gaussian distribution of the three genotype components in the Gaussian mixture model. De-duplicate module: Intensity of the color represents allele frequency where darker represents higher allele frequency. Whenever there are multiple different non-reference alleles observed within 50bp, the alleles with the highest frequency with lowest gap ratio in the 200bp window around the variant region will be selected. See methods for details.

Supplementary Figure 11: Concordance between GATK and SoapAsmVar
Supplementary Figure 11. Left: Number of indel calls binned by length by GATK haplotype caller (blue), SoapAsmVar (green) and overlapping calls determined using 50% reciprocal overlap (magenta). Average number of calls per individual is 795k and 1140k for GATK and SoapAsmVar, respectively. Right: Fraction of call-set that are also called in the other call-set binned by length. GATK indels also called by SoapAsmVar (blue) and SoapAsmVar indels also called by GATK (green). The concordance in genotype in overlapping calls is shown in orange.

Supplementary Figure 12: Callability given sequencing depth
Supplementary Figure 12. The callability (ie. the probability that a true variant passes the conservative quality filters) given the sequencing depth. DP: read depth. The callability of heterozygous variants was calculated based on 3.2 million trio-variant combinations where one parent of the trio was homozygous for the reference allele and the other parent was homozygous for the alternative allele. The callability of homozygous variants was calculated based on 27 million trio-variant combinations where both parents of the trio were homozygous for the reference allele.  We look at three criteria: i) the inferred mutation rate, ii) the fraction of mutations that are in dbSNP and iii) the fraction of mutations that are transitions. All three criteria are stable once the minGQ cutoff is higher than 50. Using the most conservative cutoff (minGQ=90) we find 365 de novo mutations and with the least conservative cutoff (minGQ=10) we find 606 de novo mutations. The error bars represent standard errors.