Extended Data Figure 1 : UK10K-cohorts, sequence and sample quality and variation metrics.

From: The UK10K project identifies rare variants in health and disease

Extended Data Figure 1

ae, Sample quality metrics for UK10K-cohorts (n = 3,781) where n = 1–1,927 corresponds to ALSPAC and 1,928 to 3,781 to TwinsUK. This sample includes all individuals passing sample quality control, including related pairs and non-European individuals that were later removed from association tests. A subset of 3,621 individuals was included in association analyses. Samples sequenced at BGI are coloured in blue and samples sequenced at Sanger are coloured in grey. a, Number of singletons (AC = 1) by sample (×103). b, Number of INDELs by sample (×105). c, Read depth (sequence coverage) by sample. d, Ratio of heterozygous and homozygous non-reference (=homozygous alternative) SNV genotypes (mean for females = 1.54, mean for males = 1.47). e, Transition to transversion ratio (Ts/Tv) by sample. fi, Sequence variation metrics for UK10K-cohorts. f, Types of substitution (×106). g, Number of SNVs (×106), INDELs (×105) and large deletions (×103) by non-overlapping non-reference allele frequency (AF) bins. h, Size distribution of INDELs. Negative INDEL lengths represent deletions and positive INDEL lengths represent insertions. i, Large deletion size distribution in unequal bin sizes where the smallest deletions were 200 bp to 1 kb long and the largest deletions 100 kb to 1 Mb. In total 18,739 deletions were called with GenomeSTRiP14. The average deletion size was ˜13 kb and the median size was ˜3.7 kb. j, Total number of SNVs and INDELs by AF bin (based on 3,781 samples), multi-allelic variants are treated as separate variants. k, Sequence quality and variation metrics for UK10K-cohorts. For 61 overlapping TwinsUK individuals we compared the variant sites and genotypes of the low-coverage sequences with high-coverage exome data by non-overlapping AF bins (WGS versus Exomes). We considered 74,621 shared sites in non-overlapping AF bins. We calculated the fraction of concordant over total sites, the number of non-reference genotypes and non-reference genotype discordance (NRD, in %) between WGS and Exomes; false discovery rate (FDR = FP/(FP + TP); TP, true positive; FP, false positive), where we consider the exomes as the truth set; number of false positives (FP) and FDR for sites that are or not shared with the 1000 Genomes Project, phase I (1000GP); false negative rate (FNR = FN/(FN + TP); FN, false negative; TP, true positive), where AF bins were defined based on the 61 exomes. Furthermore, we compared 22 monozygotic twin pairs at 880,280 bi-allelic SNV sites on chromosome 20, reporting the percentage of concordant genotypes, non-reference genotypes and NRD. AFs are from the set of 3,621 samples, which contains at most one of the two monozygotic twins from each pair. We note that discrepancies can be caused by errors in either twin, so the expected NRD to the truth would be half the NRD value given.