FastGT: from raw sequence reads to 30 million genotypes in 1 less than an hour 2

We have developed a computational method that counts the frequencies of unique k -mers in 11 FASTQ-formatted genome data and uses this information to infer the genotypes of known 12 variants. FastGT can detect the variants in a 30x genome in less than 1 hour using ordinary low- 13 cost server hardware. The overall concordance with the genotypes of two Illumina “Platinum” 14 genomes is 99.96%, and the concordance with the genotypes of the Illumina HumanOmniExpress 15 is 99.82%. Our method provides k -mer database that can be used for the simultaneous 16 genotyping of approximately 30 million single nucleotide variants (SNVs), including >23,000 SNVs from Y chromosome.

bi-allelic SNVs remained usable by FastGT. We also used a subset of autosomal SNV markers present

123
The accuracy of FastGT genotype calls was analyzed by comparing the results to genotypes reported in their likelihoods can be shown in gmer_caller optional output.

139
We also compared the genotypes obtained by the FastGT method with the data from the Illumina 140 HumanOmniExpress microarray. We used 504,173 autosomal markers that overlap our whole-genome 141 dataset (Table S1), and the comparison included ten individuals from the Estonian Genome Center for 142 whom both microarray data and Illumina NGS data were available.

144
In these 10 individuals, the concordance between the genotypes from the FastGT method and 145 microarray genotypes was 99.82% (Table 2), and the concordance of non-reference alleles was 146 99.69%. The fraction of mono-allelic and tri-allelic genotypes (no-call genotypes) in 10 test individuals 147 is rather low (<0.01% of all markers), indicating that our conservative filtering procedure is able to 148 remove most of the error-prone SNVs.

153
FastGT is able to call genotypes from the Y chromosome (chrY) for 23,832 markers that remain in the 154 whole-genome dataset after all filtering steps. The genotypes on chrY cannot be directly compared with 155 the Platinum genotypes because chrY calls were not provided in the VCF file of the Platinum 156 individuals. To assess the performance of chrY genotyping, we compared our results to the genotypes haploid genotype calls of FastGT and the genotype calls in these VCF files was 99.97%.The fraction of not occur within one generation. Only one marker (rs199503278) showed conflicting genotypes in any 164 of these father-son pairs. A visual inspection revealed problems with the reference genome assembly in 165 this region, which resulted in conflicting k-mer counts and conflicting genotypes from different k-mer 166 pairs of the same SNV. This marker was removed from the dataset because it had a high likelihood of 167 causing similar problems in other individuals.

169 170
Effect of genome coverage on FastGT performance

172
We also studied how the genome sequencing depth affects the performance of FastGT. The Platinum 173 genomes have a coverage depth of approximately 50x, but in most study scenarios, sequencing to a subsets of FASTQ sequences from the Platinum individual NA12878 and measured the concordance 176 between called genotypes and genotypes from the Platinum dataset. We observed that the concordance 177 rate of non-reference genotypes (AB and BB) declines significantly as the coverage drops below 20x 178 ( Figure 3).

180 181
Time and memory usage

183
The entire process of detecting 30 million SNV genotypes from the sequencing data of a single 184 individual (30x coverage, 2 FASTQ files, 115GB each) takes approximately 40 minutes on a server 185 with 32 CPU cores. Most of this time is allocated to counting k-mer frequencies by gmer_counter.

186
The running time of gmer_counter is proportional to the size of the FASTQ files because the 187 speed-limiting step of gmer_counter is reading the sequence data from a FASTQ file. However, the

192
The minimum amount of required RAM is determined by the size of the data structure stored in 193 memory by gmer_counter. We have tested gmer_counter on Linux computer with 8 GB of relevant k-mers avoids the so-called "curse of deep sequencing," in which a higher coverage genome can overwhelm the memory or disk requirements of the software 30 . The disk and memory requirements 240 of FastGT are not directly affected by the coverage or the amount of sequencing data.

242
Our analysis focuses on genotyping SNVs. However, FastGT is not limited to identifying SNVs. Any 243 known variant that can be associated with a unique and variant-specific k-mer can be detected with

244
FastGT. For example, short indels could be easily detected by using pairs of indel-specific k-mers. In 245 principle, large indels, pseudogene insertions, polymorphic Alu-elements, and other structural variants 246 could also be detected by k-mer pairs designed over the breakpoints. However, the detection of structural variants relies on the assumption that these variants are stable in the genome and have the 248 same breakpoint sequences in all individuals, which is not always true for large structural variants. The 249 applicability of FastGT for detecting structural variants requires further investigation and testing.

251
This software has only been used with Illumina sequencing data, which raises the question of whether

496
NO Are there any k-mers covering this SNV, but not containing any other known SNVs or indels?
Remove SNV from the database Step 2. Test the uniqueness of all k-mer pairs for a given SNV in the expanded reference genome. Remove k-mer pairs if at least one k-mer in the pair is not unique. Keep up to 3 k-mer pairs.
Step  gmer_caller uses only one pair, which is the pair with a total k-mer frequency count that is closest 506 to the median k-mer frequency in a given individual.

508
Reference sequence: ...TAGGCAACGTTAG... For each SNV (shown in green), three k-mer pairs located as far away from each other as possible are selected.

Frequencies of
For rare mutaEons in the neighborhood of the SNV (shown in red), certain k-mer pairs show abnormal frequencies. In this situaEon, at least one k-mer pair should sEll be usable.
The k-mer pair with the total frequency closest to the median total frequency of all k-mer pairs in the enEre genome is selected for genotype calling.