AIVariant: a deep learning-based somatic variant detector for highly contaminated tumor samples

The detection of somatic DNA variants in tumor samples with low tumor purity or sequencing depth remains a daunting challenge despite numerous attempts to address this problem. In this study, we constructed a substantially extended set of actual positive variants originating from a wide range of tumor purities and sequencing depths, as well as actual negative variants derived from sequencer-specific sequencing errors. A deep learning model named AIVariant, trained on this extended dataset, outperforms previously reported methods when tested under various tumor purities and sequencing depths, especially low tumor purity and sequencing depth.


AIVariant preprocessing
The inputs for preprocessing were aligned sequencing read BAM files of paired tumor and normal samples.As the initial preprocessing step, the 'Picard MarkDuplicates (v2.18.1)' was applied to the input BAM files.Low-quality reads, including reads that were unmapped, PCR or optically duplicated, had secondary alignments, and had a mapping quality of 0 were then filtered out.The final preprocessing step was the base quality normalization, in which the base quality distribution of the BAM file was made to follow the reference base quality distribution (Supplementary Fig. 5a).To remove potential base quality bias derived from specific research group (Supplementary Fig. 5b and c), AIVariant uses normalized base qualities in AIVariant somatic variant detection.This was achieved by randomly scanning reads, totaling ten million bases, in the BAM file to obtain the base quality distribution of the sequencing data of the target sample.The base quality of each read in the BAM file was then converted to exactly match the reference base quality distribution.To do so, the reference base quality distribution and the target sample base quality distribution were first sorted by base quality.The highest entry in the target sample's base quality distribution was then assigned the value of the highest base quality in the reference distribution, followed by the next highest base quality, until the target sample distribution was aligned with the reference distribution.The resultant correspondence was used to convert the base quality of all reads in the target sample.The reference base quality distribution utilized for base quality normalization was derived by averaging the base quality distribution of 81 ICGC tumor BAM files (Supplementary Fig. 5c), and sorted by base quality.

AIVariant candidate search
A candidate position should satisfy the following: , and see AIVariant inputs for the definition of LODn.
For train eWGS, we applied a slightly different condition of ℎ  () ≥ 10, since an insufficient number of reads can lead to unstable training.

AIVariant inputs
AIVariant inputs consisted of two types of three-dimensional matrices, AIVariant images and AIVariant stat images, as well as four log odds (LODs) values.
The three-dimensional matrices, hereafter images, were constructed from the read collections.
AIVariant has four types of read collections: the raw tumor read, raw normal read, processed tumor read, and processed normal read collections.For a given candidate position, we collected reads aligned to its 50 bases flanking region.Reads collected from the tumor sequencing data were called the raw tumor read collection, while reads from normal sequencing data were called the raw normal read collection.The processed tumor/normal read collection was a subcollection of tumor/raw normal read collections that held high-quality reads that satisfied specific criteria.The processed normal read collection consisted of reads in the raw normal read collection that satisfied the following criteria: 1) a read did not have a deletion at the candidate position; 2) a read was aligned to the candidate position with a base quality ≥ 5.The processed tumor read collection consisted of reads in the raw tumor read collection that satisfied the above criteria and also satisfied the following: 1) a read had mapping quality > 0; 2) a ratio of soft-clipped length over read length was < 0.3; 3) the sum of base qualities for mismatches was < 100.
An AIVariant image was constructed from each read collection; therefore, we constructed four AIVariant images for a given candidate position.Each collection was sorted according to the tuple key consisting of Boolean allele, mapping quality, and base quality at the candidate position, in this order, where the Boolean allele was 1 if the aligned read base at the candidate position was the same as "max alternate allele" and 0 otherwise.We used only the first 100 reads from each sorted collection to construct an AIVariant image, i.e., a three-dimensional matrix with dimensions of 100 × 101 × 16.One hundred rows represented the first 100 reads from a sorted collection, and if a collection had less than 100 reads, then the remaining rows were padded with zeroes.A total of 101 columns represented alignment columns, which were An AIVariant stat image was constructed from each processed tumor/normal read collection.
For a given candidate position, we generated two AIVariant stat images of dimension 5 × 101 × 9, and an AIVariant stat image was constructed from the whole reads of a read collection, different from AIVariant images, which were constructed from only 100 reads in a read collection.Five rows represented A/C/G/T/-bases and "-" was a gap base.As in the AIVariant images, 101 columns represented 101 alignment columns.Nine channels represented statistical values relevant to the somatic variants.The first channel held reference base information; for instance, if an alignment column represented the actual reference position, i.e., not a gap position, then a row representing this reference base took the value of 1 and the other rows in this column and the channel took a value of 0. If an alignment column represented a gap base, there was no reference base information for this column so only the row representing "-", a gap base, took a value of 1 and other rows held a value of 0. The second channel was for the read base frequency; we counted the number of occurrences of A/C/G/T/-bases at this column position in aligned reads, where the gap base represents deletion in aligned reads, and divided each count by the number of reads that covered this column position.The remaining channels represented the averages of mapping quality, original base quality, normalized base quality, read strand, and distance to read ends, as well as epigenetic features.Averages of base quality and mapping quality were taken over reads covering a column position, and averages of other features were taken over each A/C/G/T/-base.
The LOD values are the likelihood values introduced by Mutect 3 .In AIVariant, we calculated a simplified version of Mutect LOD values to capture features relevant to somatic variants.We calculated four LOD values, namely LODt, LODn, unnormalized LODt, and unnormalized LODn, from the processed tumor/normal read collections.Similar to AIVariant stat images, LOD values were calculated using the whole reads in a collection.LODt and LODn were calculated using normalized base qualities, and unnormalized LODt and unnormalized LODn were calculated using the original base qualities.For a given candidate position, () ≔  ∈  10 (; ) ,  is a set of read bases at the candidate position,  is a likelihood , where  ≔ 10 −()/10 , where () is a base quality for , so  is an error probability for a base ,  is a reference base at the candidate position, and  is a max alternate allele at the candidate position.Finally,  ≔ (   ) − (0) , where   is the frequency of max alternate allele and  ≔ (0) − (0.5) . is a simple measure of the log likelihood of variants supported by tumor sequencing data, and  is a simple measure of the log likelihood of non-variants supported by normal data.LODt values were min-max normalized with a maximum of 100, and LODn values were min-max normalized with a maximum of 50.

AIVariant architecture
The AIVariant deep neural network model consisted of two types of encoders and a merge layer followed by fully connected layers.The model input consisted of four AIVariant images, raw tumor, processed tumor, raw normal, and processed normal images; two AIVariant stat images, processed tumor and processed normal stat images; and four LOD values, LODt, LODn, unnormalized LODt, and unnormalized LODn.Both AIVariant encoders had the same residual network architecture, except for the input shape.One type of encoder encoded each AIVariant image into a vector of size 16, and another encoded each AIVariant stat image into a vector of size 16.Each LOD pair, LODt-LODn pair, and unnormalized LODt-unnormalized LODn pair were combined and extended to a vector of size 16 through a fully connected layer.The resulting eight vectors of size 16 were merged to obtain a vector of size 128 that went through fully connected layers.Finally, the merged vector was activated with the softmax function to output the posterior probability of the somatic variant at a given position.Overall, the AIVariant is a residual network model with multiple three-dimensional matrix inputs with approximately three million trainable parameters.

AIVariant training
For AIVariant training, we generated images from the train eWGS.As described above, four AIVariant images, two AIVariant stat images, and four LOD values were generated for each candidate position (see AIVariant inputs).The candidate positions for training were identified using the candidate search step, as previously described (see AIVariant candidate search).
To ensure stability in the AIVariant training, we balanced the number of actual positive(AP) variants and actual negative(AN) variants from the ten different types (9 cancer types from ICGC and one from CHM1/CHM13) represented in the train eWGS for each SD and TP.
Balancing was performed for stable AIVariant training; however, we did not exactly match the numbers during balancing because this resulted in only 38 000 remaining training data, which was too few to train the model well.Therefore, we balanced the training data as much as possible while retaining > 100 000 training data.The final balanced training dataset contained 138 000 data points (Supplementary Table 1).We used 20% of the training data for validation with a batch size of 64 and Adam optimizer with an initial learning rate of 3*1e-3.The epoch was 100, and we only kept the weights of the best-performing model for the validation data.

Extrapolation of PR curve
A somatic variant detection method outputs scores only for the variants it deemed highly probable somatic variants.These scored variants only include a subset of AP somatic variants, and thus the maximum recall value a method can reach is limited.Hence, the PR curve, based on the output scores of a method, suddenly stops at this recall limit, middle of the precisionrecall plane.Moreover, each somatic variant detection method has a different recall limit value due to different definitions of "highly probable" somatic variants.To overcome this complication when comparing PR-AUC, we extrapolated each PR curve by drawing a linear line between the recall limit point (the recall limit value and the corresponding precision value) and the bottom-right corner of the precision-recall plane (recall 1.0 and precision 0.0).
The extrapolation is achieved by adding enough dummy true positives and false positives with the specific score, which is less than the minimum score provided by each method.The number of dummy true positives and false positives was determined as follows: the number of AP variants -the number of true positives = the number of dummy true positives.99 × the number of AP variants -the number of false positives = the number of dummy false positives.

Supplementary Fig. 3. Somatic variant detection accuracy on test eWGS
Evaluation of somatic variant detection accuracy on test eWGS for cases not shown in Fig.               Supplementary Fig. 5 the extended reference genomic positions that included insertion information.Each alignment column could represent the reference position or insertion in alignment, and we extended the reference genomic position by taking insertions in at least one of the aligned reads in four collections when constructing alignment column insertions with lengths > 1, they were left aligned for a consistent extension.From this extended genomic position or alignment columns, we took a window of length 101 centered at the candidate position, which was represented by 101 columns in the AIVariant image.Sixteen channels represented features relevant to somatic variants.The first four channels held one-hot encoded vectors for the reference A/C/G/T bases, and the next four channels held one-hot encoded vectors for the read A/C/G/T bases.The remaining channels represented mapping quality, original base quality, normalized base quality, read strand, soft-clip, distance from read ends, and epigenetic features.The mapping quality was min-max normalized to the [-1, 1] range by taking 0 as the minimum and 60 as the maximum.The base quality was min-max normalized to the [-1, 1] range by taking 0 as the minimum and 41 as the maximum.The distance from the read ends was the minimum distance of the given column position from the 5' read end and 3' read end, and the minimum was divided by half of the read length for normalization.Epigenetic features were obtained from the peak calling score and the p-value of H3K9me3 modification peak calling data generated by the Roadmap Epigenomics project 1 .We used H3K9me3 modification because it is reportedly one of the most relevant epigenetic features 2 .

. Trinucleotide distribution of actual positive and actual negative variants
4rinucleotide distribution of actual positive (AP) and actual negative (AN) variants used to train the AIVariant deep neural network (Supplementary Table1).A trinucleotide was determined as nucleotides from one nucleotide upstream to one nucleotide downstream from an AP variant or AN variant position.The distribution is displayed as in a previous study4.Somatic variant detection process of AIVariant and images used for somatic variant calling.During the four key steps of the detection process (preprocessing, candidate search, image generation, and somatic variant calling steps), sequencing reads in tumor/normal sample BAM files are preprocessed and somatic variant candidates are identified.Images are then generated for the identified candidates and somatic variants are called based on the generated images.
(b) AIVariant inputs and architecture.The inputs consist of AIVariant images, AIVariant stat images, and LOD values.AIVariant image represents the complete picture of alignments around variant with raw features of sequencing reads, but the number of reads it can include is limited by the size of the image.AIVariant stat image summarized features of sequencing reads around a variant and compensate for the limitation of the AIVariant image by statistically summarizing features of all the reads.LOD values are log likelihoods of a variant and help stable training.Each image input is encoded by a residual network and LOD values are encoded by fully connected networks.Encoded vectors are merged and go through another fully connected network.The output is activated with softmax and this value represents the posterior probability for SNV.