## Introduction

In forensic genetics and clinical diagnostics, high certainty of SNP allele calling is essential. Particularly in criminal cases with small amounts of DNA from crime scenes, the interpretation of DNA results is of utmost importance. The presently used and published SNP allele calling methods are reasonably reliable for optimal amounts of DNA. However, the existing SNP calling methods can be improved and adapted to handle results from suboptimal DNA amounts.

The Illumina method calls one or two alleles that—to some confusion—are called A/B alleles2,3. Simplified, the A/B system tells whether the SNP position has homozygous reference alleles, homozygous alternative alleles, or heterozygous alleles. The A/B alleles must be converted to the standard DNA bases A, T, G, and C using a manifest file4 created by Illumina.

Bookkeeping of the selection of the right colour channel for the right probe and conversion of the A/B allele system to the standard DNA bases is a tedious task with many different rules that must be applied. We have developed open-source software that can help do these tasks. The software is published as an R5 package called snpbeadchip6 for selecting the correct colour channels and converting the A/B alleles to plus/minus alleles and an accompanying R package called omni54manifest7 that provides easy access to the information about the probes such as the manifest4 and mapping information8. snpbeadchip uses illuminaio9 to read idat files.

Once the signal intensities of the reference and alternative alleles are obtained, the SNP can be called. We propose a method referred to as the “butterfly method” (cf. Fig. 2).

We tested the butterfly method against high-quality PCR-free (shotgun) whole-genome sequencing (WGS) data, which are considered the gold-standard for “concordance”. Furthermore, we compared the butterfly method with SNP calls from the Illumina GenomeStudio software10.

The aims of this paper and the purpose of the method are to provide an open description of a SNP calling method with transparent handling of no-calls that others can use, replicate, and improve. The description of the method used by GenomeStudio called GenTrain 3.011 is not publicly available12. It uses the data of the other samples to analyse the sample in question, which can be problematic because the SNP calling of a sample should for this kind of analysis, in principle, not be influenced by other samples. One such problematic situation may arise if the samples analysed together are of varying quality, e.g., the combination of samples with high quality and partly degraded DNA, as may be the case in a forensic setting. Also, the GenTrain does not provide a posteriori genotype probabilities for each SNP of a sample. Hence, there is no obvious way to adjust the no-call algorithm. The GenTrain Score is given for each SNP and is a combined score for all samples.

## Materials and methods

All statistical analyses were made using R5 version 4.1.2 and tidyverse13.

### Ethics

The study was approved by the Committees on Health Research Ethics in the Capital Region of Denmark (H-2-2012-017). The biobank where the samples are held is registered at the University of Copenhagen’s joint records of processing of personal data in research projects and biobanks (514-0725/22-3000) and complies with the rules of the General Data Protection Regulation (Regulation (EU) 2016/679). According to The Danish National Committee on Health Research Ethics, informed consent is not necessary for the samples used in this study (H-2-2012-017). The study was performed in accordance with the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments and comparable ethical standards.

### Blood samples and DNA extraction

Peripheral blood samples from three individuals were collected and stored at − 20 °C until DNA extraction. DNA extraction was carried out using the DNeasy Blood & Tissue Kit (Qiagen), following the manufacturer’s recommendations for purification of total DNA from whole blood.

### SNP typing using the Illumina Infinium Omni5-4 kit

All samples were analysed using the Illumina Infinium Omni5-4 Kit following the manufacturer’s recommendations with varying DNA amounts. The DNA concentration was measured using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific). Two-fold serial dilutions of DNA from the three samples were performed using nuclease-free water to obtain samples with the following DNA amounts: 400 ng, 200 ng, 100 ng, 50 ng, and 25 ng. Briefly, the DNA was hybridised to the probes attached to the BeadChips. Hereafter, the attached probes were subject to single-base extension and stained. The BeadChips were scanned using the iScan system (Illumina) following the manufacturer’s recommendations.

### PCR-free (Shotgun) whole genome sequencing

Samples were sequenced with the NextSeq500 platform (Illumina, USA) using paired-end sequencing (2 $$\times$$ 150 bases). PCR-free WGS and variant detection were carried out as described in14. AdapterRemoval version 2.1.315 identified and removed adapter sequences from the reads using the collapsed option. Consecutive stretches of low-quality bases ($$Q<30$$) were removed from the 5’ and 3’ termini, and reads shorter than 30 bases were discarded. The Phred+33 quality scores encoding was used. For alignment, we used BWA-MEM version 0.7.10-r78916 and accepted only properly aligned reads (samtools flag -f 0x2). GATK version 4.0.0.0 with HaplotypeCaller17 with standard settings was used for variant calling. The reads were approximately normally distributed with a mean coverage of 37, and we only used biallelic SNPs with at least 25 reads.

### The butterfly method

“The butterfly method” is based on a finite mixture of bivariate normal distributions18 with three mixture components: one for each SNP/genotype, i.e., AA, AB, or BB in the A/B allele system2.

Let A and B be the mean signal intensities for alleles A and B in the A/B allele system, respectively. We $$\log$$ transformed using the natural logarithm of the mean signal intensities with one added to avoid numerical problems as the mean signal intensity can be 0, and $$\ln (0) = - \infty$$. Thus, we used $$A' = \ln (A+1)$$ instead of A in the models.

Using $$A' = \ln (A+1)$$ and $$B' = \ln (B+1)$$, the model specifies a joint probability density function by

\begin{aligned} f(A', B') = \sum _{i=1}^3 \tau _i \phi _i(A', B') \end{aligned}
(1)

where $$i \in \{1, 2, 3\}$$ indicates the SNP group (e.g., $$i = 1$$ means AA, $$i = 2$$ means AB, and $$i = 3$$ means BB), $$\phi _i$$ is a probability density function for a bivariate normal distribution, and $$\tau _i = P(i)$$ is the a priori (without taking intensities $$A'$$ and $$B'$$ into account) probability that the SNP has type i (and $$\sum _{i=1}^3 \tau _i = 1$$).

In other words, we model the signal intensities as a three-component mixture of bivariate normal distributions. The signal intensities $$A'$$ and $$B'$$ can either come from SNP group 1 (AA), 2 (AB) or 3 (BB). The likelihoods of observing $$A'$$ and $$B'$$ in each SNP group are weighted by $$\tau _i$$.

The unknown parameters in the model include, e.g., the a priori probabilities, $$\tau _i$$, the mean values, and the covariance matrices for the bivariate normal distributions (not shown). The parameters were estimated using the R package mclust18. We chose to model the mixture components as bivariate normal with any shape and orientation; in the mclust terminology, this is called a VVV model.

When calling SNPs, we wanted to calculate the a posteriori probability of SNPs belonging to SNP group k, given the signal intensities $$A'$$ and $$B'$$, which is given by

\begin{aligned} P(k \mid A', B') = \frac{P(k , A', B')}{P(A', B')} = \frac{P(A', B' \mid k) P(k)}{\sum _{i=1}^3 P(A', B' \mid i) P(i)} = \frac{\tau _k \phi _k(A', B')}{ \sum _{i=1}^3 \tau _i \phi _i(A', B') } . \end{aligned}
(2)

The a posteriori probability, $$P(k \mid A', B')$$, is the a priori probability of being in group k, $$\tau _k$$, regardless of the allele intensities $$A'$$ and $$B'$$, multiplied by the likelihood of the allele intensities in SNP group k, $$\phi _k(A', B')$$, and normalised (denominator) so that the sum of the a posteriori probabilities of the three SNP groups is 1.

We present three variants of the butterfly method. Different data sets were used to train (estimate the parameters of) the three-component mixture model: 1) each sample was its own reference using all SNPs simultaneously, 2) like 1, except using separate models for the two probe types (I/II), and 3) an ensemble model using all samples to estimate a single model.

#### SNP calling

We used the recommended settings in GenomeStudio10 and the Genotyping Project Wizard to import sample data into Genome Studio based on sample intensities (not based on existing cluster files) and the “Cluster All SNPs” function under the “Analysis” tab in GenomeStudio to generate SNP clusters.

For the butterfly method, we called the SNP genotype with the maximal a posteriori threshold, except for situations with no-calls (NC). We chose always to make a NC if the mean signal intensities for both alleles A and B were 0.

We investigated two ways of making a NC. Firstly, if the maximal a posteriori probability was below a certain threshold, we made a NC. This was done for a range of thresholds (from 0.5 to 0.999). Secondly, we chose to consider the number of beads with which the SNPs had been investigated, and on which the mean signal intensities were based. If the number of beads was below five, we made a NC. We also used a threshold of zero beads.

For probe type II, the same beads capture both alleles, so there is only one number of beads for each investigated position. For probe type I, different beads capture each allele, so there is a certain number of beads for allele A and another one for allele B. For probe type I, both numbers of beads must be above the threshold.

Imposing such NC thresholds results in calling fewer alleles but with higher confidence in the alleles called.

#### Other methods

The GenoSNP method introduced in19 is a within-sample method that uses a four-component mixture of t-distributions (like the normal distribution, but with heavier tails), where the fourth cluster is a “null class” for capturing outliers. The calls are made by identifying the cluster with the maximal a posteriori probability. Hence, the no-calls are selected when the null class has the highest a posteriori probability. The classification of the outliers in a single “null class” is problematic as the outliers do not behave in the same way. Outliers are not expected to be distributed according to a t-distribution and grouped in the same cluster in the $$(A', B')$$ space.

The M3 method introduced in20 is similar to that in19, except that a four-component mixture of normal distributions is used and the focus is on calling rare variants. In20, the a posteriori probability is mentioned, but only in connection with calculating the average a posteriori probability for each SNP.

To summarise, our paper contributes the following novel work: a) analysis of data obtained with the Illumina Infinium Omni5-4 Kit by comparing SNP calls made by GenomeStudio and GenTrain 3.010,11; b) demonstrating how the a posteriori probabilities and the numbers of beads can be used to categorise NC (instead of including an unrealistic null-cluster) and analyse how they impact the concordance with WGS calls; c) showing how a non-sample specific, a general model, and a sample and probe type specific model perform. Our method is available as the R software packages snpbeadchip6, omni54manifest7, and the existing R software package mclust18 for estimating the mixture model with the function mclust(..., G = 3, modelNames = "VVV") and calculating a posteriori probabilities with the function predict().

We chose not to include all the above methods because the main focus of this paper was to explore the possibility of adjusting the NC rate and offering open-source software for this purpose.

## Results

The Omni5-4 manifest7 has 4,327,108 SNPs. We removed 271,680 SNPs (details in Table 1) and ended up with 4,055,428 autosomal SNPs (93.7% of the original). Of the 4,055,428 SNPs included, 135,419 SNPs were typed with type I probes (ambiguous), and 3,920,009 were typed with type II probes (unambiguous).

The mclust18 function with the VVV model on the object x (i.e., mclust(x, G = 3, modelNames = "VVV")) took about 3 minutes to run per sample (4,055,428 SNPs) on a AMD EPYC 7351 16-Core Processor. The x object contained the $$\log$$ transformed signal intensities in its two columns for all 4,055,428 SNPs, and it occupied 62 MB memory in R.

The densities of the numbers of beads for both probe types for the three individuals are given in Fig. 1.

The mean signal intensities with an illustration of the models are shown in Fig. 2.

We only used biallic SNPs and alleles that were called with a read depth of at least 25 with WGS. This resulted in 3,139,554 SNPs called (77.42% of 4,055,428) for Sample 1; 3,875,844 SNPs called (95.57% of 4,055,428) for Sample 2; and 3,972,024 SNPs called (97.94% of 4,055,428) for Sample 3. We considered these SNP calls reliable.

Focusing on only SNPs reliably called with WGS, we calculated the concordance rates as described below.

The SNP calling is illustrated in Fig. 3 for 400 ng DNA with the “Butterfly (sample)” method and an a posteriori probability threshold for NC of 0.8.

Of the SNPs called by WGS, the number of SNPs called by the other methods are given in Fig. 4.

For the SNPs with WGS-based SNP calls (i.e., not NC) and SNP calls with other methods (i.e., not NC), the concordances between the WGS call and the methods were calculated (Fig. 5). The figure shows how reliable a method’s calls are when the NCs are excluded.

If one accepts fewer calls by increasing the a posteriori probability threshold and the number of beads threshold, the calls will be more reliable (Figs. 4 and 5).

The importance of the DNA amount for choosing an a posteriori probability threshold can be seen in Fig. 6. For a fixed concordance, the a posteriori threshold must be increased the smaller the DNA amount. The lines did not follow the DNA amount ordering, possibly due to saturation and/or quantification errors, but the 25 ng and 50 ng DNA lines were often below those of 200 ng and 400 ng DNA.

An overview of the discordant calls (excluding no-calls for both WGS and the methods) for 400 ng DNA is shown in Fig. 7, and homo- and heterozygous allele calls are summarised in Fig. 8.

Based on Figs. 7 and 8, it seems likely that the butterfly method’s discordances are due to heterozygous calling when WGS called homozygous. Thus, calling AG instead of GG, CT instead of CC, CT instead of TT, and AG instead of AA. A similar pattern was seen with GenomeStudio but not to the same degree. GenomeStudio made more homozygous discordancies, e.g., AA instead of TT, TT instead of AA, CC instead of GG, etc.

An overview of the discordant calls (excluding no-calls for both WGS and the methods) for 400 ng DNA is shown in Fig. 7, and homo- and heterozygous allele calls are summarised in Fig. 8.

Figure 9 shows the no-call distribution with GenomeStudio of samples with 400 ng DNA.

## Discussion

Of the SNPs called with WGS, the butterfly method called more than 99.5% unless high thresholds for the a posteriori probability and number of beads were used (cf. Fig. 4). The SNP calling concordance between the butterfly method and WGS was 99.0–99.5% (Fig. 5).

We began with 4,055,428 SNPs (Table 1). Extrapolating from this (99.5%) with the uncertainty involved, we expected the butterfly method to make SNP calls of approximately 4,035,151 SNPs and no-calls for the remaining 20,277 SNPs. Of the called SNPs, 3,994,799–4,010,975 SNPs had reliable calls and 20,176–40,352 SNPs had no-calls. This gives a concordant call-rate of all SNPs of around $$0.99^2$$ = 98%, not taking the uncertainty into account. This emphasises that the numbers are adjustable by the two proposed thresholds (a posteriori probability and number of beads). The adjustment is easily done using the R packages mclust18 and snpbeadchip6.

The importance of the DNA amount and the choice of the a posteriori probability threshold can be seen in Fig. 5, which shows that for a fixed concordance, the a posteriori threshold should generally be increased for smaller DNA amounts.

We did not inspect the SNP calls from GenomeStudio and the butterfly method individually, except for excluding some SNP data at the preliminary filtering stage (Table 1). However, we believe the butterfly method’s posterior probabilities together with signal standard deviation, number of beads, etc. will be helpful in individual inspection.

Improving the SNP calling is a topic of future research. There are many ways to improve the SNP calling method proposed here. Using the WGS calls as a reference, a natural next step of modelling is discriminant analysis based on Gaussianity in a supervised learning setting. Including more explanatory variables would also enable more advanced statistical learning methods. In this study, we did not include signal variance information, which may improve SNP calling. Another option is to use probe information like base composition, colour channel, neighbour bases, etc., as explanatory variables/features. This may enable statistical learning methods like multinomial logistic regression, random forests, and deep learning.

## Conclusion

We introduced the “butterfly method” for SNP allele calling with the Illumina Infinium Omni5-4 Kit1 without using Illumina’s GenomeStudio software10. The method is a within-sample method and does not use other samples or population frequencies to call the SNP alleles. The butterfly method is based on a three-component mixture of normal distributions, in which parameters are easily estimated using the open-source statistical software R. The method is transparent, it is straight-forward to change the parameters according to the user’s needs, and easy to analyse the data within R after SNP calling. We have published two open-source R packages, omni54manifest7 and snpbeadchip6, that make SNP calling easy by helping with bookkeeping and giving easy access to meta-information about the SNPs typed with the Illumina Infinium Omni5-4 Kit (including chromosome, probe type, and SNP bases).

We tested our method on > 4 mio. SNPs and compared the results with those obtained with the GenTrain method used by Illumina GenomeStudio as well as SNPs obtained by PCR-free (shotgun) WGS. We demonstrated two variants of our method: one where we take into account potential probe type bias by estimating a separate model for each probe type (type I and type II) and another model that uses a general model such that the model’s parameter estimates do not depend on the sample that is being analysed. We focused on varying the no-call rate and showed how it changed the concordance with that of WGS. This was done by using a threshold on the a posteriori probability of belonging to a SNP cluster and by using the number of beads to adjust the stringency of the no-call mechanism.

With the butterfly method, we achieved a SNP call rate of around 99% and a SNP concordance with the WGS data of around 99%. By lowering the a posteriori probability threshold for no-calls, we obtained a higher call rate than GenomeStudio, and by increasing the a posteriori probability threshold, we achieved a higher concordance with the WGS data than GenomeStudio.