Verification of Arabidopsis stock collections using SNPmatch - an algorithm for genotyping high-plexed samples

Large-scale studies such as the Arabidopsis thaliana 1001 Genomes Project aim to understand genetic variation in populations and link it to phenotypic variation. Such studies require routine genotyping of stocks to avoid sample contamination and mix-ups. To genotype samples efficiently and economically, sequencing must be inexpensive and data processing simple. Here we present SNPmatch, a tool which identifies the most likely strain (inbred line, or “accession”) from a SNP database. We tested the tool by performing low-coverage sequencing of over 2000 strains. SNPmatch could readily genotype samples correctly from 1-fold coverage sequencing data, and could also identify the parents of F1 or F2 individuals. SNPmatch can be run either on the command line or through AraGeno (https://arageno.gmi.oeaw.ac.at), a web interface that permits sample genotyping from a user-uploaded VCF or BED file. Availability and implementation: https://github.com/Gregor-Mendel-Institute/SNPmatch.git


Background & Summary
Sample contamination is an unavoidable problem when large-scale experiments are performed. For example, cell lines are frequently misidentified or contaminated, and the need for validation has long been underappreciated 1 , 2 . The same problem applies to any germplasm collection. With an increasing number of experiments utilizing natural variation, plant seed resources are growing rapidly. In Arabidopsis thaliana , the genomes of 1135 strains have recently been sequenced 3 and this panel is now widely used. The need for verifying seed stock is clear 4 . Routine quality checks of seed stock genotypes can guard against common mistakes such as tube mislabeling, seed contamination during harvesting, or sample mix ups. In principle, genotyping can easily be performed by short-read sequencing, which has high throughput and low error rates. Sequencing and library preparation costs are dropping rapidly, whether for reduced-representation methods like restriction-site-associated-DNA sequencing strategy (RAD-seq), or whole-genome sequencing 5,6 . On the other hand, user-friendly tools for the analysis of sequencing data are only starting to become available 3 . Here we present SNPmatch, a simple tool for efficiently identifying the most likely strain from a database of A. thaliana strain based on a likelihood model for the given markers (SNPs) in each sample.
We validated SNPmatch using the sequenced 1135 genomes of A. thaliana .
SNPmatch readily identified correct genotypes with only a few thousand random SNP markers -many fewer than expected from any sequencing effort. We then used SNPmatch to perform a quality check of a lab seed stock collection comprising most of the "1001 Genomes" 3 and RegMap 7 panels -a total of ~2000 strains. We performed inexpensive (  The modified sequencing protocol of second round were validated by comparing SNPmatch results with that of first round sequencing (Fig. 2). Even with a lower coverage for the samples in second round due to the higher multi-plexing, they gave an unambiguous match to strains as good as for the samples in first round sequencing, indicating a sufficient coverage and library preparation to genotype the samples using SNPmatch.

Sequencing data analysis
All sequencing reads for all the samples were processed accordingly to a standard pipeline, outlined in Fig. 1 (i) Calculate probability for a match to each accession (a) in the database. This is calculated using the genotype probability scores (PL) from GATK.
where, j is the SNP position and a j is the genotype of accession A at position j (ii) Likelihoods are calculated for each accession based on the binomial distribution.
n(L(p ; )) ln(L(~; )) L a = l a n a − 1 n a where, n a is the number of informative sites between a and sample.  Users can download the results and run as many analysis as they like. By using our HPC system we can easily process many analyzes concurrently. Furthermore AraGeno also provides a RESTful API for programmatic access to the SNPmatch pipeline. AraGeno is hosted on our on-premises cloud infrastructure.
In the future we plan to extend the service to additionally allow researchers to mail us samples for genotyping and identifying.

Data Records
Resequencing data of all the lines are available on NCBI SRA database with Bioproject accession number PRJNA374784.

Technical Validation
We validated SNPmatch using the raw data from published 1001 genomes of A.
thaliana 3 , thinning the data to one, three and six million reads for each sample to test the effect of coverage (one million reads roughly correspond to multiplexing 192 A. thaliana samples in a single Illumina Hi-Seq 2500 lane, or roughly 1 x coverage) .
The results were essentially unaffected by this level of thinning, indicating that even 192-fold multiplexing yielded more than enough SNPs to distinguish the strains.
To investigate the required number of SNPs further, we randomly selected subset of SNPs, and ran SNPmatch using those subset only. As illustrated in Fig. 4, the number of strains identified uniquely is largely independent of the number of SNPs, provided that number exceeds a few thousand -as long as we do not try to distinguish very closely related strains. In the 1001 genomes panel, there are 78 North American strains and 40 other pairs of strains that differ by fewer than 1k SNPs, and there are 60 pairs of strains that differ by less than 50k SNPs 12 . These strains are difficult to distinguish from each other even with millions of SNPs. Next, we used SNPmatch to validate our own seed stocks. We resequenced almost complete sets of the "1001 Genomes" 3 and "RegMap" 7 collections using the protocols described in Methods. Of a t otal of 1998 sequenced strains, 1991 yielded sufficiently good SNP data for analysis. Of these, 1797 were assigned to the correct strain (or set of strains differing by less than 5000 SNPs in the databases). Of the remaining 194 strains, 82 (30 of the "1001 Genomes" and 52 of the "RegMap" collections) were unambiguously assigned to the wrong strain (or set of closely related strains), indicating sample or strain mix-up (suppl. Table 1). Finally, the remaining 112 (44 of the "1001 Genomes" and 68 of the "RegMap" collections) did not m atch any strain in the databases (suppl. Table 1) . These samples could represent unknown strains, DNA contamination, or outcrossed individuals, however DNA contamination and outcrossing should both result in an unusually large number of heterozygous calls, which we do not observe (Fig. 5). Therefore, we conclude that the ambiguous calls are most likely due to sample mix-up with unknown strains. We are currently collecting and verifying the 194 incorrect strains from different sources in order to make sure that the stock center has the right germplasm. At least when the parent strains are in the database, SNPmatch can be used to identify hybrid individuals. Such individuals will not find an unambiguous match in the database, but will be equally closely related to both parents (Fig. 6A). For such individuals running SNPmatch in windows across the genome will quickly reveal its hybrid nature (Fig. 6B). 14