A compendium of genome-wide sequence reads from NBS (nucleotide binding site) domains of resistance genes in the common potato

SolariX is a compendium of DNA sequence tags from the nucleotide binding site (NBS) domain of disease resistance genes of the common potato, Solanum tuberosum Group Tuberosum. The sequences, which we call NBS tags, for nearly all NBS domains from 91 genomes—representing a wide range of historical and contemporary potato cultivars, 24 breeding programs and 200 years—were generated using just 16 amplification primers and high-throughput sequencing. The NBS tags were mapped to 587 NBS domains on the draft potato genome DM, where we detected an average, over all the samples, of 26 nucleotide polymorphisms on each locus. The total number of NBS domains observed, differed between potato cultivars. However, both modern and old cultivars possessed comparable levels of variability, and neither the individual breeder or country nor the generation or time appeared to correlate with the NBS domain frequencies. Our attempts to detect haplotypes (i.e., sets of linked nucleotide polymorphisms) frequently yielded more than the possible 4 alleles per domain indicating potential locus intermixing during the mapping of NBS tags to the DM reference genome. Mapping inaccuracies were likely a consequence of the differences of each cultivar to the reference genome used, coupled with high levels of NBS domain sequence similarity. We illustrate that the SolariX database is useful to search for polymorphism linked with NBS-LRR R gene alleles conferring specific disease resistance and to develop molecular markers for selection.

Scientific RepoRtS | (2020) 10:11392 | https://doi.org/10.1038/s41598-020-67848-z www.nature.com/scientificreports/ The defensive response of plants to pathogens is mediated through the protein products of resistance genes (R genes). To prevent yield losses resulting from disease susceptibility, it is essential that cultivars used in potato crop production contain disease resistance alleles. The European potato's disease resistance is the result of at least 110 years of (pre-) breeding work to introgress R genes from wild and native cultivated genetic resources. One systematic approach at introgressing new resistances was made by R.N. Salaman, who added resistance to late blight 5 . This marked the start of a series of introgressions 6 and the development of new, resistant cultivars until present 7,8 . A common breeding scheme has been to cross a Group Tuberosum parental cultivar with a genebank accession of tuberous Solanum, either wild or domesticated, selected for a specific disease resistance. This is then followed by several rounds of backcrossing and selecting for individual, resistant progenies that also show other desirable characteristics of the Group Tuberosum parent. In potato, the total number of successful introgressions via crossing is not known, there may be from dozens up to over one hundred. Once stably introgressed, an R gene will be maintained in the breeding pool, and the number of different resistance alleles accumulated would increase with every new generation of potato cultivars. Potato breeding and selection has lagged behind other crops due to the complexity of this tetraploid, heterozygous plant. As such, characterizing the repertoire of R genes to facilitate development of disease resistant varieties is of great significance. R1 was the first R gene characterized for resistance to late blight 9 resulting from Salaman's introgression work dating back to 1909 5 . R1 resides on an insertional, introgressed locus of the Group Tuberosum potato 10 . The R1 resistance allele has been found to be present in the genomes of multiple Group Tuberosum clones, which include some contemporary commercial potato varieties 11 .
As most R genes contain both nucleotide-binding site (NBS) and leucine-rich repeat (LRR) domains, conserved motifs from both NBS and LRR domains have been efficiently employed to identify the majority of plant NBS-LRR R genes [12][13][14][15][16] . Jupe et al. 13 , using close to 50,000, 120-mer biotinylated RNA oligos, enriched fragments putatively containing NBS-LRR domains from the genomic DNA of one Group Tuberosum cultivar and detected a total of 755 R genes containing NBS and LRR domains. They allocated 704 of these to the 12 established potato chromosomes and 51 to unanchored superscaffolds.
Jupe and colleagues' 13 755 R genes refer to the potato's current, publicly available sequenced genome (version 4.03) of the diploid (doubled monoploid) S. tuberosum Group Phureja clone (clonal cultivar) DM 1-3 516 R44 15 . The homozygosity of this clone provided ease in sequencing of its genome. Group Phureja has been claimed to be a close relative of the tetraploid potato not only by taxonomy 17 but also by genomic fine structure (Andean; Group Andigena and European; Group Tuberosum 13,15 ). Nonetheless, despite their close taxonomic relationship, the Group Phureja genome still differs somewhat from the Group Tuberosum genome. The individual used for sequencing, clone DM 1-3 516, may represent a rather small section of the whole genetic variability within the ancient cultivated potatoes comprising Group Phureja. S. phureja was, in fact, considered in pre-molecular taxonomy treatments as a separate species 4 .
Gene duplication has been considered to play an important role in the expansion of the R gene family. R genes are largely organized in clusters, also known as multiple-copy R loci that contain variable numbers of gene copies. Copy number variations differ on the cluster as well as the species level 12,18 . The majority of NBS-LRR R genes in the DM genome occur stacked in clusters, where complete (putatively) functional genes alternate with incomplete ones. This is seen as a source of variation and a plant's reservoir for producing new functional R alleles via frameshift recombination and DNA repair processes 19,20 . Consequently, not only the number of R alleles, but the number of R genes may vary, even at cultivar level within a species 21 .
Recent high-throughput DNA sequencing methods open workable ways to elucidate the genetic diversity within the genepool of a single species. We employed Illumina HiSeq-2000 technology to mass-sequence highly conserved, characteristic 200 to 480-base pair (bp) fragments of the NBS domains of R genes, which we denominate NBS tags. These NBS tags were enriched from genomic fragments in a method similar to the NBS profiling of van der Linden et al. 16 -i.e. PCR amplification using a small number of priming oligonucleotides that target extremely conserved motifs of the NBS domain of R genes. These conserved motifs together with nearby highly polymorphic sequences, were used to efficiently survey the R gene pool for 91 genomes comprising both historical and contemporary potato cultivars. We detected in total 587 distinct NBS domains on the DM genome, 576 of these corresponded to NBS-LRR loci as laid out by Jupe et al. 13 and 11 are additional. We provide the collection of the identified NBS domains of R genes on our SolariX website. This expands the current knowledge of the variability of potato R genes to a cultivar level resolution.
Our experimental questions were: 1. Is it possible to detect and (with Illumina HiSeq-2000 technology) distinguish individual R genes carrying NBS and LRR domains by inspecting NBS tags enriched via a small set of PCR primers that target conserved sections of the NBS domain? If this question can be answered positively, then: 2. How many NBS-LRR R genes can be detected, by this method, within the genepool of S. tuberosum Group Tuberosum? 3. Can individual alleles of R genes be distinguished through variations in sequences within their corresponding NBS tags and can we determine how many alleles there are? 4. Can specific patterns be discerned from how these R alleles are distributed across the groups of germplasm that are represented in the experimental materials (clones of different ages, generations and breeders)? 5. Can the NBS tagging method be used to efficiently discover R alleles responsible for specific resistances and can it be used to create selection markers for these alleles?

Results
capture of nBS domains of the R gene pool using a limited number of pcR primers. Design of PCR primers for the P-loop, Kinase-2 and GLPL motifs of the NBS domain. The goal was to define a minimal set of primers that would amplify, out of genomic DNA, the maximum number of individual R gene alleles. The primers were chosen to be complementary to one of the highly conserved P-loop, Kinase-2 and GLPL motifs within the NBS domain of R genes (Fig. 1a). To design degenerate primers that could cover, as broadly as possible, the sequence diversity of the NBS domains of R genes present in the potato genome, tblastn searches across all 438 NBS-LRR R genes (the most current annotation at that point) mapped by Jupe et al. 12 on the 12 established DM v 3.4 chromosomes for sequences of the NBS domain was performed. This returned 435 distinct nucleotide sequences, which were aligned, grouped by the similarity of their sequences and visually compared to pinpoint polymorphic positions within each of the three NBS-motifs. The highly conserved GLPL motif neighbors the highly polymorphic LRR domain of an R gene. Our primers targeting GLPL were chosen to allow for extension of the newly formed nucleotide chain from the NBS into the LRR domain (minimum 60 nucleotides) in order to include parts of the variable sequence, consequently increasing the individuality of the amplified fragment and the chance of unambiguous read mapping. Originally, we designed 24 PCR primers using degeneracy at individual nucleotides to match the sequence diversity across all 435 NBS-LRR loci on DM v 3.4. The functionality of these 24 primers was then verified by PCR on genomic DNA as a template. Fifteen of these 24 primers, as well as the NBS5 primer of van der Linden et al. 16 that yielded abundant amplicons, were finally selected for the experiment ( Table 1). As will be demonstrated further down, these in total 16 primers covered virtually all (complete and incomplete) R genes of the DM genome that carry at least one of the three NBS domain-specific motifs. www.nature.com/scientificreports/ populations segregating for disease resistance, and the wild non-tuberous Solanum caripense (CRP, for details see "Materials and methods"). The samples were obtained from a variety of European countries (Fig. 1b). Care was taken to include samples of the majority of original European potato breeding stocks, as is indicated in Table 2. Live samples of potato cultivars and breeding clones were kindly provided by European breeders and a genebank. Five libraries, out of the 96, represent bulked DNA of disease resistant or susceptible segregants from our mapping populations. These were the progeny libraries in the following: For the tetraploid potato population MT 22 ( M; cv. MF-II, T; cv. TPS67) segregating for R gene resistance to late blight, three libraries of progenies pooled by their distinct phenotype of resistance were included. As an outgroup, four libraries of diploid CRP were included, parental genotypes C1 (late blight resistant) and K4 (susceptible) and progenies Crp_pool_R and Crp_pool_S comprising either 10 late blight resistant (by the phenotype) or 10 susceptible individuals of a segregating CK progeny 23 .
Targeted amplification and sequencing of genome-wide fragments containing NBS domain motifs. The 16 NBS-motif primers were then used for targeted amplification in all 96 libraries after digestion with a 4-cutter restrictase and ligation of adapters as indicated in van der Linden et al. 16 (see "Materials and methods" and Fig. 1c). In an initial linear PCR the NBS-motif primers were applied exclusively, allowing for enrichment of fragments containing NBS domain motifs. For subsequent exponential PCR the same NBS-motif primers were applied together with restriction site-specific adapter primers covering all restriction enzymes. PCR with all 16 primers was successful on all 96 DNA libraries, and per library, a single pooled sample containing the amplicons Table 1. Degenerate primers for selective PCR amplification of sequence tags from NB-LRR (Nucleotide Binding site and Leucine Rich Repeat) domains of R (Resistance) genes in the potato genome Primer nucleotide sequence and individual annealing temperature for optimal amplification results. I *-the Illumina  adapter for MiSeq sequencing: TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG. The universal adapter  primer for Illumina MiSeq was: GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GAC TCG ATT (Fig. 2b), confirming the specificity of the NBS-capture protocol. Within R genes, NBS domains had substantially higher length normalized read coverage than the non-NBS domains. The non-NBS domains did have some marginal coverage as compared to non-R genes, indicating that short stretches of the adjacent LRR domain were also sequenced. When we evaluated the fraction of bases with read coverage in the different regions, we observed substantially higher sequenced fractions in NBS domains of R genes (Fig. 2c).    13 , indicated in Supplementary Tables S1 and S2 as NB_GTP_1 through NB_GTP_11 (where NB refers to Nucleotide Binding site and GTP refers to Group Tuberosum Pool) followed by the PGSC gene number. Each of these NB_GTP harboring loci, located in genes not previously designated as resistance genes, was represented by a substantial number of mapped reads that were found to contain NBS domains with our DM-specific NB-ARC HMM search. Further confirmation was obtained with a tblastn search for the P-loop, GLPL, and Kinase-2 motifs (from Jupe et al. 12 ) in these regions. To determine if the genes had partial or complete NBS-LRR sequences, mast 24 was used to detect all NBS-LRR motifs (published in Jupe et al. 12 ), and NLR-Parser 25 was used for classification of the gene. We found only 1 out of the 11 genes to be complete. The remaining 10 were partial NBS-LRR genes. In addition, 8 were classified as CC-NBS-LRR (CNL) proteins (Supplementary Table S2). polymorphisms. This appears to be associated with the total sequencing depth, as was also seen with the total number of detected loci. For a comparison of the extent of polymorphisms present across all cultivars, without the influence of the effects of sequencing depth, a list of sites within NBS domains that had at least tenfold coverage in all 96 libraries (i.e., no library possesses < 10 sequence reads at the genomic position) was used (Fig. 3a). The average number of NBS domain polymorphisms per cultivar was 4,637. The cultivar with the least number of polymorphisms was Celtiane with 4,328 while the sample with the highest number of polymorphisms was Crp_pool_S with 6,247. The number of polymorphisms among the tetraploid potato cultivars ranged from 4,328 to 5,100 and did not differ substantially from the average. The wild diploid C1 and K4 and pools of their cross progenies, Crp_pool_R and S, had a noticeably higher number of polymorphisms detected (5,948-6,247). Figure 3b shows the distributions of number of polymorphisms per 100 bp (at positions within NBS domains that had at least tenfold coverage in all libraries) over the different NBS-LRR R gene clusters for the 96 libraries (We used the numbering of R gene clusters according to Table S2a in Jupe et al. 13 ). Details on each cluster's number of NBS domains, total NBS domain lengths as well as the average number of polymorphisms per 100 bp, can be found in the Supplementary Table S3. NBS domains of genes that do not belong to a cluster, singletons, ("-" in SolariX website. The comprehensive resource of variants detected in the 96 cultivars within the NBS domains has been made available on the SolariX website at www.cibiv .at/Solar iX (Fig. 4). After registration, one may download the IUPAC code DNA sequences representing the variants found in a cultivar of interest for any annotated NBS domain (Fig. 4a). In addition, polymorphic sites in NBS domains can be visually compared across all 96 cultivars, together with information of read coverage support for each nucleotide (Fig. 4b). We expect that this may facilitate the search for particular resistance alleles in individual cultivars (we include some examples in this manuscript).

Variation detected in the samples (cultivars).
Testing the possibility to detect individual NBS alleles. An attempt was undertaken to identify individual biological alleles (haplotypes) per cultivar. With variant calling using FreeBayes, multi-nucleotide polymorphisms (MNPs) or complex polymorphisms could be identified whenever proximal alleles were on the same read. To extend the analysis beyond individual reads, the variants determined by FreeBayes were used for phasing, i.e. to determine the linkage of variants in assembled reads, within NBS domains using HapCompass 27 . We found that increasing the specified ploidy number beyond 4 resulted in increased haplotypes found. NBS domains of single genes within the genomic library of a single cultivar showed from four up to twelve alleles, thus exceeding the maximum number of four possible in a tetraploid. In Fig. 5, we show the average read coverage (Fig. 5a) and the average number of unique haplotypes found (Fig. 5b) for 13 cultivars at each of the 38 NBS domains on chromosome XII when ploidy was set at 12. No clear trend appears to be present between the read coverage and number of unique haplotypes found.
This made us suspect that we had an issue of "mixture loci", where the regions of interest based on the reference genome actually contained reads from different loci. There were two potential causes of this. Firstly, the DM reference genome differs to an unknown extent from the Group Tuberosum genome of the cultivated varieties we sequenced. It is possible that reads obtained from regions present in the cultivars, but not represented in the reference, map to alternative best hit positions in the DM genome, thus artificially increasing the ploidy at the mapped regions. Secondly, the substantial sequence similarity of the NBS domains across the R genes, a consequence of multiple duplication events in the expansion of NBS-LRR R genes 28,29 , could cause mis-mapping of reads when sequenced sections of NBS domains are not unique. The issue of read mis-mapping is exacerbated in NBS-LRR clusters with high copy numbers and high sequence similarity. Therefore, it is possible that a locus is assigned reads originating from other loci.
Scientific RepoRtS | (2020) 10:11392 | https://doi.org/10.1038/s41598-020-67848-z www.nature.com/scientificreports/ Mis-mapping that results from a combination of high NBS domain sequence similarity and short read lengths (the read length of HiSeq-2000 was 100 bp) could be addressed with longer sequence reads (such as MiSeq; read length, 300 bp), from which sufficiently long sections of NBS domains can be identified, that are distinguishable from other NBS domains of high sequence similarity. Re-sequencing of 6 cultivars or pools of progenies, TPS67, MF-II, poolM, poolT, poolS and clone R2 (late blight resistance differential 9 ), was conducted specifically at a region on chromosome IV using a MiSeq sequencer (Illumina) to obtain paired-end reads of 300 bp each. Sequencing depths ranged from 1.18 to 9.47 M. The MiSeq reads were mapped with NextGenMap 30 at default parameters, as was with our HiSeq reads. This involves local alignment of the read with a required minimum mapped length of 0.5 fraction of the read-length and a required minimum 0.65 fraction identity between the mapped read portion and the DM genome. The mapping criteria are based on fractions of the read length and therefore account for proportional increases in number of polymorphic positions when evaluating longer genomic regions with the MiSeq reads. Despite longer read lengths, the percentage of reads mapped ranged   www.nature.com/scientificreports/ from 60.6% to 64.8%, which were much lower than those obtained in our HiSeq data. We did observe that the sequencing base quality scores at the ends of the MiSeq reads dropped drastically, especially so for the 2nd read of the pair, which could have contributed to the lower mapping rates. Although we anticipated that longer reads would allow for better distinction between highly similar NBS domains, we found an increase in the percentage of reads that mapped to multiple locations from an average of 22.23% of the mapped reads in the HiSeq libraries of the 6 cultivars to an average of 33.56% of the mapped reads in the MiSeq libraries. In addition, mapped reads having a mapping quality of greater than 20 decreased from an average of 27.03% of the mapped reads in the HiSeq libraries to an average of 10.03% of the mapped reads in the MiSeq libraries. The DM reference represents a rather small section of the whole genetic variability within the cultivated potatoes, and the greater percentage of unmapped reads as well as decrease in mapping qualities in our MiSeq data suggests considerable sequence divergence between the Group Tuberosum cultivars to the reference. Further examination of the unmapped reads, for a single MiSeq library (MF-II), using a blastn search to the NCBI nucleotide database revealed that 62.88% of the 648,960 reads had a significant blast hit (e value smaller than 1e-5) in either the Solanum genus (7.72%) or the tetraploid species Triticum turgidum (emmer wheat, 44.24%) to mostly nucleotide sequences representing R genes (Supplementary Figure S1). The unmapped reads therefore could contain novel R gene content that is currently absent from the DM genome. In fact the DM genome remains incomplete, with improvements still ongoing. This provides strong evidence that insufficient representation on the reference genome was a greater issue than short read lengths and high sequence similarity.
Inferring genetic similarity between cultivars. If certain genomic regions in our sequenced varieties lack representation on the DM genome, and reads from these regions map to alternative best hit positions, the variation of coverage of sequence reads obtained in the NBS domain per R gene (read coverage frequency, RCF) may be useful as a measure of divergence of a certain variety to the reference genome, DM. We hypothesized that the percentage at a single domain, out of all reads obtained for a cultivar could be informative, to some extent, about the genetic diversity. This approach implicitly allows us to utilize the copy number variations of R genes within clusters, which we expected to differ even on the cultivar level. Therefore, the absolute numbers of reads at the individual NBS domains of a cultivar were represented as a percentage of the total reads mapped to all NBS domains obtained for this cultivar (i.e. the sum of RCF values of the 587 NBS domains equals 1 in each cultivar library). These normalized figures, we termed NBS domain RCF vectors, were then used to calculate the Euclidean distance between all pairs of cultivars. A dendrogram of hierarchical clustering using the Euclidean distance and a compressed heat map of the read frequencies by NBS domains of R genes of every cultivar is shown in Fig. 6, details can be seen in the Supplementary Fig. S2. The four libraries of our outgroup taxon, S. caripense, consisting of parental C1 and K4 and two pools of cross progenies (Group Y in the Figure), form a clearly distinct cluster distant to all potatoes, as was expected. The five libraries representing parents and some progenies of population MT (Group X) also clustered together showing only minimum distances. Overall, all potatoes were distributed across two major branches, one containing 21 cultivars (Fig. 6, Reichskanzler-Performer) and the other 71 (Sinora-Ibis). The first branch containing 21 cultivars shows two distinct subbranches, one containing Reichskanzler, Irga, Bryza, and the other the remaining 18 cultivars. The second, rightmost potato branch in the dendrogram holding 71 cultivars shows two sub-branches that branch off further into several less distant groups. We selected several cultivars whose age, purpose of creation, or country of origin was distinct, to elucidate various attributes of the potato's pool of R genes. These groups of cultivars (Figs. 6 and S2, Table S4) were 'a': 6 individuals registered before 1930 (1,810-1,929), 'b': 6 late blight resistance differentials (ca 1950-1970), 'c': 4 cultivars released 1972-1975 in Germany and Poland, 'd': 9 contemporary cultivars registered by seven breeders between 1999 and 2011 in four countries, and 'e': 2 cultivars (Shepody and Magnum Bonum) of (partially) American descent. Magnum Bonum was selected by J. Clarke in the UK from a cross of (the American) Early Rose × Paterson's Victoria, and Shepody was selected in Canada (www.europ otato .org/). As can be seen from Figs. 6 and S2 and Table S4, the old cultivars of group 'a' fall into several distant clades indicating their large diversity. An exception are Ackersegen and Jubel at a minor distance, they originate from the same breeding program (Böhm, Germany), although their pedigrees (as indicated in the European Cultivated Potato Database; www.europ otato .org/) are quite different.
The peculiar distribution of the late blight differentials (group 'b') across the dendrogram seems to reconstruct the lines how these clones were created; R1 was found first, followed by R2 and R3, while R5, R8 and R9 were developed in later years when new sources of late blight resistance had become accessible 9,31 .
The modern cultivars of groups 'c' and 'd' mainly do not share the same subgroups in the dendrogram indicating that certain levels of diversity may have been maintained over the generations and programs of potato breeding.
Magnum Bonum and Shepody (group 'e' , in part American descent), are interspersed among the European cultivars confirming the common origins of these breeding pools. One hundred years lie between the two cultivars-Magnum Bonum was registered in 1876 and Shepody in 1980-but nonetheless, the relative read frequencies at their NBS domains do not seem to divert very much (Fig. 6). Shepody is in a small cluster together with the UK clones R2 and R3, and Magnum Bonum forms the base of a neighboring cluster holding the modern cultivars Eurotango, Nicola, Coquine, MF-II and TPS67. MF-II is of Indian descent, which originated in breeding stocks created by the UK based W. Black and co-workers in the early 1900s and which may therefore share the same specific genetic background.
Using the SolariX database to efficiently discover R alleles responsible for specific resistances: can it be used to create selection markers for these alleles?. In    Verification by Sanger sequencing. PCR with the above described primers was applied on template genomic DNA of the parents, the two pools, R and S, and an additional 71 individual CK progenies of our mapping CRP population (30 possessing a resistant and 41 a susceptible phenotype as observed in multiple bioassays with virulent isolates of P. infestans, from Nakitandwe et al. 23 ). The fragments were subjected to Sanger sequencing (LGC Genomics, Berlin, Germany) and yielded analyzable 596-bp sequences. All fragments shared the same  T T  T  C/T  T  C/T  13  004; C, T   100  A/T A  A/T  A/T  A/T  A  14  095; A, G, T   109  A/T T  T  A/T  T  A/T  7  -198 Scientific RepoRtS | (2020) 10:11392 | https://doi.org/10.1038/s41598-020-67848-z www.nature.com/scientificreports/ sequence containing 15 polymorphic single nucleotides that segregated in association with the resistance phenotype. The pattern of these SNPs relative to the status of late blight resistance of the individual parents and progenies indicated good but not perfect association (Table 3). Frequently, the resistant C1 was heterozygous and the susceptible K4 homozygous, whereas this pattern was unexpectedly reversed in the progenies, the resistant offspring having the homozygous haplotype corresponding to the susceptible parent. One exception was the SNP at the fragment's position 397. Parent C1 had genotype C, the pool of resistant progenies and all 30 resistant CK individuals tested were heterozygous CT, whereas parent K4, Crp_pool_S and all 41 susceptible CK progenies were homozygous T. Hence, even for this SNP the peculiar genotype of C1 (homozygous C instead of an expected heterozygous CT) was not in accordance with the suggested by the 1:1 segregation of all cross progeny for resistant/susceptible phenotype. Moreover, the SNPs detected on the Illumina sequence data differed from those of the Sanger data (Table 3). We aligned all fragments whose sequence was determined by the Sanger method via a blastn search onto the DM v 4.03 genome. This returned nine separate R gene loci on chromosome IX with (considerable) sequence similarity of 88 to 92% along uninterrupted stretches of 473-593 bp of our fragments. Therefore, we examined all nine regions for the occurrence of SNP positions in the mapped NBS-tags of C1, K4, Crp_pool_R and Crp_ pool_S. This was done by comparisons of the 15 individual SNP positions on our Sanger-sequenced fragments with SNP positions as detected on the mass sequencing data. Unfortunately, none of the nine candidate sites on the genome models perfectly matched the structural patterns of the SNP positions as detected by Sangersequencing (Table 3, rightmost column).
Search for markers of two R genes conferring resistance to late blight in the MT population. The Illumina sequences obtained for the five libraries of the MT population parents and three groups of MT progenies representing distinct resistance phenotypes were subjected to the same approach used for population CK data (described above). Two unknown major dominant genes for resistance to late blight segregate in the MT population, one originates in MF-II, chromosome XI, and the other in TPS67, chromosome IV 32 and their phenotype of resistance can be distinguished upon the interaction with differential isolates of P. infestans. Many clustered SNPs within NBS domains of R genes on chromosomes IV and XI were detected. The largest number of clustered SNPs displaying the segregation pattern expected for a dominant R gene originating in TPS67 was found in several NBS domains on cluster 16, which also holds the closest DM sequence relative to the R2 late blight resistance gene of the Black et al. 9 series. As an example, seven adjacent SNPs were found in the NBS domain of PGSC0003DMG400011525 that display the haplotypes MF-II; 0, TPS67; 0, 1, poolM; 0, poolT; 0, 1, and poolS; 0 (where 0 represents the reference nucleotide/allele of the DM genome and 1 represents an alternative nucleotide) and hence suggested the segregation of a single factor (denominated '1') from parent TPS67 in accordance with the phenotype of resistance. A total of 86 SNPs across 45 NBS domains of R genes in 13 clusters on chromosome IV displaying this pattern of segregation and potential association with the resistance gene of this parent were detected. Cluster 16, the cluster of NBS-LLR R genes that harbors the original gene R2, alone had 36 SNPs across 18 NBS domains, one of these within PGSC0003DMG400032572, the closest DM sequence relative to the R2 gene. Likewise for chromosome XI, 171 clustered SNPs across 61 NBS domains of R genes in 13 clusters or singleton NBS R genes were found in the NBS-capture data that suggested co-segregation with a resistance gene from MF-II. The experimental elaboration of these candidate SNPs and loci is in progress.

Development of molecular markers for a PVY (Potato Virus Y) resistance gene.
An in-depth analysis was conducted for the parents of the AB mapping population comprising 250 individuals from the cross: Alegria x Baltica. Cultivar Alegria inherits resistance to PVY in a qualitative, dominant fashion suggesting the action of a single R gene, and selection for this resistance with molecular markers could thus facilitate breeding for PVY resistance. From framework genetic mapping (Supplementary Methods S2), the PVY resistance was found to be in linkage with all four chromosome IX markers used; STI057, STI002, STI014 33 and STM1021 34 and not with any marker on any other chromosome (Fig. 7). Hence, we examined our Illumina sequence data of both parental cultivars, Alegria and Baltica, for polymorphisms on chromosome IX that could segregate in association with the PVY resistance of Alegria. We expected the Alegria PVY resistance allele to be in simplex state, according to the 1:1 segregation pattern of resistant and susceptible progenies in the AB population. We looked for polymorphic sites within every NBS domain that were covered by at least 20 reads and that contained SNP alleles that occurred in parent Alegria at a frequency of 0.25 (verified by a Chi-square test) but were absent in Baltica. PCR primers (containing a mismatch nucleotide at position 3′-1, see "Materials and methods") for four of the polymorphic sites that corresponded to these conditions were designed and applied on the parental and progeny DNA. The results were integrated into the genetic map of cv. Alegria. With these additional markers, it was possible to assess the position of the PVY resistance to be in the vicinity of, or within clusters 64 (containing the prominent locus Tm-2 for tomato mosaic virus resistance) to 66 (holding the potato homologous locus of tomato Sw-5 for resistance to tomato spotted wilt virus). Consequently, the specific NBS domains of the tomato Tm-2 and Sw-5 homologous potato loci in the DM v 4.03 reference genome were used as models to design 14 PCR primers for the selective amplification of these specific NBS domains (Supplementary Table S5 Table S5; primers with restriction enzyme information). We applied these primers to the parents and all progenies of the AB population and treated the amplicons with site-specific restrictases to search for informative CAPS markers. All seven primers produced segregating markers and we included these into the genetic map of Alegria. Among them, the CAPS marker chr09_11PagI was of all markers closest to the map position of the PVY resistance (Fig. 7). This marker originated in cv. Alegria and was in simplex state and in coupling phase, 6 cM away from the putative, simplex PVY resistance. The marker is located in RDC0001NLR0230 of cluster 66. The map location of the PVY resistance was deduced from the performance of all individual AB progenies in replicated resistance trials. The semi-quantitative nature of these trials rendered the interpretation of results regarding the presence or absence of a single, qualitative resistance factor somewhat dependable. Therefore, chr09_11PagI was proposed as a marker to select for the PVY resistance inherited by cv. Alegria.

Discussion
Possibility to detect and count NBS-LRR haplotypes using NBS locus tags. We were able to detect NBS domains of up to 587 R genes using selective amplification by PCR and high-throughput sequencing. Thus, a total of just 16 primers were sufficient to develop a comprehensive inventory of virtually all NBS domains www.nature.com/scientificreports/ of the NBS-LRR R genes in the DM genome. Altogether, three hundred million reads were obtained from 91 distinct cultivars and were mapped to 587 NBS domains of the reference genome. The amount of information that can be currently drawn from this data, however, is limited due to apparent mis-mapping and consequently, intermixing of the individual fragments to an unknown extent. This was evident from the numbers of haplotypes inferred for the NBS domains evaluated that frequently exceeded the maximum of four individual haplotypes (alleles) for a single domain and cultivar. A likely source for the multiple haplotypes observed per domain is the copy number variations of R genes in clusters that can differ even at accession level 21 . As proposed by Michelmore and Meyers 35 , tandemly repeated genes within a cluster are unstable because of a high degree of sequence similarity resulting in relatively frequent unequal crossing-over. This leads to further duplications and deletions. The observed differences in read mapping percentages are likely due to the varying divergences between the sequenced cultivar or sample and the reference DM genome. Correct mapping of R gene fragments (that are largely similar due to the high levels of sequence conservation among their NBS domains) to a genome that may be considerable different remains challenging. Our attempt to increase the level of individuality of sequenced fragments via obtaining fragments that include larger parts of variable, non-conserved sequence of the target genes and generating longer sequence reads proved to be ineffective. Thus, the most important means for achieving accurate allocation of sequence fragments should be the provision of high-quality reference genome(s) of the common potato, S. tuberosum Group Tuberosum.
What are the numbers of potato NBS-LRR R genes?. Improvements to the potato's reference genome, . Although we did not detect an NBS domain in 128 of the 704 R genes, 703 genes had read coverage in at least one cultivar, indicating that they were not completely absent from all our 96 genome libraries. Additionally, we found 11 previously unestablished regions on the DM genome containing an NBS domain. Thus, the number of putative NBS-LRR R genes on the 12 chromosome models of DM v 4.03 has been increased by 11 and the number of NBS domains detected by our NB-tagging method was 587. The high level of sequence similarity among the NBS-LRR domains was a challenge in our project. Our approach was based on mapping of the individual reads to a reference genome, DM v 4.03. Mismapping of reads can occur when the reference is too different from the genome that was sequenced. Reads that originate from genomic sequences that are not represented on the reference are incorrectly mapped to the next most similar location in the reference. In addition, ambiguity in mapping comes about from multiple equally divergent (with respect to the read being mapped) locations on the reference, which render the distinction of the true location impossible. We believe the DM v 4.03 reference genome may not adequately represent the genomes of our cultivars and this is based on three observations-First, DM is a (mono) haploid of a genotype that has a rather limited allelic diversity and it is derived from potato Group Phureja which may not share all features with Group Tuberosum genotypes. Second, we observed that on average 23% of the mapped reads were not properly paired; this were pairs that, when mapped on the genome, either had an incorrect orientation, were not within a reasonable distance or the pair members even mapped to separate chromosomes. Thirdly, libraries sequenced with 300 bp paired-end reads (obtained on an Illumina MiSeq) showed substantially lower mapping rates and confidence (mapping quality scores) than the libraries with 100 bp paired-end reads (sequenced on a HiSeq-2000). The unmapped reads were found to have significant best blastn hits in resistance genes in the Solanum genus and the wheat Triticum turgidum. This suggests that novel R gene content is currently absent from the DM genome.
Genetic similarity between cultivars. A relatively useful approach presented in Fig. 3a, where the total numbers of polymorphic sites are compared among all 96 DNA libraries, provides a measure of estimating genetic similarity. Unrelated sequence fragments, although sharing roughly similar sequences, will have polymorphic sites at different positions. When fewer non-related sequence reads are present (for example in an inbred, less heterozygous genotype derived from a narrow genepool), fewer such reads can be accidentally assigned together on the same loci and the resulting number of polymorphic sites tends to be reduced. The group of seven libraries showing the smallest numbers of polymorphisms and hence indirectly presenting reduced genetic variability (Fig. 1c, left tail, 4,328-4,376 SNPs per library) contains four modern cultivars (registered between 1997 and 2011), two concurrent (after 2010) breeding clones and only one historical late blight differential clone (R1, published in 1953 9 ). Today, potato breeding still depends on pedigree schemes where a relatively small group of elite cultivars are intercrossed. The continued repetition of this scheme over decades, where the genomes of the same parents and grandparents are repeatedly reshuffled, might lead to an enrichment of advantageous alleles while undesired alleles are steadily selected against, thereby leading to an overall reduction of genetic variability. Another approach at indirectly estimating the genetic variability was the analysis of Illumina read coverage frequencies (RCF) per individual NBS domain of an R gene across the genome libraries (Figs. 6 and S2). This approach allows us to utilize the copy number variations of R genes within clusters. RCF (a numerical criterion) is certainly less precise for assessing the extent of genetic variability of R genes within a genepool compared with the frequencies of true biological alleles. However, the genetically closely related individuals in our data Scientific RepoRtS | (2020) 10:11392 | https://doi.org/10.1038/s41598-020-67848-z www.nature.com/scientificreports/ set (the CRP complex and the MT parents and progenies, respectively) clustered closely together by their RCF patterns. CRP, the outgroup species distant to the potato, formed a separate group by its RCF features (Figs. 6 and S2), similar to its distinct level of polymorphic sites (Fig. 1c). MF-II, a clone of Indian descent, and TPS67, Neotuberosum, are within the same narrow group of 20 cultivars (including Magnum Bonum, registered in the UK in 1876, to Baltica, Germany, 1997, and three pools of MT progenies, Figs. 6 and S2), this underlines further, the relative narrowness of the Group Tuberosum R genepool. The continued introduction of individual R alleles and additional R loci (compare R1, 10 ) did not change the frequencies of particular features within the genetic code of NBS R genes. The Irish cultivar Lumper that survived the 1845-47 late blight calamity, which wiped out significant parts of the early potato's phenotypic diversity shares the frequencies of major conserved R gene motifs with Innovator (bred at HZPC, NL, registered in 1999), Amanda (Solana, DE, 2006), Irys (IHAR, PL, 1975) or Fabiana (Germicopa, FR, 2008). Therefore, we postulate the potato's NBS R gene family remained relatively homogeneous over two centuries throughout a wide range of cultivars from many countries and breeders. This is consistent with findings of high synteny, even on the genus level, of the sub-families of NB-LRR R genes between the genomes of tomato (Solanum lycopersicum) Heinz 1706 and the DM potato reference, with differences only in smaller cluster sizes in the tomato 36 . The presence of self-incompatibility and the polyploid nature may have supported the maintenance of the common potato's genetic plasticity. Genetic variation underlying specific disease resistance traits concerning individual biological alleles at individual R loci, such as those comparatively few that have been introgressed from wild and native cultivated sources over the past one hundred years, could not be made visible with the NBS tagging method presented.
Using the SolariX database to create selection markers for R alleles responsible for specific resistances. We tested two approaches-in the first, SNPs associated with a resistance were determined by matching expected ratios for allele frequencies in both parents and progenies within mapping populations (CK, MT) and in the second, potential SNPs within the region where a resistance maps to were determined by allele frequencies solely in the parents (population AB).
For the CK population of CRP segregating for a single late blight resistance factor, we found markers that have considerable predictive power to assess a progeny's resistance status. However, the individual locus and allele encoding the late blight resistance could not be detected. CRP is a relative of the potato and tomato, it has a larger genome size (1.2 billion bp, as determined by flow cytometry) and it may have structural differences. Therefore, mapping the NBS-LRR gene fragments from CRP to the DM genome, based on best similarity fit, would inevitably produce inaccuracies. Information on the haplotype responsible for the resistance is certainly intermixed with information on other, unrelated loci that share similarity by their sequence.
For the potato population MT (such as for CK) segregating for two independent, dominant late blight resistance factors, we detected many loci at which groups of SNPs indicated potential co-segregation with the resistance phenotype, based on their allele frequencies. This was not unexpected, as the two parental genotypes M and T (such as parents C1 and K4 of the CK population) should possess many more distinct R alleles than merely the two responsible for their resistance to late blight. The challenge is to precisely locate those polymorphisms that are truly associated with the late blight resistance. This may be achieved by designing PCR primers specific for every individual SNP and testing MT parents and progenies for the presence or absence of markers or by fine-mapping the resistance using many more MT progenies.
Further refinement of our method to combine high-throughput sequencing of longer (several thousand instead of several hundred nucleotides) DNA fragments generated with targeted amplification to capture the true biological alleles of the NBS-LRR R genes with mapping to advanced versions of appropriate reference genomes or, alternatively, reference genome-independent methods, as has been done for an R gene in the wild diploid potato relative Solanum americanum 37 , might however provide more straightforward ways to investigate the diversity of plant resistance genes.
The methodology to retrieve molecular markers for the PVY resistance segregating in the AB population differed from the CK method described above inasmuch as it relied on direct information about individual SNP sites retrieved from the Illumina reads of this population's parents, cultivars Alegria and Baltica. In a search that involved both AB parental information and the DM reference genome, individual SNPs were found that could be used to design CAPS markers for selection of resistant segregants. Again, the markers obtained were linked to, but not directly located within, the responsible putative R locus. This apparent imprecise targeting of the resistance locus was owed to the high redundancy of the NBS R genes in the potato genome and to the dependence of the method on the DM reference genome.

conclusion
The SolariX compendium (https ://www.cibiv .at/Solar iX/, European Nucleotide Archive Study ID PRJEB83917) provides a useful resource of the common potato's R genepool, for nearly all NBS domains on the DM genome. The presented variability may exceed the true genetic diversity due to inadvertent mis-mapping of highly homogeneous NBS tag reads because of the extreme levels of NBS sequence similarity and insufficient representation on the DM genome. For high-resolution analysis of individual cultivar-specific genomes more expensive methods such as SMRT RenSeq 37 may be appropriate.
Molecular markers for selecting specific resistances can be developed based on the SolariX database if methods allowing for more precise results are less economical or unavailable.  12 . The sequences of these three NBS motifs were grouped by similarity across all 438 loci. For each high-similarity group, specific degenerate primers were designed. Thus, according to their ability to amplify from genomic DNA in tests, a total of fifteen primer pairs, seven specific for the P-loop, three for the Kinase-2, and five for the GLPL motifs (Table 1), and in addition the NBS5 primer from Van der Linden et al. 16 , were finally chosen for our PCR enrichment of NBS containing genomic fragments across all 96 varietal DNA libraries ( Table 1).

Isolation of R gene fragments for Illumina HiSeq-2000 sequencing and profiling of NBS
domains. Total genomic DNA was extracted according to van der Beek et al. 45 with modifications by Park et al. 46 . After digestion using TaqI, MseI, RsaI, HaeIII and AluI restrictases, adapters for PCR 16 were ligated to the fragments. All five digestion-ligation products per varietal DNA library were pooled and 1 µl of a pool was used for each of the 16 primer-specific PCRs. For the initial linear, target-specific amplification, the reaction contained 4 µl of 5 × HOT FIREPol Blend Master Mix (Solis Biodyne) amended with 7.5 mM MgCl 2 and 2 µl of 0.1 µM NBS domain-specific primer (Table 1) in a total 20-µl volume. Conditions for the linear PCR followed van der Linden et al. 16 and annealing temperature settings were specific for each primer. Subsequently, for the exponential PCR 20 µl of a master mix containing 2 µl of 10 µM NBS domain-specific primer, 2 µl of 10 µM adapter primer and 4 µl of 5 × HOT FIREPol Blend Master Mix (Solis Biodyne) with 7.5 mM MgCl 2 and 12 µl water were added to the linear PCR product to give a total volume of 40 µl. All PCR products were cleaned using the NucleoSpin 96 PCR Clean-up kit (Macherey Nagel www.nature.com/scientificreports/ specific hidden Markov model (HMM) downloaded from Pfam and a mimimum score criterion of 100. The ARC domain is present in a wide array of resistance proteins including APAF-1 (apoptotic protease-activating factor-1), R proteins, and CED-4 (Caenorhabditis elegans death-4 protein) 49 . The detected NB-ARC aa sequences in the NBS-LRR R genes were aligned using ClustalW 50 to build a DM R gene specific NB-ARC HMM with hmmbuild. The DM specific HMM was further used to re-detect NBS domains within the translated resistance genes. The coordinates of the detected NBS domains in the aa sequences were then converted to coordinates in the resistance genes. NBS domains that had genomic overlap (resulting from NBS domains detected on different open reading frames of a gene) were merged, and a final number of 576 NBS domains were obtained for this genome. Thus, using a DM R gene specific NB-ARC HMM, we confirmed 576 of Jupe et al. 's 13 total 704 established R genes to carry an NBS domain.
Analysis of next generation sequencing (NGS) data. Read mapping. The 100 bp-long paired-end reads from the HiSeq-2000 sequencing were trimmed for the Illumina TruSeq universal adaptor sequences (AGA TCG GAA GAG CAC ACG TCT GAA CTC CAG TCAC and AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT AGA TCT CGG TGG TCG CCG TAT CATT) using cutadapt 51 from either the 5′ or 3′ end of the read. Further trimming of sequences matching the primer regions was also performed to prevent introduction of false sequence variations from degeneracy in the NBS profiling primers (compare Table 1). The trimmed pairedend reads were mapped to the Potato Genome Sequencing Consortium Public Data Release of the sequenced doubled-monoploid (DM) potato clone v 4.03 genome assembly 15 using NextGenMap 30 v 0.4.10 at default parameters. This involves local alignment of the read (ends of reads might get clipped) with a required minimum mapped length of 0.5 fraction of the read-length and 0.65 fraction identity between the mapped read portion and DM genome to be considered mapped. The 300 bp-long paired-end reads from MiSeq sequencing were trimmed for R2 gene-specific PCR degenerate primers sequences and mapped with NextGenMap 0.4.12 at default parameters, as described above.
Variant calling. Freebayes 26 v 0.9.9 was used to detect polymorphisms within NBS domains (see next section), with respect to the DM genome with parameters of ploidy 4, min alternate allele fraction of 0.1 and minimum 10 reads. Ploidy 2 was used for the CRP (diploid) libraries.
Haplotype phasing. HapCompass 27 v 0.7.7 was used for phasing of variants into haplotypes at NBS domains with parameter ploidy 12.
Annotation and classification of novel NBS domains. We observed that a substantial number of reads were mapped to regions other than NB-LRR R genes. Therefore, we performed a search with the DM R gene-specific NB-ARC HMM in regions where we obtained at least 10 × coverage (number of reads) across a minimum length of 150 bp occurring in at least 10 cultivars. In addition, we used a tblastn search for the P-loop, GLPL, and Kinase-2 motifs (from 12 ) in these regions. With this method, we detected 11 additional, novel NBS domains that had significant results given by the HMM and tblastn search. These additional domains were located in genes not previously designated as NBS-LRR R genes by Jupe et al. 13 . To determine the completeness of the putative NBS-LRR R genes, the nucleotide sequences for the full gene were translated into all 6 possible reading frames using TranslateSequence.jar from NLR-Parser 25 and mast from meme suite v 4.9.1 24 was used to detect all NBS-LRR motifs published in Jupe et al. 12 . The detected motifs in the genes were then used for classification of completeness and type, using NLR-Parser 25 .
Scientific RepoRtS | (2020) 10:11392 | https://doi.org/10.1038/s41598-020-67848-z www.nature.com/scientificreports/ 2. Paired unmapped reads (both reads do not map to the DM genome): a. The reads were first merged using PEAR v0.9.6 (paired end read merger tool) 52 . We found 14% of these had no blast hit while 9% of these had a blast hit to a Solanum species and 76% to Triticum turgidum. b. If pair was not merged in (a), the reads were concatenated one after another to obtain a longer sequence.
The second read was reverse complemented before concatenation. We found 31% of these had no blast hit while 6% of these had a blast hit to a Solanum species and 47% to Triticum turgidum. c. Some of the reads were 'discarded' by PEAR when they had too many Ns. Reads that were made up of a long string of Ns were filtered from these discarded reads (52%), and the remaining reads were used individually for a blastn. We found 11% had no blast hit while 3% had a blast hit to a Solanum species and 13% to Triticum turgidum.

Accession numbers
European Nucleotide Archive ENA, accession no. ERP086266 and study id PRJEB83917. The SolariX project can be found at www.cibiv .at/Solar iX.