Targeted enrichment by solution-based hybrid capture to identify genetic sequence variants in barley

In barley and other cereal crops, phenological diversity drives adaptation to different cultivation areas. Improvement of barley yield and quality traits requires adaptation to specific production areas with introgression of favorable alleles dependent upon precise identification of the underlying genes. Combining targeted sequence capture systems with next-generation sequencing provides an efficient approach to explore target genetic regions at high resolution, and allows rapid discovery of thousands of genetic polymorphisms. Here, we apply a versatile target-capture method to detect genome-wide polymorphisms in 174 flowering time-related genes, chosen based on prior knowledge from barley, rice, and Arabidopsis thaliana. Sequences were generated across a phenologically diverse panel of 895 barley varieties, resulting a high mean depth coverage of ~25x allowing reliable discovery and calling of insertion-deletion (InDel) and single nucleotide polymorphisms (SNPs). Sequences of InDel and SNPs from the targeted enrichment were utilized to develop 67 Kompetitive Allele Specific PCR (KASP) markers for validation. This work provides researchers and breeders a comprehensive molecular toolkit for the selection of phenology-related traits in barley.


Background & Summary
Phenological diversity drives adaptation to different areas of cultivation in barley and other cereal crops 1 . The triggering of flowering is influenced by endogenous genetic and exogenous environmental cues, including day length and temperature 2 . The timing of flowering directly influences grain yield, as it needs optimisation to occur during specific seasons to avoid environmental stresses such as frost, heat, and drought. Extensive research in the model plant Arabidopsis thaliana have revealed key genetic mechanisms that control flowering. Although many of the circadian clock genes are conserved between Arabidopsis and barley, there are many differences specific to cereal grasses 3 . Improvement of barley yield and quality traits, enhancing adaptation to specific production areas, and introgression of favorable alleles into commercial varieties all depend upon the precise identification of the underlying genes 2 . Differences in many alleles that modify phenology are based on genetic variants relative to the barley reference genome. The recent availability of a barley genome assembly with high-confidence sequences 4 allows the identification of phenology-associated gene orthologs.
Advances in second-generation sequencing technologies and bioinformatics techniques in the last decade now provide methodologies to interrogate genome diversity for many plant accessions. However, for plants with large and highly repetitive genomes including many cereal crops, technical challenges and economic barriers can prevent genome re-sequencing on a population scale to a sufficient depth. Reduced-representation sequencing approaches that access a focused subset of loci within a genome, including exome capture, RNA sequencing (RNA-seq), and target capture approaches, can be applied to any species for which a draft or complete reference genome sequence is available 5 . Exome capture approaches can identify sequence variants only in protein-coding genome regions, and RNA-seq can only be used to interrogate transcriptome diversity of genes expressed in the sampled tissue. Target capture based on in-solution hybridization allows for a cost-effective identification of sequence variants in coding-and noncoding regions of very large genomes. It involves custom design of capture probes targeting specific chromosome regions harbouring loci or candidate genes for traits of interest, which enable extremely flexible scaling of resequencing experiments of few to many genes at low cost for large plant populations.
The aim of this project is to provide gene sequences of ca. 174 flowering time regulatory genes and gene orthologs across a large barley population of 895 accessions for the purpose of identifying phenology gene sequence variations across current barley breeding germplasms. To achieve this, we used target capture based on in-solution hybridization, a method that focuses only a subset of the genome and thus significantly reduces the sequencing space and cost. The myBaits ® technology (Arbor Biosciences, Ann Arbour, MI, USA) which is based on in-solution biotinylated RNA probes was used for target enrichment. The resulting sequence capture libraries were subjected to Illumina sequencing in paired-end mode. Baits were designed for flowering time regulatory genes and gene homologs, and used for production of 80mer capture oligonucleotides for hybridization. Sequencing was performed for 895 diverse barley accessions selected from a large collection of landrace, cultivated barley, and research varieties of diverse origin. This set represents the whole variety of cultivated barley lifeforms, namely two-and six-row genotypes with winter, spring, and facultative growth habits. The results show abundant genetic variation in the form of SNPs and InDels in the different barley varieties. Sequences of InDels and SNPs from the targeted enrichment were utilized to develop 67 Kompetitive Allele Specific PCR (KASP) markers to detect and distinguish cultivar-specific alleles for candidate phenology genes in a subset of the re-sequenced barley varieties.
The ability to fine-tune flowering to the growing season could give significant advantages for both breeders and growers. Marker-assisted selection (MAS) and targeted genetic modification of flowering behaviour combined with faster breeding cycles have great potential to optimise the timing of flowering to specific environments and maximise grain yield 6 . The availability of this dataset will serve as a valuable resource to the cereal research community, and will help identify valuable traits to assist in breeding programmes.
An accompanying publication describes the biological implications of the data 7 . We have deposited the relevant raw data to public sequence data archives in Array Express 8 .

Methods
Plant material. As described in ref. 7 , the barley accessions of the phenology diversity panel of core barley varieties were selected from a large collection of over 4,000 landrace, cultivated, and research barley varieties of diverse origin preserved at the Western Barley Genetics Alliance at Murdoch University (Perth, Australia). The selection of barley accessions included domesticated, landrace, and breeding accessions from 41 countries, and was based on diversity in phenology development, and diverse geographic origins to represent European, Asian, North and South American, and Australian breeding programs (Germplasm.xlsx 9 ). The selection of domesticated barley originated from various breeding programs and represented the whole variety of cultivated barley lifeforms, including two-and six-row genotypes with winter, spring, and facultative growth habits.
Design of the targeted enrichment assay. We designed a custom target enrichment sequencing assay that included the loci implicated in the flowering pathways in barley and related plant species (described in ref. 7  . Scientific literature was mined for the genes implicated in flowering time pathways in barley and the corresponding nucleotide sequences were extracted from NCBI GenBank. In addition, barley homologs of flowering genes from the other grass species, such as bread wheat and rice, were identified through BLASTN search (e-value < 1e −10 ) and included. Complete genomic DNA (gDNA) sequences were selected for probe design. If the complete gDNA was absent, partial cDNA and cDNA sequences were included instead.
Probe design. Targeted enrichment of genomic DNA regions was performed by solution-based hybrid capture using a synthetic library consisting of a final set of 13,588 probes (myBaits ® , Arbor Biosciences, Ann Arbour, MI, USA; http://www.arborbiosci.com) following the manufacturer's protocol version 3.01 (http://www.mycroarray.com/pdf/MYbaits-manual-v3.pdf; accessed 14/02/2018). As described in ref. 7 , the capture oligonucleotides were 80 nt long and were designed to target approximately 600 kb of repeat-and organelle-masked genetic sequence (190 phenology-related genes, derived from Hordeum vulgare L. genome assembly 082214v1) in with a 2x tiling density. Probes were designed from the FASTA sequences of each gene. For probe design, the sequence of the gene was repeat-masked based on organelle, transposable elements 10 (TREP), and high copy number in the barley reference genome (genome assembly 082214v1). Probes based on intron-flanking exons were designed www.nature.com/scientificdata www.nature.com/scientificdata/ as they would still capture fragments covering the entire intron for genes with small introns (<500 bp) if probes based on intron sequence failed to capture fragments.

Library Preparation and target enrichment.
Genomic DNA was extracted from leaves of a single two-week old barley plant per variety using the cetyl-trimethyl-ammonium bromide (CTAB) method 11 . DNA concentration was determined using a nanodrop spectrophotometer (Thermo Scientific) according to the manufacturer's protocol. DNA quantity and purity was further checked on 0.5% agarose gel (100 V, 0.5 × TBE, 60 min). Physical shearing of genomic DNA was performed with an average size of 275 bp using a Covaris S2 Focused-Ultrasonicator. DNA was cleaned using the Agencourt AMPure XP SPRI beads (Beckman Coulter, Australia), followed by end repair and A-tailing of the fragments using unique Illumina 9 nt p5 and p7 barcode adaptors (IDT × Gen Custom Universal Blocking Oligos) to generate dual-indexed libraries. Library preparation was conducted with KAPA Hyper Prep Kit (KAPA Biosystems, Wilmington, MA). We performed preliminary tests to compare different scales of KAPA sample library reactions for target capture efficiency. Half-scale KAPA sample library reactions were the best compromise in terms of target capture efficiency and coverage of target genes as the proportion of on-target reads for half-scale libraries was comparable to the full-scale libraries. Equimolar amounts of each amplified library were pooled in reactions of twelve indexed samples each (total of 500 ng DNA per 12-plex pool) using a Biomek FX P laboratory automation workstation liquid handling system (Beckman Coulter, Australia). Hybridization of customised RNA baits with capture pools was performed at 65 °C for 24 hours. Dynabeads ® MyOne ™ Streptavidin C1 magnetic beads (Thermo Fisher Scientific, Australia) were used to isolate biotinylated DNA, and amplification of bead-bound enriched libraries (9-10 cycles) was performed using KAPA HiFi HotStart Ready Mix (KAPA Biosystems, Wilmington, MA, USA) and adapter-specific primers. Concentration of all enriched libraries was determined with a Qubit fluorometer (Thermo Fisher Scientific, Australia), and library size with Tapestation (Agilent, Santa Clara, CA, USA) for subsequent library titration and next-generation sequencing. Libraries were titrated with KAPA Library Quantification Kit (KAPA Biosystems, Wilmington, MA). These methods are expanded versions of descriptions in our related work (ref. 7 ). targeted resequencing, sequence alignment and data analysis. Each pooled library was sequenced on an Illumina HiSeq3000 instrument using three lanes to generate about 0.5 million 2 × 150 bp paired-end reads per sample 9 . In addition to methods described in our related work (ref. 7 ), here we include the discovery and genotyping procedures of InDels and a detailed and expanded description of the bioinformatics pipeline. A schematic overview of our data analysis pipeline for filtering, alignment, variant discovery, and genotype calling of target gene sequences is provided in Fig. 1.
Fastq sequence files were post-run filtered using nuclear software v3.6.16 (GYDLE Inc., Montreal, Canada) filtering sequences based on a minimum q-score of 20 and a minimum length of 50 bp. The resulting high-quality (HQ) reads were aligned to the target molecules, including 5 kb flanking regions on either side of the genes used in the RNA probe design, using nuclear software with low stringency high-scoring segment pairs (HSPs) with minimum read length 30 bp, with sensitivity 13 (consecutive identities), and up to 3 mismatches per window of 30 bases (~90% identity). Reads aligning to genomic regions targeted in the capture assay were retained as on-target reads. These on-target reads were re-aligned using nuclear software with higher stringency (HSPs with minimum read length 50 bp, sensitivity 25 and at most 3 mismatches per window of 50 bases) to the latest release of the barley reference genome assembly 4 (IBSC v2). Genetic coordinates were converted to match the latest assembly 4 (IBSC v2).
After quality filtering 895 barley genotypes with sufficient HQ reads were retained and used for alignment, variant discovery and genotype calling. The final dataset consisted of 1.12 billion 150-bp paired-end reads (203 Gb of raw data). The median filtered read number was 1 M reads per sample. Only 13% of the samples had less than 0.5 M reads while 8% of the samples had more than 2 M reads (see Table 1).

Specificity and sensitivity.
In total, the captured reads aligned to 193 distinct regions of the Hordeum vulgare assembly ASM32608v1 reference genome corresponding to all intended target genes. The average target sensitivity, interpreted as the percentage of target bases covered by sequence reads, was 88% (Table 2). Two target gene copies (DQ492699 and DQ492698) had insufficient coverage across samples but were retained in the target list. Targeted enrichment values (mean target coverage over the mean genome-wide coverage) for each sample were calculated from Illumina sequencing data using the formula from 12 (see equation 1). The overall mean enrichment factor was 1,389.4.

Number of reads that map to the target region
Total number of reads Target region size Haploid genome size = Variant discovery and genotype calling. The capture assay was designed to capture 600 kb of target sequence which corresponded to 1.9 Mbp of target gene space in the current barley genome assembly 4 (IBSC v2). SNP variant discovery as well as genotype calling was performed using custom Perl scripts to produce a VCF version 4.2 genotype file based on the alignment files (described in ref. 7 ). To eliminate population-specific bias due to the composition of the SNP discovery panel, the entire population was selected to discover SNPs.
The SNP discovery pipeline identified a total of 467,339 SNPs, of which 29,822 were within gene-coding regions, including 6,030 SNPs which were captured within the 174 targeted phenology gene regions (VCF_SNP. vcf.gz 9 ). On a per sample basis, about 75% of the SNP detected within the captured regions had a mean coverage depth of 25 reads per SNP. As expected for an inbreeding population, the rate of heterozygosity for each variety www.nature.com/scientificdata www.nature.com/scientificdata/ was very low (approximately 2% on average). Out of the 174 phenology genes, 168 genes had at least one SNP.   Variant prediction. Using the target enrichment strategy, we identified regulatory (3′-UTR/5′-UTR/intron/ upstream/downstream) as well as coding (frameshift/splice variant/missense/synonymous) novel SNP allelic Total reads (raw) Total reads (filtered) Total reads (aligned) Aligned reads (%)   (Fig. 2a). Approximately 26% and 34% of all SNP effects fell within exonic and intronic sequences, respectively, based on the predictions of the Variant Effect predictor (VEP) tool 13 . Of the SNP variants discovered within the target gene regions, we found an impact (deleteriousness) distribution of approximately 7% (low), 87% (modifier), 5.5% (moderate), and 0.2% (high). Out of 174 targeted genes involved in flowering time, we found 132 genes with mutations having moderate effects, and 30 genes having large effects, including many known genes (such as ELF3, PPD-H1, LFY1, and AGLG1).

Mean
Approximately 22% and 34% of all InDel effects fell within exonic and intronic sequences, respectively. Similar as for the detected SNPs, about a third of all InDels were detected in intronic regions (Fig. 2b). For all InDel variants, we found an impact (deleteriousness) distribution of 0.5% (low), 94% (modifier), 2.5% (moderate), and 3% (high). Out of 174 targeted genes involved in flowering time, we found 53 genes with mutations having moderate effects, and 69 genes having large effects, including many known flowering time-related genes (such as PHYC, FD, CO2, and ELF3).
Thus, the assembled sequences and the extensive variation dataset obtained through targeted enrichment genome sequencing should support studies using forward and reverse genetic approaches.

Data Records
Metadata for the germplasm collection and phenology gene targets is available in files Germplasm.xlsx, and Phenology_Genes.xlsx, respectively 9 . The SNP and InDel genotype matrices are available in VCF (Variant Call File format) (files VCF_SNP.vcf.gz, and VCF_INDEL.vcf.gz, respectively), and KASP primer sequences and KASP genotyping results are made available in files KASP_Primers.xlsx and KASP_Genotyping.xlsx, respectively (also 9 ). The raw sequence data (.fastq files) are stored at ArrayExpress 8 .

technical Validation
Genotypic assessment of biological samples. To support the technical quality of the dataset, sequences of InDel and SNPs from the targeted enrichment were utilized to develop and test genetic molecular markers (KASP_Primers.xlsx 9 . Uniplex Kompetitive Allele Specific PCR (KASP) technology was used to develop genetic markers, which offers a low-cost medium throughput genotyping platform suitable for fast and accurate screening of thousands of genotypes 14,15 . This makes KASP suitable for MAS in many breeding programs. Marker selection ensured that a vast range of allelic variation was captured for major phenology genes in barley inlcuding PPD-H1, PhyC, and FT1. Genomic DNA of Australian barley cultivars Vlamingh, Buloke, Commander and Fleet, as well as 60 other barley varieties from the phenology set were extracted as described previously.
For KASP assays developed in this study, complete coding sequences were obtained from the latest version of the barley reference genome assembly 4 (IBSC v2). Sequences of 64 SNPs (described in ref. 7 ) and 3 InDels present in candidate genes were converted into 67 Kompetitive Allele Specific PCR (KASP) assays to detect the specific parental allele for MAS in barley breeding. Allele-specific and common primers for each KASP marker were designed using Geneious ® software version 10.2.3 (Biomatters Ltd., NZ) following standard KASP guidelines. The allele-specific primers (AS1 and AS2) were designed carrying the standard FAM (5′-GAAGGTGACCAAGTTCATGCT-3′) and VIC (5′-GAAGGTCGGAGTCAACGGATT-3′) tails, and with the targeted SNP or Indel at the 3′ end. A common primer (C) was designed so that the total amplicon length was between 100 and 150 bp. Assays were tested in 384-well formats and set up as ~5 µl reactions (15 ng/µl dry DNA, 2.5 µl of 1 × KASP master mixture, and 0.1 µl of primer mixture). The primer mixture comprised of a final concentration of 30 µM of the common primer, and 10 µM of each tailed primer. PCR cycling was performed using www.nature.com/scientificdata www.nature.com/scientificdata/ the following protocol: hot start at 94 °C for 15 min, followed by ten touchdown cycles (94 °C for 20 s; touchdown at an initial temperature of 61 °C and decreasing by −0.6 °C per cycle for 60 s) achieving a final annealing temperature of 55 °C, followed by 26 additional cycles of annealing (94 °C for 20 s; 55 °C for 60 s). Fluorescence readings were performed at 37 °C. Amplification was carried out using the ABI ViiA7 instrument (Applied Biosystems, Foster City, CA, USA), and plants were genotyped using KASP markers according to the manufacturer's instructions (LGC Genomics, Hoddeson, UK). End-point genotyping was done using the ABI QuantStudio ™ Real-Time PCR software v1. 3.
In total, we designed and tested 67 KASP assays that distinguished between alleles for 31 key phenology genes. KASP markers were subsequently validated using subsets of the barley variety panel (6-47 varieties tested and confirmed per marker), including Australian cultivars Hindmarsh, La Trobe, Baudin, Scope, Commander, and Fleet. Individual genotyping using the KASP markers for barley accessions with contrasting alleles at the locus (allele 1 corresponds to the reference allele, allele 2 to the alternative allele, allele 0 to a heterozygous locus, and ". " to a missing data point) showed an agreement ("Y") of 80-100% between the estimated allele from the target enrichment and sequencing (column "Resequencing") and absolute values based on the KASP genotyping (column "KASP", KASP_Genotyping.xlsx 9 ). On average, more than 90% of all alleles were scored identically between the re-sequenced data and KASP genotyping.

Usage Notes
The plant material described in this paper is publically available and can be made accessible under a Standard Material Transfer Agreement (SMTA).