QTL mapping and candidate gene analysis of cadmium accumulation in polished rice by genome-wide association study

Cadmium (Cd) accumulation in rice is a serious threat to food safety and human health. Breeding rice varieties with low Cd accumulation is one of the most effective approaches to reducing health risks from Cd-polluted rice. However, the genetic basis of Cd accumulation in grains, especially in indica rice varieties, has not been fully elucidated. The evaluation of Cd-accumulation capacity was conducted among 338 diverse rice accessions grown in Cd-contaminated soils with different Cd contents. Thirteen rice lines with relatively low Cd accumulation, including six indica rice lines, were identified. Then, 35 QTLs significantly associated with Cd accumulation were identified through sequencing-based SNP discovery and a genome-wide association study (GWAS) in the two experimental years, and only qCd8-1 was detected in both years. Among of them, nine QTLs were co-localized with identified genes or QTLs. A novel QTL, qCd1-3, with the lowest P value was selected for further LD decay analysis and candidate gene prediction. We found differential expression of OsABCB24 between high-Cd-accumulative and low-Cd-accumulative accessions, suggesting it may be a candidate gene for qCd1-3 associated with low Cd accumulation. These results may be helpful for further exploiting novel functional genes related to Cd accumulation and developing rice variety with low Cd accumulation through marker-assisted breeding.

www.nature.com/scientificreports/ from crosses between indica and japonica varieties, many quantitative trait loci (QTLs) for Cd accumulation have been identified [11][12][13][14][15] . Ueno et al. 7 detected a major QTL (qCd11) controlling the Cd concentration in shoots using a F 2 population. Liu et al. 16 identified seven QTLs and validated qCd2 associated with grain Cd content using a recombinant inbred line (RIL) population. Another major QTL (qCd7) was repeatedly detected using different genetic populations and finally cloned [17][18][19][20] . The candidate gene of qCd7, OsHMA3, encodes a tonoplast-localized P 1B -type ATPase. OsHMA3 is involved in Cd sequestration from the cytoplasm into the vacuoles of root cells and its dysfunction promotes root-to-shoot Cd translocation and consequently increases Cd accumulation in the shoots and grains in some varieties 18,19 . Several dysfunctional alleles of OsHMA3 have been identified in different japonica rice accessions 21,22 . However, one SNP in the promoter of OsHMA3, which alters the normal expression of OsHMA3, is responsible for the differential Cd accumulation between indica and japonica rice varieties 23 . Like the function of OsHMA3, ATP-binding cassette (ABC) proteins mediate Cd sequestration and confer Cd tolerance in Arabidopsis thaliana 24,25 .
Most of the Cd-related QTLs were identified using bi-parental populations. However, the progress of QTL mapping is hindered due to limited allele diversity and less recombination in bi-parental populations 26 . Genomewide association study (GWAS) can overcome these two limitations and is a powerful tool to identify genome regions associating with complex traits 27,28 . Through a GWAS of 276 diverse accessions, 60 QTLs were detected for accumulation of arsenic, cadmium, and lead in rice 29 . Zhao et al. 30 identified 14 QTLs associated with Cd accumulation in rice by GWAS and predicted OsNRAMP2 as the candidate gene of qCd3-2. Using a composite method combining GWAS and other analyses, Yan et al. 10 found that a missense mutation in OsCd1 resulted in the indica-japonica differentiations of Cd accumulation in rice grain. These reports clearly demonstrate that GWAS is an effective approach to elucidate the genetic mechanism underlying Cd accumulation. Unfortunately, for HMA3 and most of these Cd-related QTLs, the favorable alleles for reducing Cd accumulation were basically derived from japonica rice varieties and therefore limited their breeding application in indica rice.
To identify rice germplasms with low Cd accumulation and the responsible loci for Cd accumulation, we selected 338 accessions mainly composed of indica rice to evaluate their Cd accumulation in different Cd-polluted paddy fields. 13 rice lines, including six indica rice lines, were identified as low Cd-accumulative germplasms. Based on the specific-locus amplified fragment sequencing (SLAF-seq) method 31 , genome-wide SNP discovery and a GWAS strategy were used to identify QTLs associated with Cd concentration in polished rice. 35 QTLs significantly associated with Cd accumulation were identified in two experimental years, but only qCd8-1 was detected in both years. Through a combined analysis of LD decay and gene expression, we predicted OsABCB24 as the candidate gene of qCd1-3. These results will be helpful to elucidate the genetic mechanism of Cd accumulation and provide a good basis for breeding low Cd-accumulative indica rice varieties.

SLAF-based SNPs discovery among rice accessions.
After sequencing and quality control, a total of 688,782 SLAF tags were obtained for each of the 338 accessions, and 515,447 polymorphic SLAFs were identified by conducting sequence alignment with the 93-11 reference genome (Supplementary Table 1). The average sequence depth was 14.3× , ranging from 8.7 to 25.5 × among different accessions. Using the GATK and Samtools software packages, 3,960,919 SNPs were called from the SLAFs for the 338 genotypes. Based on the criterion of having MAF larger than 0.05 and missing genotype rate less than 0.2, only 123,865 SNPs (3.11%) passed filters from the SNP dataset and were used for subsequent analysis. These high-quality SNPs were evenly distributed on 12 chromosomes with average density of approximately one SNP per 3.02 Kb (Supplementary Table 2). The highest maker density was detected on chromosome 7 (one SNP per 2.63 Kb), while the lowest density was detected on chromosome 8 (one SNP per 3.55 Kb).
Population structure and relative kinship. Population structure analysis can provide information on the origin and composition of individuals. Based on the filtered high-quality SNPs, the optimal number of ancestors (K) was estimated using the STRU CTU RE software. The ∆K value was lowest when K was set to 2, suggesting that the whole group of rice lines could be divided into two subgroups (Fig. 1a). Consistently, the Principal Component Analysis (PCA) results showed that two clusters clearly separated along the eigenvector of PC1, which accounted for 32.9% of the genetic variation (Fig. 1b). Based on the neighbor-joining algorithm, a phylogenetic tree for each sample was constructed with two subgroups (1 and 2) (Fig. 1c). Subgroup_1 contained 268 rice lines, which were in accordance with the indica subspecies. Subgroup_2 contained 70 lines, including a portion of the foreign germplasm and landraces, which were in accordance with the japonica subspecies. Most of the breeding accessions and landraces belonged to subgroup_1. Interestingly, glutinous lines of landraces genetically diverged significantly, with 15 and 20 lines classified to subgroup_1 and subgroup_2, respectively. Moreover, there was differentiation within the indica rice subgroup. Most of the bred varieties had close relationships with foreign indica germplasms but were distantly related to the landraces, indicating that these landraces were rarely used in modern rice breeding in China.
Variation of Cd accumulation in polished rice among 323 lines. In order to reduce confounding effects from variable growth durations among accessions, only 323 lines with moderate durations of growth were selected for Cd determination and GWAS analysis. The Cd concentrations of polished rice collected from Cd-polluted fields were determined using atomic-absorption spectrometry. The Cd accumulation varied significantly among different rice accessions in both years (Supplementary Table 3). In 2016, Cd concentrations in polished rice ranged from 0.57 mg/kg to 4.03 mg/kg, with an average of 1.61 mg/kg and a median of 1.53 mg/ kg (Fig. 2a). All the rice lines showed higher Cd accumulation over 0.2 mg/kg, which is the allowable concentration for human consumption as stipulated by the National Food Hygiene Standard of China. In 2017, Cd www.nature.com/scientificreports/ concentrations ranged from 0.06 mg/kg to 2.24 mg/kg, with an average of 0.43 mg/kg and a median of 0.33 mg/ kg (Fig. 2b). Only 58 lines displayed Cd concentrations less than 0.2 mg/kg, accounting for 18.0% of all rice lines. Overall, rice Cd accumulation in 2017 was significantly lower than that in 2016, indicating that Cd concentrations of soil was a key factor determining Cd accumulation in grains (Fig. 2c). Despite large differences between two years, thirteen rice lines were identified with relatively low Cd accumulation (Cd2016 < 0.8 mg/kg and Cd2017 < 0.2 mg/kg) in both years (Table 1). For example, breeding material BS114 showed the lowest Cd accumulation (0.57 mg/kg) in 2016 and also extremely low Cd (0.07 mg/kg) in 2017, respectively. Additionally, many landraces from Hunan province, such as "Shenshuinuo" and "Daganzaogu, " were found in this low-Cdaccumulative group and could be used as potential donors in future low-Cd rice breeding.
Considering the presence of distinct population structure, Cd accumulations were compared between different subgroups in both years. As shown in Fig. 2d, Cd accumulation in the indica subgroup was significantly higher than that in the japonica subgroup (P < 0.001). In 2016 and 2017, the mean Cd accumulations in the indica subgroup were 1.75 mg/kg and 0.47 mg/kg respectively, while in the japonica subgroup were 1.04 mg/ kg and 0.25 mg/kg, respectively. These results clearly indicated that population structure had effect on the Cd accumulation in these rice lines.

GWAS for Cd accumulation.
To investigate the genotypic basis underlying Cd accumulation in polished rice, we performed GWAS to identify the associated SNP loci in the selected 323 rice lines. Considering the effect of population structure on Cd accumulation, the mixed linear model (MLM) model was adopted with kinship matrix and PC matrix as covariates. According to Lv et al. 32 , a region was considered as one QTL when more than two significant SNPs (P < 0.001) were detected within a 200-Kb window. In total, 35 QTLs with 203 SNPs significantly associated with Cd accumulation were identified in the two experimental years with a well- www.nature.com/scientificreports/ fitted quantile-quantile (Q-Q) plots (Table 2, Supplementary Table 4, Fig. 3a,b). These QTLs were distributed on all chromosomes except chromosome 10. The comparison of QTLs identified in two years indicated that only qCd8-1 was detected in both years, suggesting that environmental factors might have great influence on the GWAS results of Cd accumulation. To verify the accuracy of the GWAS results, the identified QTLs in this study were further compared with previous reports. We found 9 QTLs were co-localized with previous mapped QTLs, associated markers and characterized genes ( Table 2), indicating that GWAS results were reliable in this study. Among the co-localized QTLs, qCd6-2 on chromosome 6 was located close to OsLCT1 33 , and qCd7-1 on chromosome 7 was identified in the genome interval of the well-characterized gene HMA3, which is involved in Cd transport into rice grains. The remaining QTLs have not been reported previously and were considered as novel QTLs.

Identification of candidate genes responsible for Cd accumulation.
Around the interval of qCd1-3, 14 consecutive SNPs were significantly associated with Cd accumulation in 2017 (Fig. 4a), among which the lead SNP rs1_43287290 (P = 2.78E−06) was selected as the representative of this loci. According to the alleles of the lead SNP, all samples were divided into two groups of a favorable allele G (designated as group G) and an unfavorable allele T (designated as group T) respectively. Cadmium accumulation in group G was significantly lower than that in group T. An average of 0.78 mg/kg and a median of 0.62 mg/kg were observed in group T while an average of 0.36 mg/kg and a median of 0.30 mg/kg were observed in group G. (Fig. 4b). In order to accurately estimate the target interval, LD decay analysis was performed for the region around qCd1-3. With r 2 = 0.8 as the threshold, a 136-Kb block containing the lead SNP was identified as the candidate region (Fig. 4c).
Based on the annotation of the reference genome, 22 genes were identified in this block including 17 functional protein-coding genes and five lncRNA-encoding genes. One of these genes (OsABCB24) located approximately 71 Kb from the lead SNP was annotated as an ATP-binding cassette (ABC) transporter. Since the ABC transporter had been reported to mediate vacuolar compartmentation of Cd in root tissue, OsABCB24 was regarded as the primary candidate gene of qCd1-3 associated with Cd accumulation. Then we analyzed the expression of OsABCB24 in different tissues at the vegetative stage. Although OsABCB24 showed the highest expression in leaf, its expression in root was higher than those in many other tissues such as stem, leaf sheath and panicle ( Supplementary Fig. 1). Based on the genotype of qCd1-3, twelve accessions with contrasting Cd accumulation, including six high-Cd-accumulative accessions and six low-Cd-accumulative accessions, were selected for further expression analysis. As shown in Fig. 4d, OsABCB24 showed lower transcript levels in the high-Cd-accumulative rice accessions than those in the low-Cd-accumulative accessions under normal growth conditions. A similar trend was also observed under Cd-treatment conditions, even though Cd treatment slightly reduced the expression of OsABCB24 in low-Cd-accumulative accessions. These results suggest that OsABCB24 might be a good candidate gene for qCd1-3.

Discussion
Natural genetic variation is a powerful resource not only for rice breeding but also for investigating the genetic mechanism of complex traits. Cadmium accumulation varied considerably among different rice accessions, suggesting that it is feasible to breed low Cd-accumulative rice varieties. In order to accurately evaluate the Cdaccumulation capacity among different genetic rice lines, we cultivated 338 rice lines in paddy fields that naturally contained different concentrations of cadmium, and the experiment was conducted in two consecutive years. The difference of soil Cd concentration and Cd accumulation of polished rice in 2017 was significantly lower than those in 2016, respectively. These results were consistent with the previous study that the grain Cd accumulation was largely affected by the soil Cd concentration 9 . Zhao et al. 30 planted 312 rice accessions in a field with a pH value of 5.5 and soil Cd level of 1.4 mg/kg (similar to the Cd level in 2016 of this study), while they recorded relatively lower Cd accumulation in the grain. Because pH is probably the most important influencing factor of Cd uptake in rice plants 34 , it is reasonable to ascribe this difference in grain Cd content to the difference in pH. Consistent with previous studies [7][8][9] , there are wide variations in Cd accumulation levels among different rice accessions. In addition, our results also indicted that indica rice accessions tended to accumulate more Cd than japonica accessions. Nevertheless, several indica rice lines with relatively low Cd accumulation were identified in both years, most of which are landraces from Hunan province. After long-term evolution under natural and artificial selection, rice landraces show high genetic diversity and outstanding environmental adaptability 27 . The results showed that most of the landraces were genetically distant from the bred varieties, and they could be ideal donors for breeding low-Cd-accumulative rice varieties, especially for indica rice varieties. Unfortunately, most of the rice lines showed higher Cd accumulation over 0.2 mg/kg in the two-year experiment. Thus, identifying www.nature.com/scientificreports/ accessions with relatively low Cd accumulation might be an important step towards reducing Cd risks to rice consumers.
Identifying the loci associated with complex traits in rice is challenging due to high population differentiation 26 . In this study, the population structure of the rice accessions likely affected Cd accumulation in the polished rice. MLM model has been widely adopted due to its effectiveness in controlling confounding factors and reducing the number of false positives 35 . Results of the GWAS for the indica subgroup were basically consistent with those for the whole group, demonstrating that the use of relatedness matrixes as covariates in GEMMA could eliminate confounding effects of population structure. The slight difference between the subgroup and whole group might be caused by two reasons: (1) a smaller number of SNPs were used in the GWAS for the indica subgroup; and (2) the criterion of requiring more than two significant SNPs existing within a 200-Kb window to identify a QTL may have been too strict. Another common problem encountered in the GWAS was the large effect of environment, especially when mapping isonomic traits 36 . Among the identified QTLs, only qCd8-1 was detected under different soil Cd concentration in both years. Except for the inaccurate phenotypic identification caused by environmental factors, it's quite likely that the genetic mechanisms underlying Cd accumulation The transfer of Cd from soil to grain is controlled by at least four steps: transport of Cd from soil into root cells, sequestration of Cd into the vacuoles, xylem loading, and phloem-mediated Cd transport to grains 37 . Nramp5 is a major transporter responsible for the Cd uptake from soil, and its knock-out resulted in a significant reduction of Cd accumulation in different genetic backgrounds of rice 38 . However, no natural allelic variation of OsNramp5 have been reported among different genetic resources. Through GWAS analysis, Zhao et al. 30 predicted that another homolog OsNRAMP2 might be involved in Cd uptake in rice. OsHMA3, which specifically sequesters Cd into the vacuole of root cells and prevents its upward transport, was identified as the candidate gene for qCd7-1 in this study. Interestingly, qCd7-1 was also detected in the indica subgroup, implying that there might be natural variations of OsHMA3 among different indica rice varieties. Unlike the function of OsHMA3, another homolog OsHMA2 functions in the transport of both zinc (Zn) and Cd between root and shoot tissues through xylem loading 39 . The last Cd translocation step from shoot to grain involves a low-affinity cation transporter, OsLCT1, the encoding gene of which was co-localized with qCd6-2 (identified in the present study).
Our bioinformatics and gene expression analyses suggest that OsABCB24 was the candidate gene of qCd1-3. The ABC transport family is one of the largest protein families and conserved in all organisms. There are more than 125 ABC transporters in the rice genome, and their functions have yet to be elucidated 40 . Several studies have shown that ABC transporters may play important roles in Cd tolerance in plants 24,25 . Among the genes www.nature.com/scientificreports/ within the block of qCd1-3, only OsABCB24 was identified to encode a transporter. Reportedly, ABC transporters contribute to detoxifying cadmium by pumping it into vacuoles in yeast 41 . In Arabidopsis, AtABCC3 functions as a transporter of phytochelatin-Cd complexes into vacuoles 42 , similar to the function of OsHMA3 in Cd vacuolar sequestration. Our results showed that the expression of OsABCB24 was relatively high in root, and was significantly lower in high-Cd-accumulative rice accessions than that in low Cd-accumulative accessions. Because overexpression of OsHMA3 decreased Cd concentration in shoots, we proposed that the strong expression of OsABCB24 might contribute to enhancing vacuolar compartmentation of Cd in roots, thereby reducing Cd accumulation in rice grains of low Cd-accumulative accessions. Future work will apply functional genomics methodologies such as genetic transformation and CRISPR-cas9 technology to verify the role of OsABCB24 in regulating Cd accumulation.  Table 3). Young leaves from each of the 338 accessions were collected, frozen in liquid nitrogen, and used for DNA extraction. Genomic DNA was isolated using the cetyltrimethyl annonium bromide (CTAB) protocol 43 . The SLAF libraries were constructed for each accession following the method proposed by Sun et al. 31 , and sequencing was performed on a HiSeq 2500 system (Illumina, CA, USA). The library construction and sequencing were carried out at Biomarker Technologies Corporation (Beijing, China). Because most of the rice lines belong to the subspecies indica, the pair-end reads were aligned to the reference genome of indica rice 93-11 (https ://rice.genom ics.org.cn/) using the MEM algorithm of Burrows-Wheeler Aligner (BWA) software (version 0.7.10) 44 . After alignment, SNP calling was conducted by the combined use of GATK (version 3.7) 45 and Samtools (version 1.9) 46 . The identified SNPs were further filtered by the Plink software (version www.nature.com/scientificreports/ 1.90) 47 . Only SNPs with minor allele frequencies (MAF) > 0.05 and missing genotype rates < 0.2 were retained for GWAS analysis.

Methods
Field and pot experiments. Two years of field experiments were conducted in two separate Cd-polluted paddy fields in Beishan, Hunan province, China. The soil Cd concentration in 2016 was 1.25 mg/kg with a pH value of 5.2, while the Cd concentration in 2017 was 0.69 mg/kg with a pH value of 5.3. To reduce potentially unexpected effects of differential growth duration among accessions, sowing dates were staggered in May based on the days to maturity of each accession to ensure most lines heading at approximately the same time. The 25-dold seedlings were transplanted in a randomized complete block design with two replications for each line. Each replication contained 16 rice plants grown in two rows with an in-row spacing of 16.7 cm and a between-row distance of 20 cm. Flooded condition was maintained in the field until mid-August. In order to increase bioavailable Cd concentration in the soil, rain-fed irrigation was mainly adopted during the grain filling stage, and flush irrigation was applied when necessary to avoid drought stress. Other field management, including fertilizer application and disease and pest control, was conducted according to standard rice farming practice.
Sampling and cd determination. Rice grain was harvested 35 days after heading and dried in an oven at 40℃ for three days. The Cd concentration was determined according to the Chinese National Standard (GB 5009. . Because polished rice is the main edible part of rice, Cd accumulation in polished rice was investigated in this study. Rice grains were polished and then ground into powder. Approximately 0.3-0.5 g samples were digested with a solution of nitric acid and perchloric acid (9:1 v/v). The Cd concentration in the digest solution was measured by atomic absorption spectrometry (Solaar S4; Thermo, USA).
Phylogenetic tree construction, population structure, and principal component analysis. A phylogenetic tree of 338 lines was constructed by MEGA 5.0 48 using the neighbor-joining method with 1000 bootstrap replicates. The population structure was analyzed using STRU CTU RE software 49 . The following parameters were used for the analysis: K = 2 to 10, burn in 5000, MCMC repeat 50,000 and three replicates for each K. Then we calculated ΔK to determine the optimal K value. The software Clumpp 50 and Pophelper 51 were used to visualize the population structure. The Q matrix of population structure was analyzed by ADMIXTURE software 52 . Principal components analysis was carried out using EIGENSOFT 53 .
Genome-wide association study. A software toolkit of GEMMA was used to perform association mapping according to Zhou and Stephens 54 . The standard linear mixed model was expressed as y = Wα + xβ + u + e, where y represents the phenotypic observation, W = (w1, … wc) is an n × c matrix of covariates, α is the vector of the corresponding coefficients including the intercept, x is an n-vector of marker genotypes, β is the effect size of marker, u and e represent random effects and errors, respectively. To minimize the effect of population structure, PCA matrix and kinship matrix were used as covariates in this study. P values of ≤ 0.001 were used as the threshold to identify significantly associated SNPs. The SNP with the minimum P value in a locus was considered as the lead SNP. The allele contributing to reduction of cadmium content was regarded as the favorable allele.
Gene prediction and expression analysis. The LD heatmap around the peak SNP in GWAS was constructed using HaploView software 55 and the candidate region was estimated using r 2 > 0.8. The local Manhattan plot was produced using R package qqman 56 . The reference sequences of a candidate region were downloaded for gene annotation. Based on the annotations, genes related to transport of metal ion were selected as candidate genes. For the expression analysis, the seedlings were grown in quarter-strength Hoagland solution in a growth chamber. Ten-days-old seedlings were then transferred to a nutrient solution with a cadmium concentration of 1.0 mg/kg, while the control group of seedlings continued to grow normally in the same nutrient solution without cadmium. After one week of treatment, roots were sampled and immediately frozen in liquid nitrogen. Total RNAs were isolated using Trizol Reagent (TransGen, Beijing, China) and were used for cDNA synthesis using RT SuperMix (Vazyme, Nanjing, China). Quantitative PCR was performed on a LightCycler 96 system (Roche, Rotkreuz, Switzerland) using SYBR qPCR Master Mix (Vazyme, Nanjing, China). Gene-specific primers for OsABCB24 were 5′-TCT TTA CGA GTG ACC CTG ACC-3′ and 5′-CTC CAT ACT ACC GAC CCG TT-3′. Actin was used as an internal control with primers 5′-CAT TGG TGC TGA GCG TTT CC-3′ and 5′-AGA AAC AAG CAG GAG GAC GG-3' .

Data availability
The raw reads of 338 rice accessions generated in this study have been deposited in the Sequence Read Archive (SRA) database (https ://www.ncbi.nlm.nih.gov/sra) under the accession number of PRJNA629658.