Background & Summary

Soybean [Glycine max (L.) Merr.], considered as the source of high-quality sugar, protein, and oil, is one of the most important crops worldwide1. Soybean aphid (SBA), Aphis glycines Matsumura (Hemiptera: Aphididae) and soybean cyst nematode (SCN), Heterodera glycines Ichinohe (Tylenchida: Heteroderidae) are the two most economically important pests of soybean in the Midwestern United States2,3. Soybean aphid, an aboveground herbivore (pest), feeds on phloem sap whereas SCN, a belowground pest, infests the soybean roots. These infestations can co-occur and amplify further reduction in soybean yield4,5. In the United States, annual economic losses due to the SBA and SCN have been estimated to be approximately $4 billion and $1.3 billion, respectively6,7,8. To counteract these devastating pests, farmers rely on various management strategies that include host plant resistance and chemical measures9,10,11. For SBA, dependency on the use of chemical management has resulted in pyrethroid resistance in SBA populations in Iowa, Minnesota, North Dakota and South Dakota as well as the impacts on non-target beneficial organisms12,13. In addition, the long-term use of SCN resistance has resulted in SCN populations that are capable of overcoming the resistance genes (i.e., HG types)14. Although host plant resistance has not been implemented on a large scale for SBA management, multiple virulent SBA biotypes have been discovered in the U.S. Virulent SBA biotypes and SCN races threaten the sustainability of host plant resistance for these two pests14,15,16,17. Thus, genetic data generated from greenhouse experiments on the effects of SBA and SCN on soybean cultivars are of tremendous importance for unraveling resistance genes and regulatory networks that can potentially be used for developing durable resistance in soybean to both pests.

Although above- and below- ground herbivores are spatially segregated, they both share the host plant through systemic tissues and are able to influence each other18. Previously, the influence of SCN on soybean aphid infestation or vice versa has been studied on soybean using demographic datasets4,5,19,20,21. McCarville, et al.4 conducted experiments on various soybean cultivars [SCN susceptible (DK 28–52, IA 3018, IA 3041) and SCN resistant (DK 27–52, AG 2821 V, IA 3028)] to understand the effect of SBA, SCN, and fungus Cadophora gregata (Allington & Chamberlain) Harrington & McNew on soybean. Their study showed 5.24 times increase in SCN reproduction in the presence of soybean aphid and the fungus. In contrast, the aphid population decreased by 26.4% in the presence of SCN and C. gregata and the aphid exposure reduced by 19.8% in SCN resistant cultivars. Later, McCarville, et al.5 demonstrated the relationship between the aboveground feeding of soybean aphid and belowground reproduction of SCN in the SCN resistant Dekalb 27–52 (PI 88788 derived) cultivar, and SCN susceptible Kenwood 94 cultivar. In 30 days, both SCN eggs and the number of females increased by 33% in SCN-resistant cultivar and reduced by 50% in the SCN-susceptible cultivar. In 60 days, the number of SCN eggs and female count remained unaffected in the resistant cultivar but decreased in the susceptible cultivar. The authors concluded that soybean aphid feeding improved the quality of soybean as a host for SCN but this result was varied significantly with the cultivar and length of the experiment. Apart from these demographic studies, molecular characterization of SBA-SCN-soybean interaction has not been reported previously.

RNA-Sequencing (RNA-Seq) has been a standard tool for studying qualitative and quantitative gene expression assays that provide information on transcript abundance with their variation22,23. The major objective of this study was to evaluate differential gene expression of soybean plants that are infested with SCN in the presence or absence of SBA. To achieve the objective, we conducted experiments on two genotypes of G. max [H. glycines susceptible Williams 82 (PI518671), and H. glycines resistant MN1806CN] that were infested with biotype 1 SBA and HG Type 0 SCN for RNA-sequencing. More than 1.1 billion reads (61.4 GB) of transcriptomic data were obtained from 47 samples derived from the experiment using whole roots of G. max. An overview of the experimental design, methods and transcriptome analysis pipeline is shown in Fig. 1a–c, respectively. A comprehensive understanding of these transcriptome data will enhance our understanding of interactions among soybean, SBA, and SCN at the molecular level. The rapid advancement of bioinformatics tools is facilitating the search of candidate genes and their function that might play a crucial role in various pathways for host resistance against both herbivores.

Fig. 1
figure 1

An overview of greenhouse experiments and transcriptomic data analysis pipeline. (a) A randomized complete block design (RCBD) using two water baths (Water bath I and Water bath II), (b) A flow chart representing experimental methods used for soybean cyst nematode and soybean aphid interaction using two cultivars of soybean, and (c) A flow chart showing RNA-seq data analysis pipeline.


Plant material, soybean aphid, and SCN

Two cultivars of soybean– Williams 82 (PI518671) and MN1806CN were used in this experiment. Williams 82 is susceptible to both HG Type 0 (race 3) of the SCN and SBA. MN1806CN is resistant to HG Type 0 (race 3) of the SCN but susceptible to SBA. Soybean aphid biotype 1 populations were originally obtained from the Ohio State University and were reared on susceptible cultivar LD12-15838R at South Dakota State University. This biotype is defined by an avirulent response to all known SBA resistance (Rag) genes and was first identified in Illinois24. The SCN population used was HG type 0, which is defined by having less than 10% reproduction documented by studies of SCN resistance and is avirulent to all SCN resistance genes in soybean.

Experimental design and sample collection

A greenhouse experiment was designed using a randomized complete block design (RCBD) with eight treatments (four treatments per cultivars) with eight experimental units (plants) in six blocks. The treatments were factors of soybean genotype, SBA infestation, and SCN infestation. For examples, each of the soybean genotypes received one of the following combinations: SCN:no SBA, no SCN:SBA, SCN:SBA, or no SBA:no SCN (control).

For this experiment, the soil-sand mixture was prepared by adding construction sand and clay soil including SCN infested clay soil in the ratio of 3:1. The 125 cc of the mixture was distributed in cone-tainers (diameter of 3.8 cm, a depth of 21 cm and a volume of 164 cc; Greenhouse Megastore, USA). For SCN included treatments, each cone-tainer received approximately 2,000 SCN eggs. The cone-tainers with three soybean seeds were arranged in a 2.0 U.S. gallon (7.57 liter) plastic buckets (Leaktite, USA) filled with construction sand (Quikrete, GA). These buckets were kept in a water bath for maintaining soil temperature between 26.7 °C and 28.9 °C to ensure the reproduction of SCN (i.e. ~30 days)5. The temperature of the water bathes were regularly monitored using thermometers. The plants were grown under 16:8 (L:D) in a greenhouse with a temperature of 28 °C and 45% relative humidity. The plants were thinned down to one plant per cone-tainer upon reaching the second vegetative growth stage (V2). The V2-staged plants with the SBA included treatments were infested with 15 mixed age (i.e., fourth instar nymphs and adults) biotype 1 SBA using a 000 fine tip paintbrush (Winsor & Newton, England). The SBA were applied on the abaxial surface of the first trifoliate of V2-staged plants. All plants in each bucket were covered with a large no-see-um mesh net (Quest Outfitters, Sarasota, FL) to prevent inter-bucket movement of aphids. After SBA infestation, soybean plants were regularly checked to confirm the successful establishment of soybean aphids. Soybean aphid populations were counted at 5, 15, and 30 days post infestation (dpi). For SBA only treatment, the populations on the two soybean varieties were not significantly different, indicating that both lines were susceptible to SBA. SCN eggs were sampled at 30 dpi. The whole roots were collected on 5 and 30 dpi by snap freezing in liquid Nitrogen and stored at −80 °C for further analysis. The 5 dpi and 30 dpi root samples treated with each treatment were collected from Water bath I and Water bath II, respectively, representing each plant from three blocks (three biological replicates). The SCN soil and SCN infested roots were used for SCN cysts collection (except root samples collected for transcriptomic study) and the soil was examined for SCN counts.

RNA extraction, library construction, and RNA sequencing

RNA was extracted from all samples representing three biological replicates of each treatment that constituted 24 samples collected at 5 and 30 dpi each. As the major foci of the project were to determine whether the gene expression differed between SCN resistant and SCN susceptible soybeans, and to evaluate the gene expression of soybeans that were dual infested with SCN and SBA, we selected two timepoints (5 and 30 dpi). We selected 30 dpi to observe gene expression of treatment effects on a single generation of SCN reproduction keeping 5 dpi as a reference in the presence or absence of SBA. Frozen root samples from each treatment were grounded in liquid nitrogen with a mortar and pestle to a fine powder followed by total RNA extraction using PureLink RNA mini kit (Invitrogen, USA). RNA samples were treated with TURBOTM DNase (Invitrogen, USA) to remove any DNA contamination following the manufacturer’s instructions. Assessment of the isolated RNA integrity was performed by 1% agarose gel electrophoresis, and RNA concentration was measured by Nanodrop 2000 (Thermo Fisher Scientific, USA). The cDNA libraries were constructed using NEBNext Ultra II RNA library 96 single index prep kit and sequenced using Illumina HiSeq. 3000 (single read end utilizing 100 bases read length) at Iowa State University Sequencing Facilities.

Pre-processing of sequencing data

Quality control of reads was assessed using FastQC program (version 0.11.3) ( The FastQC results were visualized using MultiQC v1.326, and low quality bases (QC value < 20; 5-bp window size) were removed by trimming in the program Btrim64 (version 0.2.0)27. High-quality single-end reads were mapped against the primary coding sequences of G. max. The coding sequences (Gmax: Gmax_275_Wm82.a2.v1.transcript_primaryTranscriptOnly.fa.gz) were obtained from the Phytozome database and aligned using Salmon ver.0.9.128 accessed from Bioconda29. Downstream analyses of the quantified transcript reads were performed using integrated Differential Expression and Pathway analysis (iDEP 0.81, R/Bioconductor packages)30. The quantified transcript reads were filtered with 0.5 counts per million (CPM) in at least one sample and transformed using regularized log (rlog), which is implemented in the DESeq. 231 package.

Data Records

All sequence reads were deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (accession SRR8427366-SRR8427408) under Bioproject PRJNA514200 (Project ID: SRP178193)33 (Table 1). The raw transcript abundance counts for all the samples was deposited at the Gene Expression Omnibus (GEO) database, GSE12510334. The transformed transcript abundance counts, hierarchical clustering, correlation matrices, and clusters are available in figshare at:

Table 1 Statistics of the transcriptomic data using RNA-seq pipeline used in this study.

Technical Validation

Quality control

Forty-eight RNA libraries were prepared and sequenced with the sequencing depth ranging from 9,847,269 to 57,520,568 reads (see Table 1). Sequencing of one of the replicates of control resistant cultivar (MN1806CN collected at 30 dpi) came in error. Therefore, total reads of more than 1.1 billion from 47 libraries were subjected to FastQC analysis, which helped determine the data quality using various quality metrics such as mean quality scores (Fig. 2a), per sequence quality scores (Fig. 2b), per sequence GC content (Fig. 2c), and sequence length distribution (Fig. 2d). Phred quality scores per-base for all samples were higher than 30 and GC content ranged from 43 to 45%, following a normal distribution. After trimming, more than 99% of the reads were retained as clean and good quality reads. Upon mapping these reads, a high mapping rate of 73.8% to 94.3% was obtained. Among these, 67.1% to 87.6% reads were uniquely mapped.

Fig. 2
figure 2

Quality metrics of G. max sequencing data. (a) Mean quality scores per position, (b) Per sequence quality scores, (c) GC content distribution, and (d) Read length distribution.

Assessment of transcriptomic data

The 43,122 genes passed the filter upon filtering with 0.5 CPM in at least one sample. To reduce the mean dependent variance, the quantified transcript reads were transformed as shown in Fig. 3a–c and available in Figshare32 (the transformed transcript abundance count for all the samples). The transformed data were subjected to hierarchical clustering and principal component analysis (PCA) followed by visualization using t-SNE map35 in order to assess the global transcriptomic data. The hierarchical clustering of top 6000 variable genes based on two time points (5 dpi and 30 dpi) showed distinct clustering except for some samples [Fig. 4a; Figshare32 (the hierarchical clustering of top 6,000 variable genes)]. Figure 4b represents the standard deviation (SD) distribution of the top variable 6,000 genes. Figure 4c represents the Pearson’s correlation between the samples using the top 75% genes and available in Figshare32 (the correlation between the samples using the top 75% genes). The t-SNE map revealed four clusters (A, B, C, and D) for 6,000 variable genes [Fig. 4d; Figshare32 (the four clusters for 6,000 variable genes)]. Regarding the PCA, PC1 is correlated with time (P = 1.16e-06) with 28% variance, and PC2 is correlated with Treatment (P = 2.02e-08) with 15% variance (Fig. 4e).

Fig. 3
figure 3

Pre-processing of transcriptomic data. (a) Distribution of transformed data, (b) Density plot of transformed data, and (c) Scatter plot of the first two samples (SCNS5d_1 vs SCNS5d_2).

Fig. 4
figure 4

Assessment of transcriptomic data. (a) Heatmap of top 6,000 variable genes, (b) Gene SD distribution, (c) Correlation matrix, (d) Visualization of top 6,000 genes shown in the t-SNE map, and (e) A principal components analysis (PCA) plot.

Usage Notes

This data represents the first publicly available transcriptomic data for soybean roots from the three-way interaction among G. max, H. glycines, and A. glycines. The raw compressed fastq files (fastq.gz) were submitted to the National Center for Biotechnology Information (NCBI) and are available with accession numbers (SRR8427366-SRR8427408; The data could be retrieved using fastq-dump tool SRA toolkit ( There are various tools such as Trimmomatic36, cutadapt37, Fastq_clean38 that could be used for trimming purpose. Apart from the Salmon tool for the alignment and quantification of reads, other tools such as STAR aligner (, Bowtie39, HISAT240, TopHat241, Cufflinks with HTSeq can be employed, which requires reference genome of G. max and annotation file in gff3 format. For differential gene expression analysis, EdgeR42 and limma43 could be used instead of DEseq. 231. Apart from the standalone tools like iDEP30, Galaxy (, CyVerse (, MeV (, and integrated RNA-seq interpretation system for gene expression data analysis tool ( could also be used for both analysis and visualization of RNA-seq data.