Transcriptome sequencing and marker development in winged bean (Psophocarpus tetragonolobus; Leguminosae)

Winged bean, Psophocarpus tetragonolobus (L.) DC., is similar to soybean in yield and nutritional value but more viable in tropical conditions. Here, we strengthen genetic resources for this orphan crop by producing a de novo transcriptome assembly and annotation of two Sri Lankan accessions (denoted herein as CPP34 [PI 491423] and CPP37 [PI 639033]), developing simple sequence repeat (SSR) markers, and identifying single nucleotide polymorphisms (SNPs) between geographically separated genotypes. A combined assembly based on 804,757 reads from two accessions produced 16,115 contigs with an N50 of 889 bp, over 90% of which has significant sequence similarity to other legumes. Combining contigs with singletons produced 97,241 transcripts. We identified 12,956 SSRs, including 2,594 repeats for which primers were designed and 5,190 high-confidence SNPs between Sri Lankan and Nigerian genotypes. The transcriptomic data sets generated here provide new resources for gene discovery and marker development in this orphan crop, and will be vital for future plant breeding efforts. We also analyzed the soybean trypsin inhibitor (STI) gene family, important plant defense genes, in the context of related legumes and found evidence for radiation of the Kunitz trypsin inhibitor (KTI) gene family within winged bean.


Independent Transcriptome Assemblies
Most assemblers currently used are created for short-read (i.e. Illumina) data. To explore the impact of assembler on our transcriptome assembly, preliminary analyses were conducted to test assembly strategies using four different open source de novo assemblers: MIRA v.3.4.0.1 [2]; Velvet v. 1.2.03 [3]; Trinity [4]; and GS de novo Assembler v 2.6 (formerly Newbler [5]) (Roche, Branford CT, USA). MIRA is a whole genome assembler/mapper and can be used across several different sequencing platforms. Velvet is a set of algorithms designed to make use of short read data to build contigs. Trinity is now the gold standard for Illumina-based assemblies, was initially developed specifically for transcriptome assembly, and is optimized for short-read technologies. The GS de novo assembler is specifically designed to assemble sequence data generated on the 454 pyrosequencing platform. Both MIRA and GS de novo assembler are based on overlap-layout-consensus (OLC) algorithms and are commonly suggested for use with longer-read NGS technologies, while Velvet and Trinity are based on de Bruijn graph algorithms and commonly used for shorter-read technologies. CAP3 is a basic sequence assembly program and is used here to refine contigs from MIRA.
MIRA was called using --job=denovo, est, accurate, 454 -GE:not=4 -AS:nop=4,sep=1 -SK:nrr=20 454_SETTINGS -AL:mrs=95. GS de novo Assembler was run using the cdna switch and 4 multiple threads under default conditions. Trinity was called using parameters --seqType fq --kmer_method inchworm --single seq_selected.pl --CPU 4. Velvet was performed at k-mer lengths from 49 to 69, with the optimal k-mer value of k=68 chosen by VelvetOptimiser [6]. We compared each method via assembly metrics, including the number, average length, maximum length, N50 and N90 value of the assembly. Finally, the reads were mapped back to the final assemblies using Bowtie2 [7] in order to assess the quality of all assemblies generated. All preliminary assemblies for comparative purposes were first run on CPP34 alone.

GC content analysis
Assessing the GC content of a transcriptome assembly can aid in checking for possible contamination as different species have different genomic GC content. Multiple peaks in a GC content histogram may suggest contamination by other species. GC content analysis was carried out on assembled, annotated contigs and as reference, on four other unigene datasets downloaded from NCBI; Glycine max, Medicago truncatula, Vigna unguiculata and Lotus japonicus. GC content was calculated using in house perl scripts.

Read Filtering
All reads with lengths shorter than 30 bases were discarded (Table S1). The final dataset after processing for genotype CPP34 had a total of 371,271 single-end reads comprising 140,095,060 bases with an average read length of 377 bases and genotype CPP37 had a total of 433,486 reads comprising 158,348,141 bases with an average read length of 365.  Table S1: Summary of data generated for winged bean transcriptomes. Ns are the number of ambiguous nucleotide sites in the sequence.

Preliminary independent assemblies
The goal of a de novo transcriptome assembly is to assemble as many true transcripts as possible, avoiding chimaeras or split contigs. As estimated from Arabidopsis, it is believed that plants have ~27,000 non-redundant genes, although this number will vary depending on ploidy [8]. However, transcriptome assemblies will often produce much higher numbers, due to the presence of alternatively transcribed genes, chimaeras, or split contigs. Assembly results from the four assemblers varied greatly (Table S2), as did the mapping statistics (Table S3). Taken together, these statistics helped to determine the best assembler for our data.
Velvet generated the highest number of contigs, but more than a quarter of them were less than 100 bp in length. The number of contigs may be an overestimate of the true number of gene assemblies due to split contigs or chimaeras. Mapping reads back onto the assembled contigs can help determine the validity of gene assemblies and detect chimaeric transcripts. Velvet had less than 10% of reads map to the contigs, which may suggest a high degree of chimaeras. Velvet utilized the least amount of data and also had the shortest maximum contig length, average length, N50, and N90 of all five assembly strategies. Taken together, these stats suggest that Velvet generated the poorest assembly.   MIRA produced the second highest number of contigs. Even though MIRA+CAP3 produced fewer contigs than MIRA alone, the number is reasonable and MIRA+CAP3 produced a better overall assembly than MIRA as evidenced by the increase in maximum contig length, average length, N50, and N90. MIRA+CAP3 produced the assembly with the best N50 and N90, as well as the longest average contig length, and also had the highest overall alignment rate, mapping 76-77% of reads back onto contigs. However, the majority of reads mapped more than once, suggesting the presence of chimaeric or fractured contig assemblies.

Assembler
Trinity produced a median number of contigs, coupled with relatively high maximum contig length, average contig length, and median N90. Of all five assemblers, Trinity had the highest N50. Although it had a lower overall alignment rate than the MIRA assemblies, Trinity had less than half the number of reads that mapped more than once, with almost 80% of reads mapping uniquely to the contig assemblies. This may indicate relatively few chimaeras.
Although GS de novo assembler produced the fewest contigs, it boasted the longest maximum contig length, a relatively high N50, and a median average contig length. However, GS de novo assembler had a median overall alignment rate and a relatively low N90. That said, GS de novo assembler had the highest number of uniquely mapped reads, suggesting that few chimaeras were created.
Comparing assembly algorithms, de Bruijn graphs vs OLC algorithms, the latter wins out for use with our 454 data: MIRA+CAP3 produced a reasonable number of contigs, had the highest maximum and average contig lengths, the best N90, and the second best N50, whereas GS de novo assembler had the highest number of read map uniquely, suggesting the purest contig assembly, even though fewer contigs were created. The suggestion that OLC algorithms are more appropriate for longer reads such as 454 data is supported here.
Ultimately, we chose to use GS de novo Assembler over the other programs because of the reliable output (fewer chimaeras), comparable contig length, the fact that it considers alternative splicing [9], and that it is a program specifically designed for 454 data. It should be noted that assembly metrics for GS de novo Assembler differ here as compared to the analyses in the main text, which were run on a newer version of Newbler, which is packaged into the GS de novo Assembler.

GC content analysis
GC content analysis showed that all legume references had comparable GC content profiles ( Figure S1). Genotypes CPP34 and CPP37 had a mean GC% of 45.16% and 42.44%, respectively. The GC content profiles showed a narrower range of distribution when compared to the references, but followed the overall pattern. CPP34 had a higher abundance of contigs within 45-50% GC content, while CPP37 showed higher abundance of contigs in the range of 35-40% GC content. However, such variation could be due to an incomplete transcriptome. Deeper sequencing might reveal the true GC content profile for winged bean.