Fingerprinting of Proteins that Mediate Quagga Mussel Adhesion using a De Novo Assembled Foot Transcriptome

The European freshwater mollusk Dreissena bugensis (quagga mussel), an invasive species to North America, adheres to surfaces underwater via the byssus: a non-living protein ‘anchor’. In spite of its importance as a biofouling species, the sequence of the majority of byssal proteins responsible for adhesion are not known, and little genomic data is available. To determine protein sequence information, we utilized next-generation RNA sequencing and de novo assembly to construct a cDNA library of the quagga mussel foot transcriptome, which contains over 200,000 transcripts. Quagga mussel byssal proteins were extracted from freshly induced secretions and analyzed using LC-MS/MS; peptide spectra were matched to the transcriptome to fingerprint the entire protein primary sequences. We present the full sequences of fourteen novel quagga mussel byssal proteins, named Dreissena bugensis foot proteins 4 to 17 (Dbfp4–Dbfp17), and new sequence data for two previously observed byssal proteins Dbfp1 and Dbfp2. Theoretical masses of the newly discovered proteins range from 4.3 kDa to 21.6 kDa. These protein sequences are unique but contain features similar to glue proteins from other species, including a high degree of polymorphism, proteins with repeated peptide motifs, disordered protein structure, and block structures.


Quagga mussel foot transcriptome library construction
RNA was extracted from three quagga mussels, to account for allelic variation in the Lake Ontario local population, each sequenced with its own unique barcodes. A sufficient sequencing depth of over 10 Giga-bases of data per sample was achieved; over 90% of bases identified with a Phred score higher than 30, indicating a high-quality raw data set. Following trimming and removal of low-quality reads, data from all three samples were pooled for de novo assembly using Trinity. A minimum 90% of reads from each sample were successfully mapped to the assembled library, suggesting successful high-quality assembly. The RNA-sequencing results and de novo assembly are summarized below in Table S1. For all transcripts, the median contig length was 386 bp, and the N30 and N50 contig lengths are 2396 bp and 1481 bp, respectively. The assembly of long continuous regions of DNA further suggests the assembly is of high quality. The distribution of contig lengths is shown below in

Quagga mussel gel bands removed for LC-MS/MS
Due to the extensive cross-linking in the mature quagga mussel byssus, mussels were induced to secrete fresh byssal material that was isolated and analyzed. Soluble byssal proteins separated by Tris-Bis SDS-PAGE ( Figure S2). Four gel bands were clipped from each of the two lanes and pooled for LC-MS/MS analysis: ~6, ~7, ~14, and ~28 kDa, respectively.

Dbfp9 -Manual assembly of using de novo only spectra
The Dbfp9 start (signal peptide and start codon) was observed with full coverage in the whole TP extract, 6 kDa, and 7 kDa bands, shown below with signal peptide removed. The largest Cterminus transcript with the second highest coverage, Dbfp9-fb is shown below. There are mismatched asparagine (N) and lysine (K) residues. However, since a single base substitution can result in asparagine being replaced by lysine and vice-versa, this could be explained by base substitution errors when creating the transcriptome cDNA library from the mussel mRNA. De novo-only spectra strongly suggest the alignment is correct, and the sequence actually contains either a lysine or asparagine residue. There are 8 high-quality (>80% ALC) de novo sequences (labeled DN1 through DN8) which span the conjunction between the Dpfp9 start and end transcripts, shown below in Table S5. Aspartic acid (D) and Asparagine (N) have a similar sidechain with a mass difference of only 0.98 Da, therefore a residue labelled N+0.98 is most likely an aspartic acid reside. Asparagines that were identified as N+0.98 are underlined as D to make the alignment clearer.

Fragment
Peptide sequence This C-terminus component was used to assemble Dbfp9. Three transcripts from two other components were also aligned with Dbfp9-fb (Dbfp-fa, Dbfp9-fg, Dbfp9-fh).
Other additional interesting sequences with similarities to Dbfp9 includes a variant from the component containing the C-terminus of Dbfp9, containing the sequence NYGYPGYGG repeated 5 times (highlighted in blue and green below). This variant was only fingerprinted by 1 spectra in the 7kDa band data, and was incomplete (no start or end codon), and was not included in the variant list. Another interesting sequence mined from the library that is similar to the Dbfp9 start codon sequence, contains the sequence shorter sequence GGNY repeated 9 times (highlighted in yellow and pink). No spectra were observed by LC-MS/MS confirming the existence of the sequence on the protein-level, so it was excluded from the analysis. However, RSEM analysis suggests these transcripts are expressed to a similar level as Dbfp16. Due to their similarity to Dbfp9 the sequences have been included below.

MCLLVAAAVLLAVANARSVPYYGDYGYGGNYGGNYGGNYGGNYGGNYGGNYGGNYGGNYGGNY
Many additional Dbfp9 C-terminus fragments were fingerprinted by LC-MS/MS spectra in both the whole TP and gel band data. Dbfp9-fa and Dbfp9-fb overlapped with the Dbfp9 transcript containing the start codon and could be successfully assembled. The additional Dbfp9 fragments did not overlap with the start codon. However, the sequences were all identified by multiple LC-MS/MS spectra and RSEM analysis suggests they are significantly expressed. Furthermore, the additional transcripts demonstrate sequence variation in the C-terminus of Dbfp9, thus we believe they warrant inclusion. Using Dbfp9-fb as a reference the fragments were aligned with Clustal Omega to assemble additional Dbfp9 variants to demonstrate polymorphism of Dbfp9, and the possible range in mass of the protein variants ( Figure S3). Using the proposed methods to assemble the additional variants, Dbfp9 could range from 6.6 -9.0 kDa.

Dbfp2 Assembly
Dbfp2 has a highly repetitive block structure. The Dbfp-f2 transcript contains an octapeptide sequence YPTYPEKK consecutively repeated five times, shown below in Table S6. De novo assembly during the transcriptome construction likely could not fully assemble the protein because the highly repetitive YPTYPEKK domain at the DNA level is beyond the resolution of the 150bp paired-end Illumina sequencing. Dbfp2-f1 contains the signal peptide and start codon and ends with a single YPTYPEKK sequence. Dbfp-f3 has the same motif of YPTYPEKK observed four times, highlighted in green and blue in Table S6. The Dbfp-f3 C-terminus contains a triple tandem repeat of the sequence YPDYPEKK, highlighted in pink and yellow. A consensus sequence combining both of these motifs is YP(T/D)Y(P/T)EKK, where the tyrosine, lysine, and glutamic acid residues are highly conserved. Tyrosine positioning throughout the central region Dpfb2 is highly conserved, with the pattern YxxYxxxxY, where x represents any amino acid.

TYPEKKYPSYPEKKYPAYPPKNSYPGRYPWRR-stop
Note: Tandem repeats are highlighted in green/blue, and yellow/magenta respective. Tyrosine residues have been bolded to emphasize their highly conserved position. The underlined region of Dbfp2-f1 is the signal peptide predicted to be cleaved. Transcripts are staggered to demonstrate overlap of the transcripts used to assemble Dbfp2. Sequence properties were calculated after removing the predicted signal peptide sequence b PEAKS scoring method: a -10LogP score cut-off of 20 is equivalent to P-value of 0.01