Dictyocaulus viviparus genome, variome and transcriptome elucidate lungworm biology and support future intervention

The bovine lungworm, Dictyocaulus viviparus (order Strongylida), is an important parasite of livestock that causes substantial economic and production losses worldwide. Here we report the draft genome, variome, and developmental transcriptome of D. viviparus. The genome (161 Mb) is smaller than those of related bursate nematodes and encodes fewer proteins (14,171 total). In the first genome-wide assessment of genomic variation in any parasitic nematode, we found a high degree of sequence variability in proteins predicted to be involved host-parasite interactions. Next, we used extensive RNA sequence data to track gene transcription across the life cycle of D. viviparus, and identified genes that might be important in nematode development and parasitism. Finally, we predicted genes that could be vital in host-parasite interactions, genes that could serve as drug targets, and putative RNAi effectors with a view to developing functional genomic tools. This extensive, well-curated dataset should provide a basis for developing new anthelmintics, vaccines, and improved diagnostic tests and serve as a platform for future investigations of drug resistance and epidemiology of the bovine lungworm and related nematodes.


Supplementary Figures
Supplementary Figure S1. Co-transcription of predicted operon and non-operon genes The mean Pearson's correlation coefficient for every pair of genes in an operon in D. viviparus are plotted (green) against that of a pseudo-operon of the same size, consisting of randomly selected genes (red). The correlation coefficients for genes in predicted operons are significantly higher than for background.

Supplementary Figure S2. Maximum expression (FPKM) of Wolbachia-like and other genes throughout all life cycle stages of D. viviparus
The maximum expression level (given in fragments per kilobase per million reads mapped, FPKM) for each gene was determined and potted to compare the expression of Wolbachia-like and non-Wolbachia-like genes. Wolbachia-like genes were transcribed at low levels compared to other genes (average peak expression of 77.7 FPKM compared to 3,356.6 FPKM for other genes; P < 10 -10 , T-test using log-scale FPKM values).   Table S10). Following analytical processing 6 , reads were assembled using Newbler, invoking the cDNA-specific assembly option as previously described 7 .
cDNA sequencing on the Illumina platform  Table S10). Adapter trimming, sequence quality trimming, length filtering, complexity filtering, and contaminant filtering were performed as previously described 9 . The remaining high-quality RNAseq reads were aligned to the genome assembly using Tophat2 10 (version 2.0.8, default parameters) and assembled into full-length transcripts with Cufflinks (version 2.1.1 11 ) using an established protocol 12 .

Genome sequencing, assembly, and annotation
Whole genome shotgun fragment and paired-end libraries (3kb and 8kb insert) were constructed according to standard methods and sequenced on the Roche/454 platform 13  viviparus orthologs with the expected (background) percentage being the proportion of genes from the genome belonging to the OPF (averaged for each other nematode in the OPF), the number of "successes" being the number of D. viviparus genes in the OPF, and the number of "trials" being the total number of D. viviparus genes (14,171). This statistic is non-parametric, and the P values were corrected using FDR population correction 36 .

Prediction of D. viviparus operons
The known spliced leader sequences from clade V nematodes (3 SL1 and 36 SL2 sequences 37 ) were used to find related trans-spliced genes in D. viviparus as previously described 23 . The RNAseq reads that satisfied the following criteria for similarity to known SL1 and SL2 sequences were considered to be sourced from a gene trans-spliced with a D. viviparus spliced leader sequence: 1. A hit was reported by blat, using the options '-tileSize=6 -oneOff=1 -minScore=12'. Matches on either strand were considered hits.
2. The match on the target sequence (RNA-Seq read) started, at most, 0 (for SL1) or 8 (for SL2) bases from either end.
3. The match on the query (the known SL sequence) started at most 2 (for SL1) or 8 (for SL2) bases from the 5' end.
5a. For SL1, no gaps were allowed within the aligned part of the sequence.
5b. For SL2, the aligned part was to be at most 8 bases shorted than the SL2 sequence length (e.g. for a 22 base SL2, any read with aligned region smalled than 14 bases is rejected).
6. For SL2, at most a single gap of length 1 base was allowed on either the query or the target.
The reads thus identified were then aligned to the Cufflinks-assembled consensus transcripts and only those cases where the corresponding gene model is on the same strand as the putative SL sequence identified were considered to be potentially SL-trans-spliced genes.
Reciprocal best BLAST hits (using WU-BLAST with cutoff of 30% identity and 35 bits) between D.
viviparus genes and 3,677 C. elegans operon genes (WS230) 38 were used to infer D. viviparus operons as previously described 23 . Operons with at least two D. viviparus homologs that are adjacent to each other or are separated by one neighbor were counted. For every pair of genes in every inferred operon in D. viviparus, Pearson's correlation coefficient was calculated for FPKM values determined from our RNAseq data. This was compared to a "background" set of non-operon neighboring gene pairs. 5000 pairs of genes belonging to same operon were selected at random (with replacement) and compared to 5000 randomly selected neighboring gene pairs from the set of non-operon genes. This was also tested with 10 randomly selected instances of the background set for each operon with even more significantly different distributions.

Functional annotation of protein coding genes
Inferred protein sequences were compared to known protein sequences by BLASTP 39 against the GenBank non-redundant protein database (NR, downloaded April 15, 2014). Transmembrane domains and classical secretion peptides were predicted using Phobius 40,41 , and non-classical secretion signals were predicted using SecretomeP 42 . Putative proteases and inhibitors were identified and classified using the online MEROPS peptidase database server (release 9.11) 43 . Proteins were assigned to KEGG orthologous groups, pathways and pathway modules using KEGGscan 44 with KEGG release 68 45 .
Associations with InterPro protein domains and Gene Ontology (GO) classifications were inferred using InterProScan [46][47][48] . InterPro domain enrichment for gene sets was determined (for each domain represented in at least 5 genes) using a non-parametric binomial distribution test with the expected (background) percentage being the proportion of genes in the genome containing the InterPro domain, the number of "successes" being the number of genes in the target set containing the InterPro domain, and the number of "trials" being the total number of genes in the target set. A 0.01 p value cutoff was used for significance (after Benjamini-Hochberg false-discovery-rate (FDR) population correction for the total number of domains) 36 . Functional enrichment of GO terms related to particular subsets of proteins was calculated using FUNC 49 with an adjusted p-value cutoff of 0.01. C. elegans RNAi effector proteins were reported previously 50 . The longest isoform of each C. elegans RNAi effector protein was taken from WormBase release WS230 and matched to D. viviparus orthologs by inParanoid or by best bi-directional BLASTP match (e-value < 1e -05 ).
C. elegans proteins were screened against the collection of kinase domain models in Kinomer 51,52 . Custom score thresholds were applied for each kinase group and adjusted until an hmmpfam search 53 came as close as possible to identifying the known C. elegans kinases. The same procedure and cutoffs were applied to the D. viviparus proteins to identify kinases as previously described 23,30 .
High confidence sets of enzymes available to the 5 species were obtained by using local version of KAAS server 54 , with the bit-score threshold set at 35. modDFS was then used to find which of the nematode-relevant modules were strictly complete in the species 55 . Chokepoints of KEGG metabolic pathways were defined as a reaction that either consumes a unique substrate or produces a unique product 56 . The reaction database from KEGG v65 45 was used to identify chokepoints as previously described 57 . D. viviparus proteins were assigned EC numbers based on the output of KEGGscan (described above). Protein Data Bank 58 and DrugBank 59 were identified by WU-BLAST 60 against database sequences using a cutoff of 30% sequence identity over 75% of the length of the query.

Genomic variation analysis
Genomic DNA was isolated from four male worms and five female worms of the DvHannover 2010 strain obtained from a single host, and paired end libraries were generated and sequenced on the Illumina HiSeq 2500 platform. Raw reads were deposited in the SRA under BioProject ID PRJNA72587 (Supplementary Table S11). Data available from the SRA representing a strain from Cameroon were also included in our analysis (SRA accession ERX364141) 61 . Relevant barcodes and adapters were trimmed, and reads of less than 60bp in length and/or with uncalled bases were discarded using Flexbar 62 .
Remaining reads were aligned to the DvHannover2000 reference genome using BWA-MEM (version bwa0.7.5a, default parameters 63 ). Duplicate reads were marked for removal using Picard tools Contigs predicted to represent the X chromosome (i.e., hemizygous loci) were identified based on contig-wise mean homozygosity and total median depth of coverage in male samples ( Supplementary   Fig. S4). Using SNPs located on putatively autosomal contigs, sample relationships were examined by multidimensional scaling (MDS) analysis based on pairwise identity-by-state (IBS) distance in PLINK 66 . In order to minimize the effect of excess heterozygosity in female samples (due to embryonic DNA of nonmaternal origin) on MDS clustering, variants were called using HaplotypeCaller with the "-contamination" option that invokes removal of bases supporting putative variants. The contamination fraction parameter was set at 0.25, based on the restoration to the expected heterozygosity under Hardy-Weinberg equilibrium, as judged by F IS (inbreeding coefficient) ( Supplementary Fig. S5). F-statistics and nucleotide diversity were computed using VCFtools (v0.1.12b) 67 . Mitochondrial SNPs were identified using an identical procedure to that outlined above for the nuclear genomic variants except that the reference alignment was generated using a previously published mitochondrial genome (GenBank accession NC_019810

Gene expression, alternative splicing and differential expression analyses
Pre-processed, paired-end, Illumina RNAseq reads were mapped onto the D. viviparus genome assembly with Tophat2 10 using the gff annotation file to guide alignments. Refcov 71 was used to assess the genes' breadth of coverage based on all available RNAseq datasets, and genes showing ≥50% breadth of coverage were characterized as expressed. The number of reads associated with each feature was determined using HTSeq-Count 72 . Mapped read counts and fragments per kilobase per million reads mapped (FPKM) values are available through GEO (Series accession GSE73863). Differentially expressed genes were predicted using DESeq2 (version 1.4.5 73 ) with an adjusted p-value cutoff of 0.1 according to established protocols 74 . Statistically enriched InterPro domains and GO terms were determined as previously described. Over-representation of genes from each orthologous group classification (e.g., nematode-conserved, D. viviparus-specific, etc.) among sets of over-expressed genes was tested using a Binomial distribution test, with the expected (background) percentage being the proportion of genes from the genome belonging to the orthologous group classification, the number of "successes" being the number of overexpressed genes of interest belonging to the orthologous group classification, and the number of "trials" being the total number of overexpressed genes of interest. This statistic is non-parametric, and the P values were corrected using FDR population correction 36 . This test was also used in the same way to test enrichment for Wolbachia-like genes and genes grouped based on Tajima's D and π N /π S values.