Crowdsourcing and the feasibility of manual gene annotation: A pilot study in the nematode Pristionchus pacificus

Rödelsperger, Christian; Athanasouli, Marina; Lenuzzi, Maša; Theska, Tobias; Sun, Shuai; Dardiry, Mohannad; Wighard, Sara; Hu, Wen; Sharma, Devansh Raj; Han, Ziduan

doi:10.1038/s41598-019-55359-5

Download PDF

Article
Open access
Published: 11 December 2019

Crowdsourcing and the feasibility of manual gene annotation: A pilot study in the nematode Pristionchus pacificus

Christian Rödelsperger¹,
Marina Athanasouli¹,
Maša Lenuzzi¹,
Tobias Theska¹,
Shuai Sun¹,
Mohannad Dardiry¹,
Sara Wighard¹,
Wen Hu¹,
Devansh Raj Sharma¹ &
…
Ziduan Han¹

Scientific Reports volume 9, Article number: 18789 (2019) Cite this article

1903 Accesses
18 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Nematodes such as Caenorhabditis elegans are powerful systems to study basically all aspects of biology. Their species richness together with tremendous genetic knowledge from C. elegans facilitate the evolutionary study of biological functions using reverse genetics. However, the ability to identify orthologs of candidate genes in other species can be hampered by erroneous gene annotations. To improve gene annotation in the nematode model organism Pristionchus pacificus, we performed a genome-wide screen for C. elegans genes with potentially incorrectly annotated P. pacificus orthologs. We initiated a community-based project to manually inspect more than two thousand candidate loci and to propose new gene models based on recently generated Iso-seq and RNA-seq data. In most cases, misannotation of C. elegans orthologs was due to artificially fused gene predictions and completely missing gene models. The community-based curation raised the gene count from 25,517 to 28,036 and increased the single copy ortholog completeness level from 86% to 97%. This pilot study demonstrates how even small-scale crowdsourcing can drastically improve gene annotations. In future, similar approaches can be used for other species, gene sets, and even larger communities thus making manual annotation of large parts of the genome feasible.

Quality assessment of gene repertoire annotations with OMArk

Article Open access 21 February 2024

ContScout: sensitive detection and removal of contamination from annotated genomes

Article Open access 31 January 2024

Gene Updater: a web tool that autocorrects and updates for Excel misidentified gene names

Article Open access 26 July 2022

Introduction

How well can biological knowledge be transferred across species? Are biological functions carried out by the same genes in different organisms? How fast do regulatory networks diverge? In order to address these fundamental questions, more than 20 years ago, the nematode Pristionchus pacificus has been introduced as a so-called “satellite” model organism to one of the most successful animal model systems, Caenorhabditis elegans^1,2. Since then, several comparative studies in developmental and ecological contexts have highlighted the importance of developmental system drift as a concept in evolution³ and have demonstrated that the divergence between Pristionchus and Caenorhabditis was accompanied by extensive chemical^4,5,6, genic^7,8,9, and morphological^10,11,12 innovations. The establishment of multiple genetic^13,14 and genomic tools and resources^15,16 by Sommer and colleagues motivated an increasing number of independent groups to adapt P. pacificus as a model system for comparative studies at a mechanistic level^{17,18,19,20,21}. However, reverse genetic approaches based on candidate genes with known functions in C. elegans^22,23 have been hampered not only by the huge amount of lineage-specific duplications^23,24,25,26, but also by missing and incorrect gene annotations. Traditionally, protein-coding genes are annotated by gene prediction algorithms that model general sequence features of transcription and translation start and end sites, as well as splicing signals^27,28,29. This can be complemented with evidence based approaches using transcriptomic and protein homology data^30,31. While automated annotation pipelines perform reasonably well to be useful for genetic screens^32,33,34 and evolutionary genomic analyses^35,36,37, their outcomes by far do not meet the standards of the gene annotations from classical model organisms such as C. elegans, Drosophila melanogaster, and Mus musculus that have been curated over decades by a large research community³⁸. In order to make the P. pacificus system more tractable for researchers without extensive genomic and phylogenetic expertise, we need to minimize the discrepancy in gene annotation quality between C. elegans and P. pacificus. To this end, we employed an integrative approach using comparative genomic and transcriptomic data combined with crowdsourcing to improve the P. pacificus annotations of C. elegans homologs and orthologs. First, we carry out a comparative assessment of 22 nematode genomes and demonstrate that P. pacificus has one of the best available nematode genomes. Second, we perform a genomewide screen for C. elegans genes where homologs and orthologs are not or incorrectly annotated in the P. pacificus genome. Third, a community-based manual curation of suspicious gene models reveals thousands of hidden orthologs and missing homologs. This pilot study can be extended to even larger gene sets and communities possibly employing citizen scientists, which would raise the quality of gene annotations to the next level³⁸.

Results

The quality of nematode draft genomes is highly heterogeneous

To obtain a general overview of the current status of nematode genome quality, we analyzed assemblies and gene annotations of 22 species (Fig. 1). The species were arbitrarily selected to span the diversity of the nematode phylum³⁹. We will further use this taxon sampling to perform an analysis of gene age, i.e. phylostratigraphic analysis where each phylostratum is defined by at least two outgroup species to minimize the effect of species-specific gene loss. Nematode genomes range in size between 43 and 320 Mb and contain between 11 and 37 thousand annotated protein-coding genes (Fig. 1). Analyses of assembly features and gene annotations indicate a wide range of qualitative variability. Some genomes are assembled and scaffolded to the level of chromosomes with high degrees of contiguity (the N50 value which is a measure of genome assembly contiguity is up to 29 Mb) whereas others are largely fragmented into up to 33 thousand scaffolds with N50 values below 0.1 Mb (Fig. 1). Similarly, analyses of completeness levels based on benchmarking univeral single copy orthologs (BUSCO⁴⁰) reveal substantial amount of either missing or duplicated genes and it is not totally clear to what extent these differences are of biological or technical nature⁴¹. In the case of Diploscapter coronatus, the apparent high fraction of duplicated genes could either be explained by hybridization of two divergent lineages or a whole genome duplication⁴². The genome of P. pacificus, which was generated by assembly from single-molecule, long-read sequencing data and scaffolding with the help of a genetic linkage map¹⁵, shows one of the highest levels of contiguity (47 scaffolds, N50 = 24 Mb). Gene annotations were generated by the MAKER2 pipeline^30,31 which combined gene prediction algorithms, transcriptome data, and protein homology data from other Pristionchus species^11,15,43. The completeness level of gene annotations (BUSCO completeness: 84%) is in the upper range when compared to most other nematode genomes (median 78%, interquartile range (IQR): 68–85%, Fig. 1). This demonstrates the relatively high quality of the current P. pacificus assembly and gene annotations.

Complementary genome and transcriptomes reveal potentially missing gene models

The completeness analysis as implemented in the software BUSCO⁴⁰ can also be applied to the raw genome assembly of P. pacificus. This yielded a combined completeness value of 93% (complete single copy and duplicates) as compared to 86% for the P. pacificus gene annotations and indicates towards the presence of incorrectly annotated or missing C. elegans orthologs in the genome of P. pacificus. Moreover, the fact that a recent de novo transcriptome assembly that was based on a strand-specific RNA-seq data set exhibited an even higher combined completeness level of 97% (Table 1) demonstrates even further room for improvement¹⁶. Finally, single-molecule, long-read transcriptome sequencing data were recently generated for P. pacificus which allows a much more accurate definition of gene structures from reference alignments of single reads⁴⁴. However, neither transcriptomic data set was available when the existing gene annotations (version: El Paco annotation v1/WormBase release: WS268) were generated and they could still be used for further improvement.

Table 1 Completeness analysis of different P. pacificus data set.

Full size table

To systematically identify potentially missing genes in the P. pacificus genome, we searched for C. elegans genes lacking homologs in the current P. pacificus gene annotations (BLASTP e-value < 10⁻⁵) but having a matching open reading frame in the de novo transcriptome assembly (Fig. 2a). While 12,504 (62%) C. elegans genes had BLASTP hits in both data sets, 634 (3%) C. elegans genes showed only BLASTP hits against the current gene annotations suggesting that these genes are properly annotated but are expressed so weakly that they were not captured in the transcriptome assembly of mixed-stage cultures⁴⁵.

Similarly, we identified 526 (3%) C. elegans genes that were only found in the transcriptome assembly and therefore represent candidates for missing gene annotations.

Community-based curation identifies missing genes in the P. pacificus genome

In order to improve the existing gene annotations, we chose to manually inspect and classify all 526 missing gene candidates in the P. pacificus genome browser (http://www.pristionchus.org). Thereby, we recruited and trained colleagues as community annotators, who would be capable to classify a genomic locus and to propose a correction to the existing gene models (see Methods). Lists of missing gene candidates were shared in online spreadsheets and documents, which allowed multiple annotators to inspect and correct candidate loci in parallel. 119 (25%) of the 486 non-redundant P. pacificus loci were classified as missing genes in predicted UTRs of annotated genes (Fig. 2b). We would speculate that this is caused by the fact that nematode genomes are compact and UTR regions can frequently overlap⁴⁵. This can cause artificial fusion of transcripts during the assembly of RNA-seq data. Consequently, only the largest ORF of such a gene is annotated as protein-coding and the rest is classified as 3′ and 5′ UTR. Alternatively, this problem could arise when a fused gene prediction from the sister species is used as homology information but MAKER2 fails to generate a complete gene model out of it. The C. elegans gene C29H12.2 is one example of a missing gene model residing in the UTR of a P. pacificus rars-2 homolog (Fig. 2c). The corresponding P. pacificus locus is spanned by two assembled transcripts that are homologous two C29H12.2 and rars-2, respectively. Both transcripts are also well supported by Iso-seq data and exhibit different expression levels^44,46. In such a case, we would propose a replacement of the old P. pacificus gene model by the two distinct transcripts.

After manual inspection of all 526 missing gene candidates, 201 (41%) of the 486 non-redundant P. pacificus loci were classified as missing genes (Fig. 2b). Presumably this kind of error could arise when the gene annotation pipeline is mostly dependent on gene prediction algorithms which fail to predict all genes in gene dense regions (e.g. operon structures) as the intergenic distances might span only a few hundred nucleotides, which could be too small for triggering the initiation of a new gene model. The C. elegans gene apn-1 is one example of a missed gene model in a gene dense region (Fig. 2d). Given that the P. pacificus homolog of apn-1 has good transcriptomic support, the correction in this case would simply add the transcript to the existing gene models. Other instances of missing homologs are due to borderline cases in the BLASTP searches where one search resulted in an e-value slightly below the e-value threshold (10⁻⁵) and the result of the other BLASTP search was slightly above the threshold. In total, we encountered 87 of such cases which we termed ‘weak similarity’ (Fig. 2b). For such cases no correction was proposed. In summary, we compiled corrections for 280 P. pacificus genes which were replaced by 714 new gene models. All these changes were submitted to WormBase and were incorporated in the release WS272.

Artificial gene fusions mask thousands of hidden orthologs

A small number of C. elegans genes with missing homologs in the current gene annotations (version: El Paco v1/WS268) of P. pacificus were classified as located in fused gene models (Fig. 2b). One potential explanation could be that an artificially fused gene prediction from the sister species is taken as homology data to annotate the orthologous locus in P. pacificus, but small errors cause parts of the gene model to be either incompletely or incorrectly annotated in P. pacificus resulting in a loss of detectable homology (Fig. 2c). Even if the homolog of a C. elegans gene is incorporated in the correct ORF within an artificially fused gene model, this could still cause a loss of one-to-one orthology as the corresponding P. pacificus gene can only be identified as one-to-one ortholog of a single C. elegans gene. Thus, we performed a second screen for C. elegans genes that had a predicted one-to-one ortholog (best-reciprocal hit) in the transcriptome assembly but not in current gene annotations (Fig. 3a). In total, 6075 (93%) of C. elegans genes with a predicted one-to-one ortholog (based on best-reciprocal hits) in current gene annotations, also had a predicted one-to-one ortholog against the de novo transcriptome assembly (Fig. 3a). Nevertheless, we found 2075 C. elegans genes that only had predicted one-to-one orthologs in the de novo transcriptome assembly. Excluding C. elegans genes that were identified already in the previous screen for missing homologs, this resulted in 1692 C. elegans genes with predicted one-to-one orthologs in the de novo transcriptome assembly but not in the current set of gene annotations (version: El Paco v1/WS268). Community-based classification and curation of the 1281 corresponding P. pacificus loci classified 912 (71%) cases as artificial gene fusions (Fig. 3b). One such an example is the C. elegans gene D1053.3. Its putative ortholog is fused with the P. pacificus mvb-12 ortholog (Fig. 3c). Apart from being orthologous to two different C. elegans genes, both P. pacificus genes are supported as non-overlapping transcripts by RNA-seq and Iso-seq, and are expressed at different levels. This confirmed the interpretation of an artificially fused annotation. The proposed correction in this case would be a replacement of the old gene model by the two non-overlapping transcripts. In total, we updated 1241 P. pacificus gene models and replaced them with 3305 new models. These updates were submitted to WormBase and will be released following curation. The new P. pacifcus gene annotation (version: El Paco v2) with 28,036 gene models is also available on http://www.pristionchus.org/download. The results of the BUSCO analysis (Complete and Single Copy: 95.4%, Duplicated: 1.7%, Fragmented: 2.0%, Missing: 0.9%) indicate that the new annotation represents a substantial improvement over the previous annotations¹⁵ (Table 1).

Improved gene annotations facilitate the establishment of a catalog of C. elegans homologs and orthologs in the P. pacificus genome

Since our primary focus was to improve the annotation of C. elegans orthologs in the P. pacificus genome, we wanted to use the updated gene annotation to generate a catalog of predicted orthologs between C. elegans and P. pacificus. As the identification of orthologs typically requires sufficient genomic and phylogenetic knowledge to retrieve relevant protein data sets and to perform reconstruction of gene trees^24,45,46, a genome-wide catalog of orthologs would be highly useful as a starting point for researchers without sufficient expertise. Previous comparisons between C. elegans and P. pacificus identified putative one-to-one orthologs for roughly 6000–8000 genes^44,46. To further characterize C. elegans genes without orthologs in P. pacificus, we additionally carried out a phylostratigraphic analysis⁴⁷ to estimate their relative age. Basically, phylostratigraphy uses absence-presence patterns of a gene to map its origin to an internal branch in a species tree⁴⁷. Our analysis revealed that 5258 (26%) of C. elegans genes do not have BLAST hits in Pristionchus or more distantly related species (Phylostrata I–IV, Supplemental Table 1). This strongly suggests that they are younger than the common ancestor between C. elegans and P. pacificus and consequently have no orthologs. Next, we applied two different approaches to predict orthologs between C. elegans and P. pacificus: best reciprocal hits and Markov clustering as implemented in the software orthAgogue⁴⁸. Computation of best reciprocal hits is a standard approach for predicting one-to-one orthologs across species^49,50. In order to capture more complex orthology relationships (e.g. many-to-many), more general approaches such as Markov clustering have been widely applied^48,51. Based on best reciprocal hits, we identified 8348 predicted one-to-one orthologs between both species (Supplemental Table 1) whereas the orthAgogue pipeline identified 7643 orthologous clusters, of which only 3345 corresponded to one-to-one orthologs. The large majority (98%) of these predicted one-to-one orthologs was also identified as best reciprocal hits and in 3260 (99%) cases, the same P. pacificus gene was predicted as one-to-one ortholog. The large discrepancy between the total number of best reciprocal hits and one-to-one orthologs defined by orthAgogue could be explained by the fact that best reciprocal hits do not take inparalogs into account⁴⁹. However, only 1049 (21%) of C. elegans genes that were not identified as one-to-one orthologs by orthAgogue could be explained by the presence of lineage-specific inparalogs, suggesting that orthAgogue with default settings might be too conservative for this analysis. This is further supported by the reanalysis of 57 one-to-one orthologous pairs that were previously confirmed by phylogenetic analysis⁴⁶. While 53 of the previously confirmed one-to-one orthologs were captured as best reciprocal hits, only 33 were also identified by orthAgogue. Taken together, the improved gene annotation facilitated the prediction of substantially more one-to-one orthologs (Fig. 3a, Supplemental Table 1). This resource can be taken as a starting point to identify candidate genes in P. pacificus.

Discussion

With C. elegans, C. briggsae, and P. pacificus, three genetically tractable and free living nematode model organisms have been well established and can be used to study the evolution of gene function at various time-scales^2,3,52. For example, recent reverse genetic approaches in P. pacificus have revealed functional divergence of genes with known roles in C. elegans dauer formation^22,23,53. In addition, mutant screens in P. pacificus for social behaviours have uncovered multiple orthologous C. elegans genes for which a behavioral phenotype had been overlooked previously^33,54. Together with complementary studies of the functional importance of novel genes^7,32,55, this makes nematodes an extremely powerful system to study genome evolution and gene function at a mechanistic level.

In order to facilitate fruitful functional studies across multiple model organisms, it is crucial to generate genomic resources (e.g. assemblies, annotations) and experimental genetic toolkits (e.g. forward and reverse genetics) of comparable quality. The chromosome-scale assembly of the P. pacificus genome¹⁵ was a major step towards making this species more tractable for other groups. In our study, we aimed to minimize the discrepancy between automatically generated gene annotations for P. pacificus and heavily curated annotations for C. elegans. To this end, we incorporated recently generated Iso-seq and RNA-seq data into current gene annotations by manual curation of suspicious candidate loci that were identified by comparative genomic analysis. While application of alternative annotation pipelines can generate overall better gene annotations^29,41, they cannot guarantee that gene annotations will only improve. In certain cases, new annotation pipelines will also cause new errors. In contrast, during manual inspection, each community curator has the choice to not propose any change of gene models in case of uncertainty. Thus, manual inspection should only lead to removal of errors and thus improve annotation quality without introducing biases elsewhere. While manual annotation is an incredibly tedious task that is probably not scalable to complete genomes³⁸, we minimized the workload by focusing on a small gene set of C. elegans orthologs, recruiting colleagues as community curators, and restricting the task just to the selection of alternative gene models that were generated from transcriptomic data^16,44 or previous rounds of gene prediction^56,57. In our opinion, the most crucial aspect of this community project is a good training of new annotators. We achieved this by personal training sessions between experienced and new annotators and the possibility to always discuss cases of uncertainty with other curators. For larger projects, initial training could be achieved by comprehensive online tutorials and communication via email, but this will likely be less efficient. In the case of the P. pacificus gene annotations, our study raised the gene count from 25,517 to 28,036 and increased the single copy ortholog completeness level from 86% to 97%. In the P. pacificus genome, the greatest source of error was the artificial fusion of neighboring genes. This type of error might be more prevalent in nematodes where genomes are compact⁹ and genes frequently overlap^37,45. Consequently, manual annotation of restricted gene sets has been proposed and applied previously to circumvent this problem⁵⁸. Given that nematode genomes tend to be pretty compact (Fig. 1), we anticipate that misannotation due to overlapping gene models should be much less pronounced in large vertebrate or plant genomes. Nevertheless, it would be interesting to apply similar screens for gene annotation artifacts to other systems and eventually this could reveal some incorrect annotations in the genomes of classical model organisms.

While this study was restricted to P. pacificus genes with putative orthologs in C. elegans, we cannot reliably estimate the fraction of erroneous gene models across the whole genome. Our results would suggest that the fraction of missing genes is around one percent (Fig. 2a,b) and the amount of gene models affected by artificial fusions may be up to 15% (Fig. 3a). However, as the P. pacificus genome has a higher gene density and a higher concentration of old genes at the chromosome centers^8,15, we hypothesize that errors due to artificial gene fusions should be much less pronounced at chromosome arms. To test this, an unbiased quantification of error rates across genomic segments would be needed. In future, we also plan to focus on large gene families and lineage-specific orphan genes⁵⁵ that were not explicit subjects of this study. Artificial fusions in these classes of genes could be identified by screens for unexpectedly long gene models or unusual protein domain content. For orphan genes abundant RNA-seq studies of different developmental stages^22,46, tissues^10,46, environmental conditions⁵⁹, sexes¹⁶, and genetic backgrounds^60,61 could be used to detect non-overlapping transcripts that exhibit anticorrelated expression within a single locus. Thus, while our study has demonstrated that community-based curation of gene annotations is feasible and can lead to substantial improvements, continued effort is needed to lift its quality to a level that would be similar to classical model organisms.

Methods

Comparative assessment of nematode genomes

We downloaded 22 nematode genomes and corresponding protein sequences from WormBase ParaSite (release WBPS13). For Steinernema carpocapsae, the latest version at WBPS14 was used. In case of multiple isoforms, we selected the longest isoform for further analysis. We ran BUSCO (version 3.0.1) in protein mode (option: -m prot) against the nematode_odb9 data set (N = 982 genes) to evaluate the completeness level of available protein sequences.

Genome browser integration of transcriptomic resources

To allow community annotators to propose alternative gene models, we integrated recent raw read alignments and reference guided transcript assemblies of Iso-seq data⁴⁴ and a de novo assembly of strand-specific RNA-seq data¹⁶ into our genome browser (implemented in jbrowse⁶²) on our webserver (http://www.pristionchus.org). Genomic coordinates for the de novo transcriptome assembly were generated by alignment to the P. pacificus reference genome (version El Paco) with the program exonerate⁶³ (version: 2.2.0, options: -m est2genome – dnawordlen 20 – maxintron 20000). To reduce the complexity of this data set, a condensed version of the de novo transcriptome assembly (selection of the isoform with the longest ORF as single representative isoform per gene, minimum peptide length of 60 amino acids, removal of single exon transcripts) with annotated best-reciprocal hits and best hits (BLASTP, e-value < 10⁻⁵) in C. elegans was also incorporated into our jbrowse instance. In addition, predicted protein sequences of previous versions of P. pacificus annotations (Hybrid1⁵⁶ and TAU2011⁵⁷) were mapped against the P. pacificus assembly by exonerate (version: 2.2.0, options: -m protein2genome – dnawordlen 20 – maxintron 20000). All data sets are available under the gene annotation track of the El Paco reference assembly in our genome browser. To evaluate the quality of the two recent transcriptome assemblies, we ran BUSCO (version 3.0.1, options -m trans) against the nematode_odb9 data set (N = 982 genes) for completeness assessment (Table 1).

Identification of missing and fused gene models in current gene annotations

We ran bidirectional BLASTP (e-value < 10⁻⁵) searches between C. elegans (version: WS260, longest isoform per gene) and two different P. pacificus data sets: the annotated proteins (version: El Paco v1, WS268) and the de novo transcriptome assembly¹⁶. For the de novo transcriptome, we reduced the redundancy resulting from different isoforms by selecting the longest ORFs per gene. Based on the different BLASTP searches, we first screened for C. elegans proteins with BLASTP hits against ORFs in the de novo transcriptome assembly but not against the currently annotated proteins. This yielded 526 candidate genes. In a second phase, we screened for C. elegans proteins with putative orthologs, defined by best-reciprocal BLASTP relationships, in the de novo transcriptome assembly but not in the annotated proteins, resulting in 2075 candidate genes.

Community-based manual curation of candidate loci

All C. elegans genes together with their candidate homologs and orthologs in the P. pacificus de novo transcriptome assembly were stored in a shared online spreadsheet. Community annotators were trained to find the corresponding locus in the genome browser by entering the transcript identifier and to manually inspect the surrounding regions that were defined by the encompassing P. pacificus gene model. The candidate locus was then classified as untranslated region (UTR) (the query transcript overlapped exons that were annotated as UTR), missing gene (the query transcript did not overlap any annotated exon), gene fusion (the query transcript did overlap protein-coding exons and homology was detected by BLASTP), misannotation (the query transcript did overlap protein-coding exons but no BLASTP hit was found due incorrect reading frame annotation or minimal overlap) or inconclusive. After classification, a correction was proposed that either added new genes (identifiers could be selected from the de novo assembled transcripts, Iso-seq assemblies, or previous versions of gene annotations) or replaced an existing gene model by one or more new genes. In such a case the objective was to lose as little annotated coding sequence as possible. Thus, new genes were selected from the above mentioned data set in order to cover as much coding sequence of the initial gene model as possible. If parts of the old gene model were not covered, BLAST searches against C. elegans and other Pristionchus species were used to split the old gene model into several parts with sequence matches to distinct C. elegans genes, or to extract partial protein sequences of the old gene model that were not covered. Such protein sequence stretches were given a pseudo identifier and were stored in a shared online document. All these sequences were later automatically reannotated by mapping them against the reference genome with the help of exonerate. In case that an existing gene model was replaced by multiple new gene models, we additionally selected one of the new gene models to inherit the WormBase identifier of the old gene model to allow WormBase to record the history of a given gene model. Usually, either the most conserved or the longest new gene model was chosen. Due to the fact that a single artificially fused gene could cause missing homologs and orthologs for multiple C. elegans genes, some loci were curated multiple times. We randomly picked some of these cases to compare the classifications and the corresponding corrections from multiple curators, which turned out to be largely consistent. In case of redundant curations, one out of many possible curations for a given locus was chosen based on the following criteria: preference towards higher number of new models, experience of the curator (number of curated loci), and transcriptional evidence over gene prediction.

Phylostratigraphy and orthology predictions

Outgroup data sets were defined by concatenating all protein sequence data from different species in the ladder-like phylogeny leading to C. elegans (Fig. 1). More precisely, we pooled all data from species in an induced subtree defined by branches with roman numbers in Fig. 1. We then ran a BLASTP search (e-value < 0.001) of C. elegans proteins (longest isoform per gene) against each of the outgroup data sets. Starting from the C. elegans genes with homologs in the most distant outgroup set (VIII), we iteratively defined phylostrata by comparison with the next, more closely related outgroup set. The results of this analysis are summarized in Supplemental Table 1. C. elegans specific genes are assigned to phylostratum I, whereas genes that are present in the most divergent outgroups are assigned to phylostratum VIII. Orthologs were defined after performing all pairwise BLASTP searches including self-searches (e-value < 10⁻⁵) between C. elegans and P. pacificus and extracting best reciprocal hits from the BLAST output files. Simultaneously, the program orthAgogue was run with default setting on the same input files⁴⁸.

Data availability

The strand-specific de novo transcriptome was submitted to the European Nucleotide Archive under the accession HAKN01000001¹⁶ and the Iso-seq data was submitted to the European Nucleotide Archive under the accessions ERX2315712 and ERX2315713⁴⁴. All data sets are also available at http://www.pristionchus.org/download. The initial set of P. pacificus gene annotations corresponds to WormBase WS268. Corrections from this study were submitted to WormBase and will be released following curation.

References

Sommer, R. J., Carta, L. K., Kim, S.-Y. & Sternberg, P. W. Morphological, genetic and molecular description of Pristionchus pacificus sp. n. (Nematoda: Neodiplogastridae). Fundam. Appl. Nematol. 19, 511–521 (1996).
Google Scholar
Sternberg, P. W. Why Caenorhabditis elegans is great and Pristionchus pacificus might be better. In Pristionchus pacificus (ed. Sommer, R. J.) 11, 1–17 (BRILL).
Sommer, R. J. Evolution of regulatory networks: nematode vulva induction as an example of developmental systems drift. Adv. Exp. Med. Biol. 751, 79–91 (2012).
Article CAS PubMed Google Scholar
Bose, N. et al. Complex small-molecule architectures regulate phenotypic plasticity in a nematode. Angew. Chem. Int. Ed Engl. 51, 12438–12443 (2012).
Article CAS PubMed PubMed Central Google Scholar
Yim, J. J., Bose, N., Meyer, J. M., Sommer, R. J. & Schroeder, F. C. Nematode signaling molecules derived from multimodular assembly of primary metabolic building blocks. Org. Lett. 17, 1648–1651 (2015).
Article CAS PubMed PubMed Central Google Scholar
Falcke, J. M. et al. Linking genomic and metabolomic natural variation uncovers nematode pheromone biosynthesis. Cell Chem Biol 25, 787–796.e12 (2018).
Article CAS PubMed PubMed Central Google Scholar
Mayer, M. G., Rödelsperger, C., Witte, H., Riebesell, M. & Sommer, R. J. The orphan gene dauerless regulates dauer development and intraspecific competition in nematodes by copy number variation. PLoS Genet. 11, e1005146 (2015).
Article PubMed PubMed Central CAS Google Scholar
Prabh, N. et al. Deep taxon sampling reveals the evolutionary dynamics of novel gene families in Pristionchus nematodes. Genome Res. 28, 1664–1674 (2018).
Article CAS PubMed PubMed Central Google Scholar
Rödelsperger, C., Streit, A. & Sommer, R. J. Structure, function and evolution of the nematode genome. In: eLS. John Wiley & Sons, Ltd: Chichester, https://doi.org/10.1002/9780470015902.a0024603 (2013).
Lightfoot, J. W., Chauhan, V. M., Aylott, J. W. & Rödelsperger, C. Comparative transcriptomics of the nematode gut identifies global shifts in feeding mode and pathogen susceptibility. BMC Res. Notes 9, 142 (2016).
Article PubMed PubMed Central CAS Google Scholar
Susoy, V. et al. Large-scale diversification without genetic isolation in nematode symbionts of figs. Sci Adv 2, e1501031 (2016).
Article ADS PubMed PubMed Central CAS Google Scholar
Bumbarger, D. J., Riebesell, M., Rödelsperger, C. & Sommer, R. J. System-wide rewiring underlies behavioral differences in predatory and bacterial-feeding nematodes. Cell 152, 109–119 (2013).
Article CAS PubMed Google Scholar
Witte, H. et al. Gene inactivation using the CRISPR/Cas9 system in the nematode Pristionchus pacificus. Dev. Genes Evol. 225, 55–62 (2015).
Article CAS PubMed Google Scholar
Srinivasan, J. et al. A bacterial artificial chromosome-based genetic linkage map of the nematode Pristionchus pacificus. Genetics 162, 129–134 (2002).
CAS PubMed PubMed Central Google Scholar
Rödelsperger, C. et al. Single-molecule sequencing reveals the chromosome-scale genomic architecture of the nematode model organism Pristionchus pacificus. Cell Rep. 21, 834–844 (2017).
Article PubMed CAS Google Scholar
Rödelsperger, C. et al. Phylotranscriptomics of Pristionchus nematodes reveals parallel gene loss in six hermaphroditic lineages. Curr. Biol. 28, 3123–3127.e5 (2018).
Article PubMed CAS Google Scholar
Namai, S. & Sugimoto, A. Transgenesis by microparticle bombardment for live imaging of fluorescent proteins in Pristionchus pacificus germline and early embryos. Dev. Genes Evol. 228, 75–82 (2018).
Article CAS PubMed Google Scholar
Lo, T.-W. et al. Precise and heritable genome editing in evolutionarily diverse nematodes using TALENs and CRISPR/Cas9 to engineer insertions and deletions. Genetics 195, 331–348 (2013).
Article CAS PubMed PubMed Central Google Scholar
Bui, L. T. & Ragsdale, E. J. Multiple plasticity regulators reveal targets specifying an induced predatory form in nematodes. Mol. Biol. Evol, https://doi.org/10.1093/molbev/msz171 (2019).
Article Google Scholar
Ishita, Y., Chihara, T. & Okumura, M. Serotonergic modulation of feeding behavior in Caenorhabditis elegans and other related nematodes. Neurosci. Res, https://doi.org/10.1016/j.neures.2019.04.006 (2019).
Liu, Z. et al. Predator-secreted sulfolipids induce defensive responses in C. elegans. Nature Communications 9 (2018).
Moreno, E. et al. DAF-19/RFX controls ciliogenesis and influences oxygen-induced social behaviors in Pristionchus pacificus. Evol. Dev. 20, 233–243 (2018).
Article CAS PubMed Google Scholar
Markov, G. V. et al. Functional conservation and divergence of daf-22 paralogs in Pristionchus pacificus dauer development. Mol. Biol. Evol. 33, 2506–2514 (2016).
Article CAS PubMed PubMed Central Google Scholar
Markov, G. V., Baskaran, P. & Sommer, R. J. The same or not the same: lineage-specific gene expansions and homology relationships in multigene families in nematodes. J. Mol. Evol. 80, 18–36 (2015).
Article ADS CAS PubMed Google Scholar
Namdeo, S. et al. Two independent sulfation processes regulate mouth-form plasticity in the nematode. Development 145 (2018).
Article PubMed CAS Google Scholar
Rödelsperger, C. Comparative genomics of gene loss and gain in Caenorhabditis and Other Nematodes. In Methods in Molecular Biology 419–432 (2018).
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
Article PubMed PubMed Central Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–9 (2006).
Article CAS PubMed PubMed Central Google Scholar
Hoff, K. J., Lomsadze, A., Borodovsky, M. & Stanke, M. Whole-genome annotation with BRAKER. Methods Mol. Biol. 1962, 65–95 (2019).
Article PubMed PubMed Central Google Scholar
Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188–196 (2008).
Article CAS PubMed PubMed Central Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).
Article PubMed PubMed Central Google Scholar
Lightfoot, J. W. et al. Small peptide-mediated self-recognition prevents cannibalism in predatory nematodes. Science 364, 86–89 (2019).
Article ADS CAS PubMed Google Scholar
Moreno, E. et al. Regulation of hyperoxia-induced social behaviour in Pristionchus pacificus nematodes requires a novel cilia-mediated environmental input. Sci. Rep. 7, 17550 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
Kieninger, M. R. et al. The nuclear hormone receptor NHR-40 acts downstream of the sulfatase EUD-1 as part of a developmental plasticity switch in Pristionchus. Curr. Biol. 26, 2174–2179 (2016).
Article CAS PubMed Google Scholar
Baskaran, P. & Rödelsperger, C. Microevolution of duplications and deletions and their impact on gene expression in the Nematode Pristionchus pacificus. PLoS One 10, e0131136 (2015).
Article PubMed PubMed Central CAS Google Scholar
Weller, A. M., Rödelsperger, C., Eberhardt, G., Molnar, R. I. & Sommer, R. J. Opposing forces of A/T-biased mutations and G/C-biased gene conversions shape the genome of the nematode Pristionchus pacificus. Genetics 196, 1145–1152 (2014).
Article CAS PubMed PubMed Central Google Scholar
Prabh, N. & Rödelsperger, C. Divergence, and mixed origin contribute to the emergence of orphan genes in nematodes. G3 9, 2277–2286 (2019).
Salzberg, S. L. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019).
Article PubMed PubMed Central Google Scholar
Smythe, A. B., Holovachov, O. & Kocot, K. M. Improved phylogenomic sampling of free-living nematodes enhances resolution of higher-level nematode phylogeny. BMC Evolutionary Biology 19 (2019).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed CAS Google Scholar
McLean, F., Berger, D., Laetsch, D. R., Schwartz, H. T. & Blaxter, M. Improving the annotation of the Heterorhabditis bacteriophora genome. Gigascience 7 (2018).
Hiraki, H. et al. Genome analysis of Diploscapter coronatus: insights into molecular peculiarities of a nematode with parthenogenetic reproduction. BMC Genomics 18, 478 (2017).
Article PubMed PubMed Central CAS Google Scholar
Rödelsperger, C. et al. Characterization of genetic diversity in the nematode Pristionchus pacificus from population-scale resequencing data. Genetics 196, 1153–1165 (2014).
Article PubMed PubMed Central CAS Google Scholar
Werner, M. S. et al. Young genes have distinct gene structure, epigenetic profiles, and transcriptional regulation. Genome Res. 28, 1675–1687 (2018).
Article CAS PubMed PubMed Central Google Scholar
Rödelsperger, C., Menden, K., Serobyan, V., Witte, H. & Baskaran, P. First insights into the nature and evolution of antisense transcription in nematodes. BMC Evol. Biol. 16, 165 (2016).
Article PubMed PubMed Central CAS Google Scholar
Baskaran, P. et al. Ancient gene duplications have shaped developmental stage-specific expression in Pristionchus pacificus. BMC Evol. Biol. 15, 185 (2015).
Article PubMed PubMed Central CAS Google Scholar
Domazet-Loso, T., Brajković, J. & Tautz, D. A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends Genet. 23, 533–539 (2007).
Article CAS PubMed Google Scholar
Ekseth, O. K., Kuiper, M. & Mironov, V. orthAgogue: an agile tool for the rapid prediction of orthology relations. Bioinformatics 30, 734–736 (2014).
Article CAS PubMed Google Scholar
Remm, M., Storm, C. E. & Sonnhammer, E. L. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052 (2001).
Article CAS PubMed Google Scholar
Tatusov, R. L. A Genomic Perspective on Protein Families. Science 278, 631–637 (1997).
Article ADS CAS PubMed Google Scholar
Li, L., Stoeckert, C. J. Jr & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003).
Article CAS PubMed PubMed Central Google Scholar
Verster, A. J., Ramani, A. K., McKay, S. J. & Fraser, A. G. Comparative RNAi Screens in C. elegans and C. briggsae Reveal the Impact of Developmental System Drift on Gene Function. PLoS Genetics 10, e1004077 (2014).
Article PubMed PubMed Central CAS Google Scholar
Sieriebriennikov, B., Markov, G. V., Witte, H. & Sommer, R. J. The Role of DAF-21/Hsp90 in Mouth-Form Plasticity in Pristionchus pacificus. Mol. Biol. Evol. 34, 1644–1653 (2017).
Article CAS PubMed PubMed Central Google Scholar
Moreno, E. & Sommer, R. J. A cilia-mediated environmental input induces solitary behaviour in Caenorhabditis elegans and Pristionchus pacificus nematodes. Nematology 20, 201–209 (2018).
Article CAS Google Scholar
Prabh, N. & Rödelsperger, C. Are orphan genes protein-coding, prediction artifacts, or non-coding RNAs? BMC Bioinformatics 17, 226 (2016).
Article PubMed PubMed Central CAS Google Scholar
Borchert, N. et al. Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models. Genome Res. 20, 837–846 (2010).
Article CAS PubMed PubMed Central Google Scholar
Sinha, A., Sommer, R. J. & Dieterich, C. Divergent gene expression in the conserved dauer stage of the nematodes Pristionchus pacificus and Caenorhabditis elegans. BMC Genomics 13, 254 (2012).
Article CAS PubMed PubMed Central Google Scholar
Stoltzfus, J. D., Minot, S., Berriman, M., Nolan, T. J. & Lok, J. B. RNAseq analysis of the parasitic nematode Strongyloides stercoralis reveals divergent regulation of canonical dauer pathways. PLoS Negl. Trop. Dis. 6, e1854 (2012).
Article CAS PubMed PubMed Central Google Scholar
Sanghvi, G. V. et al. Life history responses and gene expression profiles of the nematode Pristionchus pacificus cultured on Cryptococcus yeasts. PLoS One 11, e0164881 (2016).
Article PubMed PubMed Central CAS Google Scholar
Serobyan, V. et al. Chromatin remodelling and antisense-mediated up-regulation of the developmental switch gene eud-1 control predatory feeding plasticity. Nat. Commun. 7, 12337 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Moreno, E., McGaughran, A., Rödelsperger, C., Zimmer, M. & Sommer, R. J. Oxygen-induced social behaviours in Pristionchus pacificus have a distinct evolutionary history and genetic regulation from Caenorhabditis elegans. Proc. Biol. Sci. 283, 20152263 (2016).
Article PubMed PubMed Central CAS Google Scholar
Buels, R. et al. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 17, 66 (2016).
Article PubMed PubMed Central CAS Google Scholar
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

The authors would like to thank the complete Pristionchus community for their long-term interest in studying P. pacificus and thus motivating this work. Further thanks to Bogdan Sieriebriennikov for providing additional manual curations and to all members of the Sommer lab for general discussions. Finally, special thanks to Michael Paulini for incorporating the updated gene models into WormBase. This work was funded by the Max Planck Society.

Author information

Authors and Affiliations

Max Planck Institute for Developmental Biology, Department for Integrative Evolutionary Biology, Max-Planck-Ring 9, 72076, Tübingen, Germany
Christian Rödelsperger, Marina Athanasouli, Maša Lenuzzi, Tobias Theska, Shuai Sun, Mohannad Dardiry, Sara Wighard, Wen Hu, Devansh Raj Sharma & Ziduan Han

Authors

Christian Rödelsperger
View author publications
You can also search for this author in PubMed Google Scholar
Marina Athanasouli
View author publications
You can also search for this author in PubMed Google Scholar
Maša Lenuzzi
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Theska
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Sun
View author publications
You can also search for this author in PubMed Google Scholar
Mohannad Dardiry
View author publications
You can also search for this author in PubMed Google Scholar
Sara Wighard
View author publications
You can also search for this author in PubMed Google Scholar
Wen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Devansh Raj Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Ziduan Han
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, C.R.; Investigation, C.R., M.A., M.L., T.T., S.S., M.D., S.W., W.H., D.R.S. and Z.H.; Writing – Original Draft, C.R.; Writing – Review & Editing, C.R., M.A., M.L., T.T., S.S., M.D., S.W., W.H., D.R.S. and Z.H.; Supervision, C.R.

Corresponding author

Correspondence to Christian Rödelsperger.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rödelsperger, C., Athanasouli, M., Lenuzzi, M. et al. Crowdsourcing and the feasibility of manual gene annotation: A pilot study in the nematode Pristionchus pacificus. Sci Rep 9, 18789 (2019). https://doi.org/10.1038/s41598-019-55359-5

Download citation

Received: 01 October 2019
Accepted: 20 November 2019
Published: 11 December 2019
DOI: https://doi.org/10.1038/s41598-019-55359-5

This article is cited by

Novel and improved Caenorhabditis briggsae gene models generated by community curation
- Nicolas D. Moya
- Lewis Stevens
- Erik C. Andersen
BMC Genomics (2023)
Analysis of repeat elements in the Pristionchus pacificus genome reveals an ancient invasion by horizontally transferred transposons
- Marina Athanasouli
- Christian Rödelsperger
BMC Genomics (2022)
The community-curated Pristionchus pacificus genome facilitates automated gene annotation improvement in related nematodes
- Christian Rödelsperger
BMC Genomics (2021)
Comparative genomics and community curation further improve gene annotations in the nematode Pristionchus pacificus
- Marina Athanasouli
- Hanh Witte
- Christian Rödelsperger
BMC Genomics (2020)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.