Distinct interacting core taxa in co-occurrence networks enable discrimination of polymicrobial oral diseases with similar symptoms

Polymicrobial diseases, which can be life threatening, are caused by the presence and interactions of multiple microbes. Peri-implantitis and periodontitis are representative polymicrobial diseases that show similar clinical symptoms. To establish a means of differentiating between them, we compared microbial species and functional genes in situ by performing metatranscriptomic analyses of peri-implantitis and periodontitis samples obtained from the same subjects (n = 12 each). Although the two diseases differed in terms of 16S rRNA-based taxonomic profiles, they showed similarities with respect to functional genes and taxonomic and virulence factor mRNA profiles. The latter—defined as microbial virulence types—differed from those of healthy periodontal sites. We also showed that networks based on co-occurrence relationships of taxonomic mRNA abundance (co-occurrence networks) were dissimilar between the two diseases. Remarkably, these networks consisted mainly of taxa with a high relative mRNA-to-rRNA ratio, with some showing significant co-occurrence defined as interacting core taxa, highlighting differences between the two groups. Thus, peri-implantitis and periodontitis have shared as well as distinct microbiological characteristics. Our findings provide insight into microbial interactions in polymicrobial diseases with unknown etiologies.

sequencing reads were processed with Trimmomatic to generate a mix of unfiltered paired and 86 unpaired reads obtained by filtering one of the two reads in each pair. Sequences that were presumed 87 to be of human origin (a total of 3 980 260 reads; 9.56% on average) were removed using DeconSeq 88 v.0.4.3 software with default parameters 6 . Paired and unpaired reads in the processed data were 89 divided using cmpfastq software (http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php). Paired-end 90 reads were processed with fastq-join using default parameters to combine each read pair 7 , except in 91 cases where subsequent analyses required a paired-end format. 92 We used only paired reads and submitted each read separately because MG-RAST had no option for 119 accepting paired and unpaired reads in the same job. In the pipeline, hits with ≥ 70% identity, ≥ 50-bp 120 alignment length, and E values ≤ 1e−10 were considered significant. We therefore used a more 121 stringent threshold than was used previously 10 . All abundance values were normalised by conversion 122 to a reads per million reads value. Bar plots were used to illustrate the composition of mRNA profiles 123 by assignments made with level-1 SEED subsystems. Active pathways in the Kyoto Encyclopedia of 124 Genes and Genomes database were visualised using iPath2 11 . 125 126

Formation of mRNA clusters and functional annotation 127
The NCBI nr database (as of October 31, 2014) was used to assign protein functions to mRNA 128 clusters. Representative sequences of mRNA clusters were subjected to amino acid similarity searches 129 using BLASTX against the NCBI nr; hits with scores of ≥ 50 bit and E values ≤ 1e−10 were 130 considered significant-which is more stringent than previously used criteria 12 -and the function of 131 the best hit was assigned to the cluster. Abundance values of all mRNA clusters were normalised by 132 conversion to RPKM values. We determined the taxonomic origins of each mRNA, which were 133 available in the descriptions for corresponding subjects in the NCBI nr database. These RPKM values 134 were also used to determine the abundance of taxonomic origins. 135 significant, and the function of the best hit was used to assign the read; the thresholds used were more 141 stringent than those previously applied 15 . the ARB-SILVA using BLASTN; functional assignment against the VFDB and MvirDB using 170

Statistical analysis 174
Two-tailed paired t tests were used to test for significant differences between datasets of both disease 175 groups by comparing the following: the number of raw and original reads; the presence or absence of 176 sampled sites (assigning values of 1 or 0, respectively, for sampled sites in the maxillary anterior, 177 maxillary posterior, mandibular anterior, and mandibular posterior areas); five clinical parameters; 178 two alpha diversity indices; and the proportion of mRNAs considered as virulence genes. ANOSIMs 179 were used to evaluate the significance of dissimilarity between the disease groups by applying the 180 dissimilarity matrix value of 1 − Spearman's coefficient. Each test provided an R value between the 181 two disease groups (Supplemental Figure S2) within the range of −1 to 1 (typically between 0 and 1). 182 R = 0 indicated that similarities between and within groups were the same on average, while R = 1 183 indicated that samples in a group were more similar to each other than to those from different 184 group(s) 19 . In the ANOSIM, the P value was obtained from a permutation test, which was used to 185 evaluate the statistical significance of the calculated R value. Wilcoxon's signed-rank test was used to 186 test for significance differences in expression between the disease groups for each taxon or functional 187 gene (Supplemental Figure S2). 188 In all the statistical tests, P values < 0.05 were considered as statistically significant, except for 189 cases of multiple comparisons. For these, P values were converted to Q values using the  Hochberg method as a measure of false discovery rate; significance levels were P < 0.05 and Q < 0.10.

EMIRGE pipeline MG-RAST pipeline
Functional assignment with the SEED and KEGG

Assignment with NCBI nr, VFDB and MvirDB
Cluster formation by CD-HIT Removal of clusters exhibiting ≥97% nucleotide similarity to 16S rRNA in SILVA     Coordinate 2 (8.5%)