Advances in imaging techniques have led to a drastic increase in the detection of thyroid nodules; however, ~90% of these nodules are benign [1]. Due to diagnostic limitations, patients with benign thyroid disease are frequently subjected to unnecessary surgeries [1]. In addition, only a small fraction of papillary thyroid carcinoma (PTC), the most prevalent thyroid malignancy, is associated with poor outcome [2]. Hence, there is an urgent need to expand the repertoire of specific PTC markers, which can potentially aid in the management of thyroid cancer [1, 3, 4].

MicroRNAs (miRNAs) are frequently disrupted in many cancer types and are suitable candidates to be used as biomarkers [5]. Recent large-scale sequencing has allowed for the discovery of novel miRNAs, which may have greater clinical utility than their well-characterized and widely expressed counterparts [6,7,8]. The presence of these overlooked miRNAs is explained by the lack of stable expression across different tissues and low-coverage techniques used in early studies [6,7,8]. Thus, we sought to identify previously unannotated miRNAs expressed in and specific to thyroid tissues.

We applied our custom discovery pipeline to identify previously unannotated miRNAs from 504 PTC and 59 adjacent non-malignant thyroid tissues (NT) processed by The Cancer Genome Atlas (TCGA) (Figure S1 and Table S1) (https://portal.gdc.cancer.gov/) [7]. Using an algorithm analogous to miRDeep2 [9] publicly available through the miRMaster online tool [10], we generated a broad list of novel miRNA candidates [10] (Supplementary Methods and Table S2). We curated this list based on the expression (RPM ≥ 1 in 10% of samples) and molecular features of these transcripts, resulting in a robust set of 234 previously unannotated miRNAs in thyroid samples (herein referred to as Tnov-miRs). A total of 92 sequences were exclusively expressed in non-malignant samples, 17 in tumors, and 125 shared between both sample types.

The expression range of the Tnov-miRs is significantly (p < 0.001) different from currently annotated sequences (Figure S2). Despite the differences in expression levels, the Tnov-miRs share genomic features characteristic of annotated miRNAs such as their widespread genomic localization and distribution (Figure S3), seed sequence nucleotide composition (positions 2–7), GC content, as well as secondary structure and canonical processing by DICER (Figures S4, 5 Table S3). To further exemplify the miRNA-like characteristics of these novel sequences, we found eight Tnov-miRs that share seed sequences with miRNAs derived from the miR-17 family, which is frequently deregulated in numerous cancers [11, 12]. As expected, these Tnov-miRs were also predicted to regulate a similar set of target genes (Figure S5). Collectively, these results support the characterization of these novel transcripts as miRNAs.

Tnov-miRs were observed to be thyroid-specific, as evidenced by t-Distributed Stochastic Neighbor Embedding (t-SNE) analysis. Their combined expression pattern in each tissue was able to clearly distinguish non-malignant thyroid samples from other tissue types (Fig. 1). This tissue-specific nature highlights their potential relevance to thyroid biology.

Fig. 1
figure 1

t-Distributed Stochastic Neighbor Embedding (t-SNE) analysis of the novel miRNAs in non-neoplastic tissues from TCGA. A suggestive thyroid-specific expression pattern of the 217 Tnov-miRs was detected in non-malignant thyroid tissues (RPM > 1 in more than 10% of the NT group). The distribution of the samples in the three-dimensional chart demonstrates a cluster of thyroid samples (red), clearly separated from 12 other non-malignant tissue types from TCGA (bile duct, bladder, brain, cervix, colon, head and neck, kidney, liver, lung, pancreas, prostate and stomach)

Interestingly, 125 of the 234 novel miRNAs detected were differentially expressed (BH-p < 0.001, fold-change > 1.5) between matched tumor and non-malignant thyroid samples (n = 59 pairs). Most sequences (n = 115) were found to be significantly under-expressed in PTC, while only a fraction (n = 10) showed over-expression (Table S4). Although miRNAs are known to have either tumor-suppressive or oncogenic functions, an overall downregulation of their expression has been described as a common feature of tumourigenesis through the impairment of miRNA biogenesis [13, 14]. A recent study demonstrated that restoration of DICER1 expression inhibits PTC invasion and metastasis, suggesting that the loss of functional miRNAs may be important in the development and progression of PTC [15].

To validate the expression of these Tnov-miRs, we analyzed an independent dataset of 17 PTC and eight NT samples using the same criteria (GSE63511). Despite the limited sample size of this dataset, 16 of our Tnov-miRs were predicted and nine met the expression threshold criteria, confirming the existence of these sequences. Among them, Tnov-mir-74 and Tnov-mir-161 were over-expressed and Tnov-mir-231 under-expressed in PTC (Fig. 2a and Figure S6). Further, the combined expression pattern of these Tnov-miRs was capable of correctly classifying (Diagonal Linear Discriminant Analysis classifier) tumors from NT (TCGA cohort) with 91% accuracy (94.1% sensitivity, 66.1% specificity; Fig. 2b). The three-novel miRNA model presented superior performance than the use of single markers (Figure S7). These findings pointed a potential role of these Tnov-miRs in the diagnosis of thyroid lesions.

Fig. 2
figure 2

Potential clinical applications of the disrupted novel miRNAs. a Visual representation of the structure of the novel miRNA precursors generated by the novoMiRank tool and expression of Tnov-mir-74, Tnov-mir-161, and Tnov-mir-231 among non-malignant (blue), papillary thyroid carcinoma negative for BRAF mutation (gray and green) and with BRAF mutation (red) samples from the TCGA cohort. b Three-dimensional scatter-plot comprising the expression of the three novel miRNAs, demonstrating their potential diagnostic value due to the separation of PTC (red) from NT samples (blue). c Kaplan–Meier representation (ranging from 50 to 100% survival probability) of overall and disease-free survival from PTC patients according to Tnov-mir-231, demonstrating a worse prognosis in patients presenting lower levels of the predicted miRNA (P from log rank test). NS not significant; ***P < 0.001 (ANOVA followed by Tukey post-hoc); PTC papillary thyroid carcinoma, NT non-malignant thyroid tissue

We verified that Tnov-mir-74 increased expression was exclusively observed in BRAFV600E-PTC, a frequently observed genetic event and related to more aggressive PTC cases [2, 16, 17] (Fig. 2a). Tnov-mir-74 coding region was found hypomethylated in PTC compared to NT samples (Δβ = −0.31; BH-p < 0.001) and negatively correlated with miRNA expression (r = −0.605; p < 0.001). The same genomic region (probe-ID: cg25571414) had been previously reported to be hypomethylated in BRAFV600E compared to BRAFWT-PTC [18]. In fact, global hypomethylation has been shown to be induced by BRAF mutation [18], which can increase the expression of specific miRNAs [19]. In addition, the decreased expression of Tnov-mir-231 was found to be associated with lower overall and disease-free survival (Fig. 2c). We observed that the region encoding Tnov-mir-234 (22q13.2) presented hemizygous deletion in 18% of the PTC samples; however, no association was observed with its decreased expression (p = 0.715).

In order to predict their possible biological roles, a miRNA–mRNA integrative analysis was carried out with Tnov-miRs and their predicted mRNA targets (miRanda algorithm). Negative correlations (p < 0.05) were observed between Tnov-mir-74, Tnov-mir-161, and Tnov-mir-234 and 36, 69, and 212 mRNAs, respectively (Table S5). A subsequent in-silico analysis (pathDIP, BH-p < 0.05) revealed an enrichment for cancer-related pathways (Table S6). These results highlight a potential biological role of the Tnov-miRs in thyroid cancer onset.

In conclusion, we identified 234 Tnov-miRs expressed in thyroid tissues with potential relevance to PTC biology. While our study was performed on a predictive platform and further validation will be required, we provide an additional resource for further exploration of these transcripts. Our results suggest that the incorporation of previously unannotated miRNAs in the development of diagnostic and prognostic stratification panels may increase their accuracy.