Identification of putative master regulators in rheumatoid arthritis synovial fibroblasts using gene expression data and network inference

Rheumatoid arthritis (RA) is a systemic autoimmune disease that affects the synovial joints of the body. Rheumatoid arthritis fibroblast-like synoviocytes (RA FLS) are central players in the disease pathogenesis, as they are involved in the secretion of cytokines and proteolytic enzymes, exhibit invasive traits, high rate of self-proliferation and an apoptosis-resistant phenotype. We aim at characterizing transcription factors (TFs) that are master regulators in RA FLS and could potentially explain phenotypic traits. We make use of differentially expressed genes in synovial tissue from patients suffering from RA and osteoarthritis (OA) to infer a TF co-regulatory network, using dedicated software. The co-regulatory network serves as a reference to analyze microarray and single-cell RNA-seq data from isolated RA FLS. We identified five master regulators specific to RA FLS, namely BATF, POU2AF1, STAT1, LEF1 and IRF4. TF activity of the identified master regulators was also estimated with the use of two additional, independent software. The identified TFs contribute to the regulation of inflammation, proliferation and apoptosis, as indicated by the comparison of their differentially expressed target genes with hallmark molecular signatures derived from the Molecular Signatures Database (MSigDB). Our results show that TFs influence could be used to identify putative master regulators of phenotypic traits and suggest novel, druggable targets for experimental validation.


Scientific RepoRtS
| (2020) 10:16236 | https://doi.org/10.1038/s41598-020-73147-4 www.nature.com/scientificreports/ regulatory interactions. The TF-TF co-regulatory network is composed of 989 TFs pairs which are linked together by co-regulatory interactions ( Fig. 2A). With CoRegNet we can identify the most influential TFs (Table 1) based on their influence score, and we can also retrieve information about the co-regulatory pairs in our network that share common targets (Table S1).
To identify which of the TFs have a strong influence in the RA and which in the OA synovium, we used the reference network to visualize the TF influences for the RA and OA samples separately (Fig. 2B,C). This way, we were able to identify the TFs that are active only in the RA synovium (Fig. 2B). POU2AF1 and BATF appear as the most influential TFs on the gene expression in synovial tissue of RA patients followed by IRF4, STAT1 and LEF1 (Table 1, Table S2). It is worth noting that the network consists of two distinct clusters connected by one TF, SOX5.
Estimation of TFs influence in synovial fibroblasts using microarray gene expression. Next, we analyzed the dataset GSE107105 comprising seven RA FLS subpopulations and five OA FLS subpopulations, clustered according to their protein surface markers (CD34, THY1 and CDH11). For this dataset we calculated TF influence in two levels: (a) disease level, analyzing all RA FLS samples against all OA FLS used as a control, and (b) for each fibroblast subpopulation, comparing between similar subsets of RA and OA FLS, where possible.
For the analysis, and the TF influence estimation we used the whole expression matrix. Using the reference network, we calculated TF activities, and we were able to identify four out of the five TF considered as master regulators in RA synovial tissue data (Tables 2). NFKB2 was estimated as the fifth most influential TF for the RA samples. POU2AF1 was not identified, but POU2F2, its target, was found in the top 20 of most active TFs (Table S3).
Next, we analyzed the TF activities of the different subpopulations depending on the surface markers, separately. We analyzed data from seven RA and five OA FLS subpopulations (Table 3). We were not able to retrieve expression data for the C and G OA FLS. Corresponding co-regulatory networks for each FLS subpopulation are available in the Supplementary material (Figures S1, S2, S3, S4, S5). Comparing TF activities between the subpopulations revealed a dominant RA FLS TF activity profile where all five TF previously identified as master regulators in RA appear in the top 5 for most of the RA FLS types, with some slight variations. However, populations E and F exhibit an entirely different profile that is highly similar to the ones obtained for OA FLS. POU2AF1 is absent from all subpopulations of RA FLS. Concerning OA FLS it is worth noting that the dominant TFs in all subpopulations are TFs of the HOX family and SIX1 and 3.
Calculating TFs influence from scRNA-seq data of isolated fibroblasts. Finally, we analysed scRNA-seq data of isolated RA and OA FLS from two RA and two OA patients (GSE109449). First, we performed a dimensionality reduction using the tSNE algorithm 31 (Fig. 3A). Then, we projected the patients' disease state (OA/RA) on the plot, and we observed the clustering between OA and RA patients. As seen in Fig. 3B, the patients are separated into three clusters, based on their disease state. RA is the dominant case in one of the three clusters and OA in a second, however all three clusters have both cases. Subsequently, we calculated the TFs influence for the whole dataset using CoRegNet 24 .
The last step was to project on the tSNE clustering result, the TFs influence calculated using the whole dataset and not just DEGs. The aim was to see if at the single-cell level, the influence of the TFs identified as master regulators in RA synovium and RA FLS, would be considerably stronger in RA versus OA samples. Indeed, as shown in Figs. 3C-F, there is a concordance between the patients' disease state (OA/RA) and TFs influence. In essence, each of the identified TF has a strong influence on the gene expression of FLS from RA patients (coloured in green) while it has a weak influence on the gene expression of FLS from OA patients (coloured in purple). The five TFs previously identified as master regulators (STAT1, BATF, IRF4, LEF1 and POU2AF1) appear in the top 15 of the most influential TFs for RA FLS based on TF influence estimation using CoRegNet (Table S4). STAT1 and BATF are ranked as the second and the fifth most influential TFs, respectively.
Comparison of the top 5 of the most influential TFs in different FLS subpopulations revealed GATA3 as omnipresent in both RA and OA samples. We also observed the presence of STAT1 and BATF in most RA samples. Lastly, we obtained two very different TF activity profiles for the subpopulation C of the OA FLS and the subpopulation E for the RA FLS (Table 4). We did not observe any apparent clustering according to the surface markers independently of the disease state (Fig. 3F).
Estimation of transcriptional activities of the master regulators identified by CoRegNet, using ISMARA and DoRothEA. In order to see if the TFs identified as master regulators in the RA synovium and FLS samples using the TF influence and the reference network obtained from CoRegNet 24 could be retrieved using an independent approach, we analyzed the datasets with two additional software (Table 5): (a) ISMARA 21 , which estimates the TFs activities by identifying the promoters, quantifying their genome-wide expression and then fitting a linear model, and (b) DoRothEA 20 a framework to estimate TF activities from gene expression data and consensus TF-target binding networks. www.nature.com/scientificreports/ Microarray data of synovial tissue samples. Using ISMARA, BATF was identified as the top TF with the highest Z-score. IRF4, STAT1 and LEF1 were also retrieved in various rankings. POU2AF1 was not identified but we were able to retrieve the target of POU2AF1, POU2F1 (Table S5). With DoRothEA, STAT1, BATF and IRF4 were found in the top 15 of the most active TFs in the RA synovium. POU2AF1 was not identified but its targets, POU2F1 and POU2F2 could be retrieved in the analysis in various rankings (Table S6).
Microarray data of FLS. Using ISMARA and DoRothEA, BATF, STAT1, IRF4 and LEF1 were retrieved in various rankings. Similarly to previous analyses, POU2AF1 was not identified, but its targets POU2F1 and POU2F2 were retrieved in the analysis in various rankings (Tables S7, S8).
scRNA-seq data of FLS. ISMARA analysis was not possible as the FASTQ files were not available. Analysis with DoRothEA identified STAT1, IRF4 and BATF in the top 25 (Table S9).

Involvement of the master regulators in primary RA FLS phenotypic outcomes.
To study the implication of the five identified master regulators to the major functional phenotypic outcomes of RA FLS, namely apoptosis, inflammation, proliferation and migration/invasion, we retrieved hallmark molecular signatures from MSig database 26 . We compared the signatures with the differentially expressed genes in the scRNAseq data set (GSE109449). We kept only the genes having a p-value lower than 0.05 and a log fold change (logFC) higher than 1.5 in absolute value (Tables S10-S13). Next, we searched to see which of the genes identified as contributing to hallmark signatures were under the control of the five TFs characterized as master regulators (Fig. 4). We identified 12 DEGs that belong to the pro-apoptotic signature, 13 DEGs that belong to the proliferation signature and 18 DEGs that belong to the inflammatory signature. Concerning the migration markers we identified THBS1 and also MMP-14, an enzyme   Table 3. FLS subpopulations according to their surface proteins and the corresponding top 5 TFs identified in samples of OA and RA FLS (microarray data). In bold are highlighted the 5 identified TFs and in italic the TFs that differ considerably in the respected subpopulation. www.nature.com/scientificreports/  The master regulators can cause downregulation of four DEGs belonging to the apoptotic signature, namely CLU, SFRP1, GPX3 and GSTM1 (Fig. 4A), activate the expression of one proliferation marker, CCNB1 and downregulate another, IGF1 (Fig. 4B), and activate eleven genes that contribute to inflammation including IL6, a major inflammatory cytokine in RA (Fig. 4C). Table 4. FLS subpopulations according to their surface proteins and the corresponding top 5 TFs identified in samples of OA and RA FLS (scRNA-seq data). In bold are highlighted the 5 identified TFs and in italic the TFs that differ considerably in the respected subpopulation. A  −  −  +  SP140, GATA3, BCL11B, HMGA1, SPIB GATA3, STAT1, SAP30, PRDM1, BATF   B  −  −  −  GATA3, BCL11B, TCF7, SPIB, HMGA1  GATA3, PML, STAT1, HMGA1,  TNFAIP3   C  −  +  −  TNFAIP3, RELB, NFKB2, NFKBIA,  POU2F2  GATA3, STAT1, HMGA1, SP140, BATF   D  −  +  +  PRDM1, GATA3, SAP30, IRF4, TOX  GATA3, STAT1, RELB, PML, BCL11A   E  +  −  +  GATA3, STAT1, FOXM1, LEF1, BATF  PER2, SAP30, ZNF44, ZFP2, E2F8   F  +  +  +  GATA3, FOXM1, LEF1, TOX,    www.nature.com/scientificreports/

Discussion
Synovial fibroblasts are the main stromal cells in the synovial tissue of the joints. In healthy joints, they are found in the synovial lining and sublining layer (intima and subintima), forming a layer with a thickness of one to two cells, intercalated with tissue-resident macrophages 10,33 . In RA, synovial fibroblasts start to form thicker layers (15-20 cells thick) mainly due to a higher rate of proliferation and also due to a characteristic apoptosis-resistant phenotype. Immune cells infiltrate the sublining layer of the synovium, and the lining layer thickens, even more, resulting in the hallmark pannus formation. Activated synovial fibroblasts produce a variety of ECM remodelling components such as MMPs, and also cytokines and chemokines, contributing actively to cell recruitment and infiltration of the joint, and sustained and prolonged inflammation of the joint. Activated synovial fibroblasts can also invade and destroy the adjacent cartilage, contributing to cartilage damage and subsequent bone erosion, two major debilitating symptoms of RA. In summary, RA FLS differ from healthy synovial fibroblasts or even, OA FLS in the four following phenotypic traits: (a) resistance to apoptosis, (b) migratory and invasive capacities, (c) inflammatory responses, (d) higher proliferation rate.
In this study, we tried to characterize transcription factors that could potentially control and regulate these distinct phenotypic traits of RA FLS based on the assumption that these traits are the result of a distinct TF activity profile in comparison to control.
While in synovial tissue, other cells, like macrophages, leukocytes, plasma cells and mast cells are present, RA FLS is the dominant cell population 34 . Transcriptomic data of synovial tissue were used to infer a global, reference network of tissue-specific transcriptional activity. As a control, we used synovial tissue from OA. OA is a degenerative, non-autoimmune and non-inflammatory disease of the joints that also causes cartilage damage and bone erosion 35 . Inflammatory mediators observed in OA joints are thought to be the downstream effectors of the pathogenesis of the disease. Even though it's called non-inflammatory arthritis, OA can still result in some inflammation of the joints. The difference is that this inflammation probably results from wear and tear, as a secondary effect of the disease and not as a characteristic feature. Moreover, OA is a degenerative disease, while RA involves autoimmunity. While it is hard to distinguish between a characteristic feature and a secondary effect, the fact that inflammation in RA stems from the activated autoimmune pathways, allows us to hypothesize that there should exist different mechanisms that drive the onset, progression and pathogenesis of the two diseases. For the analysis, we used CoRegNet 24 , a tool for network inference based on TF influence estimation. The tool CoRegNet has successfully been used to identify driver regulators for bladder cancer 36 , master regulators in fatty liver disease 37 and hepatocellular carcinoma 38 and sex-specific gene regulation in drosophila 39 , among other studies. Two additional software was used for calculating TF activities of the samples, for seeing if we could retrieve the TFs already identified with CoRegNet, independently of the reference network.
Two opposite TF influence profiles were obtained for the OA patients and the RA patients using the global network inferred from the synovial tissue gene expression. This result corresponds to the fact that we used only the differentially expressed genes between RA and OA patients to perform the network inference so that only autoimmune changes specific to RA are revealed.
Five TFs having a strong influence on RA gene expression were detected. BATF, a member of the AP-1 family transcription factor that controls the differentiation of lineage-specific cells in the immune system. BATF mediates the differentiation of T-helper 17 cells (Th17), follicular T-helper cells (TfH), CD8 + dendritic cells and class-switch recombination (CSR) in B-cells 40 . In chondrocytes, overexpression of BATF leads to increased mRNA and protein levels of the matrix-degrading enzymes MMP3, MMP13 and ADAMTS5, which play crucial roles in OA cartilage destruction 41 . BATF overexpression in joint tissues causes synovial inflammation suggesting that BATF could contribute to inflammatory arthritis. BATF has been shown to regulate synovitis, synovial hyperplasia, angiogenesis, cartilage destruction, and bone erosion in the joint tissues 42 .
STAT1, a member of the STAT protein family that can be activated by various ligands including interferonalpha, interferon-gamma, EGF, PDGF and IL6. High expression of STAT1 is intrinsic to RA fibroblast-like synoviocytes in the intimal lining layer, whereas activation of the pathway by phosphorylation is an active process 43 . Raised levels of total STAT1 protein and both its activated tyrosine and serine phosphorylated forms were also seen in RA synovium as compared with controls. STAT1 is predominantly abundant in T and B lymphocytes in focal inflammatory infiltrates and in fibroblast-like synoviocytes in the intimal lining layer 44 . POU2AF1 (POU Class 2 Homeobox Associating Factor 1) is a transcriptional coactivator that associates specifically with either POU2F1 or POU2F2. It boosts the POU2F1 mediated promoter activity and to a lesser extent, that of POU2F2. POU2AF1 was identified when comparing arthritis versus healthy or ACPA + individuals, suggesting that the downregulation of such genes starts after the onset of symptoms in RA patients. Also, a significant correlation was identified for POU2AF1 and disease progression with a downward trend for those with established RA 19 .
IRF4 is a TF is required for proper maturation and differentiation of immune cells. It is expressed in cells relevant for IFN signature in autoimmunity such as dendritic cells, monocytes/macrophages, granulocytes and B-cells. Moreover, IRF4 loci are associated with genetic susceptibility to systemic autoimmune diseases 45 . IRF4 has been related to NFkB pathway and can also interact with MyD88, an adaptor protein crucial for the activation of IRGs 45 .
The last identified TF is LEF1 for Lymphoid enhancer-binding factor-1 (LEF1), a 48-kD nuclear protein that is expressed in pre-B and T cells. LEF1 participates in the Wnt signalling pathway which regulates fibronectin and metalloproteinase expression in RA FLS 46 .
An interesting observation concerning the reference network is that it consists of two distinct clusters united by a single TF, SOX5. While SOX5 did not appear to have a strong influence regarding the target genes, it can be considered as a structural hub of the reference network. SOX5 was reported recently to contribute to the www.nature.com/scientificreports/ migration and invasion of FLS in collagen-induced arthritis 47 . SOX5 has also been suggested as an important regulator of IL-6-induced RANKL expression in RA FLS 48 . The five TFs identified as master regulators can be considered disease-related for RA, as for the inference of the reference network we used the DEGs between samples of synovial tissue from RA and OA patients.
To verify that these TFs also play a role in RA FLS and they were not the result of the contribution of other cells present in the synovial tissue samples, the second step of our study involved the analysis of transcriptomic data from RA and OA FLS subpopulations. These datasets are related to Mizoguchi et al. study 49 where seven fibroblast subsets with distinct surface protein (CD34, THY1 and CDH11) phenotypes were isolated. Using the reference network obtained from the synovial tissue as a basis, we calculated the TF influence of the FLS transcriptome. We performed this step twice, one for comparing TF profiles in RA FLS versus OA FLS, and subsequently for comparing TF activities between subpopulations of RA and OA FLS sharing the same surface markers.
The first part of the analysis revealed that four out of the five identified TFs as master regulators for the RA synovium were also characteristic for the RA FLS samples. POU2AF1 was not retrieved in the microarray dataset for FLS. Instead, the TF NFKB1 appeared in the top 5. Analysis comparing RA and OA FLS according to their surface markers revealed two RA FLS subpopulations, E and F respectively, with a significantly different TF activity profile in comparison to the other RA FLS subpopulations.
The analysis of the scRNA seq data revealed STAT1, BATF, IRF4, LEF1 and POU2AF1 in the top 15 of the most influential TFs for RA FLS, with STAT1 and BATF ranking as the second and the fifth most influential TF, respectively. Regarding the subpopulation-specific TF activity profiles, STAT1 and BATF were dominant in RA samples. The transcription profiles of the subpopulation C of the OA FLS and the subpopulation E of the RA FLS were very different in comparison to the other subpopulations. Lastly, we did not observe any apparent clustering according to surface markers.
Transcription factor activity of the master regulators was also estimated with two additional software, ISMARA and DoRothEA. Unlike these tools, CoRegNet identifies cooperative regulators of genes. It considers, rather than the expression of regulators, their influence to detect co-regulators sharing a certain number of common targets. The results are not directly comparable since the three software use different algorithms to assess TFs activity, however most of the TFs identified with CoRegNet as having a strong influence in RA FLS samples were also retrieved from the ISMARA and DoRothEA analysis, confirming in an independent manner that these TFs do play an active role in RA FLS.
Analysis of the DEGs in the scRNA-seq data revealed genes implicated in hallmark molecular signatures related to major functional phenotypic outcomes of RA FLS, namely apoptosis, inflammation, proliferation and migration/invasion. Further analysis using the co-regulatory network obtained with CoRegNet highlighted genes under the control of the master regulators that are involved in apoptosis (suppressing pro-apoptotic markers), proliferation (suppressing proliferation markers) and inflammation (activating inflammatory markers). Very few characteristic migration markers were differentially expressed and none under the control of the master regulators identified.
In conclusion, this study contributes to a better understanding of RA FLS regulation by identifying diseaseand cell-specific master regulators, some highlighted for the first time as playing a pivotal role in fibroblasts. In an effort to limit misleading results and misinterpretations, statistical tests were performed-where feasible-to ensure homogeneity of the datasets used for sex and age (see Methods section and supplementary Tables 14-16). It should be noted, though, that besides age and sex, medical treatments and disease duration may potentially affect the TF levels in the patients. However, there are technical difficulties in accessing this type of data (often missing from original studies, not published along the datasets in public repositories, not available for all selected datasets, supplementary data not accessible) that do not allow for further adjustments using appropriate statistical tests. To ensure biological relevance of our results we have used, besides our main methodology, two independent software for cross validation, and finally we have cross checked our findings with experimental results in peer-reviewed RA studies. Experimental validation of the TFs identified as putative master regulators using in vitro settings and additional analyses are required to understand how different TF activity profiles coordinate gene expression in RA synovium and RA FLS subsets, determining disease and cell-specific phenotypic traits, cellular destiny and disease severity.

Methods
Data description and analysis for the inference of the co-regulatory reference network. For the inference of the co-regulatory reference network, we used three genome-wide transcriptomic datasets from the Affymetrix HG-U133 A/B array comprising a total of 79 samples of synovial tissue, including 20 Controls (C), 26 OA and 33 RA patients. The first dataset (Berlin dataset, GSE55235) includes 30 samples: 10 RA, 10 OA and 10 C. The second dataset (Jena dataset, GSE55457) includes 33 samples: 10 C, 13 RA and 10 OA. The third dataset (Leipzig, GSE55584) includes 16 samples: 6 OA and 10 RA. We used R version 3.5.3 to perform the exploratory data analysis and the differential expression analysis, with an eight-core processor, with 32Go of RAM running on Windows 10 (64 bits) and the Affy package version 1.62 50 . We normalised the genes expression using the mas5 function from the Affy package. Then we deleted the weakly expressed genes with the function mas5calls which performs the Wilcoxon signed rank-based gene expression presence/absence detection algorithm. Next, we performed differential expression analysis to identify genes that are differentially expressed between RA and OA patients using as a significance threshold an adjusted p-value (FDR) equal to 0,05. As the number of controls was lower than the number of OA samples, we decided to use OA samples as reference. OA remains a reasonable control for RA since OA is a non-inflammatory, non-autoimmune disease. We annotated the lists of DEGs using the file provided by Affymetrix. The annotation allowed us to retrieve the GenBank accession number and Gene Symbol for each probe. To address the problems of several probes corresponding to www.nature.com/scientificreports/ the same gene and of several genes having several symbols (aliases) we averaged the expression of each gene by calculating the mean of all the gene's probes. After that, for each gene, we kept all the possible aliases. The differential expressed gene list contained 2865 gene symbols, and it was used for the inference of the co-regulatory network.
Inference of the co-regulatory reference network. CoRegNet is a Bioconductor R-package. Based on a transcriptomic dataset, it can be used to reconstruct a co-regulatory network and integrate regulation evidence. The package uses the hLICORN algorithm 51 , which identifies cooperative regulators of genes. One key breakthrough in the exploration of networks was to consider, rather than the expression of regulators, their influence by evaluating the expression of target genes to detect master regulators. We used the CoRegNet package 24 version 1.20.0 to infer the network. We gave it as input the annotated expression matrix with the gene symbol as row names and the samples (OA/RA) as columns. Then, we run in parallel on 48 cores, the function hLICORN to obtain the co-regulatory network. Regulatory and cooperative evidence were used for gene regulatory network and co-regulatory network enrichment.
Calculating TF activities with ISMARA and DoRothEA. To validate the results and see if the TFs identified as master regulators in the reference network with CoRegNet could be retrieved with a different approach, we employed in our analysis two additional software which calculates TFs activity: ISMARA 21 and DoRothEA 20 . ISMARA is an online tool that models genome-wide expression or ChIP-seq data, in terms of computationally predicted regulatory sites for transcription factors. The input required for running ISMARA is either expression data (microarray CEL files) or ChIP-seq data to identify the key transcription factors driving the observed expression. Results include, for each regulatory motif, inferred activities across the input samples, predicted genome-wide targets, enriched pathways and functional classes of genes, and direct interactions between regulators. This tool uses a linear model and performs a Bayesian inference of the motif activities to obtain both bestfit activities and error-bars on the activities. The second tool, DoRothEA, is an R package containing a list of transcription factors and their transcriptional targets. The target genes are organized into regulons coming from different types of evidence like literature curated resources and interactions inferred directly from gene expression. Based on the expression of each regulon, the tool estimates the activity of its corresponding TF.
TFs influence generation using fibroblasts gene expression microarray dataset. We used microarray data of freshly isolated synovial fibroblast subsets in patients with RA and OA to refine the previously obtained results. The dataset GSE107105 includes 34 samples from three OA and three RA donors. First, we generated TF influence profiles for RA and OA patients and then, for each fibroblast subpopulation. Next, we used the co-regulatory network inferred at the previous step and the influence of the co-regulators to detect possible variations.
TFs influence generation using scRNA-seq dataset. We used scRNA-seq data we got from the GEO  Tables 3 and 4). The first step of the analysis was to preprocess the data. We used Single Cell Experiment R package 52 . It is specifically designed for storing data from sc experiments. Then we filtered low abundance genes because most of them are non-informative and are not representative of the biological variance of the data. We only kept genes having an average expression level higher than one. We also filtered genes that are in a small number of cells by keeping only the ones expressed in at least five cells. A complementary way to filter genes is to keep only proteincoding genes. scRNASeq profiles not only protein-coding genes but also long non-coding RNA and pseudogenes. These genes are often much noisier than protein-coding and might be interesting to remove them. To do that, we annotated the expression matrix using Biomart R package and kept the genes having "protein-coding" as biotype. After that, we performed quality control using isOutlier function from Scater R package to filter lowquality cells and low-quality features.
The following step of our analysis was to normalize the data using a size factor approach. Once the quality control finished, we identified the highly variable genes (HVG). To do that, we calculated the variance and the mean expression for each gene. Then, we performed a dimensionality reduction using the tSNE algorithm 31 . In addition to the tSNE clustering, we used hierarchical clustering and SC3 clustering, an unsupervised clustering method for scRNA-seq data 53 . Finally, we used scRNA-seq data and CoRegNet to calculate the influence of the transcription factors present in this dataset on the gene expression of FLS from RA and OA patients. To validate the results we got from this dataset, we applied ISMARA and DoRothEA for calculating the TFs activity.
Proliferation, apoptosis, migration/ invasion and inflammation markers identification :. First, we retrieved hallmark gene sets from MSig database 26 . Two hundred genes were retrieved for the inflammatory pathway, 171 plus 3 fibroblast specific genes for the apoptosis pathway, 40 fibroblast specific genes for the migration/invasion pathway and 85 fibroblast specific genes for the proliferation pathway.The expression profiles of these markers come from: Genotype-Tissue expression (GTEx) project data 54 , Human Tissue compendium data 55 , The NCI-60 Human Tumor Cell Lines Screen data 56 and the Global Cancer Map from Broad Institute 57 . The fibroblast specific gene sets were contributed by the Gene Ontology consortium 58 .
We performed a differential expression analysis using the Limma package 59 to the scRNA-seq data (GSE109449) to see which of the corresponding markers have differential expression between RA and OA Scientific RepoRtS | (2020) 10:16236 | https://doi.org/10.1038/s41598-020-73147-4 www.nature.com/scientificreports/ patients. We kept only the genes having a p-value lower than 0.05 and a log fold change higher than 1.5 in absolute value.Then, for the apoptosis pathway, we kept all the overexpressed antiapoptotic markers and all the underexpressed proapoptotic ones. For the inflammation, proliferation and migration, we only kept the overexpressed markers having a positive regulation and the underexpressed markers having a negative regulation on these pathways. Finally, we searched to see which of these genes are under the control of the five identified master regulators based on the co-regulatory network inferred by CoRegNet.
Correction for the age and sex of the patients. The distribution of the sex between RA and OA patients is similar in the synovial tissue samples (p-value Fisher test = 0.52) and in the fibroblasts microarray samples (p-value Fisher test = 1). In the fibroblast single cell samples, all individuals were females. Therefore, we did not adjust for sex. Concerning the age, we did not have information for each individual in the synovial tissue samples (only the mean age at onset for RA and OA). Therefore, we did not adjust for age.
A pairwise comparison was also performed for the sex and age between the OA patients of the 3 datasets and the RA patients of the 3 datasets (for age, the synovial tissue samples were not considered as explained above). The sex and age were homogeneously distributed between OA patients (p-value Fisher test > 0. 45  www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.