Introduction

The bone extracellular matrix (ECM) consists of both organic compounds, which make up approximately 40% of the matrix, and inorganic compounds, which account for the remaining 60%.1 The organic fraction of the ECM is mostly composed of collagen (90%) and hundreds of non-collagenous proteins2 that play structural and regulatory roles.3,4 Liquid chromatography tandem mass spectrometry (LC-MS-MS) proteomics has eased the elucidation of regulatory [.1] mechanisms, therapeutic strategies, and biomarkers for bone regeneration and osteoporosis research.5 Several studies of bone proteomics have reported using secreted proteins from cells, including mesenchymal stem cells, osteoblasts, and osteoclasts.6 However, since the bone ECM contains minerals deposited on highly crosslinked collagen fibrils, it is highly challenging to solubilize for proteomics analyses.7 Usually, a combination of decalcification and chemical treatments is necessary for the extraction of proteins and LC-MS-MS-based proteomics.8 Thus, the availability of bioinformatics resources is necessary to facilitate comparative studies of bone extracellular matrix proteomes, both within human populations and across different model organisms. Although several non-collagenous proteins of the bone ECM proteome have been identified, the list and roles of most proteins,1 as well as individual and cross-species variations, have not been fully explicated. Therefore, the identification of non-collagenous proteins in humans and model organisms will significantly advance the field of bone regeneration and osteoporosis.

In this article, we introduce the most comprehensive [.1] database of putative bone ECM proteins in 39 species, including Homo sapiens and the most common animal models (e.g., Danio rerio, Mus musculus, and Xenopus laevis), useful for the study of osteoporosis.9 Due to deer antlers being a speculated model of bone regeneration and osteoporosis,10,11,12,13,14 the database includes proteins from six species of the Cerevidae family. The Phylobone database, which includes 255 (28 collagenous and 227 non-collagenous) proteins, presents information on protein sequences, functional characterizations, and potential drugs that interact with bone ECM proteins. The database provides a robust tool for the study of bone weakness and regeneration, and will be a suitable resource for the identification of novel target proteins and therapeutic peptides for the treatment and prevention of osteoporosis. Thus, Phylobone will support emerging therapies targeting novel disease mechanisms to provide a powerful strategy for osteoporosis management in the future.15

Materials and methods

Preliminary list of bone ECM proteins

A list of 255 seed proteins previously identified in the literature as proteins present in the ECM was gathered from the UniProt database.16 These seed proteins were previously identified in the proteome of the bone ECM of D. rerio (n = 243), H. sapiens (n = 255), and Cervus nippon (n = 57)17,18 (Table S1). UniProt codes of D. rerio (Zebrafish) proteins that were obsolete (or fractions of protein sequences) were updated with current UniProt codes. Additional putative bone ECM proteins from H. sapiens were identified from the literature 11,18,19,20,21 and gathered from the UniProt database.16 Each protein group was initially identified based on the human bone ECM proteome, with a unique code ranging from PB0001 to PB0255.

Selection of species of interest

We selected 31 representative species of vertebrates to cover a wide range of phylogenetic groups (Table S2). The list includes a wide coverage of vertebrates, including most common model organisms and other vertebrates that are potentially important as model organisms of bone regeneration and osteoporosis (e.g., six members of the Cervidae family, including Cervus hanglu yarkandensis, Odocoileus virginianus texanus, Cervus canadensis, Cervus elaphus, Muntiacus muntjak, and Muntiacus reevesiveus). Moreover, we included in the search some species of invertebrates (n = 8) and other taxonomic groups that are phylogenetically distant to vertebrates (e.g., Arabidopsis, Archaea, Bacteria, Choanoflagellata, Cnidaria, Ctenophora, Fungi, and Porifera).

Gathering ECM proteins from public databases

Orthologous proteins were gathered from public repositories of the National Center for Biotechnology Information (NCBI). Phylobone draws its results from NCBI’s Eukaryotic Genome Annotation pipeline22 and NCBI Gene dataset23 to identify orthologous [.1] sequences of some vertebrates (e.g., M. musculus, Rattus norvegicus, Gallus gallus, and Bos taurus). Additional protein sequences of vertebrates were gathered from protein BLAST searches (blastp)24 with a threshold of 10-6 on the NCBI’s non-redundant protein sequences (nr). Synthetic construct sequences (taxid: 32630) and annotated partial proteins were not included in the Phylobone database.

Phylogenetic analysis and reconstruction of phyletic patterns

Phylogenetic analyses were performed with NGPhylogeny web server.25 We utilized the One Click Workflow option to obtain multiple sequence alignments (MAFFT algorithm26), clean alignments (BMGE algorithm27) and phylogenetic trees (PhyML algorithm28) of the 255 protein. Phyletic patterns of each protein group were built with custom-made perl scripts and information from NCBI’s Taxonomy.29 We selected various species of different taxonomic groups, including 5 primates, 1 rabbit, 2 rodents, 2 carnivores, 10 even-toed ungulates (including 5 members of the family Cervidae), 2 reptiles, 3 birds, 2 frogs, 4 bony fishes, and 8 invertebrates (Table S3).

Functional analysis

Clusters of Orthologous Groups (COG) functional categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) annotations of all the selected proteins in the database are retrieved from eggnog.30,31 The dataset of 255 protein groups was mapped onto eggNOG-mapper30,31 to identify the COG functional categories involved in bone ECM proteins. COGs are classified into 26 functional groups.32 Gene Ontology (GO) enrichment data for human and zebrafish proteins were collected from GO33 with the Panther resource. Benjamini–Hochberg false discovery rate (FDR) correction was used for the searches. Pfam34 was used to retrieve data about the domains of each human protein in the dataset. Pfam has structural and functional information for each domain as well as links to other databases, such as InterPro.35 In addition, information on the functional domains of all protein sequences in the dataset was annotated with CD-Search (with default parameters: E-value cut-off of 0.01; composition-corrected scoring: applied; low-complexity regions: not filtered; maximum aligns: 500).36

Protein–protein interaction network

Protein–protein interactions of each human protein in the database were retrieved from European Bioinformatics Institute’s Intact [.1] webserver37 and visualized with Cytoscape.38 Betweenness centrality was used to evaluate the centrality of each protein in the network. This parameter gives a value to a node calculated according to Eq. (1):

$${C}_{n}={\sum }_{s\ne t\ne n}\frac{{\sigma }_{{st}/n}}{{\sigma }_{{st}}}$$
(1)

where \({C}_{n}\) is the betweenness centrality value of the node n, \({\sigma }_{{st}}\) is the number of shortest paths from s to t and \({\sigma }_{{st}/n}\) is the number of shortest paths from s to t that n lies in between.

Conservation between zebrafish and human proteins

The conservation level between zebrafish and human proteins has been assessed by calculating the percentage of amino acid matches in pairwise alignments of orthologous proteins.

Targets by existing drugs

To facilitate the research on repurposing drugs for osteoporosis treatment, we used information from the Drug Bank39 and KEGG40 databases to identify bone ECM proteins that are targets of currently available drugs on the market.

Results

Phyletic patterns

The phyletic distribution of proteins in the selected group of species shows a high conservation of the number of bone ECM proteins within the vertebrate species (Fig. 1 and Table S3). Seed proteins to build the Phylobone database were obtained from humans and zebrafish, which were present in 255 (100%) and 251 (98%) protein groups. Most vertebrates were present in ~90% of the protein groups. This finding underscores the significance of the presence of these proteins in the bone ECM, as they exhibit a high degree of conservation across multiple species. Such conservation may have important implications for the function and regulation of bone tissue across vertebrates. Moreover, certain proteins, such as amelogenin (PB0250), are only present in mammals, amphibians, and reptiles because of potential evolutionary adaptations in tetrapods.41 Note that there are exceptional gaps in certain members of the Cervidae family (e.g., C. elaphus, M. reeversi, and M. muntjak have fewer known proteins than the rest of mammals) due to incomplete annotation of the genomes. As expected, only around one-third of the proteins found in invertebrates have homologs in vertebrate species, and in many cases, the levels of similarity were found to be quite low. None of the protein sequences have homologs in lower taxonomic groups, such as bacteria.

Fig. 1
figure 1

General parameters of the selected species in the Phylobone database. Tree visualization was made with iTOL v577 and shows the following datasets: G + C percentage (yellow), genome size (blue), number of proteins (green), number of putative bone extracellular matrix proteins (purple), and taxonomic information

Functional analysis

Protein functional domains

As expected, the most abundant protein functional domain in the Phylobone dataset is collagen, which is present in 12% of human proteins, covering 90% of the organic fraction of the bone ECM.2 Collagen is essential for homeostasis maintenance, and it serves as a scaffold to many other macromolecules and hydroxyapatite, enabling cell attachment and bone resistance to mechanical forces.1,42,43 Furthermore, 214 non-collagenous functional domains are present in human sequences of the dataset (Fig. 2 and Table S4). They are mostly common in bone formation, resorption, cell attachment, or as intermediaries in a variety of metabolic pathways. Leucine-rich repeats (LRRs) are the second most common functional domains due to the large abundance of leucine-rich proteoglycans (SLRPs) in bone ECM.1 LRRs play an important role in bone formation and the maintenance of bone homeostasis. Moreover, laminins, von Willebrand factor (VWF), epidermal growth factor (EGF), and trypsin domains are also frequent. Laminins, VWF, and EGF have significant roles in bone regeneration and stability—laminins are involved in cell adhesion, proliferation, and differentiation;44 VWF enhances the inhibition of osteoclastogenesis;45 and EGF can stimulate bone resorption.46 Trypsin is an important domain in the ECM because of its ability to cleave proteins. Several matrix-degrading enzymes, such as cathepsin K and matrix metalloproteinases, also contribute to the efficient degradation of bone.47

Fig. 2
figure 2

Distribution of protein functional domains in the Phylobone database. Relation between the human proteins of the dataset (purple) and their domains (blue). Dots filled in orange are the most common domains in the database (notice that only domains present in at least 5 proteins are shown). The number of proteins that each domain has is annotated in green (n)

Analysis of gene ontology

We compared GO enrichment analyses of biological process, molecular function, and cellular component categories between humans and zebrafish (Fig. 3 and Fig. S16)). GO annotations are more abundant in humans, but there is a certain functional overlap with zebrafish that can be utilized to evaluate the use of zebrafish as a model organism of osteoporosis.48 A common category in biological processes is bone mineralization, which is essential for bone regeneration.49 In terms of molecular functions, several binding activities are abundant in both humans and zebrafish. As expected, collagen is frequent in the GO cellular component categories of both species, as it is the main component of bone ECM.1

Fig. 3
figure 3

Venn diagram of the number of GO categories in the bone extracellular matrix proteins of Homo sapiens and Danio rerio. a Biological process categories. b Molecular function categories. c Cellular component categories

Clusters of orthologous groups

The Phylobone database is a compilation of ECM proteins; thus, several proteins are categorized in the COG database under the functions of extracellular structures (W) and signal transduction (T) (Fig. 4a). Signal transduction is an indispensable function of the ECM for controlling homeostasis and transmission of molecular signals into the cell.50 Furthermore, proteins involved in bone signaling regulate bone formation and resorption49 (Fig. 5). The third and fourth most common COG functional categories are function unknown (S) (Phylobone may shed some light on the identification of the functions of these proteins when studied in suitable animal models), as well [.1] as post-translational modification, protein turnover, and chaperones (O). Proteins of the ECM are expected to be highly integrated to enable signaling functions and to control homeostasis.

Fig. 4
figure 4

Protein–protein interaction network and functional analysis of human proteins in the bone extracellular matrix. a Percentage of proteins that have each COG functional group. The sequences of all selected species are represented. Signal transduction mechanisms (T), extracellular structures (W), function unknown (S), post-translational modification, protein turnover, chaperones (O), defense mechanisms (V), cytoskeleton (Z), intracellular trafficking, secretion, and vesicular transport (U), inorganic ion transport and metabolism (P), carbohydrate transport and metabolism (G), cell wall/membrane/envelope biogenesis (M), secondary metabolite biosynthesis, transport, and catabolism (Q), amino acid transport and metabolism (E), transcription (K), lipid transport and metabolism (I), and energy proSignal transduction mechanisms (T). b Tree of interactions between proteins. Orange dots are human proteins in the dataset, and blue dots are for other proteins with which they interact. c Gaussian representation of proteins’ betweenness centrality on a logarithmic scale (ln). Green lines represent all 5871 proteins that interact with each other, and orange lines represent the 255 human proteins in the dataset. d Human proteins that have a higher betweenness centrality value in our database

Fig. 5
figure 5

Schematic representation of the state of the art in bone formation and remodeling mediated by extracellular matrix (ECM) proteins. a Simplified representation of processes involving osteocytes, osteoblasts, and osteoclast formation in the bone. ECM proteins can modulate them via activation or inhibition through signaling pathways. These mechanisms allow bone homeostasis to be controlled. b Percentage of proteins with putative structural or regulatory roles in the bone ECM based on annotations in the UniProt database.16

Protein interaction network of bone ECM proteins in human

We built a protein interaction network of 5 781 human proteins, including putative bone ECM proteins and those that potentially interact (independently of body tissue) with them (Fig. 4b). The parameter of betweenness centrality showed higher values in bone ECM proteins compared to those that interacted with them (Fig. 4c). This result is in agreement with the central role of bone ECM proteins in maintaining tissue homeostasis and regulatory processes. Interactions with cellular components also allow signaling (e.g., cell–matrix interactions permit osteoblast differentiation4). The amyloid beta precursor protein, a transmembrane protein that can be cleaved by a secretase and released in the ECM,51 showed the highest value of betweenness centrality (Fig. 4d). This protein regulates osteoclast function (affecting bone remodeling), is present in both bone and brain, and can link osteoporosis and neurodegenerative diseases.52,53 Fibronectin also had one of the largest values of betweenness centrality. This glycoprotein of the ECM extracellular matrix has an important role in osteogenesis because it promotes pre-osteoblast mineralization and differentiation.54 Furthermore, fibronectin acts as a scaffold for the cleavage of procollagen (interacting with both bone morphogenic protein 1 and procollagen).55 This ability to bind to other ECM proteins gives fibronectin its high centrality value and an important role in bone formation and regeneration with a single mechanism.

Candidate proteins to be used in future studies

The objective of the phylobone database is to serve as a valuable resource for further investigations in the areas of bone regeneration, osteoporosis, and related fields. Through phylogenetic analysis, it has been observed that several ECM proteins are conserved between zebrafish and humans (Table S5). This conservation potentially suggests their significant regulatory role or importance in maintaining homeostasis. Moreover, the database includes 36 proteins out of 255 proteins from the bone ECM that have existing drugs and are putative candidates for repurposing existing drugs (Table 1 and Table S6), including 10 drugs that interact with more than one protein (Table 2).

Table 1 List of the 36 bone extracellular matrix (ECM) proteins that are the target of at least one available drug on the market
Table 2 Drugs that interact with more than one protein of the bone extracellular matrix (ECM)

Discussion

How to use the Phylobone database to study bone regeneration and osteoporosis

Osteoporosis is one of the most common bone problems in the middle age and elderly population worldwide.56,57 Approximately 9 million fractures per year (one every three seconds) are caused by osteoporosis, which contributes significantly to morbidity and mortality in developed countries.15 Given that life expectancy is increasing globally, osteoporosis has become an emerging topic, as it significantly affects the quality of life of individuals in most countries.15 Osteoporosis is a metabolic disease caused by an imbalance between bone anabolism and catabolism.48 In general, the recommendations to reduce the risk of osteoporosis are intake of adequate calcium, exposure to sunlight, intake of vitamin D, and engaging in weight-bearing exercises. It is known that a reduction in mechanical loading due to prolonged bed rest or long-term exposure to microgravity can lead to a reduction in bone mass.58,59 The mechanisms of weight-bearing exercises to prevent osteoporosis are not fully understood; yet, during weight-bearing exercise, collagen fibrils, which represent 90% of the organic components of the bone ECM, generate electricity that is stored in the inorganic components of the ECM.60 Since bone ECM has both structural and regulatory roles, non-collagenous organic components have a key role in bone regulation by mechanical loading.3,4

Several non-collagenous proteins of the bone ECM proteome have been identified previously, including osteocalcin, osteonectin, and R-spondins; however, the list and roles of bone ECM proteins in humans (and model organisms) are not fully elucidated.1 Some ECM proteins have been reported to play an important role in promoting bone formation. Type I collagen, the most abundant protein in the bone ECM, plays a structural role in providing mechanical support, such as bone strength and fragility,61 and regulates integrin receptors for osteoblasts.62 Moreover, the bone ECM proteome comprises hundreds of non-collagenous proteins with structural, regulatory, and homeostatic roles63 (Fig. 5). Osteopontin is a non-collagenous protein in bone ECM related to mechanical stress with multifunctional functions.64 This protein is produced by osteoblasts as well as osteoclasts,65 and it inhibits the activities of osteoblasts while promoting osteoclast activities.66 There are other proteins responsible for maintaining bone homeostasis, such as bone morphogenetic proteins (BMPs).67

Phylobone database to study mechanobiology

Outcomes from the Phylobone project will also have an impact on the field of mechanobiology. Several publications have determined the role of some cell types, cellular membrane receptors, and bone ECM proteins in mechanical loading.64,68,69 For instance, osteocytes are known as mechanosensing cells in bone tissue, as they transduce mechanical signals to biochemical responses.68 PIEZO1 is a mechanosensitive ion channel component that became a topic of discussion because its discoverer won the Nobel Prize in 2021.70 PIEZO1 regulates homeostasis via crosstalk of osteoblasts (bone-making cells) and osteoclasts (bone-eating cells).69 Further, mechanical properties, especially fluid shear stress in bone ECM, have significant effects on cells and their interactions.71 However, only a few non-collagenous proteins of the ECM (e.g., osteopontin) have been related to mechanical stress.64

Potential repurposing of existing drugs

The Phylobone database is a key resource for advancing research on the prevention of osteoporosis and the development of new treatments for bone regeneration. The most common treatments for osteoporosis are bisphosphonates, monoclonal receptor activator of nuclear factor-kB ligand (RANKL) antibodies, monoclonal sclerostin antibodies, and a parathyroid hormone peptide.72 The parathyroid hormone peptide increases osteoblast activity and inhibits osteoclast recruitment, whereas the targets of the other treatments inhibit osteoclast resorption.73 However, these treatments have limitations and problems with side effects and serious risks. For instance, bisphosphonates have been reported to have higher fracture rates after long-term use, characterized as more than six years and requiring a “bisphosphonate holiday”.74 RANKL antibodies have been reported to have some side effects, such as skin eczema, flatulence, cellulitis, and osteonecrosis of the jaw.75 Thus, there is still a need for osteoporosis treatment. However, finding new drugs is expensive; thus, repurposing existing drugs, such as those in Table 1, is a main goal for the identification of novel treatments and preventive methods. Given that several proteins of the bone ECM proteome have been identified as playing a regulatory role in osteogenesis and bone degradation,1 we hypothesize that key proteins of the ECM involved in mechanical loading could be potential drug targets for the treatment and prevention of osteoporosis.

Conclusions and future developments

The aim of the Phylobone research project is to provide a platform to study bone regeneration and osteoporosis in light of (biological) evolution. The current database includes the most comprehensive repository of bone ECM proteins in human and animal models. The phylobone database provides a bioinformatics resource that includes sequences, phylogenetics, and functionals. Thus, this resource will be helpful in addressing a major technical challenge in bone biology: the identification of non-collagenous proteins involved in bone formation, regeneration, and progression of osteoporosis disease. In the future, we anticipate an update of the Phylobone database with more experimental data on the regulatory role of bone ECM proteins in bone regeneration and osteoporosis.