Introduction

Throughout their evolution, cells differentiate and specialize into different subtypes, that are often controlled by underlying molecular-level mechanisms1,2,3. This process is generally pictured by the famous metaphor that is a ball rolling down a hill, called the Waddington Landscape4. Analogous to a ball rolling down a hill, which may change its direction by the effect of obstacles in its way, lose its kinetic energy, slow down, and eventually reside at a stable point, cells may change their trajectories and differentiate to different subtypes due to some regulatory or evolutional triggers while they are maturing. Similarly, due to abnormalities, stochasticity, or other unknown reasons, they may diverge from their trajectories and become cancerous cells5. Moreover, cancerous cells may also evolve and differentiate into other subtypes6,7,8. Therefore, developing effective treatments for cancer has been a challenge due to heterogeneous cell subpopulations that appear within a tumor. Genetic or non-genetic mechanisms can drive the cancerous cell subpopulations via plasticity, drug-induced selection, or state transitions between the subtypes and have them escape the treatment or recur with a resistance to the treatment9,10,11, which is the case in multiple cancer types such as breast cancer12,13, melanoma14, and small cell lung cancer (SCLC)15,16,17,18,19,20.

SCLC is an extremely aggressive disease with a low survival rate21,22,23,24,25 (7% 5-year survival as of 202226). Although it was characterized as molecularly homogeneous due to loss of TP53 and RB1, and neuroendocrine/epithelial differentiation27,28, SCLC was shown to be heterogeneous29,30,31,32,33,34,35,36,37 by the identification of its mixtures of transcriptional subtypes such as neuroendocrine (NE) stem-cell-like subtype centered on the expression of the transcription factors ASCL1 and NEUROD135 and non-neuroendocrine (NON-NE) subtype centered on the expression of the transcription factor YAP136. Overall, the SCLC subtypes have been classified into four classes SCLC-A (also labeled as NE), SCLC-N (also labeled as NEv1), SCLC-Y (also labeled as NON-NE), and SCLC-P defined by the expression of the transcription factors ASCL1 (A), NEUROD1 (N), YAP1(Y), and POU2F3 (P), respectively29,30,31,32,33,34,35,36,37. Recently, the fifth subtype has also been proposed named SCLC-A2 (also labeled as NEv2) which is driven by ASCL1 but distinct from the SCLC-A neuroendocrine subtype38. At the early stages of the disease, the cancerous cell population contains the NE type cells, and then over time the population begins to include the NON-NE subtype that is more treatment-resistant34,39,40, indicating that subtype transition is happening. In addition to various subtypes with different levels of resistance to treatment, such transitions between the subtypes further complicate the treatment of the disease. Therefore, understanding molecular heterogeneity in SCLC is essential for developing more precise, tailored treatments to cure the pathology.

Transcription factor (TF) networks have been the focus of the studies to understand the mechanism of the disease and to identify different SCLC subtypes as they are associated with the overexpression of different transcription factors30,34,37,38,41. These networks have been mechanistically analyzed at the systems level which led to the identification of regulators and destabilizers of different subtypes30,34,38, and have contributed to our understanding of the underlying gene regulatory system. However, the structures of these networks were barely studied about a decade ago42. It has been shown in many studies that the structure of a network can be as informative as its dynamical features and their analysis may help to identify key components associated with fundamental functional behaviors43,44,45. Specifically, hubs (Box 1) of the networks are shown to have key functional properties46,47,48,49,50,51.

In this study, we analyze the topology of SCLC TF network (Fig. 1) provided in34,38 and has been key in the identification of different SCLC subtypes. It comprises literature-based connections that are verified from ChEA, a database of ChIP-seq-derived interactions52. Overall, the network consists of 35 TFs connected through 239 activatory and inhibitory interactions (red and green arrows in Fig. 1, respectively). Combinational ON–OFF states of the TFs in this network have been shown to drive cells toward different subtypes34. Here, one of our goals is to identify the hubs of the SCLC TF network, which are the special nodes that interconnect several key pathways and play an important role in collecting, processing, and distributing key signals throughout the signaling mechanism. We hypothesize that the hubs might be important for the overall network dynamics and perhaps may help to identify specific TFs that regulate SCLC subtypes. Furthermore, although the earlier studies elucidate regulators of different SCLC subtypes, they lack mechanisms of subtype transitions whose understanding is critical to controlling disease progression. We also hypothesize that the pathways connecting the functionally distinct hubs may have roles in the subtype transitions.

Fig. 1: Small cell lung cancer transcription factor network reproduced from34,38.
figure 1

The hexagonal nodes represent the individual transcription factors, the red edges represent the inhibitory interactions, and the green edges represent the activatory interactions.

To identify the hubs of the SCLC TF network, we implement a graph theory concept called Dense Spanning Tree (DST, see Box 1), which can be found by solving an optimization problem (Methods Section Dense Spanning Trees of the unbiased SCLC TF network)53,54,55. We initially analyze a relatively unbiased network structure by considering the undirected and unweighted network. In other words, we only consider whether two nodes are interacting and do not consider the type and direction of interaction. Later, we integrate previously published RNA-seq data into our analysis, which is the probability of each interaction occurring34,38, assigned to each interaction as weights. To identify the hubs given the weighted network graph, we extend the DST concept into Minimum Dense Spanning Tree (MDST, see Box 1) concept for which the DST optimization problem is extended into a multi-objective optimization problem (Methods Section Integrating data into the structural analysis: Minimum Dense Spanning Tree). Interestingly, all the found hubs are either regulators or destabilizers of the previously identified SCLC subtypes as elaborated in the Results section. Next, we test a pathway connecting the two functionally distinct hubs via simulations and observe a transition from the NON-NE to NE subtype. Furthermore, running and tracking several asynchronous NON-NE to NE transition simulations suggest additional TFs other than the hubs that may have a role in this transition.

The paper is organized as follows. First, we present the results of the DST and MDST analyses of the SCLC TF network in the Results Sections Structural analysis of the unbiased SCLC TF network identifies some of the known SCLC subtype regulators and destabilizers and Data-driven structural analysis of the SCLC TF network highlights MYC as a hub in addition to those previously identified as subtype regulators and destabilizers. Then, we present the results of the asynchronous subtype transition simulations in the Results section The pathways connecting the SCLC TF network hubs may have a role in SCLC subtype transitions: NON-NE to NE transition occurs when FLI1–ASCL1–MITF pathway is active. Next, we provide the mathematical details of DST and MDST analyses as well as the details of the transition simulations in Methods Sections Dense Spanning Trees of the unbiased SCLC TF network, Integrating data into the structural analysis: minimum dense spanning trees, and SCLC TF network subtype transition simulations, respectively. In addition, we compare the dst and mdst analysis results in the supplementary information. Finally, we conclude the paper with some concluding remarks.

Results

In our analyses, given the SCLC TF network (Fig. 1), we search for hubs of the network by finding the substructure DSTs (Box 1). The DST of a given network contains hubs that are known to be structurally important nodes interconnecting several pathways. Due to their high and strategic connectedness, they are very likely to have functional importance as well. This concept has many applications in different areas such as telecommunications networks, social networks, resource allocation, and biological networks55.

In biological networks, the DSTs of the network are substructures that preserve the shortest pathways between the nodes (TFs) and hence they preserve the maximum influence among the individual components while highlighting a few nodes as the hubs. Since the identified hubs connect several pathways, they receive so many signals from their peripherals, process them, and distribute them to multiple other nodes. Therefore, in general, they have functional importance as well46,47,48,49,50,51. Also, depending on the size of the initial network, the identified DSTs may contain multiple hubs. Due to their individual importance, the pathways connecting the hubs might also be important as they are the pathways communicating complex signaling between the hubs. In this section, we show that the hubs of the SCLC TF network are relevant to the SCLC subtypes. Additionally, we test a pathway connecting two identified hubs via network simulations. All the results are elaborated in the following subsections.

Structural analysis of the unbiased SCLC TF network identifies some of the known SCLC subtype regulators and destabilizers

We start our analysis by converting the SCLC TF network (Fig. 1) into an undirected, unweighted network (see Methods Section Dense spanning trees of the unbiased SCLC TF network). In this way, we just focus on whether interactions between two nodes exist without considering their interaction types, directionality, or weights (i.e., probabilities), which allows us to minimize bias on the network structure. Then, we searched for the DSTs of the SCLC TF network following the approach of Ref. 55. Upon solving the global optimization problem in Eq. (1) (Methods Section Dense spanning trees of the unbiased SCLC TF network), we observed 146,143 DSTs, all having the same optimum total distances between the TFs. Examples of the found DSTs are presented in Fig. 2. In one of the DSTs, FLI1 and MITF are identified as the hubs (Fig. 2a) while in the other DST, FLI1, ASCL1, and FOXA1 are identified as the hubs (Fig. 2b). Since different DSTs may highlight different TFs as the hubs, we computed the average node degrees (Box 1) of the nodes among all the found 146,143 DSTs, which is collectively presented in Fig. 3. As seen in the figure, FLI1 is a major hub with about 20 connections on average among all the found DSTs. In addition, MITF, ASCL1, NR0B1, and FOXA1 are the other hubs with relatively high average node degrees in some DSTs.

Fig. 2: Examples of the found DSTs of SCLC TF network.
figure 2

a An example DST in which FLI1 and MITF are the two hubs. b An example of found DSTs in which FLI1, ASCL1, and FOXA1 are the three hubs.

Fig. 3: Average node degrees of each TF among the found DSTs.
figure 3

FLI1 is the major hub with about 20 connections on average in the found DSTs. The other hubs are MITF, ASCL1, NR0B1, and FOXA1 with relatively high connectedness on average.

The found major and side hubs are not only structurally important but also shown to have biological importance to the identified SCLC subtypes. For instance, FLI1—the major hub in Fig. 3—is shown to be one of the regulators of the SCLC NE subtype34,56,57. Similarly, ASCL1, NR0B1, and FOXA1 are reported as one of the regulators of SCLC NE and NEv2 subtypes, and MITF is reported as one of the regulators of the SCLC NON-NE subtype34, which shows the specificity of the hubs of SCLC TF network.

Data-driven structural analysis of the SCLC TF network highlights MYC as a hub in addition to those previously identified as subtype regulators and destabilizers

Next, we repeat our hub search by integrating experimental data into the analysis. The data is the individual probabilities of each interaction between the TFs in the SCLC TF network (Fig. 1), extracted from RNA-seq data34. The probabilities are integrated into the network structure as the weights that are assigned to the associated edges. Then, to identify the hubs of the weighted SCLC TF network, we extend the DST concept into MDST (Box 1) for which we solve an extended multi-objective optimization problem (Methods Section Integrating data into the structural analysis: minimum dense spanning trees). Apart from DSTs, MDSTs allow us to highlight the hubs while preserving the maximum likelihood of the interactions.

Upon solving the optimization, we observed only 46 MDSTs which is drastically lower than the number of DSTs (146,143) found with the unbiased network structure. This means that this analysis guided by prior knowledge, i.e., the experimental data, can constrain the search space more efficiently. Once we compute the average node degrees among the found MDSTs, we observe that FLI1 still is the major hub (Fig. 4). Similarly, ASCL1 and MITF are still identified as the hubs but this time with higher average node degrees compared to the unbiased network analysis (Fig. 4). In other words, they become more major hubs, which coincides with their biological importance in SCLC as reported in the literature30,31,34,38,40,58,59,60. Interestingly, the data-driven structural analysis further reveals MYC as another hub (Fig. 4), which does not appear in the unbiased network analysis (Fig. 3). Recently, MYC was shown to be one of the key TFs for SCLC32,61,62,63, which initiates Notch signaling to reprogram neuroendocrine fate from NE to NEv1 to NEv2 to NON-NE states40. Overall, our observations support that structurally important nodes are very likely to be functionally significant as well. Therefore, such structural analyses could be an initial step in the analysis of complex intracellular networked processes because of their potential to pinpoint important network components, which would guide experimental target discovery.

Fig. 4: Average node degrees of each TF among the found MDSTs.
figure 4

FLI1 is the major hub with about 14 connections on average in the found MDSTs. The other hubs are ASCL1, MITF, and MYC with relatively high connectedness on average. We note that MYC emerges as a hub after data integration, which was not the case with the pure structural analysis in Fig. 3.

The pathways connecting the SCLC TF network hubs may have a role in SCLC subtype transitions: NON-NE to NE transition occurs when FLI1 – ASCL1 – MITF pathway is active

SCLC TF network contains multiple hubs with varying average node degrees. These hubs are shown to have distinct functional features in terms of SCLC subtypes, as elaborated in the previous sections, which leads us to a question: Do the pathways connecting different hubs that are identified as regulators of different SCLC subtypes have any role in subtype transition? For instance, FLI1 and MITF are the two major hubs identified in both unbiased (Fig. 3) and data-driven structural analyses (Fig. 4). One of the pathways connecting these two hubs is through FLI1–ASCL1–MITF. FLI1 being a regulator of the SCLC NE subtype, MITF being a regulator of the NON-NE subtype, and ASCL1 being a destabilizer of the NON-NE subtype and regulator of the NE subtypes34 suggest that this pathway has a potential role in NON-NE to NE subtype transition. One can also identify such structurally important pathways by checking the interactions remaining in the found DSTs and MDSTs with high probability, as exemplified in Supplementary Information.

To test the possible role of this pathway in the NON-NE to NE subtype transition, here we simulate the SCLC TF network using a tool called BooleaBayes34 that automatically infers gene regulatory mechanisms, based on Boolean logic models, and links inputs and output states tailored to -omics datasets such as those from RNA-seq data. Upon setting the network’s initial state to NON-NE subtype based on previously identified combinational ON-OFF states of the TFs34, keeping the FLI1–ASCL1–MITF pathway active, and running asynchronous network simulation (i.e., one TF is randomly picked and updated at each iteration) using the extracted logic rules (Methods section SCLC TF network subtype transition simulations), we observe a transition from NON-NE to NE subtype (Fig. 5).

Fig. 5: SCLC subtype transition from NON-NE to NE subtype.
figure 5

The network was initially set to NON-NE subtype attractor. After running several asynchronous iterations by keeping the FLI1–ASCL1–MITF pathway active, the system converges to an NE subtype attractor shown in the last iteration. Running further iterations does not move the network to another state. This pathway was identified based on the hubs observed from both unbiased and data-driven network structure analyses. The details of network simulation are provided in Methods Section SCLC TF network subtype transition simulations. The red color means TF is ON and cream color means TF is OFF.

Dynamic analysis of asynchronous NON-NE to NE subtype transition simulations

Although the NON-NE to NE subtype transition was observed by keeping the FLI1–ASCL1–MITF pathway active, there are possibly other TFs and dominant pathways that contribute to the transition. Identifying those TFs and dominant pathways may reveal how the system mechanistically executes such transitions and allow us to identify potential other TFs playing a role in the transition. Therefore, as the next step, we run 700 asynchronous NON-NE to NE subtype transition simulations and keep track of all the iterations. Then, we compute the Longest Common Sequence (LCS) based distance (Methods section Distance measure between instantaneous network state and NE subtype) between the target SCLC Boolean NE state and the instantaneous network state at each iteration (Methods section SCLC TF network subtype transition simulations). As seen in Fig. 6, throughout the NON-NE to NE transition, the network state dynamically alternates between NON-NE and NE subtypes through many distance-increasing and -decreasing patterns until it finally converges to the NE state. This means that some reaction patterns drive the cells toward the NE subtype (distance-decreasing patterns in Fig. 7) whereas some other reaction patterns drive the cells toward the NON-NE subtype (distance-increasing patterns in Fig. 7).

Fig. 6: Longest Common Sequence-based distance between NE subtype and the instantaneous network state versus asynchronous iterations.
figure 6

Starting from NON-NE state, the system converges to and diverges from NE state multiple times throughout the iterations until finally it fully converges.

Fig. 7: Examples of increase and decrease distance patterns between the network instantaneous state and SCLC NE subtype.
figure 7

In the zoomed-in panels, the x-axis tick labels are the TF names that are updated during that iteration window of the simulation which cause the increase/decrease in the distance between the network’s instantaneous state and the target state.

Overall, the 700 asynchronous NON-NE to NE subtype transition simulations, in which transition occurs in the order of 105 asynchronous iterations, contain about 7 × 105 distance increasing and 5 × 105 distance decreasing patterns. To see which TF appears most in the distance-increasing and -decreasing patterns, we compute their frequencies (Fig. 8). Interestingly, four TFs that are ASCL1, FLI1, NR0B1, and CEBPD, appear more than the other TFs in the distance-decreasing patterns (Fig. 8a) whereas the same four TFs appear less than the others in the distance-increasing patterns (Fig. 8b). This means that in addition to the ASCL1 and FLI1 which are part of the pathway identified NON-NE to NE transition pathway, NR0B1 and CEBPD may have a regulatory involvement in this transition as well. Moreover, throughout all the asynchronous iterations among 700 NON-NE to NE transitions, we compute the number of iterations for each TFs, on which an update of the TF causes an increase in the distance between the network’s instantaneous state and NE subtype. As seen in Fig. 9a, in addition to ASCL1 and FLI1 which never drives the cells toward the NON-NE subtype, NR0B1 and CEBPD are the two TFs that have a lower effect on the increase in the distance between the network state and the NE subtype compared to the others, which further supports their possible regulatory involvement in NON-NE to NE subtype transition. Furthermore, we compute the probability of TFs being ON at the network state during the initiation of distance decrease patterns (Fig. 9b). With about 0.2 probability of being ON, NR0B1 seems to drive the cells toward the NE subtype by mostly being OFF whereas the activity status of CEBPD seems not very important as its probability of being ON is very close to 0.5. Additionally, Fig. 9b suggests that whenever ISL1 and FOXA2 appear in the distance-decreasing patterns which is very likely as seen in Fig. 8a, they are mostly ON with relatively high probabilities which implies that they may have a role in the NON-NE to NE transition.

Fig. 8: Frequencies of TFs in the distance decreasing and increasing patterns.
figure 8

a Appearance of TFs in the distance decreasing patterns. The shaded TFs, i.e., ASCL1, FLI1, NR0B1, and CEBPD, appear more than the other TFs in the distance-decreasing patterns. b Appearance of TFs in the distance increasing patterns. The shaded TFs, i.e., ASCL1, FLI1, NR0B1, and CEBPD, appear less than the other TFs in the distance-increasing patterns.

Fig. 9: Effect of TFs in distance increase and decrease between network state and NE subtype.
figure 9

a Number of iterations on which update of TF cause an increase in the distance between network state and NE subtype. b Probability of TF is ON in the network state initiating distance decreasing patterns.

Overall, the presented results suggest that structural analysis of the biological networks may guide the identification of functionally important molecules. More specifically, the concepts of DST and here extended to MDST by integrating data can identify hubs of the networks which can be potential targets in the experiments due to their involvement in complex biological processes. Focusing on the SCLC TF network that is being analyzed in this work, all the identified hubs in both unbiased and data-driven analysis show biological importance in terms of SCLC subtype regulation and destabilization as supported by the literature. Moreover, integrating data into the structural analysis highlights MYC as another hub whose importance in SCLC subtypes has recently been discovered32,61,62,63. This observation further supports those previously reported results. Furthermore, the ability to identify multiple hubs that have distinct functional roles in SCLC subtypes lets us scrutinize the pathways connecting the hubs. Upon asynchronously simulating the network by keeping the pathway connecting FLI1 and MITF—the two major hubs—active, we observed a transition from NON-NE to NE subtype. In addition, analysis of 700 asynchronous NON-NE to NE transition simulations suggests other TFs that may play a role in this transition. As a result, starting from a pure network structure, its analysis leads us to understand the underlying mechanism of a complex biological system, which is noteworthy.

Methods

Dense spanning trees of the unbiased SCLC TF network

Given the SCLC TF network (Fig. 1), to analyze its structure and identify the hubs (Box 1) that are potentially fundamental in terms of their roles in complex biological processes, we search for the substructures called dense spanning trees (DSTs, Box 1). Suppose G is a graph that represents the SCLC TF network, V(G) is the set of nodes that represent the TFs in the network and E(G) is the set of edges that represents the interactions between the TFs in the network. Then, the DST of G is a substructure that minimizes the total distances between the TFs and contains all the TFs in V(G) with a minimum number of interactions while highlighting some nodes with high connectedness, i.e., the hubs. In other words, the DSTs are the subnetworks of the SCLC TF network that comprises the hubs and the shortest pathways from the hubs to all other TFs preserving the maximum biological influence.

To identify the hubs of the SCLC TF network, we minimize possible bias to the network structure by removing all the edge directions, i.e., the information on which node influence the other, the edge types, i.e., the information on activating and inhibitory interactions, and not using any data on strength of the connections, i.e., the probabilities of the interactions (Supplementary Figure 1). Then, the DSTs of the network are observed by solving the following optimization55:

For the graph G with vertex set \(V(G)=\left\{{v}_{1},{v}_{2},\ldots ,{v}_{N}\right\}\) where \(N=\left|V\right|\), and edge set \(E(G)=\left\{{e}_{1},{e}_{2},\ldots ,{e}_{M}\right\}\) where \(M=\left|E\right|\),

$$\begin{array}{ll}{{{\qquad\qquad\qquad}^{\displaystyle\min }_{\,\,\,\displaystyle{\overrightarrow{h}}}}}\mathop{\sum }\limits_{i,j=1,i\ne j}^{N}d({v}_{i},\,{v}_{j}|{\overrightarrow{{h}}{\,}^{\ast }})\\ \begin{array}{ll}{\rm{subject}}\,{\rm{to}}\, \\ \qquad\quad{h}_{i}\in \{1,\,2,\,\ldots ,\,M\}\subset {{\mathbb{Z}}}^{+},\,i=1,\,\ldots ,|{\overrightarrow{h}}|\\ \qquad\quad{h}_{i}\;\ne\; {h}_{j},\,\forall i\;\ne\; j\\ \qquad\quad{\overrightarrow{h}}\,{\rm{contains}}\,{\rm{at}}\,{\rm{least}}\,{\rm{one}}\,{\rm{edge}}\,{\rm{adjacent}}\,{\rm{to}}\,{v}_{i}\in V,\,\forall i=1,\,\ldots ,\,N\\ \qquad\quad{{\overrightarrow{{h}}{\,}^{\ast }}}={\rm{Kruskal}}({\overrightarrow{h}})\end{array}\end{array}$$
(1)

in which \({\vec{h}}^{* }\) denotes the minimum spanning tree obtained from \(\vec{h}\) that is a subset of E(G), and \(d\left({v}_{i},{v}_{j}\right)\) is the distance between nodes \({v}_{i}\) and \({v}_{j}\) defined as the total number of edges in the shortest pathway between \({v}_{i}\) and \({v}_{j}\). The main idea here is to find the optimal subset(s) of edges E(G) from which the constructed DST has the optimal objective value which is the total distances between the individual nodes. For more mathematical details and possible applications of this approach, we refer the reader to54,55. Upon solving the optimization problem (1) via Genetic Algorithm (GA), which is a metaheuristic optimization method that attempts to find the global optimum or at least its good approximation64, we observed 146,143 DSTs with the same objective value.

Integrating data into the structural analysis: minimum dense spanning trees

As the next step, we blend this pure structural analysis with some data that is the probability of the interactions, i.e., the strength of the connections estimated from RNA-seq data from the probabilistic Boolean rules by Wooten et al.34. They are the difference of means for a particular node when the parent node is on versus off. To elaborate, suppose FLI1 regulates ASCL1. Then, the weight for the edge between FLI1 and ASCL1 is the mean probability of ASCL1 turning on for FLI1 being on minus the probability of ASCL1 turning on when FLI1 is off across the samples, i.e., P(ASCL1 = 1 | FLI1 = 1) - P(ASCL1 = 1 | FLI1 = 0). So, if ASCL1 is always on when FLI1 is on, and always off when FLI1 is off, then the edge weight = 1. These probabilities are integrated into the network structure as the weights that are assigned to the associated edges. The source codes for computing these probability values were provided in Wooten et al.34 (see their BooleaBayes source codes on GitHub).

To identify the hubs of the weighted SCLC TF network, here we reformulate the optimization problem constructed to find DSTs in Eq. (1) as a multi-objective optimization problem given in Eq. (2) and call the resulting optimal trees as the minimum dense spanning trees (MDSTs, Box 1). MDSTs add another information layer to the found trees by preserving the maximum likelihood of the interactions in addition to the minimum total distances between the nodes while highlighting the hubs of the network. More precisely, MDSTs of the SCLC TF network are the subnetworks that preserve the most probable interactions as well as the maximum biological influence between the TFs via the shortest pathways through the hubs. Note that one can assign different weights to the interactions by different means such as the mutual information between the TFs extracted from experimental data. In this case, the MDSTs will be the substructures that preserve the highest mutual information in addition to the shortest pathways through the hubs.

To find the MDSTs of the SCLC TF network, we extend Eq. (1) as follows: Suppose for each interaction \(i\), we are given a probability \({p}_{i}\), that is probability of the existence of the \({i}^{{th}}\) interaction. Then, for the graph G with vertex set \(V(G)=\left\{{v}_{1},{v}_{2},\ldots ,{v}_{N}\right\}\) where \(N=\left|V\right|\), and edge set \(E(G)=\left\{{e}_{1},{e}_{2},\ldots ,{e}_{M}\right\}\) where \(M=\left|E\right|\) with associated weights \({w}_{i},{i}=1,\ldots ,{M}\):

$$\begin{array}{l}{{{\qquad}^{\displaystyle\min }_{\,\,\,\displaystyle{\overrightarrow{h}}}}}\left\{\mathop{\sum }\limits_{i,j=1,\,i\ne j}^{N}d({v}_{i},\,{v}_{j}|{\overrightarrow{{h}}{\,}^{\ast }}|),\mathop{\sum }\limits_{i=1}^{M}1({{e}_{i}\;\in\; {\overrightarrow{h}}})\times ({w_{i}})\right\}\\ \begin{array}{ll}{\rm{subject}}\,{\rm{to}} \\ \qquad\quad\,{h}_{i}\in \{1,\,2,\,\ldots ,\,M\}\subset {{\mathbb{Z}}}^{+},\,i=1,\,\ldots ,\,|\overrightarrow{h}|\\ \qquad\quad\,{h}_{i}\ne {h}_{j},\,\forall i\,\ne\, j\\ \qquad\quad\,\overrightarrow{h}\,{\rm{contains}}\,{\rm{at}}\,{\rm{least}}\,{\rm{one}}\,{\rm{edge}}\,{\rm{adjacent}}\,{\rm{to}}\,{v}_{i}\in V,\,\forall i=1,\,\ldots ,\,N\\ \qquad\quad \overrightarrow{{h}^{\ast }}={\rm{Kruskal}}(\overrightarrow{h})\end{array}\end{array}$$
(2)

in which weight \({w}_{i}=1-{p}_{i},\) \({\vec{h}}^{* }\) denotes the minimum spanning tree obtained from \(\vec{h}\) that is a subset of E(G), and \(d\left({v}_{i},{v}_{j}\right)\) is the distance between nodes \({v}_{i}\) and \({v}_{j}\), and \(1\left({e}_{i}\in \vec{h}\right)\) results in 1 if the edge \({e}_{i}\) is in \(\vec{h}\). Here, the first objective function is the minimization of the total sum of distances between the nodes whereas the second objective function is the minimization of the sum of weights assigned to each edge, which is the same as the maximization of the sum of probabilities of each selected interaction exists based on the definition of weights. Once we solved the multi-objective optimization problem (2) by GA, we observed 46 MDSTs all having the same objective value, which shows the effect of prior knowledge on narrowing down the search space.

SCLC TF network subtype transition simulations

To see how important the pathways connecting the hubs having distinct functional features are, we simulate the SCLC TF network using a tool called BooleaBayes34. BooleaBayes is a Boolean rule-fitting algorithm that infers local regulatory mechanisms near stable cell subtypes from gene expression data. The approach has previously been applied to the SCLC TF network (Fig. 1) to identify and rank master regulators and master destabilizers of SCLC subtypes assuming binary, i.e., ON and OFF, activity states of each transcription factor (Supplementary Figure 2). Further details of BooleaBayes and how it infers the logic rules can be found in34.

Using the Boolean rules extracted via BooleaBayes, we test the role of FLI1–ASCL1–MITF pathway, in which FLI1 and MITF are the two major hubs found by both DST and MDST approaches, in NON-NE to NE subtype transition. This is hypothesized due to FLI1 being a regulator of the SCLC NE subtype, MITF being a regulator of the NON-NE subtype, and ASCL1 being a destabilizer of the NON-NE subtype and regulator of the NE subtype34. In other words, FLI1 and MITF are two functionally distinct hubs identified by DST/MDST analyses and ASCL1 connects these hubs. Note that FLI1–ASCL1–MITF is only one of the candidate pathways connecting these two hubs. We picked this pathway based on prior knowledge from the literature. Nevertheless, if one does this analysis in the same way without any prior knowledge and try all possible candidates, FLI1–ASCL1–MITF pathway will still be identified as one of the candidate pathways that results in a subtype transition.

First, we set the initial state of the network to the NON-NE subtype using the logic TF states in Supplementary Figure 2. Then, we simulate the network using a general asynchronous update scheme with the inferred Boolean rules and keeping the FLI1–ASCL1–MITF pathway active by setting ASCL1 and FLI1 always ON. At each iteration, we randomly select a node and fetch its probability of being ON based on its parent nodes’ instantaneous state from the Boolean lookup tables generated by BooleaBayes. Then, based on the probability value, we flip a weighted coin to set the selected node’s state to ON or OFF. After updating the selected node’s state, we compare the overall network’s state to the target state. After several asynchronous update/compare iterations (usually in the order of 105), the network converged to one of the NE subtype Boolean states (Supplementary Figure 2). The stopping criteria for the simulation is either the network state is equal to the target state, or the simulation reaches to the maximum number of iterations, which we set to 106 (three times more than the typical number of iterations needed for such a transition based on our experience).

We have also tested various activity status of this pathway to see under which conditions such a transition occurs. Keeping FLI1 and ASCL1 always inactive does not result in a NON-NE to NE transition, which is intuitive because the target NE state requires them to be active and they are forced to be inactive. Similarly, keeping FLI1 active and ASCL1 inactive or vice versa does not result in a transition as well. Keeping one of them active and not forcing the other one to any state resulted in a NON-NE to NE transition in a few instances (5% of the simulations). We believe this is due to the random nature of the update scheme, which resulted in the “right” conditions for such a transition. On the other hand, Keeping FLI1 and ASCL1 always active results in this transition at every single run (100% of the simulations). Note that due to the nature of the asynchronous update scheme, the convergence of the system to the NE subtype may occur in a different number of iterations and update patterns at each run of the simulations.

Distance measure between instantaneous network state and NE subtype

To track the network state and understand its dynamic behavior throughout NON-NE to NE transition, we compute the distance between the network’s instantaneous state at each iteration and the target NE subtype. The distance metric we chose is longest common sequence (LCS) metric65 due to its sensitivity to order differences by assigning a larger distance value to the difference between the network state and target state and it can be applied to vectors with the same or different lengths. Overall, LCS-based distance is a metric that measures the difference between two sequences as a cost of required insertions and deletions operations to transform one sequence to another. Given two vectors \({v}_{1}\) and \({v}_{2}\) of length \(m\), that in our case represent the network state and the target state, respectively, the LCS-based distance \({d}_{{LCS}}\) is defined as follows:

$${d}_{{LCS}}\left({v}_{1},{v}_{2}\right)=A\left({v}_{1},{v}_{1}\right)+A\left({v}_{2},{v}_{2}\right)-2A\left({v}_{1},{v}_{2}\right)$$
(3)

where \(A\left({v}_{1},{v}_{2}\right)\) is the number of elements in \({v}_{1}\) that uniquely matches the elements of \({v}_{2}\) in the same order (not necessarily contiguous). Note that one can use other distance metrics such as Hamming distance to perform the same analysis if the vectors are equal in lengths.

Computing LCS-based distance between the instantaneous network state and NE subtype throughout the asynchronous transition simulations shows us how the network converges and diverges from the NE subtype starting from the NON-NE subtype. Furthermore, this allows us to identify some patterns causing increase and decrease between the two network states; and hence, allows us to identify other TFs that may contribute to this transition.

Discussion

Small cell lung cancer (SCLC) is an aggressive disease with its mixtures of transcriptional subtypes such as neuroendocrine (NE) and non-neuroendocrine (NON-NE), later being more treatment-resistant, regulated by the expression of different transcription factors (TFs). In addition to the heterogeneity in cancerous cell types, transitions between the subtypes make the disease even harder to treat. To date, SCLC TF networks have been broadly studied via systems approaches to reveal regulators and destabilizers of different subtypes. Yet, the studies lack mechanisms of subtype transitions, whose understanding is critical to control disease progression and perhaps develop ways for permanent cure. In this work, we hypothesize that analysis of the SCLC TF network structure (Fig. 1), which is barely investigated to our best knowledge, can provide clues on distinct subtype drivers, and further reveal pathways controlling subtype transitions. To test this hypothesis, here we use graph theory concepts called Dense Spanning Trees and its extended version called Minimum Dense Spanning Trees (DSTs and MDSTs, see Box 1 and Methods Sections Dense Spanning Trees of the unbiased SCLC TF network and Integrating data into the structural analysis: Minimum Dense Spanning Trees). DSTs and MDSTs are special subnetworks of the initial TF network that feature strategical nodes called hubs and the pathways connecting the hubs. Hubs are critical nodes due to interconnecting several key pathways and collecting, processing, and distributing key signals throughout the signaling mechanism. Moreover, the pathways connecting the hubs are also important as they are potential probes for controlling complex signaling across hubs. Therefore, given two hubs regulating different SCLC subtypes, we hypothesize that the pathways connecting these hubs could be targets to control subtype transitions.

First, with DSTs, we analyze a relatively unbiased network structure by removing all the edge directions, i.e., the information on activatory and inhibitory interactions, and not using any data on the strength of the connections (Fig. 3). Next, we integrate data into this pure structural analysis, assigned to each edge as weights that are the probability of the existence of the interactions, i.e., the strength of the connections estimated from RNA-seq data34. Then, we extend the DST into MDST (Methods Section Integrating data into the structural analysis: minimum dense spanning trees) to identify the hubs of the weighted network structure (Fig. 4). Interestingly, all the hubs such as ASCL1, FLI1, and MITF identified in both unbiased and data-driven structural analyses are either regulators or destabilizers of different SCLC subtypes as reported in the literature, which confirms our hypothesis on the importance of hubs. Additionally, the structural analysis driven by the data highlights MYC as another hub in addition to those identified in unbiased analysis (Fig. 4), which supports its importance in SCLC subtypes as shown in recent studies32,61,62,63. To test the roles of pathways connecting functionally distinct hubs, we asynchronously simulate the SCLC TF network using a Boolean modeling framework extracted by a tool called BooleaBayes34 (Methods section SCLC TF network subtype transition simulations). As a result of several asynchronous iterations and keeping the pathway connecting FLI1 and MITF—the two major hubs in both unbiased and data-driven analyses—active, we observe a transition from NON-NE to NE subtype (Fig. 5), confirming our hypothesis on the importance of hub-connecting pathways. Furthermore, after analyzing increasing and decreasing patterns in distance between the network state and NE subtype (Figs. 6 and 7) in 700 asynchronous NON-NE to NE transition simulations, we conclude that the TFs NR0B1 and CEBPD may also play a role in this transition in addition to FLI1 and ASCL1 (Figs. 8 and 9).

Note that, one can integrate different data into this analysis, assigned as the weights to the edges. For instance, instead of assigning probabilities of interactions extracted from experimental data, the mutual information between the pair of nodes can be used. In this case, resulting MDSTs would contain the hubs while preserving the highest mutual information and the maximum influence within the nodes. Similarly, one can assign the weights manually guided by prior knowledge to keep the preferred interactions in the resulting substructures. Also, one can apply the tools presented here for any network type such as protein–protein interaction networks (PPINs), gene regulatory networks (GRNs), cell signaling networks, and metabolic networks. In addition, they can be applied to any network structures such as directed or undirected and weighted or unweighted. Note that although preserving the directedness of interactions would integrate more information into the structural analysis, it would also require adding new constraints to the optimization problems (1) and (2), which may become harder to solve due to increased complexity, leaving room for potential improvement to the found DSTs and MDSTs for the SCLC network. Moreover, as this is a structural network analysis, the results will be sensitive to the given network structure. Here, we analyzed the SCLC TF network provided in34,38. Given different SCLC TF networks with different set of nodes and interactions, the observations might change.

There are different ways to define and identify the hubs for a given network than ours. One can define a node that has the most connections (highest node degree) or a node that has the most connections that make it central in the network as the hub (see Supplementary Information for application of different hub definitions and their results on SCLC TF network). However, we believe they are not very well suited for biological applications as they are purely structural concepts and aren’t concerned about the closeness, i.e., the influence of the nodes with each other. Moreover, such hubs are expected to occur only in scale-free networks, i.e., the networks whose degree distribution follows power law66. On the other hand, the concept of DSTs and MDSTs can identify hubs for any given network because, in DSTs and MDSTs, hubs are defined as the central nodes that minimize the total distance between every node, and such substructures can be found for any random network. Additionally, there are other ways to find DSTs of a given network such as the edge-swap heuristic algorithms presented in53,54. However, we have previously shown that optimization-based approaches outperform such edge-swap heuristic algorithms55 both in accuracy and computational complexity changing by the network size. Lastly, here, to identify the DST and MDSTs, we solve the optimization problems (1) and (2) using genetic algorithm (GA), which is a metaheuristic optimization method that attempts to find a globally optimal solution, but it does not guarantee a global solution because it does not guarantee exploration of all the search space and the solution quality and optimality depend on several parameters that need to be properly selected by the user, including population size, rate of mutation and crossover, etc.64. However, GA is well suited for problems that are discrete and combinatorial in nature by providing at least a good approximation of the global solution. Nevertheless, one can solve these optimization problems via other algorithms such as particle swarm optimization.

Overall, the presented results have shown that the hubs of the SCLC TF network identified via DSTs and MDSTs are either regulators or destabilizers of different SCLC subtypes. This implies that structural analyses of the networks can be advantageous as the initial analysis step as their results can be used as guidance to generate hypotheses to be tested in experiments. Moreover, the pathways connecting the functionally distinct hubs may have major roles in SCLC subtype transitions as shown by the simulations, which may allow the control of such transitions and help develop better treatment strategies by driving the cancerous cells toward more sensitive states. Furthermore, targeting those pathways in the experiments may lead to the identification of other dominant components in such transitions and hence help to understand the underlying mechanism of this complex signaling process. As a result, pure as well as data-driven structural analyses of the networked processes could be a plausible first step and may result in important biological observations in complex systems as well as help generate hypotheses to be tested.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.