Pediatric Sarcoma Data Forms a Unique Cluster Measured via the Earth Mover’s Distance

In this note, we combined pediatric sarcoma data from Columbia University with adult sarcoma data collected from TCGA, in order to see if one can automatically discern a unique pediatric cluster in the combined data set. Using a novel clustering pipeline based on optimal transport theory, this turned out to be the case. The overall methodology may find uses for the classification of data from other biological networking problems.


Results
Data. The gene expression data sets used in the present work, consist of two parts. The first part includes the gene expression of 27 patients diagnosed with pediatric-associated sarcoma and treated at Columbia University Medical Center (CUMC). Informed and signed consent for clinical and research sequencing was obtained in the context of the pediatric precision medicine program (PIPseq) established at CUMC and under the CUMC Institutional Review Board (IRB)-approved protocol AAAN8404 10 . The second part was downloaded from The Cancer Genome Atlas (TCGA) database, covering the gene expression data of 265 adult patients. We have one sample per patient for both of them, so 292 samples in total. The data sets were normalized utilizing one of the standard methods for treating RNA-Seq counts data via the variance-stabilizing transformation (VST) in the DESeq2 package for R 11 . This normalization was done amongst all of the 292 sarcoma samples.
The network topology (graph adjacency matrix) was constructed using interaction information from the Human Protein Reference Database (HPRD) 12 . Specifically, we took the intersection of the genes that appear in both HPRD data and the gene expression data, and then kept the largest connected component. After discarding the redundant genes, we arrived at a gene regulatory network with 8844 nodes (genes) and 34926 edges (interactions). The average and median degrees are 7.9 and 4, respectively.
Weighted graph and invariant measure. We constructed a weighted graph for each sample using the mass action principle 13 . In particular, for given gene expression {x i > 0|1 ≤ i ≤ n} the weight p ij on the edge (i, j) is defined as for any j∈N(i). Here n = 8844 is the number of nodes and N(i) denotes the set of neighbors of the node i. Note by construction the matrix = = P p [ ] ij i j n , 1 is a stochastic matrix and satisfies that p ij = 0 if the edge (i, j) doesn't exist. Biologically speaking, mass action is similar to well established methods of differential gene co-expression 14 utilized to develop specific profile for metastatic states 13 . Here, similar to differential co-expression or correlation for analyzing co-regulation patterns in cellular pathways, mass action is based on the assumption that intensity of the interaction between two interactive genes is likely to be larger if both of them have higher expression level. For example, often in drug studies 15,16 , one studies co-regulation patterns via the differential expression of genes that are induced through a knockdown of separate gene (e.g., PI3K inhibition of BYL719 induces expression of estrogen receptor function in breast cancer 15 ). Here, the same underlying principles are used when employing mass action with the added advantage that one can construct patient/sample specific networks without the usage of multiple samples needed for correlation.
The stochastic matrix P defines a Markov chain 9 on the gene regulatory network. Different properties such as entropy and curvature have been considered for this object to study robustness of cancer network 6,7,17 . Here we consider the invariant measure (stationary distribution) of this Markov chain. The Markov chain describes the information flow between genes. When the underlying network is connected, the system will eventually reach an equilibrium and this equilibrium is described by the invariant measure. Mathematically, it is a probability vector π satisfying π π = . P (2) Thus π is a left eigenvector of P with non-negative entries that sum to 1. The value π i at node i reflects the portion of contribution of that node to the entire network. In other words, the invariant measure π is a centrality measure of the significance of different genes.
In general, to obtain the invariant measure, one needs to solve the linear equation (2). However, for the specific stochastic matrix in (1), π has the explicit structure where Z is a normalization factor (partition function) forcing π to be a probability vector.
The expression (3) is very interesting. Note that the value of π i at node i reflects the significance of gene i in the gene regulatory network. It consists of two components: the gene expression level x i of gene i and the total gene expression of its neighbors ∑ ∈ x j N i j ( ) . In other words, the invariant measure captures the key property that a gene is important if its expression level is high and it interacts with many other genes.
Optimal transport on graphs. Consider a connected undirected graph = ( , ) G V E with n nodes in  and m edges in . Given two densities  ρ ρ ∈ , n 0 1 on the graph, the original formulation of the optimal transport problem seeks a joint distribution  µ ∈ × n n of ρ 0 and ρ 1 minimizing the total cost µ ∑ c ij ij , that is, Here c ij is the cost of moving unit mass from node i to node j and is taken to be the minimum of the number of steps to go from i to j, namely, c is the ground metric on the graph. For example, if the edge (i, j) exists, then c ij = 1. The minimum of this optimization problem defines a metric W 1 (the Earth Mover's Distance) on the space probability densities on . An alternative formulation (see Methods) is defined on the fluxes  ∈ u m given on the edges.
n m is defined by associating an orientation to each edge e k = (i, j) = (j, i) of the graph: one of the nodes i, j is defined to be the head and the other the tail, and then we set d ik = +1(−1) if i is the head (tail) of e k and 0 otherwise. Compared to (4), which has n 2 variables, the above formulation has only m variables. It may greatly reduce the computational load when the graph  is sparse, i.e.,  m n 2 . This is the case in our data sets, where n = 8844 and m = 34926. In implementation, we used the standard convex optimization package CVX 18 written in Matlab, in order to numerically solve (5). We should also note that there is some very nice recent work on the fast computation of the Earth Mover's Distance 19 based on (5).
Clustering of sarcoma data. We define a distance function between different gene expression data sets using optimal transport theory on graphs. More specifically, we define the distance between two gene expression data sets to be the W 1 optimal mass transport distance between the two invariant measures induced by the gene expressions as in (3). This distance W 1 can be computed through convex programming 1,8 . We computed the W 1 distances between each pair of all the 292 samples (27 pediatric sarcoma and 265 adult cancer). The heat map of the resulting distance matrix is as shown in Fig. 1. The samples clearly split into two clusters; one cluster for the 27 pediatric sarcoma samples and one cluster for the 265 adult cancer patients.
To visualize more clearly the two clusters, we truncate the distances using some threshold: set the value to be zero if the distance is less than the given threshold and one otherwise. The results with threshold value 0.075 and 0.1 are depicted in Figs 2 and 3, respectively. Note that there is a small gap between these two clusters, which indicates that the last sample in the pediatric sarcoma is an outlier. Figure 4 is a 3D plot of the distance matrix, from which we can see an obvious difference that distinguishes this outlier from the rest of the sarcoma samples. The clusters and the outlier can be also seen based on the histograms. Figures 5, 6 and 7 are the histograms of the distances within the pediatric sarcomas, within the adult sarcomas, and between these two age groups, respectively. Apparently the distances within the two groups (pediatric, adult) are smaller than the distances between them. In particular, the average distances within the two groups are 0.0891, 0.0665 while the average distance between them is 0.1366. The distance between the outlier and the other samples is shown in Fig. 8, with mean value 0.2424, which is significantly larger than the average. See our discussion in the next section for further analysis of these results.

Discussion
Sarcomas represent a heterogeneous group of malignant solid tumors of connective tissue. Sarcomas comprise approximately 1.5% of all malignant tumors diagnosed in adults and over 7% of cancers in children 20 . Although the diversity of sarcoma subtypes can be encountered across the age spectrum, there exists a pattern of sarcoma subtypes that significantly distributes between adults and children. For example, osteosarcoma and Ewing sarcoma (malignant bone tumors) are predominant in children and early adults, whereas undifferentiated pleomorphic sarcoma (previously called malignant fibrous histiocytoma), liposarcoma and leiomyosarcoma are extremely rare in children 21,22 .
In addition to the observation that particular sarcoma subtypes predominate in either childhood or in adulthood, there are also differences in the clinical outcomes of adult and childhood sarcoma patients that extend beyond the differences in treatment regimens between adult and childhood sarcomas 21, 23-25 . With the emergence of next-generation sequencing technologies, we are afforded the opportunity to evaluate the biologic differences between pediatric-associated and adult-associated sarcomas.
In our analysis of 27 sarcoma cases treated at CUMC, only 26 of the 27 original cases would be categorized as a pediatric-associated sarcoma. Interestingly, one case originally included in the pediatric set segregated as   an outlier. This case represents a 25 year old female with a history of multiply relapsed, metastatic alveolar soft part sarcoma (ASPS). ASPS is a rare sarcoma subtype comprising 0.2-0.9% of all soft tissue sarcomas 26 . ASPS is extremely rare in childhood, and is more commonly diagnosed in adolescence and young adulthood (15-35 years of age) 20 .
A second adult case included in the pediatric cohort is from a 38 year old male with metastatic synovial sarcoma. In contrast to the previous adult cases of ASPS, this case segregated with the pediatric cohort. Synovial sarcoma is a soft tissue sarcoma with a peak incidence in the 3rd decade of life, and with about 1/3 of cases occurring within the pediatric age range 27 . Synovial sarcoma is more common than ASPS and is the most frequent   non-rhabdomyosarcomatous soft tissue sarcoma in adolescents and young adults 28 . Although historical differences in the approach to therapy between pediatric and adult oncologists have existed for the treatment of sarcomas and other tumors, there has been acknowledgement in the adult oncology community of the clinical utility of pediatric-based regimens for the treatment of sarcomas occurring in adulthood 29,30 . However, despite use of more dose-intense chemotherapeutic approaches to the treatment of sarcomas in adulthood, pediatric-associated sarcomas diagnosed and treated in adulthood continue to have inferior outcomes compared to treatment in childhood 31,32 .
These observations suggest that there may exist age-dependent differences in the biology of sarcomas. However, it is unclear what the thresholds for age may be that would contribute to differential responses to treatment and clinical outcome as the cutoffs for age and the definition of "adult age" has varied in the literature. The results from this analysis suggest that the sarcoma subtype may supersede, in this instance, the contribution of age to the biologic behavior and genomic signature. So from this classification scheme, it seems that there are indeed biologic differences between sarcoma subtypes that are generally associated with childhood (such as synovial sarcoma) versus those more commonly associated with adulthood (such as ASPS), and provides a rationale for the use of pediatric regimens for the treatment of these diseases regardless of the patient's age.
Genomic characterization of a larger cohort of pediatric-associated and adult-associated sarcomas will be imperative in specifically clarifying the genomic lesions that result in the clinical differences in behavior of sarcomas across the age spectrum. In any event, we did manage to cluster 26 out of the 27 CUMC cases from the TCGA data using our methodology.
In our Supplemental Information file, we have two other examples. The first illustrates the methodology applied to clustering breast cancer data (triple negative and normal). The second using synthetic "gene expression" networks shows the importance of topology in clustering. EMD has the nice feature of explicitly utilizing the topology of the network under consideration.
We should finally note that the pipeline sketched in Fig. 9 is quite general and may be quite useful in clustering various biological networks. These typically may be represented as weighted graphs, and thus after suitable normalization as Markov chains for which there exist the corresponding stationary measures. Optimal mass transport theory realized by the Earth Mover's Distance seems to be an ideal tool for capturing distances among these measures, and thus leads to a natural clustering/classification framework. Several interesting biological graphs as suggested by one of the reviewers could include those based on evolutionary distance between genes,  structural similarity in within same fold family, percent of shared functional sites, even predicted, and percent of shared protein domains.

Methods
Overall sketch. Figure 9 illustrates the overall pipeline of the clustering methodology described in the previous sections. The basic idea is that once one has defined the network topology (in this case via the Human Protein Reference Database), and the weights connecting the nodes (derived here from the mass action principle), one can use in a straightforward manner an invariant of each network, and then compute the distance matrix defined by the EMD or Wasserstein 1-metric. In the next section, we will review the definition and properties of this central mathematical object underpinning our analysis.
Earth Mover's Distance. In this section, we briefly review the mathematics of the Earth's Mover's Distance (EMD) from optimal mass transport theory, the key method on which all the previous results were based. The classical Earth Mover's Distance was formulated by Monge in 1781 to solve the problem of moving a pile of soil to a excavation site with the least amount of work relative to some cost. This is illustrated in Fig. 10. For full details as well as long lists of references, see the monographs 1-3 .
Mathematically, we let ρ 0 and ρ 1 denote two probability densities on  m . This means that   ρ → : The above "dual of the dual" method can be applied to transport problems on graphs to show the equivalence between (4) and (5) by replacing the metric ⋅ by c and the divergence operator ∇⋅ by D. In so doing, one gets a tremendous saving in computational burden since equation (4) involves solving systems on the order of the square of the number of nodes, while equation (5) is of the order of the number of edges. In our specific case, we had 8844 nodes (genes) and 34926 edges (interactions), that is, we save in this manner ∼ × 8844 /34926 2 10