Segmentation of biological multivariate time-series data

Time-series data from multicomponent systems capture the dynamics of the ongoing processes and reflect the interactions between the components. The progression of processes in such systems usually involves check-points and events at which the relationships between the components are altered in response to stimuli. Detecting these events together with the implicated components can help understand the temporal aspects of complex biological systems. Here we propose a regularized regression-based approach for identifying breakpoints and corresponding segments from multivariate time-series data. In combination with techniques from clustering, the approach also allows estimating the significance of the determined breakpoints as well as the key components implicated in the emergence of the breakpoints. Comparative analysis with the existing alternatives demonstrates the power of the approach to identify biologically meaningful breakpoints in diverse time-resolved transcriptomics data sets from the yeast Saccharomyces cerevisiae and the diatom Thalassiosira pseudonana.


Synthetic data
In this section, we provide additional elaboration of the results from different approaches applied on the synthetic data, described in the main text in Omranian et al. (2013). The results are succinctly summarized in Table S1, and Figure S1 showing the comparison between the obtained segmentations. The actual breakpoints are 7, 12, and 21.

Yeast's metabolic cycle
In this section, we provide additional elaboration of the results from different approaches applied on the yeast's metabolic cycle (YMC) data, described in the main text. The results are succinctly summarized in Table S2, showing the comparison between the obtained segmentations.

Yeast's cell cycle
In this section, we provide additional elaboration of the results from different approaches applied on the yeast's cell cycle (YCC) data. With the filtering step, the number of genes was reduced from 6076 to 2071. The characteristics of the resulting segmentations are summarized in Table S3 and Figure S3. Based on the work done by Spellman et al. (1998), the data should be segmented into 5 segments representing 5 cycles for which each cycle includes the following phases: M/G1, G1, S, G2, and M. Each of the M/G1, G1 and S phases lasts 2 time points while the G2 phase lasts only one time point, as described in Ramakrishnan et al. (2010). Therefore, as shown in Table S3, our method revealed the cell cycles in the YCC data.

Oxidative stress and yeast's cell cycle
In this section, we provide additional elaboration of the results from the different approaches applied on the data capturing the effect of oxidative stress, induced by hydrogen peroxide (HP), on the yeast's cell cycle. With the filtering step, the number of genes was reduced from 4771 to 1189.
The characteristics of the resulting segmentations are summarized in Table S4 and Figure S4. Based on the work done by Shapira et al. (2004), we could capture all phases in the system which correspond to the G1, S, G2, G2/M phases of the cell cycle.    Tables   Table S1. Optimal segmentation for synthetic data. The first part of the table comprises the result of the optimal segmentation for synthetic data based on the regularized regression approach implemented in Algorithm 1. The second part includes the result based on Bleakley et al. (2011) approach. The third and the forth parts show the results based on the method of Omranian et al. (2013), penalized longest path algorithm using number of segments and distribution of length of the segments to calculate the penalty of a path, respectively. The lower part contains the result based on the method of Ramakrishnan et al. (2010). The upper part of the table includes the number of segments k, breakpoints, and tuning parameters corresponding to fused LASSO regularization parameters, λ 1 and λ 2 . In the third part of the table, the first and second columns show the name and the type of network properties used to determine the distances: G stands for global, L for local, and LG for local-global. The third column includes the number k for each of the three methods and the resulting segments are given in the forth column. The fifth and sixth columns in the third and forth parts present the values of lower (ν min ) and upper (ν max ) bound of the tuning parameter ν with dynamic programming approach. The lower part also includes minimum and maximum length of the segments, i.e., l min and l max , as parameters of the contending method.  Table S2. Optimal segmentation for yeast's metabolic cycle (YMC) data with the same preprocessing has been applied in Ramakrishnan et al. . The upper part of the table includes the number of segments k, breakpoints, and tuning parameters corresponding to fused LASSO regularization parameters, λ 1 and λ 2 . In the third part of the table, the first and second columns show the name and the type of network properties used to determine the distances: G stands for global, L for local, and LG for local-global. The third column includes the number k for each of the three methods and the resulting segments are given in the forth column. The fifth and sixth columns in the third and forth parts present the values of lower (ν min ) and upper (ν max ) bound of the tuning parameter ν with dynamic programming approach. The lower part also includes minimum and maximum length of the segments, i.e., l min and l max , as parameters of the contending method. LG for local-global. The third column includes the number k for each of the three methods and the resulting segments are given in the forth column. The fifth and sixth columns in the third and forth parts present the values of lower (ν min ) and upper (ν max ) bound of the tuning parameter ν with dynamic programming approach. The lower part also includes minimum and maximum length of the segments, i.e., l min and l max , as parameters of the contending method.

Figure Legends
. The upper part of the table includes the number of segments k, breakpoints, and tuning parameters corresponding to fused LASSO regularization parameters, λ 1 and λ 2 . In the third part of the table, the first and second columns show the name and the type of network properties used to determine the distances: G stands for global, L for local, and LG for local-global. The third column includes the number k for each of the three methods and the resulting segments are given in the forth column. The fifth and sixth columns in the third and forth parts present the values of lower (ν min ) and upper (ν max ) bound of the tuning parameter ν with dynamic programming approach. The lower part also includes minimum and maximum length of the segments, i.e., l min and l max , as parameters of the contending method.