Supplementary Figures

t t t t t t a b c d e f a(t) a(t) D data D data d(t) d(t) Supplementary Figure 1: Examples of changes in linear trend and CUSUM. (a) The existence of an abrupt jump (a) and a slope change (b). The traces are fitted with straight lines a fit (t) (grey lines). (c) and (d) are the difference time series d(t) = a(t) − a fit (t) of the traces in (a) and (b), respectively. (e) and (f) are the CUSUM curves of (c) and (d), respectively. The total fluctuation of the CUSUM curves are denoted as D data .

The deviation of the histogram from the fitted normal distribution is insignificant to tell that there is an angular increment at the catalytic dwell. i.e., pause 1 to 2, pause 2 to 3, and pause 3 to 1, are plotted separately from the left to right, respectively. (P (τ , τ ) − P (τ )P (τ )) has small values fluctuating randomly around zero indicating that τ and τ are not correlated. The lack of correlation between τ and τ is further confirmed by permutation test for the Pearson correlation coefficient (PCC) of τ and τ as follows: Let {(τ 1 , τ 1 ), (τ 2 , τ 2 ), · · · } be the set of consecutive dwell times obtained from the experimental trace and C data be the PCC of this set. We permute the positions of τ i in the set {(τ 1 , τ 1 ), (τ 2 , τ 2 ), · · · } to remove the correlation (if there is any) between τ and τ , and compute the permuted PCC. This permutation procedure is repeated many times and a set of permuted PCCs is generated. The original C data is then compared to the set of permuted PCCs to obtain the two-sided p-value. For all experimental traces considered in this work, the two-sided p-values are always larger than, e.g., 5%, indicating that no correlation between τ and τ can be detected with statistical significance.

Supplementary Notes
Supplementary Note 1: Testing the existence of change points by permutation method We generalize the change point algorithm developed by Taylor [1], which detects changes in the mean values, to detect both sudden jumps and changes in the linear trend (see e.g., Here we first describe how to detect the existence of a single change point, and the detection of multiple change points will be discussed in Supplementary Note 3.
Suppose we have a segment of time series as shown in Supplementary Figure 1a The CUSUM is defined by the cumulative sum of the difference time series as The CUSUMs for the two traces in Supplementary Supplementary Figures 1a-b. For a given time t * (1 < t * < T ), the SE is defined as where T is the number of data points, a L fit (t ) and a R fit (t ) are, respectively, straight lines fitted to the left and right segments of the time series separated at t = t * (grey lines in Supplementary Figures 3a-b) Next we provide a simple scheme based on bootstrapping method [2] to estimate the error bar associated with the determined change point location. The error bar represents the uncertainty in pinpointing the change point location due to the sampling error in evaluating the SE. Suppose that the change point location is determined to be at t * = t ch , we first estimate the uncertainty in SE(t ch ) = t ch t =1 (a(t ) − a L fit (t )) 2 + T t =t ch +1 (a(t ) − a R fit (t )) 2 by bootstrapping method as follows: The segment of data, {a(1), · · · , a(t ch )}, under the first summation in evaluating SE(t ch ) is resampled with replacement (i.e., bootstrapping). Similarly a bootstrap resampling is performed for the second segment {a(t ch + 1), · · · , a(T )}. As before, we fit each of these two bootstrapped segments with straight line, and the bootstrapped squared error SE boot 1 (t ch ) is evaluated. This process is repeated many times (usually ∼ 1000 times) to generate an ensemble of bootstrapped SE, i.e., {SE boot 1 (t ch ), · · · , SE boot 1000 (t ch )}, which gives the bootstrap distribution of SE(t ch ). The error bar (shown in Supplementary Figures 3c-d) associated with the value of SE(t ch ) corresponds to the bootstrap 68% confidence interval, i.e., the interval from the 16th percentile to the 84th percentile of the bootstrap distribution. Finally, we note that the resulting change points from the above procedure of binary segmentation may contain some error both in the hypothesis tests for their existence and in their location estimations. This is because the change points on the left and right hand sides of the segmentation, that will be detected in later stages of the binary segmentation (e.g. the change points that have not yet been detected in Supplementary Figure 4a), can defect the hypothesis test and the estimation of change point location at the current stage of the segmentation. Therefore, we perform a final clean-up procedure as follows: Let the locations of the multiple change points resulted from the binary segmentation be t i for the ith change point with i = 1, 2, 3, · · · and t 1 < t 2 < t 3 < · · · . The hypothesis test and location estimation are carried out again for each change point at t i by only using the segment of the time series from t i−1 to t i+1 . In this way, the existence and location of each change point can be evaluated more precisely free from the effect of the undetected change points.

Supplementary Note 4: Clustering to assign change point intervals to catalytic dwells
We first introduce the concept of "soft" clustering as follows. Given N e elements, {e 1 , e 2 , · · · , e Ne }, "hard" clustering algorithm assigns each element to exactly one cluster (or group) out of N c clusters, {C 1 , C 2 , · · · , C Nc } with N c ≤ N e . On the other hand, soft clustering allows the elements to belong to more than one cluster with a certain "membership". These memberships are specified by the conditional probability P (C α |e i ) (with Nc α=1 P (C α |e i ) = 1) for the given element e i belonging to the cluster C α . The hard clustering is a special case of soft clustering in which P (C α |e i ) equals to either zero or one.
Clustering procedures assign dissimilar elements to distinct clusters and so one needs to provide a distance (or dissimilarity measure), d(e i , e j ) (with d(e i , e j ) = d(e j , e i ) and d(e i , e i ) = 0), between the elements. In our study of the rotary time series of F 1 -ATPase, the distance between two change point intervals (the elements) is chosen to be the difference between the mean angles of the intervals. From the element-to-element distance d(e i , e j ), one can obtain the element-to-cluster distance d(e i , C α ) as the weighted average, where P (e i |C α ) = P (C α |e i )P (e i )/P (C α ) is the probability of finding the element e i in the cluster C α . The distortion in the cluster description represents the averaged element-to- The second line of Eq. 4 tells us that the distortion is the intra-cluster distance (the term inside the square brackets) averaged over the clusters. It can be easily checked that 0 ≤ d(e i , C α ) P (e i ,Cα) ≤ Ne i,j=1 P (e i )P (e j )d(e i , e j ). The distortion obtains its minimum value at zero when there are N c = N e clusters and each element is itself a cluster, i.e., no compression.
In this case, we simply have P (C α |e i ) = P (e i |C α ) = δ α,i (δ α,i = 1 if α = i and δ α,i = 0 otherwise) and P (C α ) = P (e i ). The maximum distortion is reached when there is only one cluster (N c = 1 and P (C 1 ) = 1) and all elements are assigned to this cluster, i.e., P (C 1 |e i ) = 1 for i = 1, · · · , N e . This corresponds to the maximally compressed case.
On the other hand, the mutual information [3] between the elements and the clusters, provides a measure to quantify the degree of compression (or clustering) described by the membership P (C α |e i ) with N c clusters. Here P (C α , e i ) = P (C α |e i )P (e i ). In the least compressed case when there are N c = N e clusters (i.e., each element is itself a cluster), , which is just the information content of the elements. In the maximum compressed case when there is only one cluster (N c = 1) (i.e., all elements are assigned to a single cluster), I(C, e) = 0 and the cluster carries no information of the elements.
Rate distortion theory [3][4][5], developed by Claude Shannon in his foundational work on information theory, formulates the tradeoff between compression and distortion to find the most compressed description of the elements for a given degree of distortion. Suppose there are N c clusters, the tradeoff corresponds to minimize the mutual information I(C, e) with respect to P (C α |e i ) subject to the constraint d(e i , C α ) P (e i ,Cα) = D, where D is the desired value for the distortion. The solution to this constrained optimization problem can be obtained using the method of Lagrange multiplier, where we minimize the Lagrange function, with respect to P (C α |e i ) with the Lagrange multiplier β ≥ 0. The formal expression of P (C α |e i ) that minimizes Eq. 6 (i.e., by setting ∂L/∂P (C α |e i ) = 0) is given by [3], where Z(e i , β) is the "generalized" partition function, to ensure correct normalization (i.e., Nc α =1 P (C α |e i ) = 1). The expression Eq. 7 only serves as a formal solution since the right hand side of the equality also depends on P (C α |e i ) through and P (e i |C α ) = P (C α |e i )P (e i ) Ne j=1 P (C α |e j )P (e j ) .
Moreover, the actual value of the Lagrange multiplier β have to be determined by requiring d(e i , C α ) P (e i ,Cα) = D where P (C α |e i ) has the form of Eq. 7. In practice, the determination of P (C α |e i ) in the optimization problem is solved numerically by an iterative procedure, called the Blahut-Arimoto algorithm [3], in terms of the self-consistent equations Eq. 7, 9 and 10. For given N c (N c = 3 in the current study for the 3 catalytic dwells) and β, the iterative procedure is as follows: [ Step 1] P (C α |e i ) (α = 1, · · · , N c , i = 1, · · · , N e ) are randomly generated with Nc α=1 P (C α |e i ) = 1. [ Step 2] P (C α ) and P (e i |C α ) are evaluated using Eq. 9 and Eq. 10, respectively. The element-to-cluster distance d(e i , C α ) are then evaluated by Eq. 3. Next the partition functions Z(e i , β) can be obtained using Eq. 8. [ Step 3] The memberships are then updated from P (C α |e i ) to P (C α |e i ) using Eq. 7 as [ Step 4] The procedure stops if the updated memberships, P (C α |e i ), agree with the old one, P (C α |e i ), up to a chosen precision. If convergence is not reached yet, Step 2 to 4 are repeated with the initial memberships replaced by the updated one, i.e., setting P (C α |e i ) = P (C α |e i ).
We note that generally the above procedure may not lead to the global minimum of the Lagrange function and therefore multiple runs (> 20 in this work) with different initial conditions, i.e., with different set of P (C α |e i ) in Step 1 above, are performed and the iteration result with the minimum value of the Lagrange function is used.
Before addressing how we fix the value of β that is required as an input to the iterative procedures for a given number of cluster N c , we first give some intuitions on the meaning of β and its relation to the "softness" of the clustering. First let us consider the case when β is a very large positive number. For a given element e i , the partition function in Eq. 8, the cluster having the smallest element-to-cluster distance with e i , i.e., d(e i , C α ) is the smallest among all N c clusters. After substituting Z(e i , β) → P (C α ) exp[−βd(e i , C α )] into Eq. 7, one obtains P (C α |e i ) = δ α,α as β → ∞. Therefore, the large β case corresponds to the hard clustering case in which the element e i is assigned to the cluster C α having the smallest element-to-cluster distance with e i . Moreover, one can see in this case that the Lagrange function (Eq. 6) is dominated by the second term, β d(e i , C α ) , and so the minimization problem reduces to the minimization of the distortion only.
On the other hand, P (C α |e i ) (Eq. 7) becomes independent of C α when β = 0. This means that each element is equally assigned to the clusters, i.e., the softest clustering case.
In this case even for N c > 1, there is no need to distinguish the clusters as their membership P (C α |e i ) are the same, so there exists effectively only one cluster and all elements belong to it, i.e., the maximally compressed case. This can also be seen from the Lagrange function (Eq. 6) that when β = 0, the minimization reduces to the minimization of the compression (the mutual information) only. In general, the Lagrange multiplier β characterizes both the degree of softness, and the tradeoff between compression and distortion in the clustering.
Such tradeoff can be visualized by the information curve as shown in Supplementary Figure   5 in which each point on the curve represents the best compression (i.e., with the lowest mutual information) given the corresponding distortion value.
We now move on to the determination of the degree of softness. Here we adopt an error-based algorithm [6] to determine the appropriate softness (i.e., the value of β) for the clustering as follows. We first perform hard clustering of the elements with a large β corresponding to the black dot in the information curve in Supplementary Figure 5 and let us denote the corresponding value of the distortion as D hard = d(e i , C α ) | hard . One can think that the obtained hard clustering corresponds to the case when there is no error in evaluating the element-to-element distance d(e i , e j ) so that the change point intervals can be assigned to a particular cluster unambiguously. However, in practice there exist several errors that can affect the evaluation of d(e i , e j ), which include the uncertainty in the change point locations that affects the range of data points belonging to each change point interval, and the sampling error to evaluate the mean angle of the change point intervals with finite number of data points. Therefore, the assignment of the elements, especially for those located near the cluster boundaries, to the clusters can be fuzzy (i.e., soft).
To determine the degree of softness originated from the errors in evaluating d(e i , e j ), we estimate the error in D hard associated with the sampling error and uncertainties in change point location in terms of the bootstrapping method [2] similar to those in the change point detection. The idea is to evaluate the bootstrapped mean angle by resampling with replacement the data points inside a change point interval whose boundaries are randomly chosen according to the error bars of the change point location. Using the bootstrapped mean angles of all the change point intervals, the bootstrapped distances d boot (e i , e j ) (i, j = 1, · · · , N e ) can be calculated and the hard clustering procedure is performed to obtain the bootstrapped distortion D boot hard . This bootstrapping procedure is then repeated for many times (usually 1000 times) to generate an ensemble of bootstrapped distortion, {D boot hard,1 , D boot hard,2 , · · · , D boot hard,1000 }, which gives the bootstrap distribution of D boot hard representing the possible variations of D hard due to the sampling and change point location errors. The error bar (shown in Supplementary Figure 5) associated with the value of D hard corresponds to the bootstrap 68% confidence interval, i.e., the interval from the 16th percentile to the 84th percentile of the bootstrap distribution. Let β * be the largest value of β that falls outside the error bar of D hard (Supplementary Figure 5), any β > β * then represents clustering with too small distortion that can be allowed by the sampling and change point location errors. Therefore, we choose β * as the desired value of β which also fixes the degree of softness in the clustering.
One can easily see that β * is smaller and the clustering is softer if the sampling and change point errors are bigger, simply reflecting the fact that the assignment of the elements to the clusters become more ambiguous.