Detecting the impact of subject characteristics on machine learning-based diagnostic applications

Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets (“record-wise” data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of “identity confounding.” In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided.


The proposed permutation tests
Before describing the permutation tests, we first introduce some notation. Throughout the text we let s = {1, 2, . . . , S} index the subjects, r s = {1, 2, . . . , R s } index the records across subject s, and f = {1, 2, . . . , f } index the features. X represents the s R s ×f matrix of feature data, with columns indexing the features, and rows indexing the records across all S subjects, while y represents the corresponding vector of disease labels, with dimension s R s × 1. We let m represent an arbitrary classification performance metric, and reserve p to represent the number of permutations employed in our tests. We let A and B represent the events, A ≡ the classifier is performing disease recognition, (1) B ≡ the classifier is performing subject identification, (2) and adopt standard set theory notation where: E c represents the complement of event E; ∅ represents the empty set; Ω represents the universe; and the symbols ∪ and ∩ represent the union and intersection of events, respectively.

Assessing whether the classifier is performing disease recognition
In order to evaluate whether the classifier is performing disease recognition (even when identity confounding is present) we need to generate a permutation null distribution where the association between the disease labels and the features is destroyed while the association between the features and the subject identities is still preserved. To this end, we generate a permutation distribution by shuffling the disease labels in a subject-wise fashion, as described in Algorithms  Generate a subject-wise shuffled version of the label data, y * ← SubjectWiseShuffle(y, z) (see Algorithm 2 for details) 5: Split y * into training and test label sets, y * train ← y * [i train ], y * test ← y * [i test ]

6:
Train a classifier on the X train and y * train data 7: Evaluate the classifier performance on the X test and y * test data 8: Record the value of the performance metric, m * i , on the shuffled data 9: end for 10: Output: m * 1 , m * 2 , . . . , m * p Algorithm 2 Subject-wise label permutation (SubjectWiseShuffle) 1: Input: Label vector, y; subject label vector, z, (i.e., a vector of length equal to the number of subjects where each entry corresponds to 1 if the subject has the disease, or to 0 if the subject is a control) 2: Generate a permutation, z * , of the subject label vector 3: Initialize the permuted label vector, y * ← y 4: for s = 1, 2, . . . , S do 5: Get the index set i s (namely, the positions, in vector y, of all records provided by subject s) 6: Replace the disease label by the permuted label across all records contributed by subject s in the permuted label vector, y * [i s ] ← z * [s] 7: end for 8: Output: y * Note that because we are using the performance metric, m, as the test statistic of the permutation test (and m is able to capture the contributions of both disease recognition and subject identification to the classification performance), we have that the alternative hypothesis H * 1 is equivalent to "the classifier is performing disease recognition irrespective of whether it is also doing subject identification or not".
A permutation p-value for testing H * 0 is computed as the proportion of times that the performance metric computed in shuffled label data was better than the performance metric computed with the original (un-shuffled) labels, m o . For instance, for a performance metric such as the area under the receiver operating characteristic curve (AUC), where larger values indicate better classification performance, the permutation p-value can be estimated as where 1 1 represents the indicator function, assuming value 1, when m * i ≥ m o , and 0, otherwise. (For performance metrics where smaller values indicate better classification performance, such as misclassification rate, the permutation p-value is given by p i=1 1 1{m * i ≤ m o }/p.) Note that when the data is split in a record-wise fashion, and identity confounding is present, the disease recognition null distribution will be centered away from the baseline random guess value (e.g., 0.5 when adopting the AUC) since in the generation of the null distribution the label data is split into training and test sets only after the labels are shuffled (see steps 4 to 7 of Algorithm 1) so that both training and test sets have the same shuffled labels (as illustrated in Figure 1 in the main text). As a consequence, the presence of identity confounding will show up as an ability to classify the permuted labels with an accuracy that is better than a random guess (since even though the shuffling of labels prevent the algorithm from performing disease recognition, it is still able to perform subject identification). For example, when adopting the AUC metric, and in the presence of strong identity confounding, the disease recognition null will be centered at AUC values closer to 1 than to 0.5. Hence, the disease recognition null distribution not only allows us to test whether the classifier is performing disease recognition, but also informally allows us to infer the presence of identity confounding by simply inspecting how far it is centered from the baseline random guess value. Finally, note that the proposed disease recognition permutation test assumes that the disease labels are exchangeable across subjects (in a subject-wise fashion) under the null H * 0 .

Detecting identity confounding
As described above, the presence of identity confounding will shift the disease recognition null distribution away from the baseline random guess value. Hence, the median of the disease recognition null distribution,m * = median(m * 1 , m * 2 , . . . , m * p ) , represents a natural statistic to quantify identity confounding alone, as it measures the contribution of identity confounding to the classifier's predictive ability, after the algorithm's ability to recognize disease has been neutralized by the subject-wise shuffling of disease labels. (Note that while other statistics such as the mean of the disease recognition null distribution could be used to quantify identity confounding, we prefer the more robust median metric since the distribution can be asymmetric.) In order to test whether the classifier is performing subject identification, we need to generate a permutation null distribution (for them * statistic) where the association between the subject identities and the features is broken. To this end, we shuffle the feature data in a record-wise fashion, before computing them * statistic, as described in Algorithm 3. Note that because the record-wise shuffling of the feature data also breaks the association between the features and the disease labels, the identity confounding null distribution will always be centered around the baseline random guess value (0.5 for the AUC metric). Generate a record-wise shuffled version of the feature data, X *

4:
Split X * into training and test sets, for j = 1, 2, . . . , p l do 6: Generate a subject-wise shuffled version of the label data, y * (see Algorithm 2 for details)

7:
Split y * into training and test label sets, Train a classifier on the X * train and y * train data 9: Evaluate the classifier performance on the X * test and y * test data 10: Compute the value of the performance metric, m * * j , on the shuffled data 11: end for 12: Record the median of the perf. metric distr.,m * * i = median(m * * 1 , m * * 2 , . . . , m * * p l ) 13: end for 14: Output:m * * 1 ,m * * 2 , . . . ,m * * p For performance metrics where larger values indicate better classification performance, the permutation p-value for testing the hypotheses, H * * 0 : the classifier is not performing subject identification H * * 1 : the classifier is performing subject indentification, is estimated as, wherem * represents the observed value of the test statistic, computed on the original (un-shuffled) feature data, and defined in equation (4), whereasm * * i represents the statistic computed on the shuffled feature data, as described in step 12 of Algorithm 3. (Otherwise, it is computed as p i=1 1 1{m * * i ≤m * }/p.) Note that when we shuffle the feature data in a record-wise fashion we, of course, break the connection between the feature data and both the subject identities and the disease labels. We point out, nonetheless, that the null hypothesis H * * 0 is given by H * * 0 : B c and not by H * * 0 : B c ∩ A c because, by construction, our test statistic only captures the contribution of the subject identification to the classifier's performance (where events A and B are defined in eqs 1 and 2). Observe, as well, that the identity confounding permutation test assumes the exchangeability of the feature data (in a record-wise fashion) under the null H * * 0 . One important drawback of this permutation test is the high amount of computation required to generate the null distribution. We point out, however, that for the particular case where the AUC metric is adopted to evaluate the classifier performance, it is sometimes possible to infer the presence of identity confounding using an analytical approach without having to run the permutation test.

An analytical shortcut for detecting identity confounding, based on the AUC metric
Here, we describe an alternative approach to detect identity confounding. While it is not a proper statistical test, it provides a statistic similar to a p-value (denoted a "pseudo p-value") which can still be useful in practice. But, before presenting the pseudo p-value statistic, we first describe an additional statistical test that helps understand the pseudo p-value approach.
Consider a third permutation test, where we adopt the AUC as the test statistic, and generate a null distribution by shuffling the disease labels in a record-wise fashion. In this case, because the AUC metric is able to capture the contributions of both disease recognition and subject identification to the classification performance, and the record-wise shuffling of labels breaks the connection between the disease labels and both the feature data and the subject identities, we have that the permutation procedure is now testing the null hypothesis A c ∩ B c against the alternative A ∪ B (where events A and B are defined in eqs 1 and 2). Hence, the permutation p-value for testing this third set of hypothesis, H * * * 0 : the classifier is not performing disease recognition and subject identification, : the classifier is performing disease recognition or subject identification, is computed as, where auc 0 represents the AUC value computed from the original data. For this statistical test, nonetheless, an analytical solution is available. Explicitly, it has been shown [5] that, when there are no ties in the predicted class probabilities used for the computation of the AUC, the test statistic of the Mann-Whitney U test (U ) is related to the AUC statistic by, U = n n n p (1 − AUC), where n n and n p represent the number of negative and positive labels in the test set (see section 2 of reference [6] for details).
In the presence of ties, the p-value can be computed as the left tail of the asymptotic approximate null, where n = n n + n p , τ is the number of groups of ties, and t j is the number of ties in group j [6]. Alternatively, we can get the p-value as the right tail probability of the corresponding AUC null, AUC ≈ N(0.5 , φ 2 ), where Φ() represents the cumulative distribution function of a standard normal distribution, and, φ 2 = n + 1 12 n n n p − 1 12 n n n p n (n − 1) Here, we propose to use the pseudo p-value statistic, as an alternative measure for identity confounding, whereã uc * 0 corresponds to the median of the disease recognition null distribution computed using the AUC metric.
Note that this procedure does not correspond to a proper statistical test, since it compares the value of one test statistic against the null distribution of a different test statistic, effectively performing an "apples-to-bananas" comparison (namely, it compares the median of the AUC metric under the disease recognition null distribution against the distribution of the AUC metric under H * * * 0 ). (The correct comparison is the one described in the previous section, where we compare the observed median statisticã uc * 0 to the null distribution for the same statistic, generated according to Algorithm 3.) In any case, the pseudo p-value might be still useful in practice, because the null distribution of the AUC statistic under H * * * 0 will always have a larger spread than the null distribution of the median AUC statistic under H * * 0 1 . Hence, if the pseudo p-value is already very small, there is no need to perform the computationally expensive permutation test for detecting identity confounding. We point out, however, that a non-significant pseudo p-value does not necessarily mean lack of identity confounding, and that, in this case, it is necessary to compute the permutation p-value since identity confounding can still be present when the pseudo p-value suggests it is not.

Illustrative examples
We illustrate the application of our permutation tests using synthetic and real data-sets, split into training and test sets according to both the record-wise and subject-wise data split strategies. (Note that we apply the identity confounding permutation test to both record-wise and subject-wise data splits for illustrative purposes. In practice, however, it is not necessary to test for identity confounding when adopting subject-wise data splits, as the influence of individual characteristics are automatically neutralized by the subject-wise strategy.) In all illustrations, we adopt the AUC as the performance metric, and employ the random forest classifier [7] implemented in the randomForest R package [8], using the default tuning parameter specifications.

Synthetic data-sets
In this section, we illustrate the presence of identity confounding using synthetic data examples. Our goal is to show that many different sources of heterogeneity in the feature data can easily give rise to identity confounding (and that our permutation tests are able to detect it). We do not make any claims that the particular model adopted to simulate the feature data is able to closely reproduce the dependency/distributional structures observed in real mobile data (as any simulated example will always be artificial to some extent).
When the training and test sets are obtained by record-wise data split, identity confounding can arise due to statistical dependencies in the feature data of each participant (e.g, serial association in longitudinal data), as well as, due to differences in the average or variance of the feature distributions 1 The following rational shows why the spread of the permutation null distribution generated under H * * 0 is smaller than the spread of the permutation null generated under H * * * 0 . Note that in order to generate a permutation distribution under the null hypothesis H * * 0 (i.e., the classifier is not performing subject identification) we need to employ Algorithm 3, which performs a "double-shuffling" of the data (implemented with an internal for-loop nested inside an external for-loop). Observe that for each of the p iterations of the external for-loop (step 2 of Algorithm 3) the feature data is first shuffled in a record-wise manner (step 3). Observe, as well, that for each shuffled version of the feature data, the internal for-loop (step 5) performs p l subject-wise shuffles of the label data, and for each of these p l iterations the AUC is computed using the record-wise shuffled features and subject-wise shuffled labels (step 10). Finally, the median of the p l AUC scores generated in the internal for-loop is computed at each of the p iterations of the external for-loop (step 12 od Algorithm 3). Now, note that because the record-wise shuffle of the feature data already breaks the association between the disease labels and the subject identity it turns out that the distribution of the AUC scores generated by the internal for-loop (steps 5 to 11) will be centered around 0.5. Hence, the median of these p l AUC scores (step 12) will always be very close to 0.5. Therefore, it follows that the distribution of the median AUC scores (generated from the p iterations of the external for-loop) will always show a smaller spread around 0.5 than each of the AUC distributions generated using steps 5 to 11. Now, observe that in order to generate a permutation distribution under the null hypothesis H * * * 0 (i.e., the classifier is not performing disease recognition and subject identification) we simply need to to shuffle the disease labels in a record-wise fashion. Again, because the record-wise shuffle of the disease labels destroys the associations between labels and features and between labels and subject identities, it turns out that this permutation null will also be centered around 0.5. Furthermore, because a permutation distribution generated by shuffling the labels in a record-wise fashion is actually equivalent to a permutation distribution generated by shuffling the features in a record-wise manner (and to a permutation null where both the labels and the features are independently shuffled in a record-wise manner), it turns out that the spread of the permutation null generated under H * * * 0 will be similar to the spread of the AUC scores computed by the internal for-loop (steps 5 to 11 in Algorithm 3). Hence, it follows that the spread of the permutation null under H * * 0 will be always smaller than the spread of a permutation null generated under H * * * 0 .
across the distinct subjects (even when the longitudinal feature measurements are independent of each other). Here, we illustrate the application of our permutation tests to synthetic data simulated: (i) in the presence of identity confounding, and absence of disease effect; (ii) in the presence of both identity confounding and disease effect; and (iii) in the absence of both identity confounding and disease effect. We simulate feature data using a model similar to the one employed by Saeb et al [1], except that we generate the data using matrix-normal distributions in order to model the correlation structure across the records and across the features. (The matrix-normal distribution, over the space of matrices with r rows and c columns, is parametrized as, where the r × c matrix M represents the mean of matrix-normal variable W ; the r × r matrix Σ models the covariance structure across the rows of W ; and the c × c matrix Ψ models the covariance structure across the columns of W . As an illustration, Figure 1 shows the correlations across rows and columns for data matrices simulated from 4 distinct matrix-normal distributions.) In our simulations, the feature data matrix from each subject s was generated independently from all other subjects according to the model, where a, b, c, and d are scalars; µ s and σ 2 s represent, respectively, subject specific mean and variance values (shared by all features); 1 corresponds to a matrix with R s rows and f columns filled with 1s; y s represents the subject's disease label, assuming the value -1 if the subject is a control and 1 if he/she is a case; and the matrices U s , V s , and E s are sampled from the matrix-normal distributions, where we adopt an autoregressive correlation structure for Σ (where each of its elements is given by , with the fixed scalar ρ r representing a correlation coefficient), but a simpler correlation structure for Ψ (where the diagonal elements are given by 1, and off-diagonal elements are given by the correlation scalar ρ f ).
The component U s induces a correlation structure across the records, while E s generates correlation across the features. The component V s , on the other hand, only contributes white noise, since by adopting identity matrices for the covariances across the records and the features, we have that sampling from MN Rs×f (0, I, I) is equivalent to independently sampling each entry of V s from a standard normal distribution. Also, because U s , V s , and E s are centered at 0, the mean of X s is given by µ s 1 + a y s 1.
By varying the scalar values µ s , σ s , a, b, c, and d in equation (12), we can simulate data under the null and alternative hypothesis for disease recognition and identity confounding. Here, we illustrate the application of our permutation tests in 6 distinct scenarios presented in Figures 2, 3, 4, 5, 6, and 7. Table 1 provides a brief description of the mechanisms giving rise to identity confounding, the hypotheses under which the data was simulated, and the models used to generate the feature data. Our goal is to illustrate how identity confounding can arise for many different reasons, and how the proposed permutation tests are able to detect it.
In all examples, we simulated data for 10 correlated features (ρ f = 0.5), across 20 subjects (13 cases and 7 controls), with the number of records varying from 10 to 20 records per subject, and adopting ρ r = 0.95. We employed 10,000 permutations in the generation of the disease recognition null distribution. The generation of the identity confounding null distribution, on the other hand, was Example Identity confounding Hypotheses (H 0 or H 1 ) Model used to simulate ( Figure) generation mechanism consistent with the data the feature data, X s Example 1 serial dependency disease recog. null ( across records (U s ), identity conf. alter. (H * * 1 ) and location shifts due to disease label (y s ) Example 3 location shifts due disease recog. null (H * 0 ), µ s 1 + V s (Fig. 4) to subject specific identity conf. alter.
generation mechanism identity conf. null (H * * 0 ) based on only 1,000 record-wise feature permutations, with 300 subject-wise label permutations per record-wise feature permutation due to computational constraints. (Figure 8, provides a justification for our choice of 300 subject-wise label shufflings. It shows that the sampling variability associated with the adoption of 1,000 record-wise feature permutations overwhelms the small differences in the shape of the identity confounding null achieved by increasing the number of subject-wise label shufflings beyond 300 permutations.) In our first synthetic data example, we simulated feature data from the model, where identity confounding arises due to the serial correlation across the records induced by the U s component. Since, the model does not include a disease effect (a was set to 0), we have that the data is consistent with the null hypothesis of the disease recognition permutation test, H * 0 , and with the alternative hypothesis of the identity confounding permutation test, H * * 1 . Figure 2a shows the ROC curves for data split in a record-wise (brown) and subject-wise (black) fashion. Note that the subject-wise split leads to a much smaller AUC value (0.57) than the record-wise data split (AUC = 0.8). Figure 2b shows boxplots of the feature data across all 20 subjects, with cases and controls shown in orange and green, respectively. In this case, we don't see mean differences between cases and controls, since we simulated featured data centered at 0 (both µ s and a are set to 0 in the model shown in equation 16). Figure 2c shows the disease recognition null distribution (blue), and the observed AUC value, auc o , (brown line) for the record-wise data split. The null distribution is centered away from 0.5 (median = 0.77), indicating the presence of identity confounding. The disease recognition permutation p-value (= 0.2984, given by the tail probability of the blue distribution to the right of the brown line), nonetheless, indicates that the classifier is still unable to perform disease recognition, even though the observed AUC value (0.8) is high. This observation suggests that the high classification performance is entirely due to the classifier's ability to identify subjects. Figure  2d, shows the identity confounding null distribution (red), and the observed value of theã uc * statistic (blue line), which corresponds to the median of the disease recognition null distribution shown in panel c. The identity confounding permutation p-value (computed as the proportion of times that the red distribution was larger than the blue line) is 0, and confirms the presence of identity confounding when the data is split in a record-wise fashion. The grey density represents the analytical approximation for the H * * * 0 null, used in the computation of the the pseudo p-value for detecting identity confounding (given by the area under the grey density to the right of the blue line -which, in this example, is 6 × 10 −8 ). Figure 2e shows the disease recognition null distribution for the subject-wise data split (blue) and the observed AUC value (black line). Because the subjectwise split (by construction) eliminates the classifier's ability to perform subject identification, this null distribution will always be centered around 0.5. In this example, again, the disease recognition permutation p-value (0.2781) indicates that the random forest classifier is is not performing disease recognition. Figure 2f shows the identity confounding null distribution (red) for the subject-wise data split, and further confirms that this data split avoids identity confounding issues (permutation p-value = 0.719, pseudo p-value = 0.516).
In our second synthetic data example, we simulate data from the model, which basically adds a disease effect to the model used in the first example. Hence, the data is now simulated under the alternative hypothesis for disease recognition, H * 1 , and identity confounding, H * * 1 . Figure 3a shows much higher AUC values for both record-wise (0.97) and subject-wise (0.95) data splits. Figure 3b shows clear mean differences between cases and controls (as one would expect, since we now include a disease effect). Figure 3c shows that, even though identity confounding is playing a role in the random forest's classification performance, when the data is split in a recordwise manner (note the shift away from 0.5 in the blue distribution), the algorithm is, nonetheless, still doing disease recognition (permutation p-value = 0). Figure 3e, shows that the random forest is performing disease recognition with the subject-wise data split, as well (permutation p-value = 0). Figure 3d confirms the presence of identity confounding (permutation p-value = 0, pseudo p-value = 0.0024) with record-wise data split, while Figure 3f confirms it's absence (permutation p-value = 0.318, pseudo p-value = 0.487) with subject-wise split.
In our third example, we illustrate how differences in the mean feature values across the subjects can lead to identity confounding, even when the longitudinal data of each individual has no serial association structure. To this end, we simulate data from the model, where the feature data of each subject is independent and identically distributed (i.i.d.) according to N(µ s , 1). Note that in this case the data is simulated under the null for disease recognition, H * 0 , and under the alternative for identity confounding, H * * 1 . (Observe, as well, that this model corresponds to the i.i.d. mixture model for data clustered by subject, discussed in reference [2].) Figure 4b shows accentuated mean differences in feature values across the subjects, and Figures 4c and d illustrate how, even though, the record-wise data split leads to a high AUC value (0.89), the random forest classifier is performing subject identification only (disease recognition permutation p-value = 0.474, identity confounding permutation p-value = 0).
Our fourth example, illustrates how the disease effect can itself induce a certain amount of identity confounding, when the data is split in a record-wise manner. To this end, we simulated data from the model, where the feature data from case and control subjects is i.i.d. N(1, 1) and N(−1, 1), respectively. Because the disease label contributes to the mean of the feature values, and identity confounding can arise because of mean differences across the subjects (as illustrated in the previous example), we have that, the data in model (19) is generated under the alternative for disease recognition, H * 1 , as well as, for identity confounding, H * * 1 . (Observe that, at least for data simulated according to model (12), it is not possible to generate data under the alternative for disease recognition and, concomitantly, under the null for identity confounding.) Figure 5b shows clear mean differences between cases and controls, while Figure 5c shows that the classifier is performing disease recognition with the recordwise data split (permutation p-value = 0). But, more interestingly, it also illustrates that the disease recognition null is slightly shifted away from 0.5 (median = 0.55), showing that the difference in means is enough to generate identity confounding for record-wise data splits. Furthermore, Figure  5d shows that this small shift is statistically significant (identity confounding permutation p-value = 0). Observe, as well, that this example illustrates the case where the pseudo p-value (= 0.162) fails to detect identity confounding, while the permutation p-value does not.
In our fifth example, we illustrate how differences in the variance of the feature values across the subjects can lead to identity confounding, even when the longitudinal data of each individual has no serial association structure. To this end, we simulate data from the model, where the feature data of each subject is i.i.d. according to N(0, σ 2 s ), and the σ 2 s values were sampled from a uniform distribution in the interval (1, 10). Note that in this case the data is again simulated under the null for disease recognition, H * 0 , but under the alternative for identity confounding, H * * 1 . Figure 6b shows clear spread differences between the subjects. Note that although Figure 6c shows that the classifier is not performing disease recognition with the record-wise data split (permutation p-value = 0.3088), it illustrates that the difference in variance is enough to generate identity confounding for record-wise data splits. Furthermore, Figure 6d shows that this small shift is statistically significant (identity confounding permutation p-value = 0). Note, as well, that this example illustrates the case where the pseudo p-value (= 0.069) is marginally significant, while the permutation p-value clearly indicates the presence of identity confounding.
In our sixth example, we consider the null model, where all the feature data within and across all participants is i.i.d. N(0, 1), so that the data is simulated under the null hypothesis for disease recognition, H * 0 , and identity confounding, H * * 0 . As expected, Figure 7 shows that the random forest classifier is not performing disease recognition or subject identification, no matter how the data is split.
Finally, as a sanity check, we performed a simulation study using 500 distinct data-sets simulated under the null for both disease recognition and identity confounding. Figure 9 reports the results and describes the simulation experiment in more detail. As expected, the distribution of the pvalues is approximately uniform for the permutation tests, indicating that the type I error rates are being controlled at the nominal significance levels. The pseudo p-value distributions, on the other hand, are bell shaped and centered around 0.5, indicating that the pseudo p-value approach is conservative at small nominal significance levels (that is, the pseudo p-values tend to be larger than they should be), illustrating, once again, that a small pseudo p-value implies the presence of identity confounding, whereas non-significant pseudo p-values do not imply the absence of digital fingerprints.

A more detailed analysis of the mPower data
We illustrate the application of our proposed tests to the diagnosis of Parkinson's disease (PD), using iPhone sensor data collected during the first six months of the mPower study [4,3]. We focus our analyses on data collected from the voice and tapping activity tasks (details about the activity tasks are provided in reference [3]).
Our analyses were based on a matched sample of 11 PD patients and 11 control participants, who provided at least 100 records over the 6 month period. All case and control participants were males, and were first matched by age (using exact matching), and then by education level (using nearest neighbor matching [9]) whenever there were multiple cases with the same age of a control, or vice-versa. (In the event that ties persisted after the application of this second matching criterion, we randomly selected one participant to perform the matching.) We trained separate random forest classifiers for the voice and tapping data, based on 13 voice features and 41 tapping features proposed in the literature [10,11].
Due to the small number of subjects, in order to maintain balance in the age distributions of cases and controls in the training and test sets, we performed an age balanced version of the subjectwise data split (where whenever the training set contained a case of a given age, then the test set contained the matched control of the same age, and vice versa). For the record-wise data split, on the other hand, we simply randomly selected the records. (Note that because each subject has a large number of records (at least one hundred), the distribution of the ages tends to be well balanced between cases and controls in both the training and test sets.) Figure 10 shows the results for the voice data. From panel a we observe a massive amount of identity confounding for the record-wise split data, with the disease recognition null distribution (blue) concentrated at very high AUC values (median AUC = 0.954). Both the identity confounding permutation p-value and the pseudo p-value are 0 in this example. The disease recognition p-value (= 0.5843), on the other hand, shows that the random forest classifier is not performing disease recognition, even though the observed AUC (brown line) is 0.952. Panel b reports the results for the subject-wise data split, and show that, once again, the classifier is not learning about the disease labels (disease recognition p-value = 0.7243).
In order to assess the robustness of these results with respect to the training/test data split, we show in panel c the disease recognition null distributions (blue boxplots) for 30 distinct record-wise data splits, as well as, the respective AUC values (brown dots), and the distribution of the disease recognition permutation p-values across the 30 data splits (brown boxplot). The results show that in none of the 30 data splits we observed a permutation p-value smaller than 0.05 (dashed red line), showing that the very high AUC values (brown dots) achieved by the classifier are due largely to the classifier's ability to identify subjects. Finally, panel d reports the results for the subject-wise data split, confirming that the classifier is indeed unable to perform disease recognition. Figure 11 shows the results for the tapping data. Again from panel a we observe high levels of identity confounding for the record-wise data split (identity confounding permutation p-value and the pseudo p-value are again 0), but contrary to the voice data, the random forest classifier is now able to perform disease recognition with the tapping features (disease recognition p-value = 0.014). Panel b, shows that the classifier trained with the subject-wise data split is still able to perform disease recognition, although the statistical significance is weaker (permutation p-value = 0.055). Panel c shows that the results presented in panel a are robust across distinct record-wise data splits. Panel d assesses the robustness of the findings observed in panel b, and shows that results tend to be more variable and less significant in this case when compared to the record-wise data splits (note how the distribution of the disease recognition p-values have a larger spread than in the record-wise case). We point out, nonetheless, that this larger variability is likely due to the very small number of subjects and the heterogeneous nature of the data, and suggests that a certain amount of model under-fitting is going on in the tapping data.
In order to assess whether a larger number of subjects would improve the power to perform disease recognition (especially when adopting the subject-wise data split strategy) we repeated our analysis using a larger number of subjects (who, nonetheless, provided a smaller number of records). Specifically, for the voice data we analyzed two additional (age matched and male) cohorts containing 42 subjects (who provided at least 50 records), and 240 subjects (who contributed at least 10 records), whereas for the tapping data we analyzed two additional cohorts containing 48 and 290 subjects (who contributed at least 50 and 10 records, respectively). As before, we ran the analyses on 30 distinct data splits, with half of the subjects and (approximately) half of the records assigned to training set in the subject-wise and record-wise data splits, respectively. Figure 12 reports the results for the voice data. Panels a, b, and c show the disease recognition null distributions and observed AUC values for the record-wise data splits across increasing numbers of subjects. In all 3 panels, the disease recognition null distributions are concentrated around very high AUC values, suggesting that identity confounding is playing an important role in all three cohorts. It is interesting to note that the AUC values tended to decrease, as the amount of data used to train the random forests increased. This drop in AUC is, nevertheless, not surprising, given that the classifier is mostly performing subject identification and that a larger number of subjects makes it harder to identify the individual subjects. Observe, as well, that the random forest only started to detect the disease signal in panel c, where a larger amount of data was available to train the classifier (note how the brown dots started to approach the upper tail of the blue boxplots). Panels d, e, and f, show analogous results for the subject-wise data splits. Note that, as expected, the disease recognition null distributions were always centered around 0.5, and their spread tended to decrease as the number of subjects increased. Observe, nonetheless, that the random forest was unable to perform disease recognition even when the training set contained data from 120 subjects (panel f). Panel g shows the distributions of the disease recognition permutation p-values, across the 30 random data splits shown in panels a to f. Figure 13 shows the results for the tapping task. Overall, we observe the same patterns as for the voice data, except that tapping features seem to be less prone to perform subject identification (comparison of panels a, b, and c on Figures 13 and 12 show that both the observed AUC values and disease recognition null distributions tend to be located at lower AUC values for the tapping data than for the voice data), and are better able to perform disease recognition in both record-wise and subject-wise data splits (compare Figures 13g and 12g).
Our analyses of the voice and tapping data, collected by the mPower study, showed undeniable evidence of a high degree of identity confounding for classifiers built using record-wise data splits. Although it is true that our results also showed evidence of model under-fitting for classifiers trained with subject-wise data splits, we point out that model under-fitting can be ameliorated by simply increasing the number of subjects used in the analyses. (The rationale is that, as the number of subjects used to train the classifier increases, the chance that the training set is missing a critical part of the pattern that relates features to disease labels decreases, so that the classifier has a better chance to generalize to new unseen cases, even when the data is fairly heterogeneous). But, most importantly, our data does not support the hypothesis raised by the first author of Little et al [2] that model under-fitting (alone), rather than identity confounding, could explain the discrepancy in classification performance between the record-wise and subject-wise strategies.
Interestingly, the voice data seems to require a much larger number of subjects in order to perform disease recognition. The fact that the random forest classifier was unable to perform disease recognition with the subject-wise data splits, and only started to detect the disease signal with recordwise data splits when the training set contained over 100 subjects, combined with the fact that the voice-based classifier achieved higher AUC values than the tapping-based one in the record-wise data splits, suggests that our voice features are more vulnerable to identity confounding and less suited to perform disease recognition, when compared to the tapping features. A possible explanation, is that voice features are intrinsically better able to capture personal and physiological characteristic of a subject compared to a "game" based tapping feature. An alternative explanation, is that the weaker disease recognition signal might be due to suboptimal processing and cleaning of the voice raw data. To see why, note that the digital fingerprint of a subject may arise not only from biological characteristics, such as the pitch of a subject's voice, but also from non-biological/environmental artifacts, such as the amount of background noise and the distance of the subject's mouth to the phone's microphone. These non-biological artifacts can certainly contribute to identify subjects when the voice tasks are performed under different conditions across the subjects, but under consistent conditions within each subject. (Note that while this issue can be somewhat controlled in laboratory experiments, this is not generally the case in mobile health studies, such as mPower, that are conducted at home under uncontrolled conditions.) This problem is likely exacerbated by the use of highly sensitive sensors, capable of detecting quite subtle variations in sound. The tapping task, on the other hand, seems to provide less room for environmental artifacts.
Finally, while all analyses of the mPower data presented so far were based in data splits assigning 50% of the data to training set and 50% to the test set (denoted, for short, as the 50/50 split), Figures  14 and 15 compare the results based on 50/50 record-wise data splits to 90/10 splits (i.e., record-wise data splits assigning 90% of the records to training set and 10% to the test set) for the tapping and voice data, respectively. In all panels, the blue boxplots show the permutation null distributions, the brown dots show the observed AUC score, and the blue and brown horizontal lines show the median (across all 30 distinct data splits) of the permutation null distributions and observed AUC scores, respectively. Comparison of panels a versus b, c versus d, e versus f, show that for both tapping ( Figure 14) and voice (Figure 15) datasets the 90/10 split allowed the classifiers to better identify the subjects. (Note that in all comparisons the horizontal blue lines are higher for the 90/10 splits in comparison with the 50/50 splits. Similarly, the horizontal brown lines were also higher for the 90/10 splits relative to the 50/50 splits.) On the other hand, because the test sets were considerably smaller for the 90/10 splits, the spreads to the permutation null distributions tended to be larger, and the results also tended to be more variable for the 90/10 splits (in the sense that the blue boxplots were more widespread around the blue horizontal line, and the brown dots were more widespread around the brown horizontal line). Furthermore, for the tapping data (Figure 14), note how the brown dots tended to be further away at the tail of the permutation nulls for the 50/50 splits in comparison with the 90/10 splits. This suggests that, as expected, the ability of the classifiers to perform disease recognition tended to decrease in the 90/10 splits. (For the voice data, on the other hand, the results tended to be less clear cut since the classifiers were already having difficulty to learn the disease signal with the 50/50 split.) In all examples, we adopted ρ r = 0.95, and ρ f = 0.5 (see Section 3.1). Panels a and e illustrate the case where the data was generated without correlation structures across the rows and columns. Panels b and f depict the case with serial correlation structure across the rows, but no correlation across the columns. Panels c and g, illustrate the case with correlation across the columns, but not across the rows. Finally, panels d and h, depict the case with correlation structure across both rows and columns.        (c) Figure 8: Determination of a sufficient number of subject-wise label permutations in the generation of the identity confounding null distribution. As described in Algorithm 2, in the main text, the generation of the identity confounding null distribution requires the computation of the disease recognition null distribution (which is based on p l subject-wise label shufflings), for each one of the p record-wise feature data permutations, so that the total number of computations is given by p × p l . Note, nonetheless, that because the test statistic (m * ) corresponds to the median of a performance metric (m * ), we don't need to use a large number of label permutations, p l (since it is reasonable to expect that the median of the disease recognition null distribution, computed with, say, 1,000 subject-wise label permutations should be close to the median computed with 10,000 permutations). This rationale suggests, that we might be able to determine a sufficient number of label permutations, by computing the identity confounding null distribution using increasing p l values, and inspecting when the null distribution "stabilizes". Panel a shows the standard deviation of the identity confounding null distribution (red dots) computed with p = 10, 000 record-wise feature permutations for p l varying from 10 to 500 subject-wise label permutations. The black boxplots show the distributions of the standard deviations across 10,000 sub-samples of size 1,000 of the respective identity confounding null distributions, and provide estimates of the amount of sampling variability we should expect when we adopt 1,000 record-wise feature permutations, rather than 10,000. The panel shows that, while the standard deviation estimates based on p = 10, 000 permutations (red dots) are still monotonically decreasing as p l increases, the sampling variability that we see when we adopt p = 1, 000 by far overwhelms the small difference in standard deviation that is achieved by increasing p l from 300 to 400 (or 500). Similarly, panel b shows that the identity confounding null distribution tends to stabilize as the number of label permutations increases, and panel c shows that the null distributions are already very similar when we adopt p l = {300, 400, 500}. All the results were based on data simulated according to synthetic example 1 (equation 16), in the main text.  Figure 9: Assessing type I error rate control with a simulation study. We simulated 500 distinct data-sets under the null hypothesis of both the disease recognition and the identity confounding tests. For each synthetic data-set, the feature data of each subject was independently simulated according to the model, X s = c V s + d E s , in order to allow for correlation across the features (see main text for further details). Each simulated data-set was generated with a unique combination of simulation parameters, with the scalars c and d varying from 0.1 to 2, the number of cases and controls varying from 5 to 10, and the number of records per participant varying from 10 to 20. In order to select parameter values spread as uniformly as possible over the entire parameter ranges, we adopted a Latin hypercube space filling design in the determination of the parameter values used to generate each data-set. Due to computational constraints, the permutation tests were based on 100 permutations. The top panels show the results from tests applied to data-sets split in a the recordwise fashion, while the bottom panels report the results based on subject-wise data splits. Panels a and e show the p-value distributions from the disease recognition permutation test, while panels b and f show the p-value distributions from the identity confounding permutation test. As expected, the distributions are approximately uniform showing that the type I error rate of the permutation tests are being controlled at the nominal significance levels. Panels c and g show the distributions of the pseudo p-values. As expected, the distributions are bell-shaped since a central limit theorem for the sample median of a random variable is kicking in (note that because the pseudo p-value is computed using the median of the AUC metric, and the tail probability computed with the median of a sample of test statistics is equal to the median of the tail probabilities computed for each test statistic in the sample, that is, we have that the pseudo p-value distributions actually correspond to the distributions of the sample median of disease recognition p-value distributions). Finally, panels d and h, report the p-value distributions from the analytical test for the presence of disease recognition and/or identity confounding (H * * * 0 ) calculated according to expression (8), in the main text. As expected, the p-value distributions are also approximately uniform. AUC q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q    AUC q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q    Figure 11: Tapping data example. Panel a shows the disease recognition permutation null (blue) the observed AUC value (brown line), the identity confounding permutation null distribution (red), and the density of the analytical null distribution used for the computation of the pseudo p-values, for the record-wise data split. Panel b shows the analogous objects for the subject-wise data split. Panel c reports the disease recognition null distributions (blue boxplots) for 30 distinct record-wise data splits, as well as, the observed AUC values (brown dots), and the distribution of the disease recognition permutation p-values across the 30 data splits (black boxplot). Panel d shows the analogous objects for the subject-wise data split. The disease recognition nulls was generated with 10,000 permutations in panels a and b, and 1,000 permutations in panels c and d. The identity confounding nulls were generated with 1,000 feature permutations (and 300 label permutations per feature permutation). q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q (a) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q (b) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q (c) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q (a) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q (b) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  AUC   run1  run2  run3  run4  run5  run6  run7  run8  run9  run10  run11  run12  run13  run14  run15  run16  run17  run18  run19  run20  run21  run22  run23  run24  run25  run26  run27  run28  run29 AUC   run1  run2  run3  run4  run5  run6  run7  run8  run9  run10  run11  run12  run13  run14  run15  run16  run17  run18  run19  run20  run21  run22  run23  run24  run25  run26  run27  run28  run29 AUC   run1  run2  run3  run4  run5  run6  run7  run8  run9  run10  run11  run12  run13  run14  run15  run16  run17  run18  run19  run20  run21  run22  run23  run24  run25  run26  run27  run28  run29  run30 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  Figure 13: Comparison of the disease recognition performance in the tapping data, across increasing number of subjects. Panels a, b, and c, show the disease recognition null distributions, across 30 distinct record-wise data splits, for classifiers built with, 22 subjects (with at least 100 records per subject), 48 subjects (with at least 50 records per subject), and 290 subjects (with at least 10 records), respectively. The brown dots represent the observed AUC values. Observe the decrease in AUC, as the number of subjects increase. Panels d, e, and f, show the analogous results for 30 distinct subject-wise data splits (with black dots representing the observed AUC values). Note the decrease in the spread of the disease recognition null distributions, as the number of subjects increases. Panel g shows the distributions of the disease recognition permutation p-values, across the 30 random data splits shown in panels a to f. Note how the p-values tend to decrease, as the number of subjects increase. In particular, the accentuated improvement in power for the subject-wise split strategy (black boxplots), illustrates how the increase in the number of subjects reduced to a great extent the model under-fitting issue observed in panel d.   run3  run4  run5  run6  run7  run8  run9  run10  run11  run12  run13  run14  run15  run16  run17  run18  run19  run20  run21  run22  run23  run24  run25  run26  run27  run28  run29 run3  run4  run5  run6  run7  run8  run9  run10  run11  run12  run13  run14  run15  run16  run17  run18  run19  run20  run21  run22  run23  run24  run25  run26  run27  run28  run29 run3  run4  run5  run6  run7  run8  run9  run10  run11  run12  run13  run14  run15  run16  run17  run18  run19  run20  run21  run22  run23  run24  run25  run26  run27  run28  run29