Machine learning to reveal hidden risk combinations for the trajectory of posttraumatic stress disorder symptoms

The nature of the recovery process of posttraumatic stress disorder (PTSD) symptoms is multifactorial. The Massive Parallel Limitless-Arity Multiple-testing Procedure (MP-LAMP), which was developed to detect significant combinational risk factors comprehensively, was utilized to reveal hidden combinational risk factors to explain the long-term trajectory of the PTSD symptoms. In 624 population-based subjects severely affected by the Great East Japan Earthquake, 61 potential risk factors encompassing sociodemographics, lifestyle, and traumatic experiences were analyzed by MP-LAMP regarding combinational associations with the trajectory of PTSD symptoms, as evaluated by the Impact of Event Scale-Revised score after eight years adjusted by the baseline score. The comprehensive combinational analysis detected 56 significant combinational risk factors, including 15 independent variables, although the conventional bivariate analysis between single risk factors and the trajectory detected no significant risk factors. The strongest association was observed with the combination of short resting time, short walking time, unemployment, and evacuation without preparation (adjusted P value = 2.2 × 10−4, and raw P value = 3.1 × 10−9). Although short resting time had no association with the poor trajectory, it had a significant interaction with short walking time (P value = 1.2 × 10−3), which was further strengthened by the other two components (P value = 9.7 × 10−5). Likewise, components that were not associated with a poor trajectory in bivariate analysis were included in every observed significant risk combination due to their interactions with other components. Comprehensive combination detection by MP-LAMP is essential for explaining multifactorial psychiatric symptoms by revealing the hidden combinations of risk factors.


Selecting testable combinations
Suppose that we have N subjects and that we know their PTSD trajectory score ranks. Given a risk combination, the set of subjects who have the risk combination is defined as J. N subjects are classified into x = |J| risk and N −x nonrisk subjects. Mann-Whitney U is defined as the probability that the ranks are more biased than J.
The P-value is achieved at the smallest value when the PTSD trajectory scores in J are larger than the others or smaller than the others. The probability of the case appearing is described as 1/ ( ) This is the minimum P-value of Mann-Whitney U.
This value decreases with increasing x for 1 ≤ x ≤ N/2 and takes the minimum value when x = 2 ⁄ , which is a nonzero value.
Therefore, the smallest P-value depends on the number of all subjects and the number of subjects who have the risk combination. When x is small enough for the minimum P-value (equation (1)) to be larger than the significance level, the risk combination cannot be significant regardless of the PTSD trajectory scores of subjects.

Checking the validity of imputation process
We checked the validity of the imputation as follows according to the guidelines. 23 First, we assessed the missing data before imputation by the following two steps.
1. The missing rates were calculated. The missing rates among IES-R items and potential risk factors were (1) 0.5% and 2.9%, respectively. The highest missing rate for potential risk factors is 9.2%, which is about the current smoking rate.
2. We compared summary statistics (i.e., mean, median, the first quartile, the third quartile, minimum, maximum, and SD) of IES-R scores and potential risk factors between incomplete subjects and complete subjects, and we checked that there were no significant differences.
Second, after imputation, we assessed the validity of imputation as follows.
1. We performed an internal check for imputation.
 The summary statistics (i.e., mean, median, the first quartile, the third quartile, minimum, maximum, and SD) of IES-R scores and potential risk factors between imputed datasets and complete datasets were compared.
 The graphs of the distribution (i.e., histogram, density plot, quantile-quantile plot, and cumulative distribution plot) of imputed and complete datasets were compared.
2. We checked the imputation externally.
 The distribution of imputed IES-R scores and other potential risk factors was compared with external Japanese datasets. 24 These external datasets were based on another Japanese disaster cohort study using a similar questionnaire as the current study. 24 Based on the abovementioned procedure, we concluded that there would be no significant bias effect resulting from the imputation.

Converting the ordinary or continuous variables into binary variables
Among 61 potential risk factors, 53 risk factors were binary variables in the questionnaire and not converted.
Three scales (K6, AIS, and LSNS) were standard scales, and we followed previous studies for cutoff scores 25-28 .
The remaining 5 risk factors were converted from ordinal (or continuous) scales into binary scales as follows.
First, the walking time and resting time had three choices in the questionnaire. The two groups (highest and lowest categories) were analyzed as possible risk groups.   b This column indicates whether the combinations were included in the significant risk combinations in the main analysis, which is not adjusted for sex and age.
Supplementary Figure S1