Comparing supervised and unsupervised approaches to emotion categorization in the human brain, body, and subjective experience

Machine learning methods provide powerful tools to map physical measurements to scientific categories. But are such methods suitable for discovering the ground truth about psychological categories? We use the science of emotion as a test case to explore this question. In studies of emotion, researchers use supervised classifiers, guided by emotion labels, to attempt to discover biomarkers in the brain or body for the corresponding emotion categories. This practice relies on the assumption that the labels refer to objective categories that can be discovered. Here, we critically examine this approach across three distinct datasets collected during emotional episodes—measuring the human brain, body, and subjective experience—and compare supervised classification solutions with those from unsupervised clustering in which no labels are assigned to the data. We conclude with a set of recommendations to guide researchers towards meaningful, data-driven discoveries in the science of emotion and beyond.


Details of Model Performance Evaluation using Synthetic Data
As described in the main text, we generated synthetic data to test the ability of our clustering model to capture the signal induced by stimuli embedded in noise. We generated synthetic data using the neurosim MATLAB package (https://github.com/ContextLab/neurosim) with the generative process proposed in ([7]). The synthetic data was designed to imitate BOLD data with clear, discoverable categories. For a given signal-to-noise ratio (SNR), an experimental design matrix, and brain voxel location, the package generates a set of synthesized voxel activations. For each emotion category, we used 20 randomly chosen spherical regions in the brain with varying BOLD amplitude during each trial. Each of these regions was assigned a single radial basis function whose spatial center and width was chosen uniformly but restricted to remain within the limits of a standard human brain. The synthesized "brain image" for each trial t was a weighted combination of these 20 basis functions for the specific emotion category active during that trial. Zero-mean Gaussian noise was added as the last step to simulate the measurement noise. The standard deviation of the Gaussian noise was calculated using the SNR supplied. We varied the SNR in our study from 0.01 to 100 in the range of [0.01, 0.05, 0.1, 0.5, 1, 5, 10, 100] to assess the performance of our model under varying levels of noise. In a Monte Carlo simulation, for each SNR value, we performed 500 random realizations of synthesized brain images and calculated the accuracy of our model at clustering the emotion categories. We reported the average accuracy across all random realizations. To evaluate the supervised performance on the same synthetic data we used the same structure for the CNN as was used for the actual fMRI BOLD data, described above.

Neural Network Details
The Details of the neural network configuration are as follows: • Three fully-connected layers of size 5, 4, 3 were used.
• After each fully-connected layer we used a batch normalization layer to speed up training of the network and reduce the sensitivity to network initialization.
• A rectified linear unit (ReLU) activation layer was applied after batch normalization.
• Finally a soft max layer generated the classification probabilities.
• The batch size was 10, the maximum epoch was set to 50, and the learning rate was chosen as a hyperparameter to be equal 0.001.
• ADAM optimization algorithm, Kingma and Ba [6], was used to iteratively update network weights in the training procedure.

Dataset 3: Self-Report Data
Overview of Cowen & Keltner 2017 split-half canonical correlation analysis The experimenters did not provide emotion labels for the film clips, but the clips were chosen with those specific 34 labels in mind (from Cowen & Keltner, "The videos were gathered by querying search engines and content aggregation websites with contextual phrases targeting 34 emotion categories"; pg. 2 [2]).
However, it can be debated whether the initial analyses reported in [2] were free from strong assumptions about the existence of certain emotion categories. The researchers devised a split-half canonical correlation analysis (SH-CCA) in which they correlated the mean emotion category ratings for each clip from half the subsample of participants with the mean ratings from the other half. They reported that the results of the SH-CCA indicated that the data contained between 24 and 26 categories of emotional experience. Because the categories in the SH-CCA were pre-specified rather than discovered (i.e.,one half of the categorical ratings constrained the analysis of data the other half), this method is more constrained than is optimal for a fully unsupervised analysis. Indeed, it might be argued that the categorical nature of the ratings and using the same labels for video selection and for participant ratings makes this SH-CCA more likely to reveal a solution consistent with a supervised classification than data-driven clustering ( [1]).

Latent Dirichlet Allocation
The LDA model is shown as a probabilistic graphical model below. The interpretation of the symbols in the graph is the same as described in the Methods section for Dataset 1. Specifically, we have a collection of D video clips, each of which is a mixture over K topics (i.e., video clip d ∈ {1, ..., D} has a associated K dimensional vector θ d that shows its distribution over topics). Each topic k is characterized by 34-dimensional vectors ψ k , which is a distribution over the predefined emotion categories. For each video clip, N is the total number of 'yes' answers for all of the pre-defined emotion categories, W is a categorical variable showing the 'yes' answer for a specific predefined emotion category, and Z is the corresponding topic index. Finally, α and β are the parameters of the Dirichlet priors for the topic mixtures θ and topics φ, respectively. The corresponding probabilistic generative model for this LDA model is as follows: where the Dirichlet distribution is given by and B(α) denotes the multivariate Beta function.
We adapted a collapsed Gibbs sampling method for LDA model based on [4] in MATLAB's Text Analytics Toolbox to solve for the unknown parameters of the model, including the topic distributions and the distribution over topic φ for each video clip, θ.
To decide on a suitable number of topics for LDA, we compared the goodness-of-fit of LDA models across varying numbers of topics. We evaluated the goodness-of-fit of an LDA model after by training on all videos except for a held-out subset by calculating the perplexity on the held-out subset: where D test is the held-out collection of M videos. The perplexity indicates how well the model described the data, with a lower perplexity suggesting a better fit. We used 10-fold cross-validation by randomly partitioning the original dataset into ten equal size subsets. Nine of the subsets were used for training, and the remaining subset for validation. The cross-validation process was then repeated 10 times for all folds and each of the 10 subsets was used exactly once as the validation set. The average perplexity across 10 folds was used for choosing the number of topics (see Fig. 3 in the main document).

Neural Network Details
The Details of the neural network configuration are as follows: • Three fully-connected layers of size 10, 8, 6 were used.
• After each fully-connected layer we used a batch normalization layer to speed up training of the network and reduce the sensitivity to network initialization.
• A rectified linear unit (ReLU) activation layer was applied after batch normalization.
• Finally a soft max layer generated the classification probabilities.
• The batch size was 20, the maximum epoch was set to 10, and the learning rate was chosen as a hyperparameter to be equal 0.001.
• ADAM optimization algorithm, Kingma and Ba [6], was used to iteratively update network weights in the training procedure.

Permutation Test
To evaluate the statistical significance of our classification findings, we conducted a permutation test based on the definition from [3]. The details of the procedure are as follows. Given the original data set D = {(X i , y i )} n i and a permutation function for n elements, π, one can permute the labels y of data set D with the aim to produce a new data set D = {(X i , π(y) i )} n i whose marginal distributions of features p(X) and labels p(y) are the same as those of the original data set while the dependence between features and labels are broken. The new data set D is intuitively known to come from chance. Based on the definition by Good (2000), we defined D as a set of k randomized permutations D of the original data set D sampled from a null distribution. We then calculated the p-value based on the following equation, where a(f, D ) is the training accuracy computed using the permuted labels in D : The | · | represent the cardinality (size) of the set. The calculated p-value represents the fraction of permuted samples where the classifier's accuracy was better in the random, permuted data than the original data, reflecting the probability that the classification accuracy for our data was obtained by chance. (Fig. 3) In Fig. 3, we cannot decide in favor of a specific value for the number of clusters because there is not a clear minimum. To confirm that there is no minimum attained that is statistically significant, we ran a paired t-test for each pair consisting of the smallest value at 31 and one of the adjacent values at 30 or 32. We used the perplexity value across 10 folds of validation as inputs in the t-test. According to the t-test result, the perplexity value at 31 was not significantly different from 30 or 32 (p > 0.1). Therefore, we cannot choose that as the minimum perplexity.

Supplementary Figures
Supplementary Figure 2 and three (right) discovered clusters respectively. Bars represent the proportion of the trials from each emotion category found within each cluster. Blue bars represent trials labeled as fear, orange bars happiness, and yellow sadness. The total proportion of categories in each cluster sums to 1. The mixing proportion π k reported below each cluster is the probability that an observation comes from that cluster, which is representative of the size of the cluster.  Different methods indicate different numbers of components to retain. The dashed line indicates a the retention criteria of eigenvalues greater than one, which indicates eight components. Alternately, a parallel analysis [5] to determine the number of components suggested five components, indicated by the 95% and 5% quantile of the distribution of shuffled data. participants for every permutation. The average classifier accuracy across subjects was 47.1%, which fell within the 5% tail of the chance distribution, indicating statistical significance.