LISA improves statistical analysis for fMRI

One of the principal goals in functional magnetic resonance imaging (fMRI) is the detection of local activation in the human brain. However, lack of statistical power and inflated false positive rates have recently been identified as major problems in this regard. Here, we propose a non-parametric and threshold-free framework called LISA to address this demand. It uses a non-linear filter for incorporating spatial context without sacrificing spatial precision. Multiple comparison correction is achieved by controlling the false discovery rate in the filtered maps. Compared to widely used other methods, it shows a boost in statistical power and allows to find small activation areas that have previously evaded detection. The spatial sensitivity of LISA makes it especially suitable for the analysis of high-resolution fMRI data acquired at ultrahigh field (≥7 Tesla).

F dr ← GetFdr(H orig , H 1 , ..., H K ) M ← Voxelwise threshold applied to image B z such that F dr < α (returnM ) To make sure that LISA does not produce inflated false positive rates, we subjected it to the test described by Eklund et al. [1]. We used the same four experimental designs (B1,B2,E1,E2) and the same 198 resting state data sets of the Beijing sample as in [1], as well as the same preprocessing pipeline which consisted of a correction for motion, an alignment with the MNI template, and four different levels of spatial smoothing. After preprocessing, we used SPM [2] to compute GLM-based contrast maps for each subject, each of the four "fake" designs and the four levels of spatial smoothing resulting in a total of 4x4x198=3168 contrast maps. We then randomly drew 1000 samples of size twenty from each of the 4x4 different sets of maps, and applied a group-level onesample t-test using LISA on each of these samples. By definition, every activation found in this scenario is a false positive. We found that the false positive rates were well within the acceptable range of 5%, see supplementary table S1 below.  Table 1: Results of the Eklund test applied to 3 Tesla group level data. The numbers denote the percentage of tests in which at least one voxel was a false positive. The tests were performed using two different preprocessing regimes in which the data were either resampled to (2mm) 3 or to (3mm) 3 resolution.
Supplementary Figure 1: Distribution of false positives of table 1. This figure shows the spatial distribution of the false positives across all experimental designs and all test subjects totalling 8 × 1000 data sets for the (2mm) 3 (left) and the (3mm) 3 preprocessing regime (right). The color code indicates the total number of false positives per voxel. In the (2mm) 3 preprocessing regime, the largest number of false positives per voxel is 42 out of 8000. In the (3mm) 3 preprocessing regime, it is 49.

Supplementary Methods 2: False positive rates in single subject 7 Tesla data
To check whether LISA produces inflated false positive rates when applied to single subject data at ultrahigh fields we modified the tests proposed by Eklund et al. [1,3]. Specifically, we used resting state 7T fMRI data of the Human Connectome Project (HCP) of 25 subjects [4,5]. All data sets were acquired with the following parameters: TR=1000ms, 1.6 mm isotropic voxel size. Data were acquired in four runs. Here we only used the first two runs. Each run had a duration of 15 minutes and contained 900 volumes. The preprocessing protocol is described in [6]. 1 Artefacts due to motion, scanner noise, and other nuisance sources were removed using FSL-Fix [8,9]. No spatial smoothing was applied.
We randomly generated 40 different experimental "fake" designs of four types each, namely two types of block designs (B1, B2) and two types of event-related designs (E1, E2). All designs simulated two different experimental conditions separated by intertrial "resting" periods. The designs of type B1 had trials of length 10 seconds, and the B2 designs had trials of length 20 seconds. The intertrial duration was 6 seconds. The event-related designs E1, E2 had variable intertrial durations with a mean duration of 6 seconds and standard deviation of 2 seconds. The trials of the E1 design had a duration of 2 seconds, those of E2 were 1 second long. We applied the canonical hemodynamic response function [10] and computed the contrast between the two experimental conditions. We performed 4x40 tests for each subject so that for each type of design we had a total of 1000 tests per condition. We performed two sets of tests using either one run with a duration of 15 minutes, or two runs with a total duration of 30 minutes. If two runs were used, to ensure exchangeability random permutations of task labels were only done within the same run. We recorded the false positive rate, i.e. the number of cases in which at least one voxel passed the significance threshold, see table 2 below.  one run   two runs  subject ID  B1  B2  E1  E2  B1  B2  E1  E2  100610  0  3  0  0  0  2  1  1  102311  1  0  1  0  0  2  0  1  104416  2  2  0  2  0  0  0  0  105923  2  7  1  1  1  2  0  0  111312  2  1  1  0  0  2  0  0  111514  0  1  1  2  1  1  0  0  114823  0  1  0  1  2  1  1  0  115017  0  0  1  1  0  0  1  2  118225  0  1

Supplementary Methods 3: Human Connectome Project data (3 Tesla)
Experimental design. In the motor task, participants were cued visually to tap their left or right fingers, squeeze their left or right toes, or move their tongue. Each block lasted 12 seconds (10 movements), and was preceded by a 3 second cue. Here we investigated only the left hand fingertapping condition. In the emotion experiment, participants were cued to decide which of two faces matched the face shown on top of the screen. In a second experimental condition, an analogous task was done using shapes instead of faces. The faces had either angry or fearful expressions. Here we investigated the contrast "faces minus shapes". For further details see [11].
Imaging parameters. We used data of the left-right phase encoding runs. All data sets were acquired with the following parameters: TR=720ms, TE=33.1ms, 2 mm isotropic voxel size, multiband factor 8. The preprocessing protocol is described in [6]. In addition, FSL-Fix [12][13][14] was used for data denoising, removing physiological nuisance effects as well as motion and multi-band artifacts. To account for intersubject variations, we applied a Gaussian spatial filter with a kernel size of 6 mm. Using FSL-FEAT [12][13][14], we performed individual one-sample t-tests resulting in 400 single-subject input maps of contrast values.

Supplementary Methods 4: Finger maps at 7 Tesla (single subject data)
Experimental design. A healthy subject (female, aged 25 years) was stimulated in the MR scanner at four fingers of her right hand using an MR-compatible tactile stimulator. The stimulator consisted of four stimulation belts, which stimulated one finger pad at a time. Hand and fingers did not move during stimulation. Surfaces of six different roughness levels were mounted on the belts. During stimulation, the belts moved and thus stimulated the finger pads that rested passively on the belts. After each stimulation, the subject was asked to make a roughness judgment (choosing between level 1=smooth to 6=very rough) about the previous stimulation. Each finger stimulation lasted for 9 seconds (blocked design), and was repeated 13 times per run resulting in 52 trials per run. The stimulation consisted of two runs that were separated by a short break, one forward scan (stimulating first the index finger, followed by the middle finger, ring finger, small finger, back to index, and so forth), and a reverse scan (stimulating first the small finger, followed by the ring finger, the middle finger, the index finger, and so forth). At the start and end of each run, a 15 second fixation screen was presented. Functional imaging data were used to calculate the representation of the fingers in contralateral primary somatosensory cortex, for details see [15].
Imaging parameters. Task-based fMRI data were acquired of a single subject at a 7 Tesla MRI scanner (Magnetom 7T, Siemens Healthcare Sector, Erlangen, Germany) using an EPI sequence, TR=2000ms, TE=18ms, with a spatial resolution of (1.5mm) 3 with 43 axial slices covering primary motor cortex and primary somatosensory cortex bilaterally.
Data analysis. We used a standard preprocessing pipeline which included motion correction and a removal of baseline drifts with a highpass filter (cutoff 1/100 Hz). For the SPM analysis, we applied a spatial Gaussian filter with FWHM=4mm during preprocessing. For the LISA analysis, no spatial smoothing was applied. Four contrasts were computed. In each contrast, one finger was contrasted against the other three fingers.

HCP Motor task
Supplementary Figure 21: Difference between reproducibility maps (motor task, sample size 20), Gaussfilter applied before versus after tests. The figure shows the effect of using a Gaussfilter applied before versus after the statistical test using various kernel sizes, see Fig 1 of the main manuscript. All data sets were preprocessed using a Gaussian filter with a smaller kernel size of FWHM=4mm (previously fwhm=6mm) to make room for the additional pre/post-test filtering. As a reference, the figure at the bottom row shows the reproducibility map obtained using FSL-TFCE (sample size 80). Note that the posttest Gaussfilter shows better reproducibility compared to the pre-test Gaussfilter particularly inside the main activation areas.
Supplementary Figure 22: Difference between reproducibility maps (emotion task, sample size 20), Gaussfilter applied before versus after tests. The figure shows the effect of using a Gaussfilter applied before versus after the statistical test using various kernel sizes, see Fig 1 of the main manuscript. All data sets were preprocessed using a Gaussian filter with a smaller kernel size of FWHM=4mm (previously fwhm=6mm) to make room for the additional pre/post-test filtering. As a reference, the figure at the bottom row shows the reproducibility map obtained using FSL-TFCE (sample size 80). Note that the posttest Gaussfilter shows better reproducibility compared to the pre-test Gaussfilter particularly inside the main activation areas.

HCP Motor task
Supplementary Figure 23: Difference in reproducibility, bilateral filter minus no filter (motor task). The image shows the difference between two reproducibility maps across 100 tests based on randomly drawn samples of size twenty. Reproducibility was assessed either with or without filtering. The reproducibility maps were computed in the same way as in Fig 2 of the manuscript. The colors represent differences in reproducibility scores. For example, a value of 20 means that this voxel survived FDR thresholding in 20 more tests with bilateral filtering than without. Thus, the bilateral filter produces much better reproducibility.

HCP Emotion task
Supplementary Figure 24: Difference in reproducibility, bilateral filter minus no filter (emotion task). The image shows the difference between two reproducibility maps across 100 tests based on randomly drawn samples of size twenty. Reproducibility was assessed either with or without filtering. The reproducibility maps were computed in the same way as in Fig 2 of the manuscript. The colors represent differences in reproducibility scores. For example, a value of 20 means that this voxel survived FDR thresholding in 20 more tests with bilateral filtering than without. Thus, the bilateral filter produces much better reproducibility.

Supplementary Methods 5: Simulations for power estimation
Here we describe a method for generating synthetic data to be used for power calculations. The first step is to fill an image with spatially uncorrelated random Gaussian noise. Next, this image is spatially convolved with the autocorrelation function (ACF) that was recently proposed by Cox et al. [16, p. 156] It is defined as ACF (r) = a exp(−r 2 /2b 2 ) + (1 − a) exp(−r/c) with 0 ≤ a ≤ 1, b > 0, c > 0, and r the Euclidean distance between voxels. For the simulations, we have used four sets of parameters, see Supplementary Table 3 152-171). The top image (S0) was generated by filling a brain template with spatially uncorrelated random Gaussian noise. The other images (S1,...S4) were generated by convolving the top image with the ACF using the parameters shown on the left. Note that spatial smoothness increases from S1 to S4. An arbitrary number of such noise fields can be generated by varying the seed of the random number generator used for generating S0. The underlying brain template has 221336 voxels and simulates a spatial resolution of (2mm) 3 .

Results of power estimation (spheres)
Results obtained from small spheres are shown in the top row, medium sized spheres in the middle row, large spheres in the bottom row. Four levels of spatial smoothness (S1,...S4) as shown in Supplementary Figure 26 were used. The signal to noise ratio varied from SNR=0.4 to SNR=1.0. Statistical inference was performed using four methods: SPM-GRF, AFNI, FSL-TFCE, and LISA. SPM and AFNI require initial cluster defining thresholds. They were set to CDT=0.001 (SPM) and CDT=0.01 (AFNI). As expected, sensitivity increased for all methods with increasing sphere size and increasing SNR. All methods performed poorly at SNR=0.4. Compared to the other methods, LISA shows higher power for small spheres, and for medium sized spheres added to smooth backgrounds (S3,S4). The advantage is less pronounced for large spheres presented against smooth backgrounds (S2,S3,S4). The data presented here are an average across three runs. The results of each single run are shown in Supplementary Figure 31.  Figure 30: Power estimation based on simulations using sticks as signals. Results obtained from thin sticks are shown in the top row, medium sticks in the middle row, thick sticks in the bottom row. Four levels of spatial smoothness (S1,..,S4) as shown in Supplementary Figure 26 were used. The signal to noise ratio varied from SNR=0.4 to SNR=1.0. Statistical inference was performed using four methods: SPM-GRF, AFNI, FSL-TFCE, and LISA. SPM and AFNI require initial cluster defining thresholds. They were set to CDT=0.001 (SPM) and CDT=0.01 (AFNI). As with the sphere shapes, sensitivity increased for all methods with increasing shape size and increasing SNR, and all methods performed poorly at SNR=0.4. Compared to the other methods, LISA shows higher power for thin sticks, and medium sized sticks added to smooth background noise (S3,S4). The advantage is less pronounced for large shapes presented against non-smooth backgrounds (S1,S2). The data presented here are an average across three runs. The results of each single run are shown in Supplementary Figure 31.