Massive online data annotation, crowdsourcing to generate high quality sleep spindle annotations from EEG data

Spindle event detection is a key component in analyzing human sleep. However, detection of these oscillatory patterns by experts is time consuming and costly. Automated detection algorithms are cost efficient and reproducible but require robust datasets to be trained and validated. Using the MODA (Massive Online Data Annotation) platform, we used crowdsourcing to produce a large open-source dataset of high quality, human-scored sleep spindles (5342 spindles, from 180 subjects). We evaluated the performance of three subtype scorers: “experts, researchers and non-experts”, as well as 7 previously published spindle detection algorithms. Our findings show that only two algorithms had performance scores similar to human experts. Furthermore, the human scorers agreed on the average spindle characteristics (density, duration and amplitude), but there were significant age and sex differences (also observed in the set of detected spindles). This study demonstrates how the MODA platform can be used to generate a highly valid open source standardized dataset for researchers to train, validate and compare automated detectors of biological signals such as the EEG.


Introduction
Sleep spindles are brief 10-16 Hz bursts of brain activity during stage N2 and N3 sleep. They are typically recorded from cortical surfaces by electroencephalography (EEG) and are markers of sleep dependent cognition 1 , early indicators of mental disorders 2 or brain deterioration due to age 3 . Spindles follow a characteristic waxing and waning profile, and generally last 0.5 to 1.0 seconds in duration. These characteristics are predominately trait-like, and remain remarkably stable night after night within an individual, but vary between individuals 4 . A small but consistently observed decrease of the spindle density, amplitude and duration occurs with age [5][6][7][8][9] . Sex differences of spindle activity linked to memory or aging have been reported [10][11][12][13] , where women tend to be less affected by aging 6,10 resulting a greater spindle activity (peak-to-peak amplitude 14 and density 4,7 ) in women than men, particularly in the elderly. Characteristics of spindles may index the underlying neuroanatomy involved in normal brain function, particularly in the processing of learning and memory, and have been related to intelligence [15][16][17][18][19][20][21] .
As well as their relation to biological processes, the detection of spindles is a key component in analyzing human sleep, as spindles are used to indicate the transition from stage N1 to N2 sleep during sleep scoring. However, detection and quantification of these oscillatory patterns by highly trained experts is time consuming and costly. Further, the definition of sleep spindles(A train of distinct waves with frequency 11-16 Hz with a duration > = 0.5 seconds, usually maximal in amplitude using central derivations) 22 is not entirely precise, and experts disagree on variations of sleep spindles. As well, the EEG signal may be obscured by other signal phenomena, thereby limited human detection. Critical for the advancement of sleep science is the development of automated feature detection tools. Recent years have highlighted the power of machine learning methods in the biosciences to augment expert clinical judgment. For example, cardiologist level arrhythmia detection 23 , or seizure diagnosis 24 . Automated methods do not fatigue, are cost efficient, remain consistent, and are readily deployable. However, previous studies have suggested that there are important differences between human and algorithm detected spindles 14 , leading to conflicting results depending on how spindles were detected 25,26 . For instance, a Results Spindle dataset collection. Polysomnographic data from 180 subjects was sourced from the Montreal Archive of Sleep Studies (MASS) 37 . The dataset was split into two "phases", where phase 1 consisted of 100 younger subjects (mean age of 24.1 years old) and phase 2 consisted of 80 older subjects (mean age of 62.0 years old). A subset of N2 stage sleep from the C3 channel was sampled from each subject (see methods for details). 25 sec epochs of this single channel EEG were presented to expert PSG technologists, researchers, and non-expert scorers via a custom web based scoring platform. Users identified the start and stop of candidate spindles, and indicated their confidence (high, med, low) for each spindle marked. In total, 47 PSG technologists, 18 researchers and 695 non-experts viewed 10,453, 6,636 and 37,467 epochs respectively in Phase 1. Phase 2 was viewed by 31 PSG technologists (7,941 epochs viewed). No scorers viewed the whole dataset, and the histogram of the number of scorer views per epoch image is shown in Fig. 1. A minimum number of scorers per epoch was crucial to compile a reliable gold standard (GS): the median number of scorers per epoch is 5 for the PSG technologists ( Fig. 1a,b), 4 for researchers (Fig. 1c) and 18 for non-experts (Fig. 1d). More than 95% of all the epochs have been seen by at least 3 PSG technologists. Table 1 presents the number of scorers and amount of data scored for each user subtype and phase. Almost 100,000 candidate spindles were identified by all scorers combined.
Human group consensus. The collected scores include many candidate spindles, and some of them showed low agreement across scorers (an event scored as a spindle by some can be scored as "not a spindle" by others). To create our GS (dataset of the highest quality spindles from the Group Consensus (GC) of experts) we averaged Performance of the human group consensus and automated detectors. A rigorous evaluation of spindle results from clinical and academic sleep studies hinges on quantifying the accuracy and biases of the spindle detection method used. Therefore, to inform future work, we evaluate the spindle detection performance of  www.nature.com/scientificdata www.nature.com/scientificdata/ experts, researchers and non-experts. Human detection of spindles is still considered the highest standard; however, many recent publications have utilized automated methods to save time and cost. Therefore, along with evaluating the performance of humans, seven popular and previously published spindle detection algorithms 6,34,[39][40][41][42][43] were run on the EEG data (see Methods for details on the algorithms). We compared the by-event performance of each automated detector or human group consensus (GC re and GC ne ) against the GS, and the individual experts were evaluated against the leave-one-out GS to avoid reporting bias.
The mean individual expert f1 was higher in phase 1 (0.76) than phase 2 (0.65), suggesting that spindles are easier to score in the younger cohort. A mean individual expert f1 of 0.67 has previously been reported 14 for a cohort similar to our phase 2. The f1 of the GC re and GC ne was ~0.8, suggesting that the group consensus performs better than individual experts, on average (Figs. 2a, 3d). It is noteworthy that individuals (including individual experts, non-experts and researchers) that have very high or low f1 scores tend to be scorers that did not score much data (indicated by lighter colored markers in Fig. 3). Scoring a small amount of data and thereby not encountering the full variety of epochs could have resulted in artificially high/low individual scores.
Similar to human scores, the f1 of the detectors were slightly reduced in the older cohort compared to the younger cohort, except for a9 43 which remained the same (Fig. 2a,b and Supplementary Table 2). Top performance (based on f1 score) on the younger cohort (phase 1) was the GC re followed closely by the GC ne . The a7 42 detector had the highest f1 in the younger cohort, closely matching performance of the average human expert (Figs. 2a, 3d). The highest f1 in the older cohort was reached by a9. Interestingly, a9 was the method most sensitive to the overlap threshold, as its performance decreases more rapidly than other methods as the threshold becomes more stringent (see methods). Therefore, spindles detected by the a9 algorithm and matching GS spindles are less perfectly temporally aligned (i.e. the start/stop and duration of spindles is less accurate) compared to the other methods. Detector a9 performance was followed closely by a7. We also evaluated the detectors performance against the GC re (see Supplementary Fig. 2a) or the GC ne (see Supplementary Fig. 2b). The performance of the automated methods remained essentially the same (for more details see Supplementary Table 3).
Automated detectors had their own specific tradeoff between precision (how many detected spindles were matching GS spindles) and recall (how many GS spindles were detected), the most balanced algorithms were a4 The phase2 (older cohort of 80 subjects). An overlap threshold of 0.2 (dotted vertical line) is used for this study, as it is the strictest threshold that does not decrease performance of the detectors. As the threshold increases (i.e. spindle events must overlap with a higher percentage of the overall length of a gold standard event in order to be a match), the overall performance (f1) decreases.   Table 2). The highest f1 on the whole cohort (phase 1 & 2, 180 subjects) was reached by a7 (0.72 against the GS) which is the same as the average individual expert f1. This performance is followed closely by a9 with a f1 = 0.71, a9 showed a higher recall (0.8) but a lower precision (0.65) (Fig. 3d). Figure 3(b,c) shows the Precision-Recall plot of the individual re or ne and their GC (GC re and GC ne respectively). Note that the majority of the individual researchers showed a high precision to the detriment of the recall (i.e. are overly conservative when marking spindles), and the resulting GC re is perfectly balanced with a GCt = 0. The performance evaluation of the detectors against the three different human references (GS, GC re , GC ne ) provided similar results (for more information see Supplementary Table 3). The number of spindles, and detailed performance metrics (True positives, False positives, False Negatives) for the GS, GC re , GC ne and each automated algorithm are reported in Supplementary Table 4. The performance (as quantified by the precision, recall and f1-score) of the seven tested detectors were essentially the same as reported previously 14,34,42,43 . Note that the performance of a9 was slightly more balanced in the original publication 43 than in the current study.
Spindle characteristics by-subject as a function of age and sex. Spindle activity decreases with age, and sex differences have also been reported [3][4][5][6][7][8][9][10][11][12][13] . We evaluated the age group difference between 100 subjects 18-35 years old and 80 subjects 50-76 years old, and sex difference between the 88 females and 92 males. We tested the spindle density measured as spindle per minute (spm), average maximum peak-to-peak amplitude (µV), average duration (s) and average dominant oscillation frequency (Hz) by-subject on the spindle dataset included in the www.nature.com/scientificdata www.nature.com/scientificdata/ GS (see Methods). A 2 × 2 ANOVA showed main effect for age and sex but no interaction on both for spindle density (age p = 0.0001 and sex p = 0.001) and average amplitude (age p = 1.5e-6 and sex p = 3e-8). The difference on the average spindle duration was significant only for age (p = 0.01). No significant effect was found for the dominant oscillation frequency of the spindle. Further analyses of the age and sex differences were performed with the non-parametric Mann-Whitney test (Fig. 4) since the spindle characteristics distributions were not all normally distributed based on the Shapiro-Wilk test. The spindle density in the GS was higher (p = 0.0002), average duration was longer (p = 0.008) and average amplitude was higher (p = 2e-06) in younger compared to older subjects (Fig. 4). The spindle density (p = 0.0008) and the average spindle amplitude (p = 1e-06) in the GS were also higher in females compared to males (Fig. 4). Supplementary Tables 2 and 3 contain detailed analysis of each detector's ability to capture the sex and age trends present in the GS.
The average spindle activity reported in the previous crowdsourcing project 14 was similar to our phase 2 (older cohort) despite a relatively high standard deviation across subjects. Warby et al. 14 reported 2.3 ± 2 spm with an average duration of 0.75 ± 0.27 s, a maximum peak-to-peak amplitude of 27 ± 11 μV and an oscillation frequency mean of 13.3 ± 1 Hz. We measured a by-subject dominant oscillation frequency of 13.1 ± 0.8 Hz (see Supplementary Table 5).
Comparison of detection methods. When considering which method to use to detect spindles, automated or otherwise, it is important to understand which spindle properties are best captured by each. To this end, we computed the correlation of the spindle density and spindle characteristics between the GS spindles and automatically detected spindles for each algorithm (a2-a9) as well as GC re and GC ne . The correlations for the spindle density in phase 1 (younger cohort, 100 subjects) are reported in Table 3. For phase 1, the correlation is higher for human GC than automated detectors. The GC ne is slightly more correlated (r 2 = 0.91) than the GC re (r 2 = 0.88). The correlation for the detectors is low for the spindle density (r 2 average across detectors is 0.37) and spindle duration (r 2 = 0.32), but very high for spindle amplitude (r 2 = 0.90) and high for spindle frequency (r 2 = 0.69). The detectors a7 and a9 performed better than the average of the detectors, especially for the spindle density which their r 2 were 0.73 and 0.85 respectively. The correlation coefficients for the detectors in phase 2 are reported in the Supplementary Table 6. Briefly, the correlation was higher for the spindle density but lower for all the other characteristics compared to the phase 1. Again, the detectors a7 and a9 outperformed the other detectors for the correlation with the GS spindle density with a r 2 = 0.83 and 0.88 respectively. phase 2) subjects, females (f) and males (m). Each dot in the plot represents one subject; darker points indicate multiple subjects at the same position. The '−' marker is the mean, the 'X' marker is the median across subjects, and the white box shows the mean + −1 × standard deviation. The *markers show significant difference with the Mann-Whitney test: "*"p values < = 0.05, "**"p value < = 0.01, and "***"p value < = 0.001.  Table 3. Correlation coefficient r² between Gold Standard from experts (GS exp ) and automated detectors (a2-a9) or group consensus of researchers (GC re ) or non-experts (GC ne ) for the spindle density, average duration and amplitude by-subject. The mean r 2 across detectors is also reported (Auto avg). The correlation coefficient p-value was significant (<0.05) for each detector or human scoring except for the spindle density of a8. Only the phase 1 is shown (younger 100 subjects). (2020) 7:190 | https://doi.org/10.1038/s41597-020-0533-4 www.nature.com/scientificdata www.nature.com/scientificdata/ We compared the spindle characteristics by-subject distribution of each detector (a2-a9) and human group consensus (GC re and GC ne ) to the GS for the whole cohort except for GC re and GC ne using a Mann-Whitney test. The variance in spindle characteristics was much larger across detectors than across the three human subtypes (PSG technologists, researchers and non-experts) ( Fig. 5 and Supplementary Table 7). The spindle density of a2 was much lower (0.9 spm, p = 9e-38) than the GS (3.8 spm), a3 (7 spm, p = 3.6e-25) and a8 (6.9 spm, p = 2.3e-34) were much higher than the GS. The average duration was much higher for a2 (1.15 s, p = 1.6e-33) and a9 (1.15 s, p = 2e-49) compared to the GS (0.78 s), but a3 (0.56 s, p = 4.7e-43), a4 (0.67 s, p = 1.1e-15) and a5 (0.5 s, p = 1.2e-48) were much lower. The average amplitude and oscillation frequency were about the same for all the detectors except a2 which showed spindles with greater amplitude (43 µV, p = 9.5e-30) than the GS (30 µV). The histogram at the cohort level (by-subject analysis) of the dominant oscillation frequency of spindles of the GS spindles or any of the automated detectors is unimodal, and does not support the hypothesis of decomposing the spindles into fast and slow spindles (Fig. 5d). Note that the slightly higher spindle density, duration and amplitude for the re and ne spindle dataset (Fig. 5) are biased due to the fact that only the younger cohort (phase 1) was scored by these groups (see Table 2 for the true comparison for the phase 1, "Phase 1 -Younger" column).
How many scorers are needed for crowdsourcing sleep spindle annotations?. Obtaining quality spindle scoring is costly and time consuming; knowing the number of scorers per epoch to achieve reliable results is worthwhile and may help to create future GS datasets. We identified that aggregating the scoring from two to four experts or researchers per epoch is optimum (Fig. 6a). However, three to ten non-experts were needed for similar performance (Fig. 6b). Zhao et al. 35 reported the need for at least six non-experts to score spindles in N2 sleep stage, but the plateau of the non-experts group consensus performance (f1 < 0.8) was reached around 10 non-experts. Figure 6 shows the f1-score-by-event of five "partial" GCs, each based on different number of scorers. We evaluated these partial GC's against the GC from another user subtype to avoid positive reporting bias. Using the leave-one-out GS was not sufficient since only few epochs include more than five experts per www.nature.com/scientificdata www.nature.com/scientificdata/ epoch. Therefore, partial GCs of experts (pGC exp ) were evaluated against the GC made from the scoring of all the researchers (GC re ), and partial GCs of researchers (pGC re ) and non-experts (pGC ne ) were evaluated against the formal GS made from the scoring of all the experts. Three random selections of the scorers per epoch were performed to see the inter-scorers/inter-epochs variation shown as a gray area. The Group Consensus Thresholds (GCt) used depended of the number of scorers per epoch and the user subtypes, from 0.4 for one scorer/epoch to the optimum GCt for each user subtype.

Discussion
In this study, we describe the use of the MODA platform and crowdsourcing to generate the group consensus of a large number of human scorers for sleep spindle detection in EEG data. The group consensus of human PSG technologists (experts) is used to form the gold standard (GS), and we outline a method to evaluate the performance of the different groups of scorers, including previously reported spindle detection algorithms. The group consensus of experts and non-experts produced a high-quality spindle dataset, and the automated detectors performed, on average, worse than human scorers. Our current study (specifically phase 2, 80 older subjects from MASS 37 ) is consistent with the results from our first crowdsourcing project 14 (110 old subjects from Wisconsin Sleep Cohort 44 ). The lower performance of spindle detection algorithms does not appear to be due to the age of the sleeping subjects, as we initially hypothesized 14 , as the finding is now similarly reproduced in a group of younger adults (phase 1 of MODA). The current study additionally included evaluation against a Group Consensus (GC) made from researchers scoring, and the analysis of spindle activity as a function of age and sex. Furthermore, two additional spindle detectors tested (a7 42 and a9 43 ) yielded performance equivalent to an average individual expert. To our knowledge, the MODA dataset is now the largest and most comprehensively scored sleep spindles GS available for validation of spindle detection algorithms.
The average spindle activity (such as the density, duration and amplitude) of the MODA GS for the phase 2 (80 old subjects from MASS 37 ) were surprisingly consistent with the expert GS from the previous crowdsourcing project 14 suggesting a high agreement between experts in an older cohort even across datasets. This agreement was also observed between the experts, re and ne (phase 1, 100 subjects) in the younger cohort of the current study. The high validity of our scoring allowed us to conclude the average spindle density for a young cohort was 4.2 spm with an average duration of 0.8 s, average maximum peak-to-peak amplitude of 33 µV and average dominant oscillation frequency of 13.3 Hz (activity when considering all the scorers of MODA, across all 100 subjects). The aggregated average spindle activity for older sleepers was 2.5 spm with an average duration of 0.75 s, average amplitude of 27 µV and average frequency of 13.2 Hz (MODA phase 2, 80 subjects 50-76 years old, and Warby et al. 14 110 subjects 42-72 years old). The agreement for the average spindle activity between automated algorithms was poorer than the human scoring. Only the a7 detector showed similar descriptive statistics to human scorers; i.e. average density of 3.9 spm, duration of 0.85 s, amplitude of 29 µV and frequency of 13.26 Hz for the whole cohort (phase 1 & 2). Spindles detected by a9 showed similarities with the GS spindles but the average duration was significantly longer (1.15 s). One caveat of the algorithmic performance evaluation is that the detectors were not tuned for the current dataset (instead using the default parameters suggested in their original publications). While many researchers do not tune these algorithms, the performance with tuning is potentially higher than reported here. We did not differentiate slow and fast spindles in our analysis because the oscillation frequency histogram of the spindles at the group level is clearly unimodal for the GS, GCre, GCne and each automated detector. The existence of slow and fast spindles could have been more obvious in our database with the analysis of additional channels, such as a frontal channel for slow spindles and a parietal channel for fast spindles 6,45,46 .
Most of the detectors tested in our study showed the same significant age and sex differences as the experts, which, in-turn, matches the literature 3-13 . However, algorithms a7 and a9 detected an additional significant sex difference: the spindles were on average longer in females, a finding which until now has only been seen at a www.nature.com/scientificdata www.nature.com/scientificdata/ trending significance level (p0.05 < p < 0.1) 4,7 . We did not detect this effect in our own GS (only a trend in correct direction was observed, with p = 0.2), and this potentially points to a7 and a9 detectors being more discriminatory than human scoring. The a8 34 detector was alone in showing an opposite age and sex effect for the spindle density. Not all detectors performed equally: the correlation of the by-subject spindle density between the GS and the detectors was generally low (an average r 2 across detectors of 0.37) compared to the human group consensus (GCre r 2 = 0.88 and GCne r 2 = 0.91) in the younger cohort, however the detectors a7 (r 2 = 0.73) and a9 (r 2 = 0.85) performed well. Algorithm performance was slightly better in the older cohort (average r 2 across detectors = 0.47), and again a7 (r 2 = 0.83) and a9 (r 2 = 0.88) performed well. Spindle density and amplitude was more accurately captured compared to spindle duration: correlations between GS and detectors/humans were generally lower for duration than for density (even for the human Group Consensus, GC re r 2 = 0.59 and GC ne r 2 = 0.73), and all detectors and human GC had a high correlation with the GS for the average spindle maximum peak-to-peak amplitude.
Creating an optimal GS is central to maximizing dataset validity. Obtaining the highest number of scorers with the highest level of expertise possible is, of course, the best scenario to create this optimal GS. However, our study suggests that collecting the scores of three researchers (re) or 10 non-experts (ne) provided a GC f1 of 0.8 against the expert's GS, providing a performance similar to the average individual expert (f1 = 0.76) (only observed in phase 1 since the phase 2 was not scored by re or ne). Comparing the spindle detectors to the GC of re (GC re ) or ne (GC ne ) allowed the same conclusion about their performance as the experts' comparison. The GC re proved to be a valid standard reference despite the high precision and the moderate recall of the researchers. Creating a GC where the f1-score is maximized effectively forced the GC to be balanced between the recall and the precision. We also identified that aggregating the scoring from two to four experts/researchers or three to ten non-experts is sufficient, and after this point, the performance of the GC begins to plateau.
Throughout the analysis, we have used a relaxed overlap threshold (only 20% overlap between a potential detection and a spindle in the GS was required to be a true positive). Clearly a higher threshold is desirable in practice, but we wanted to present the best performance possible for the automated detectors. All automated detectors decay in performance with increasing overlap threshold faster than human scorers, meaning that automated detectors do not predict the start/stop and duration of spindles similarly to humans. Using a stricter threshold such as 80% would produce an even larger difference between human and automated scoring performance. In this regard, the a9 detector, which had some of the highest performance scores with an overlap of 20%, was unique in that it had the most rapid decline of performance with increasing overlap threshold requirements (Fig. 2). The preference of the a9 detector to find very long spindles may be an area of potential improvement for this particular algorithm.
It should be clearly noted that the use of human-scored spindles as the gold standard is open for debate. In our study, the scoring performance reported and descriptive statistics of the MODA GS spindles are "true" only in the sense that many human experts agree on them. The lesser performance of spindle algorithms is only relative to human scoring. It remains unclear whether the algorithmically detected "hidden spindles" that are missed by humans are mechanistically identical to human detected spindles, spurious, or perhaps separate and biologically meaningful phenomena. Individual spindle detection algorithms may prove to be superior for specific uses, such as disease biomarkers, markers of cognition and intelligence, or in cases of co-recorded EEG and fMRI, where the signal-to-noise ratio becomes more challenging. What remains clear however, is that individual spindle detection algorithms find different sets of spindles relative to human scoring, and different than other spindle algorithms. Since the different algorithms are not entirely consistent with each other, it is difficult to use any one detector as the gold standard. Therefore, if you are designing an automated detector to match human scoring, then validation against the MODA GS is the best choice.
The variance in automated detectors means choosing one is not a one-size-fits-all process. Some detectors may be better in characterizing the spindle activity of unhealthy subjects, subjects under different conditions or to reveal specific features of spindles. For example, out of the seven detectors tested in the current study a2 39 was the best in separating Parkinson's disease patients from controls (unpublished results conducted in our lab). Furthermore, a8 showed poor results in the current study, but performed well when compared to an expert (f1 = 0.71) or a group consensus of non-experts (f1 = 0.73) who scored on a band-pass filtered 11-16 Hz EEG signal 34 . The a7 detector was the most similar to the human scoring, which is not surprising considering it was designed to emulate expert human scoring, and it has been trained on a human GS 42 . The detector a9 showed high performances in the current study. It is based on a non-linear model to separate the transients from the sustained rhythmic oscillations of the spindle 43 . The detector a9 is also proposed to pre-process the raw EEG prior to the spindle detection (possibly combined to another automated detector) 47 , a design possibly of interest for noisy EEG signals such as those recorded in fMRI. Instead of choosing the top performing algorithm defined here (e.g. a7), researchers might consider testing multiple or even a combination of detectors. For example, testing multiple published detectors initially (ideally on pilot data) to establish which detector is the most useful for their application, they could then use that method consistently for all future work, thereby allowing valid comparison between versions of their work. Specific research areas may focus on specific properties of the spindle signal (e.g. require amplitude sensitivity rather than frequency sensitivity), and, as shown here, some detectors are more sensitive specific signal properties. Therefore, automated algorithms to detect spindles may also be chosen based on the specific field of inquiry and their history in answering specific research questions. Overall, choosing appropriate spindle detection requires efforts from the researchers to standardize the evaluation of the detectors. A common set of spindles to compare with, e.g. the MODA GS, is one important step of this standardization.
There are some limitations to the current work. Producing a higher quality GS might be achieved with more experts (although see our recommendations for a sufficient number of scorers), but also by improvements to the MODA web interface. An interface which better replicates the PSG technologist work environment, such as presenting a complete montage of channels, the possibility to go back and forth between epochs, and displaying a (2020) 7:190 | https://doi.org/10.1038/s41597-020-0533-4 www.nature.com/scientificdata www.nature.com/scientificdata/ whole night per subject, and may yield higher validity expert annotations. Furthermore, the current GS includes only healthy subjects from 18-76 years old (distributed non-uniformly), and focuses on spindles in stage 2 and the C3 channel; different GS could be created from other populations, channels or stages.
With the release of the MODA annotations dataset, we hope to spur development of reliable, generalizable automatic sleep analysis tools. Complex models with many parameters (such as those in machine learning) are prone to overfitting (i.e. fitting dataset specific noise), and therefore, the reported accuracy of detectors may be inflated, and results may not generalize to new, unseen data. We suggest that developers should train, validate and test their algorithm with a nested cross-validation on the MODA GS.
In conclusion, our study demonstrates that crowdsourcing with experts, researchers and non-experts replicates well, and is a viable method for generating a large dataset of EEG events. We trust that the MODA interface and the GS dataset generated from it will prove a popular tool for researchers to collect data, train and validate automated detectors, and act as a standardized benchmark for selecting the most appropriate algorithm for specific research goals. The MODA dataset was a concerted effort, and highlights the importance of open, transparent and collaborative research. In this vein, we encourage all developed algorithms to be open source so that these tools may help us understand sleep further, including how spindles play a role in memory and mental disorders.

Methods
EEG data. Polysomnographic data used came from the Montreal Archive of Sleep Studies (MASS) 37 ; 180 subjects were sampled from the SS1-SS5 subsets. The dataset was split into "phase 1" and "phase 2"; 100 younger subjects (mean age of 24.1 years old) and 80 older subjects (mean age of 62.0 years old) respectively. "Blocks" of 115 s were randomly extracted from artifact free Stage N2 sleep. Three blocks (~6 mins) were extracted in 85 subjects in phase 1 and 65 subjects in phase 2; and 10 blocks (~20 mins) were extracted in the remaining 15 subjects of each phase. Almost 24 h of EEG time series was extracted to be scored. Table 4 presents the demographic information of subjects sampled and the amount of EEG data extracted. C3 channel was reformatted to C3-A2 when possible otherwise the original reference "C3-Linked Ear" (C3-LE) was kept. We band-pass filtered the signal between 0.3-30 Hz as suggested by AASM 22 and down sampled it to 100 Hz to reduce processing time.
Signal processing. Band-pass filter 0.3-30 Hz is implemented in MATLAB 2016b (MathWorks, Inc., Natick, MA, USA). The filter characteristics are Butterworth IIR 10th order. The filter is constructed with zero-pole-gain form converted into a Second Order Section (SOS) and the non-linear phase is corrected by the "filtfilt" function. The EEG down sampling to 100 Hz is also implemented in MATLAB 2016b (MathWorks, Inc., Natick, MA, USA) with the function "resample" which has been called to use a polyphase antialiasing filter.
MODa Web interface developed to collect spindle scoring. We developed a custom JavaScript web interface, called MODA, to collect the annotations of a large number of scorers. Signals to be scored on MODA must be encoded as images, therefore the extracted data blocks of 115 s (C3 EEG channel) were converted into 5 epoch images of 25 s (overlap of 2.5 s between consecutive epochs). Images were 10" wide per 1" high in five resolutions from 80 dpi to 125 dpi to suit the most common monitors. Negative voltages (+100 µV to −100 µV) were displayed upward to present data time series in a familiar way to experts. The scorers were first asked to register and complete a simple profile about their experience in sleep scoring (if any). A short description of how the interface works, and how to score spindles, was presented. The American Academy of sleep medicine's (AASM's) 22 spindle criteria were used to develop the instructions to score spindles. All the scorers underwent 10 practice trails with feedback; they were asked to draw boxes around each spindle they saw and rate the confidence (as "high", "medium" or "low") that each box contains a spindle ( Supplementary Fig. 3). After the completion of the practice session, they were allowed to score spindles (possibly in multiple short sessions) for the MODA dataset (Fig. 7). Phase 1 dataset (younger cohort) was presented first. Images were displayed as a "set" of 2 blocks (i.e. 10 epochs) to scorers. The same "set" was presented to different scorers until the desired number of views was reached. The number of sets scored was shown, but the total number of sets left to score was unknown for each scorer. Epochs may contain no spindles and there was no limit on the number of spindles that could be present.

MODa scorers. Scorers consisted of PSG Technologists (registered as Polysomnographic Technologists on
www.brpt.org/rpsgt), designed the experts (exp) in our study, Researchers (re) with experience in scoring sleep, and Non-Expert (ne) "MTurkers" recruited from Amazon Mechanical Turk. PSG technologists and researchers were recruited through on-line announcements, scientific conferences, word of mouth, and from the authors' personal database.  www.nature.com/scientificdata www.nature.com/scientificdata/ Creating the MODa group consensus. The visual scoring of spindles needs practice since other signal features also mimic spindles and they can be partially hidden or deformed; therefore, only spindles that have been marked with a certain agreement between scorers should be kept to form a high-quality set of spindles noted Group Consensus (GC). To increase the scoring quality, we asked to scorers to rate their confidence (low, med, high) for each spindle marked. Specifically, each sample of the EEG time series had a score weighted by the confidence rate given by the scorer; 1 for high, 0.75 for medium, 0.5 for low confidence and 0 for no spindle at all. Then, sample by sample, the scores were averaged across scorers, and if they exceed the chosen Group Consensus Threshold (GCt) then these samples were identified to be part of the GC spindle dataset. In this way, either some scorers must be certain, or many scorers must be moderately confident for a location to be marked as a GC spindle. The three subtypes of users who scored on MODA: "exp, re and ne" allowed the creation of different GC. The GC of experts was considered the highest-quality set of spindles of MODA, and therefore was designated as the formal "gold standard" GS of MODA. The GCt used to create our GS was chosen to maximize the average individual performance across experts. Each individual expert was evaluated against a GS which did not include its own scores (leave-one-out GS) to avoid a positive reporting bias. The GCt used to form the GC of re (GCre) or ne (GCne) were chosen to maximize the GC performance against the GS made from all the experts. These thresholds are arbitrary, and others may want to use a different aggregation method or thresholds to create their own GC. Additional clean-ups, on the created GC, were made to increase their validity. A spindle shown on two consecutive epochs (during the 2.5 s overlap of epochs) may be detected more easily on either epoch. Therefore, for the set of samples that occur on two epochs, we consider the highest score for each scorer. Too short (<0.3 s) adjacent (<0.1 s apart) spindles were merged, and spindle longer than 2.5 s or shorter than 0.3 s were filtered out of the GC.
Performance evaluation. The performance evaluation followed the strategy described in the previous spindle crowdsourcing project 14 . The primary performance evaluation was approached 'by-event' , meaning that spindles are considered to be variable length events. An overlap rule must therefore be applied to determine if two variable length and partially overlapping events (estimated spindle and GS spindle) can be considered a match. The recall (fraction of GS spindles found: (where TP is the number of True Positive, FP the False Positive and FN the False Negative) were used since spindles are relatively rare events. To consider an estimated spindle (detection) as correctly matching a GS spindle (event), the detection must overlap the event above a certain overlap threshold. The overlap is computed as the intersection (the part of event detected) over the union (sum of the length of the event and the detection) between the event and the detection. Only one detection can match an event, the one with the greatest overlap, other detections overlapping the same event are considered FP. The overlap threshold chosen was the strictest threshold that did not penalize any of the human group consensuses or automated algorithms. A low overlap threshold (0.2 was previously reported 14 ) allows detections to be shorter or longer than the GS spindle or being not perfectly aligned with the GS spindle. In addition, the performance evaluation was done at the 'by-subject' level. The multiple detections or measurements that belong to the same individual EEG recording (sleeping night of one subject) were aggregated into a single average for that subject. These characteristics are the spindle density measured as the number of spindles per minute (spm), the average spindle duration (s), amplitude (µV) and frequency (Hz). In detail, the amplitude was computed as the maximum peak-to-peak amplitude of the spindle band-pass filtered 11-16 Hz. The frequency was computed as the dominant oscillation frequency of the spindle through FFT (Fast Fourier Transform). An FFT with five seconds zero-padding was performed on the EEG signal of the spindle band-pass filtered 10-16 Hz, and the frequency with the maximum energy was extracted. The frequency histogram at the cohort level was generated to evaluate the opportunity of breaking down spindles into fast and slow. The by-subject analysis allowed looking at the correlation of the spindle density or characteristics with the GS. The by-subject performance can be high compared to the by-event performance if the detection bias (such as recall or precision) is constant across subjects (ex. detections are consistently 0.5 s longer or delayed by 0.2 s compared to the GS spindles). automated spindle detectors tested. To provide a framework of how to test automated algorithms on the MODA GS, we evaluated the performance of seven previously published spindle detectors 6,34,[39][40][41][42][43] (for more details about their respective design see Table 5). These detectors were selected because of the prevalence of their use, the requirement that they only need one EEG channel to perform the analysis, and the availability of www.nature.com/scientificdata www.nature.com/scientificdata/ open source Matlab code to facilitate their implementation. Detectors were run "out-of-the-box" with the default parameters suggested in their corresponding publications. The detectors were evaluated first by-event against the MODA GS and secondly against the GC made from the researchers (GC re ) or non-expert scoring (GC ne ). The by-subject analysis was also performed in order to compare their spindle density and average spindle characteristics to the human scoring. Age and sex differences for the spindle activity were also tested for each detector. Reported performances are valid "out-of-sample" performance since none of these detectors have been developed or trained on the MODA GS. Even if the EEG data for MODA comes from the open source MASS 37 dataset, only 15 subjects (out of the 180 subjects used for MODA) have existing spindles scored (and by only 2 experts compared to an average of 5 experts per epoch in our data). Furthermore, one of the previous experts from the MASS spindle dataset did not score in the same manner as MODA (looking at the band-pass filtered signal 11-16 Hz instead of looking only at the broad-band (0.3-30 Hz) C3 channel).

Data availability
The dataset generated in the current study is described on the Open Science Framework (OSF) 38 , it includes links to the spindle annotations and instructions on how to obtain the PSG data used (MASS 37 dataset). See the wiki on the OSF site, and Readme on linked Github repository for more information on how to download the data. The PSG files can be requested as described on the MASS web page (http://www.ceams-carsm.ca/mass). Sharing occurs after the requirements of the MASS databank application are met.

Code availability
The JavaScript code of the MODA interface developed to collect the annotations is open source 48 . The Matlab code to manage the PSG files and generate the GS from the spindle scoring files is also open source 38 .