U-Sleep: resilient high-frequency sleep staging

Sleep disorders affect a large portion of the global population and are strong predictors of morbidity and all-cause mortality. Sleep staging segments a period of sleep into a sequence of phases providing the basis for most clinical decisions in sleep medicine. Manual sleep staging is difficult and time-consuming as experts must evaluate hours of polysomnography (PSG) recordings with electroencephalography (EEG) and electrooculography (EOG) data for each patient. Here, we present U-Sleep, a publicly available, ready-to-use deep-learning-based system for automated sleep staging (sleep.ai.ku.dk). U-Sleep is a fully convolutional neural network, which was trained and evaluated on PSG recordings from 15,660 participants of 16 clinical studies. It provides accurate segmentations across a wide range of patient cohorts and PSG protocols not considered when building the system. U-Sleep works for arbitrary combinations of typical EEG and EOG channels, and its special deep learning architecture can label sleep stages at shorter intervals than the typical 30 s periods used during training. We show that these labels can provide additional diagnostic information and lead to new ways of analyzing sleep. U-Sleep performs on par with state-of-the-art automatic sleep staging systems on multiple clinical datasets, even if the other systems were built specifically for the particular data. A comparison with consensus-scores from a previously unseen clinic shows that U-Sleep performs as accurately as the best of the human experts. U-Sleep can support the sleep staging workflow of medical experts, which decreases healthcare costs, and can provide highly accurate segmentations when human expertize is lacking.

not selected on the basis of sleep disorders or cognitive impairment 1,9,10 . Subjects were enrolled from 6 different US clinical sites in Alabama, Minnesota, Pennsylvania, Oregon, and California. Between 2003Between -2005, 3135 subjects enrolled of which 2909 underwent in-home overnight polysomnography (PSG). In this study, we considered a total of 3926 PSG records (2900 from visit 1 and 1026 from visit 2) from 2903 subjects. We excluded a total of 7 records (IDs aa2180, aa3370, aa1367, aa1715, aa1900, aa3903, aa3411 all from visit 1) due to missing EOG channels and/or sleep stage annotation files. EEG and EOG signals were recorded at 256 Hz and hardware high-pass filtered at 0.15 Hz. Hypnograms were scored according to the AASM criteria. For more information on the MROS dataset and studies, please refer to https://doi.org/10.25822/kc27-0425.
PHYS The over-night PSG data the from 2018 PhysioNet/CinC Challenge were contributed by the Massachusetts General Hospital's Computational Clinical Neurophysiology Laboratory and the Clinical Data Animation Laboratory. The full dataset spans 1,985 patients who were monitored for the diagnosis of sleep disorders. The original challenge was automatic detection of arousal, but sleep stages were annotated by clinical staff. The dataset was split into two equal sized halves for training and testing. In our study we considered the 994 subjects publicly available in the training subset. EEG and EOG signals were recorded at 200 Hz. Hypnograms were scored according to the AASM criteria by a total of 7 annotators (1 scoring per PSG). For more information, we refer to https://physionet.org/content/ challenge-2018 and 11,12 .

SEDF-SC & SEDF-ST
The Sleep-EDF Database (Expanded) consists of 197 whole-night PSG recordings. In the Sleep Cassette (SEDF-SC) sub-study, 153 PSGs were collected between 1987-1991 from healthy Caucasians aged 25-101 not taking sleep-related medication. The Sleep Telemetry (SEDF-ST) sub-study investigated the effect of temazepam intake on sleep in 22 Caucassian males and females. Participants took no other medication and were generally healthy but having mild difficulties falling asleep. Two recordings were collected from each individual on two nights at the hospital, one after temazepam intake and the other one after placebo intake. EEG and EOG signals were recorded at 100 Hz. Hypnograms were scored according to the Rechtschaffen and Kales criteria, which we aligned to AASM as described in the Methods section. The SEDF-SC database has been regularly used for benchmarking of automatic sleep stage classification algorithms. For more information on either sub-study, we refer to https://doi.org/10.13026/C2C30J and 12,13 .
SHHS The Sleep Heart Health Study (SHHS) was a large, prospective cohort study investigating sleep-disordered breathing such as OSA as risk-factors for the development of cardiovascular disease 1,14 . A total of 6441 subjects were recruited from 6 already on-going National Heart, Lung, and Blood Institute studies (see https://clinicaltrials. gov/ct2/show/NCT00005275 for details). Adults of age 40 or older, who were able and willing to undergo home PSG, were enrolled between 1995-1998. Between 2001-2003, a second PSG was obtained for 3295 of the participants. For our study, we had a total of 8444 PSG records available (5793 visit 1; 2651 visit 2) collected from 5797 individuals. EEG and EOG signals were recorded at 125 Hz and 50 Hz, respectively, and hardware high-pass filtered at 0.15 Hz. Hypnograms were scored according to the Rechtschaffen and Kales criteria, which we aligned to AASM as described in the Methods section. For more information on SHHS, we refer to https://doi.org/10.25822/ghy8-ks59.
SOF A sub-study of the larger Study of Osteoporotic Fractures (SOF) investigated the association between sleepdisordered breathing and cognitive impairment in community-dwelling Caucasian women aged 65 and above 1,15,16 . Subjects were enrolled from four US cities between 1986-1988. An additional cohort of African-American women were recruited between 1997-1998. In our study, we considered the unattended, whole-night, in-home PSG data collected between 2002-2004 from 461 participants at SOF visit 8 (subjects originally enrolled from US metropolitan areas Minneapolis, Minnesota and and Pittsburgh, Pennsylvania between 1986-1988). EEG and EOG signals were recorded at 128 Hz and hardware high-pass filtered at 0.15 Hz. Hypnograms were scored according to the Rechtschaffen and Kales criteria, which we aligned to AASM as described in the Methods section. For more information, we refer to https://doi.org/10.13026/C2X676. DCSM This new dataset was collected and prepared by the Danish Centre for Sleep Medicine (DCSM) and comprises 255 whole-night PSG recordings of patients visiting the center for diagnosis of non-specific sleep related disorders. The records are fully anonymized and were selected randomly. The included subjects thus likely vary in demographic characteristics, diagnostic background and sleep/non-sleep related medication usage. The PSGs were collected between 2015-2018. EEG and EOG signals were recorded at 256 Hz and bandpass filtered to interval 0.3 Hz -70 Hz (3dB limits). Hypnograms were scored according to the AASM criteria. This dataset serves as an unbiased, random sample from the distribution of data generated by the DCSM. The DCSM dataset is publicaly available at https://sid.erda.dk/wsgi-bin/ls.py?share_id=fUH3xbOXv8. This repository will be frozen and issued a DOI for persistent access following the review process. The ISRUC dataset consists of randomly selected all-night PSG recordings acquired by the Sleep Medicine Centre of the Hospital of Coimbra University, Portugal 17 . It covers both healthy  subjects and patients with sleep disorders under the effect of sleep medication. It is divided into three sub-groups  (ISRUC-SG1, -SG2, -SG3) with 100 sleep disordered adults, 8 sleep disordered adults with PSGs aqquired twice on  different nights, and 10 healthy control subjects in each of the three sub-groups, respectively. Data were aqquired between 2009-2013. All records were scored by two experts. We considered hypnograms from annotator 1 for all records but subject_2_visit_2 of ISURC-SG2 for which we used the hypnogram of annotator 2 due to missing data. EEG and EOG signals were recorded at 200 Hz and filtered using a bandpass Butterworth filter with lower and higher cutoff frequencies of 0.3 Hz and 35 Hz, respectively. Hypnograms were scored according to the AASM criteria. For more information on the ISRUC dataset, we refer to https://sleeptight.isr.uc.pt.

MASS-C1 & MASS-C3
The Montreal Archive of Sleep Studies (MASS) pooled 200 whole-night recordings from three different hospital-based sleep laboratories of the Center for Advanced Research in Sleep Medicine, Montreal, Canada 18 . Subjects were between 18-76 years at the time of recording, which occurred in the period 2001-2013. The subjects were organized into five subsets (C1-C5) according to the research protocols used for data acquisition. All included subjects were healthy controls, although an apnea-hypnea index of up to 20 (moderate sleep apnea) was allowed for subjects in C1. In this study, we considered PSG recordings from subsets C1 (53 subjects) and C3 (62 subjects) for which the experts annotated 30-second intervals in line with the other datasets. EEG and EOG signals were recorded at 256 Hz and hardware low-passed filtered at 0.10 Hz (EOG) or 0.30 Hz (EEG) and high-pass filtered at 30 Hz (EOG) or 100 Hz (EEG). Hypnograms were scored according to the AASM criteria. For more information, we refer to http://ceams-carsm.ca/en/MASS. SVUH The St. Vincent's University Hospital / University College Dublin Sleep Apnea Database (SVUH) contains 25 full overnight PSG records of randomly selected individuals under diagnosis for either obstructive sleep apnea, central sleep apnea or primary snoring 12 . Subjects were enrolled over a 6-month period. We considered data from the revised database of 2001. Subjects were at least 18 years old and had no known cardiac disease, had no autonomic dysfunction, and took no medication known to interfere with heart rate. EEG and EOG signals were recorded at 128 Hz. Hypnograms were scored according to the Rechtschaffen and Kales criteria, which we aligned to AASM as described in the Methods section. For more information, we refer to https://doi.org/10.13026/C26C7D.

DOD-H & DOD-O
The Dreem Open Dataset -Healthy (DOD-H) was collected from 25 volunteers at the French Armed Forces Biomedical Research Institute's Fatigue and Vigilance Unit in France. Subjects were without sleep complaints, aged 18-65 and locally recruited without regard to gender or ethnicity. The Dreem Open Dataset -Obstructive (DOD-O) was collected from the Stanford Sleep Medicine Center, California, US from 55 patients (clinical trial NCT03657329) with clinical suspicion for sleep-related breathing disorder. Individuals clinically diagnosed with sleep disorders other than OSA, suffering from morbid obesity, taking sleep medications or with certain cardiopulmonary or neurological comorbidities were exclude from the study. EEG and EOG signals from both DOD-H and DOD-O were sampled at 256 Hz and each PSG was scored by 5 individual experts from 3 different sleep clinics. All experts were registered Sleep Technologists with at least 5 years of clinical scoring experience. For more information on the DOD datasets and consensus scoring, we refer to the recent publications [19][20][21] .

Demographic Bias
We conducted an analysis of potential demographic bias in the average U-Sleep performance. We considered the variables age, sex and BMI and accounted for dataset origin. We could not evaluate important variables such as disease state and ethnicity, because the required information was missing was missing for several datasets. Supplementary Figure 1 shows graphical representations of the test-set distribution of F1 scores as a function of age, sex, BMI and general disease stage, respectively. Note that these plots show only correlations, not causal relations.
We fitted a beta regression model (using the betareg 22 v3.1-3 package in R 23 v3.6.1) on 532 records from the test sets of datasets ABC, CCSHS, CFS, CHAT, HPAP, MROS, SHHS, SOF and SVUH. The 532 records represent all available test-set records for which we have age, sex and BMI information available. The regression model predicts the mean F1 score as a function of those parameters along with variables encoding the dataset origin of each sample giving a total 11 covariates. The estimated coefficients of the model were −0.004 ± 0.007 for BMI (95% CI, z = −1.273, p = 0.203), −0.012 ± 0.004 for age (95% CI, z = −6.141, p < 0.001), and −0.102 ± 0.102 for sex (difference if subject is Male, 95% CI, z = −1.954, p = 0.051). Coefficients testing were done using two-sided Z-tests. The interpretation of the coefficients is that the expected F1 performance drops with increasing BMI and increasing age as well as for male subjects. However, only age was significant (p < 0.05). It is likely that this observation is confounded by the general worsening of health with age.     It took a total of ≈ 4, 500, 000 gradient updates (processed batches of data) to train the model to convergence, equivalent to observing ≈ 9, 582 years of (non-unique) annotated PSG data. The total training set length is ≈ 19.4 years. The long training time needed to obtain the final model is a result of both the highly challenging task -learning sleep staging across clinical cohorts with varying and noisy labels, randomly varying input channels as well as augmentation -and that we chose to train U-Sleep using a very small learning rate (please refer to the Methods section). As we were interested only in a single, final version of the U-Sleep model, the long training time is only an issue because of the energy consumption. We estimate that training U-Sleep consumed up to a total of 1, 121 kWh (96.1 kg CO 2 eq.) using the CarbonTracker tool 25 2: U-Sleep model topology for input window size i = 3840 (30 seconds of 128 Hz signal), number of input channels C = 2, sequence length T = 35, number of output classes K = 5 and complexity factor scaling value α = 1.67. The complexity scaling modifies the filter number in each block or layer as described in the Methods section to a number c = c · √ α where c is the original number of filters. Note that T · i = 134400. Each encoder block performs the following operations: convolution → activation function → batch normalization → (zero-padding if input length is odd) → max pooling (kernel width 2, stride 2). Each encoder block also outputs a residual connection (the output of the layer immediately before max pooling) which is passed to its corresponding decoder block. Each decoder block performs the following operations on its input: nearest-neighbour up-sampling (kernel width 2) → convolution → activation function → batch normalization → (crop if needed to match residual connection input) → concatenation with residual connection → convolution → activation function → batch normalization. The average pooling layer (ID=28) has striding of i = 3840.  .
Post-processing None Re-sampling (S) 128 Hz Batch size (B) 64 For element in a batch, a class from the label set {W, N1, N2, N3, R} is determined by uniform sampling. A random PSG record of this class is sampled, from which the input window is sampled randomly so that the selected class is in the window.

Steps per epoch 443
Early stopping criteria Validation F1 Mean per-class F1 scores computed over random subsets of up to 20 validation records from each dataset.
Model selection criteria Validation F1