Since its origin in the late 1950s, polysomnography (PSG) has been at the center of sleep medicine testing with the main aim of standardizing and of simplifying the scoring procedure. A common methodology has fostered clinical research and improved sleep disorder classification and comprehension. A PSG typically involves a whole night recording of bio-signals. Brain activity, eye movements, muscle activity, body position, heart rhythm, breathing functions and other vital parameters are monitored overnight. PSG scoring is the procedure of extracting information from the recorded signals. Sleep stages, arousals, respiratory events, movements and cardiac events have to be correctly identified. Wakefulness and sleep stages, i.e., stages 1, 2, 3 and rapid eye movement (REM), can be mainly described by three bio-signals: electroencephalography (EEG), electrooculography (EOG) and electromyography (EMG). Clinical sleep scoring involves a visual analysis of overnight PSG by a human expert and may require up to two hours of tedious repetitive work. The scoring is done worldwide accordingly to official standards, e.g., the American Academy of Sleep Medicine (AASM) scoring manual1.

Artificial intelligence (AI) is a powerful technique that has the potential to simplify and accelerate the sleep scoring procedure. In literature over the last two decades, a wide variety of machine learning (ML) and deep learning (DL) based algorithms have been proposed to solve sleep scoring task2,3,4,5,6,7. DL based scoring algorithms have shown higher performances compared to the traditional ML approaches. Autoencoders8, deep neural networks (DNNs)9, U-Net inspired architectures10,11, convolutional neural networks (CNNs) and fully-CNNs12,13,14,15,16,17,18,19,20,21, recurrent neural networks (RNNs)22,23 and several combinations of them24,25,26,27,28,29,30,31,32 have been recently proposed in sleep scoring. The possibility to extract complex information from a large amount of data is one of the main reasons to apply DL techniques in PSG classification. Another significant advantage is the ability to learn features directly from raw data, by also taking into account the temporal dependency among the sleep stages.

In literature we can find many examples about how clinical guidelines have been exploited when trying to support ML and DL based algorithms. The oldest Rechtschaffen and Kales (R&K)33 or the updated AASM1 scoring manuals have been designed to cover all the aspects of the PSG: from the technical/digital specifications (e.g., assessment protocols, data filtering, recommended EEG derivations) to the scoring rules (e.g., sleep scoring rules for adults, children and infants, movement rules, respiratory rules) and the final interpretation of the results. All the sleep scoring algorithms, both ML or DL based, are trained on sleep recordings annotated by sleep physicians according to these manuals. In some of these studies the sleep recordings are pre-filtered, as indicated in the AASM guidelines, before feeding them to their scoring system. Almost all of the algorithms mentioned above are trained using recommended channel derivations and fixed length (i.e., 30-second) sleep epochs. However, it still remains unknown whether a DL based sleep scoring algorithm actually needs to be trained by following these guidelines. More than a decade ago, it was already highlighted that sleep is not just a global phenomenon affecting the whole brain at the same time, but that sleep patterns such as slow waves and spindle oscillations often occur out-of-phase in different brain regions34. Hence, it may be that DL-based scoring algorithms could retrieve the needed information from brain regions that are not necessarily the ones indicated in the AASM guidelines, reaching equally high performance. Indeed, in the growing field of mobile sleep monitoring with wearable devices, many studies are attempting to tackle the automated sleep scoring task by using unconventional channels, even not necessarily placed on the scalp, e.g., in-ear EEG35,36,37. Furthermore, in the AASM manual and in previous studies38,39, age has been addressed as one of the demographic factors that mainly change sleep characteristics (e.g., sleep latency, sleep cycle structure, EEG amplitude etc.). To the best of our knowledge, it has never been attempted before to incorporate this information within a sleep scoring system: it could reasonably improve its performance.

To date, all the efforts have focused on optimizing a sleep scoring algorithm in order to be ready to score any kind of subject. Data heterogeneity is one of the biggest challenges to address. A common objective among researchers is to increase the model generalizability, i.e., the ability of the model to make accurate predictions over different or never seen data domains. The performance of a sleep scoring algorithm on a PSG from an unseen data distribution (e.g., different data domains/centers) usually drastically decreases11,30,40,41,42. This drop in performance can be due to a variety of well-known reasons: high inter-scorer variability; hardware variability, e.g., channels/derivations; high data variability from different sleep centers, e.g., subject distributions with different sleep disorders. In recent studies, Phan et al. and Guillot et al.30,40 propose to adapt a sleep scoring architecture on a new data domain via transfer learning techniques. They demonstrate the efficiency of their approaches in addressing the variability between the source and target data domains. Perslev at al., Olesen et al. and Vallat et al.11,41,42 propose to train their sleep scoring architectures on tens of thousands of PSGs from different large-scale-heterogeneous cohorts. They demonstrate that using data from many different sleep centers improves the performance of their model, even on never seen data domains. In particular, Olesen et al.41 show that models trained on a single data domain fail to generalize on a new data domain or data center.

In our study we do several experiments to evaluate the resilience of an existing DL based algorithm against the AASM guidelines. In particular we focus on the following questions:

  1. (i)

    can a sleep scoring algorithm successfully encode sleep patterns, from clinically non-recommended or non-conventional electrode derivations?

  2. (ii)

    can a single sleep center large dataset contain enough heterogeneity (i.e., different demographic groups, different sleep disorders) to allow the algorithm to generalize on multiple data centers?

  3. (iii)

    whenever we train an algorithm on a dataset with subjects with a large age range, should we exploit the information about their age, conditioning the training of the model on it?

We run all of our experiments on U-Sleep, a state-of-the-art sleep scoring architecture recently proposed by Perslev et al.11. U-Sleep has been chosen mainly for the following reasons: it has been evaluated on recordings from 15660 participants of 16 different clinical studies (four of them never seen by the architecture); it processes inputs of arbitrary length, from any arbitrary EEG and EOG electrode positions, from any hardware and software filtering; it predicts the sleep stages for an entire PSG recording in a single forward pass; it outputs sleep stage labels at any temporal frequency, up to the signal sampling rate, i.e., it can label sleep stages at shorter intervals than the standard 30-s, up to one sleep stage per each sampled time point.

In the original implementation of U-Sleep we found an extremely interesting bug: the data sampling procedure was not extracting the channel derivations recommended in the AASM guidelines, as stated by the authors in11. Instead, atypical or non-conventional channel derivations were randomly extracted. This insight triggered the above mentioned question (i).

Our contributions can be summarized as follows: (1) we find that a DL sleep scoring algorithm is still able to solve the scoring task, with high performance, even when trained with clinically non-conventional channel derivations; (2) we show that a DL sleep scoring model, even if trained on a single large and heterogeneous sleep center, fails to generalize on new recordings from different data centers; (3) we show that the conditional training based on the chronological age of the subjects does not improve the performance of a DL sleep scoring architecture.


Datasets and model experiments

We train and evaluate U-Sleep on 19578 recordings from 15,322 subjects of 12 publicly available clinical studies, as done previously11.

In this study we also exploit the Bern Sleep Data Base (BSDB) registry, the sleep disorder patient cohort of the Inselspital, University hospital Bern. The recordings have been collected from 2000 to 2021 at the Department of Neurology, at the University hospital Bern. Secondary usage was approved by the cantonal ethics committee (KEK-Nr. 2020-01094). The dataset consists of 8950 recordings from patients and healthy subjects aged 0–91 years. In our experiments we consider 8884 recordings, given the low signal quality of the remaining recordings. The strength of this dataset is that, unlike the ones available online, it contains patients covering the full spectrum of sleep disorders, many of whom were diagnosed with multiple sleep disorders and non-sleep related comorbidities43; thus providing an exceptionally heterogeneous PSG data set.

An overview of the BSDB and the open access (OA) datasets along with demographic statistics is reported in Table 1. In Supplementary notes: Datasets, we also report a detailed description of all the datasets used in this study.

Table 1 Datasets overview with demographic statistics.

The data pre-processing and the data selection/sampling across all the datasets is implemented as described in11 (see subsection U-Sleep architecture). In contrast with the recommendation of the AASM manual, no filtering was applied to the EEG and the EOG signals during the pre-processing procedure. Most importantly, we found that in the original implementation of U-Sleep11 atypical or non-conventional channel derivations were erroneously extracted. In fact, the data extraction and the resulting sampling procedure were creating totally random derivations, see Supplementary Table 6, obviously different to those recommended in the AASM guidelines. In this study, we examine the resilience of U-Sleep with respect to the official AASM guidelines. To this aim, we extract the channel derivations following the guidelines (as was originally meant to be done in11), to better understand the impact of channel selection on the overall performance. Below we summarize all the experiments performed in our work on U-Sleep:

  1. (i)

    We pre-train U-Sleep on all the OA datasets using both the original implementation selecting the atypical channel derivations (U-Sleep-v0), and our adaptation following AASM guidelines (U-Sleep-v1). We split each dataset in training (75%), validation (up to 10%, at most 50 subjects) and test set (up to 15%, at most 100 subjects). The split of the PSG recordings is done per-subject or per-family, i.e., recordings from the same subject or members of the same family appear in the same data split. In Supplementary Table 7 we summarize the data split on each OA dataset. We evaluate both U-Sleep-v0 and U-Sleep-v1 on the test set of the BSDB dataset. We also evaluate the models on the whole BSDB(100%) dataset, to test on a higher number of subjects, with a higher heterogeneity of sleep disorders and a wider age range. A model pre-trained on the OA datasets and evaluated directly on the BSDB dataset is what we will refer to as direct transfer (DT) on BSDB.

  2. (ii)

    We exploit the BSDB dataset to evaluate whether a DL-based scoring architecture, trained with a large and a highly heterogeneous database, is able to generalize on the OA datasets from different data centers. We split the BSDB recordings in training (75%), validation (10%) and test set (15%). We run two different experiments on U-Sleep-v1: we train the model from scratch (S) on the BSDB dataset; we fine-tune (FT) the model pre-trained in (i) on the BSDB dataset, by using the transfer learning approach (see subsection Transfer learning). Then, we evaluate both (S) and (FT) on the test set of all the OA datasets and the test set of the BSDB dataset.

  3. (iii)

    We exploit the BSDB dataset to investigate whether U-Sleep needs to be trained by also having access to chronological age-related information. We split the BSDB dataset in seven groups, according to the age categories of the subjects38, resulting in G = 7 sub-datasets, see Supplementary notes: Age analysis. We further split the recordings of each subdataset in training (75%), validation (10% at most 50 subjects) and test set (15% at most 100 subjects). We run three different experiments on U-Sleep-v1: we fine-tune the model by using all the training sets of the seven groups (FT); we fine-tune seven independent models by using the training set of each group independently (FT-I); we fine-tune a single sandwich batch normalization model (exploiting the batch normalization layers, see subsection Conditional learning), to add the condition on the age-group-index G for each recording (FT-SaBN). These last two experiments are replicated considering only two age groups, i.e., babies/children and adults, as recommended in1, resulting in two additional fine-tuned model (FT-I and FT-SaBN for G = 2). We then evaluate all of the fine-tuned models on the independent test set of each age group.

In Supplementary Table 8 we summarize the two different data split sets, in experiment (ii) and experiment (iii), on the BSDB dataset.

Performance overview

  1. (i)

    Clinically non-recommended channel derivations. In Table 2 we compare the performance of U-Sleep pre-trained on all the OA datasets, with (U-Sleep-v0) and without (U-Sleep-v1) using randomly ordered channel derivations. There is no statistically significant difference between the two differently trained architectures evaluated on the test set of the BSDB dataset (two-sided paired t-test p − value > 0.05). Most importantly, we find no difference in performance with the direct transfer also on the whole BSDB(100%) dataset (two-sided paired t-test p − value > 0.05). These results clearly show how the architecture is able to generalize regardless of the channel derivations used during the training procedure, also on a never seen highly heterogeneous dataset. In Supplementary Table 9 we also compare the performance of U-Sleep-v0 and U-Sleep-v1 per sleep stage. The results suggest that there are statistically significant differences between the two differently trained architectures for each of the classes (two-sided paired t-test p − value < 0.001). U-Sleep-v0 better recognizes N1 and N3 sleep stages, at the expense of awake, N2, and REM sleep stages.

    Table 2 (i) Clinically non-recommended channel derivations.
  2. (ii)

    Generalizability on different data centers with a heterogeneous dataset. In Table 3 we report the results obtained on U-Sleep-v1 pre-trained (i) on the OA datasets, and evaluated on all the test sets of the OA datasets and on the test set of the BSDB dataset. We also show the results obtained on U-Sleep-v1 trained from scratch (S) on the BSDB dataset, and the results obtained on the model pre-trained in (i) on OA and then fine-tuned (FT) on the BSDB dataset. Unlike what we expected, both the models (S) and (FT), trained with a large and a highly heterogeneous database, are not able to generalize on the OA datasets from the different data centers. The average performance achieved on the OA with (S) and (FT) models is significantly lower compared to the performance of the model pre-trained on OA (two-sided paired t-tests p − value < 0.001). Whilst, with both (S) and (FT) we show a significant increase in performance compared to the direct transfer (DT), on the test set of the BSDB dataset (two-sided paired t-tests p − value < 0.001). We also find that the training from scratch results in significantly higher performance (two-sided paired t-test p − value < 0.001) on the BSDB dataset, compared to the performance of the fine-tuned model. No significant difference (two-sided paired t-test p − value > 0.05) occurs between (S) and (FT) evaluated on the average performance on OA datasets. The pre-training on the OA dataset is not beneficial for the model fine-tuned on the BSDB dataset. With a large number of highly heterogeneous subjects, we can directly train the model from scratch on the dataset. However, we have to mention that the main advantage of using the fine-tuned model is that it reaches same performance in less computational time, i.e., a fewer number of iterations (number of iterations: FT = 382 < S = 533).

    Table 3 (ii) Generalizability on different data centers with a heterogeneous dataset.
  3. (iii)

    Training conditioned by age. In Table 4 we first show the performance of U-Sleep-v1 fine-tuned on all the training sets of the seven BSDB groups, i.e., single model (FT-G1). We also report the performance achieved using the training set of each group independently (FT-I) with G = 7 and G = 2 respectively (i.e., seven and two models), and the performance achieved using the training set of the seven/two BSDB groups conditioned (FT-SaBN) by G = 7 and by G = 2 groups respectively (i.e., single model). The mean and the standard deviation of the F1-score (%F1), are computed across the recordings of the test set of each of the seven BSDB age groups. Comparing both the experiments (FT-I and FT-SaBN) and types of grouping (G = 2 and G = 7) with the baseline (FT), we do not find a statistically significant increase of the performance in any of the subgroups (one-sided paired t-test p − value > 0.05). Despite the lack of significant performance differences in our age-conditioned models, REM sleep seems to be less accurately predicted for small children, if the training data set only consists of data from adults (see Supplementary Fig. 13, confusion matrix for test {CH} against Model 1b). This is an interesting finding since small children exhibit more REM sleep (see Supplementary Fig. 11). Visual scoring guidelines for small children differ from the guidelines for adults, with REM sleep scoring strongly relying on irregular respiration44. However, overall these results show that, despite the age-related differences, the DL algorithm is able to deal with different age subgroups at the same time, without needing to have access to chronological age-related information during the training procedure.

    Table 4 (iii) Training conditioned by age.


In this paper, we demonstrate the resilience of a DL network, when trained on a large and heterogeneous dataset. We focus on the three more significant influencing factors: channel derivation selection, multi-center heterogeneity needs, and age-conditioned fine-tuning. Channel derivations do have complementary information, and a DL-based model resulted resilient enough to be able to extract sleep patterns also from atypical and clinically non-recommended derivations. We show that the variability among different sleep data centers (e.g., hardware, subjective interpretation of the scoring rules, etc.) needs to be taken into account more than the variability inside one single sleep center. A large database such as the BSDB (sleep disorder patient cohort of the Inselspital, with patients covering the full spectrum of sleep disorders) does not have enough heterogeneity to strengthen the performance of the DL-based model on unseen data centers. Lastly, we show that a state-of-the-art DL network is able to deal with different age groups simultaneously, mitigating the need of adding chronological age-related information during training. In summary, what seems to be essential for the visual scoring (e.g., specific channel derivations, or specific scoring rules that consider also the age of the individuals) is not necessary for the DL based automatic procedure, which follows other analysis principles.

The resilience of the DL-based model to the atypical or non-conventional channel derivations is fascinating. The model still learns relevant sleep patterns while solving the scoring tasks with high state-of-the-art performance on multiple large-scale-heterogeneous data cohorts. This result proves and strengthens the feasibility to exploit alternative channels to the AASM standard ones(e.g., wearable applications). Although this is a remarkable finding, it would be useful to further investigate the reasons why the DL model is still able to encode clinically valid information. DL has been criticized for its non-interpretability and its black-box behavior, factors that may actually limit its implementation in sleep centers. Future works, strongly linked to the hot topic of the explainable AI, should focus on solving the following open questions: which sleep patterns/features our DL algorithms are encoding/highlighting from the typical/atypical channel derivations? How each individual channel affects the performance of the DL algorithms?

AASM scoring rules have been widely criticized over the years, for various reasons. The scoring manual has been designed to consider the sleep stages almost as discrete entities. However, it is well-known that sleep should be viewed as a continuum/gradual transition from one stage to another. A growing consensus suggests that we should reconsider the AASM scoring rules and the entire scoring procedure. Given the high variability among the individual scorers and different sleep centers, more efforts should be made by the scientific community to improve the standardization of the scoring procedure. Perhaps the introduction, even partially, of automated procedure could help.

The inter-scorer variability inevitably affects the performance of any kind of algorithm, since all algorithms are learned from the noisy variability of labels. A very relevant finding of this paper is that the heterogeneity given by data coming from different sleep data centers (e.g., different sleep scorers) is much more relevant than the variability coming from patients affected by different sleep disorders. These latter insights raise a research question yet to be answered: i.e., how could we define and quantify the heterogeneity of a sleep database? To what extent could we consider a database heterogeneous enough, to allow the algorithm to generalize across different data domains/centers?

The age-related findings drive another important observation: the DL algorithm is intrinsically encoding age-related features, which may not be categorized into discrete age-subgroups. As sleep should be considered as a continuous physiological process, the hyperspace of features associated with the respective age-subgroups should be considered continuum as well. We are forcing the algorithm to learn sleep patterns based on the chronological age of the subjects, but there are many other factors that the DL model is taking into account. Certainly, biological age has an effect on sleep characteristics. Although the DL algorithm does not need to be guided with the chronological age information during its learning procedure, it may be that with a less optimal DL-based approach (e.g., architecture, number of channel derivation in input) age would still be useful information to give in input.

To our knowledge, our study on the automatic sleep scoring task is the largest in terms of a number of polysomnography recordings and diversity with respect to both patient clinical pathology and age spectrum.

Considering the previous study findings and our present results, the strong resilience and the generalization capability of a DL-based architecture is undeniable. DL algorithms are now reaching better performance than the feature-based approach. DL is definitely able to extract feature representations that are extremely useful to generalize across datasets from different sleep data centers. These hidden feature representations seem to better decode the unconscious analytical evaluation process of the human scorer. To conclude, being the AASM so widely criticized, the sleep labels so noisy (e.g., high inter- and intra- scorer variability), and sleep so complex: could an unsupervised DL-based sleep scoring algorithm, that does not need to learn from the labels, be the solution?


U-Sleep architecture

U-Sleep11, optimized version of its predecessor U-Time10, is inspired by the popular U-Net architecture for image-segmentation45,46,47. Below we briefly describe U-Sleep architecture, for further details we refer the reader to11.

U-Sleep is a fully convolutional deep neural network. It takes as input a sequence of length L of 30-second epochs and outputs the predicted sleep stage for each epoch. The peculiarity of this architecture is that it defines the general function \(f({{{\bf{X}}}};\theta ):{{\mathbb{R}}}^{L\cdot i\times C}\to {{\mathbb{R}}}^{L\times K}\), where L > 0 is any positive integer, θ are the learning parameters, L is a number of fixed-length windows with i sampled points each, C the number of PSG channels and K the number of sleep stages. Hence, U-Sleep takes in input any temporal section of a PSG (even the whole PSG) and output a sequence of labels for each fixed-length i > 0 window. Ideally Li > 4096, because U-Sleep contains 12 pooling operations, downsampling the signal by a factor of 2. The architecture requires at least C = 2, one EEG and one EOG channel, sampled/resampled at 128Hz, with K = 5, i.e., awake, N1, N2, N3, R.

U-Sleep architecture consists of three learning modules as shown in Fig. 1.

  • The encoder module is designed to extract feature maps from the input signals, each resulting in a lower temporal resolution compared to its input. It includes 12 encoder blocks. Each block consists of a 1D convolutional layer, one layer of activation function - i.e., exponential linear unit (ELU), a batch normalization (BN) layer and one max-pooling layer.

  • The decoder module is designed to up-scale the feature maps to match the temporal resolution of the signals in input. We can interpret the output of the decoder as a high-frequency representation of the sleep stages at the same fs of the input signal (e.g., with fs = 128Hz, output one sleep stage each 1/128Hz). The module includes 12 decoder blocks. Each block consists of a nearest neighbor up-sampling layer (e.g., with a kernel_size=2, the length of the feature map in input is doubled), a 1D convolutional layer, one layer of ELU activation function and a BN layer. Then, a skip connection layer combines the up-scaled input with the output of the BN layer of the corresponding encoder block. Finally, a 1D convolution, a ELU non-linearity and a BN are applied to the stacked feature maps. The output has the same temporal resolution of the signal in input.

  • The segment classifier module is designed to segment the high-frequency representation output of the decoder into the desired sleep stage prediction frequency. The module consist of a dense segmentation layer (i.e., 1d convolution layer with a hyperbolic tangent activation function), an average-pooling layer (e.g., with kernel_size = stride_size = 30sec*fs considering the same prediction frequency of a sleep scorer) and two 1D convolutional layers (the first using an ELU activation function, and the latter using a softmax activation function). The output of the segment classifier is a L × K, where L is the number of segments and K = 5 is the number of sleep stages.

Fig. 1: U-Sleep overall architecture.
figure 1

U-Sleep is a fully convolutional deep neural network. It takes as input a sequence of length L of 30-second sleep epochs and it outputs the predicted sleep stage for each epoch. We slightly modified the original figure (see Figure 2: Model architecture in11) reporting the additional SaBN layers exploited in the conditional learning procedure (see subsection Conditional learning). Please refer to11 for details on the U-Sleep model architecture and training parameters.

The sequence length L, the number of filters, the kernel and the stride sizes are specified in Fig. 1. The softmax function, together with the cross-entropy loss function, is used to train the model to output the probabilities for the five mutually exclusive classes K that correspond to the five sleep stages. The architecture is trained end-to-end via backpropagation, using the sequence-to-sequence learning approach. The model is trained using mini-batch Adam gradient-based optimizer48 with a learning rate lr. The training procedure runs up to a maximum number of iterations, as long as the break early stopping condition is satisfied.

Unlike11, we consider early stopping and data augmentation as regularization techniques. As stated in49regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error”. Early stopping and data augmentation do so in different ways, they both decrease the regularization error. By using the early stopping the training procedure is stopped as soon as the performance (i.e., F1-score) on the validation set is lower than it was in the previous iteration steps, by fixing the so called patience parameter. By using the data augmentation technique, the signals in input are randomly modified during training procedure to improve model generalization. Variable length of the sequences in input are replaced with Gaussian noise. For each sample in a batch, with 0.1 probability, a fraction of the sequence is replaced with \(N(\mu =\hat{\mu },{\sigma }^{2}=0.01)\), where \(\hat{\mu }\) is the mean of the sample’s signals. The fraction is sampled with a log-uniform distribution {min = 0.001; max = 0.33}. With a 0.1 probability at most one channel is entirely replaced by noise.

The training parameters (e.g., Adam-optimizer parameters beta1 and beta2, mini-batch size etc.) are all set as stated in11. The learning rate, the early stopping patience parameter and the maximum number of iterations have been changed to 10−5, 100, and 1000 respectively, to let U-Sleep converge faster. The architecture has several hyperparameters (e.g., number of layers, number/sizes of filters, regularization parameters, training parameters, etc.) which could be optimized to tune its performance on any dataset. We decide to not systematically tune all these parameters, as this is out of our scope, but to fix them for all the experiments, as done in the original network.

Data pre-processing

The signals are resampled to 128 Hz and rescaled (per channel and per-subject), so that, for each channel, the EEG signal has median 0 and inter quartile range (IRQ) 1. The values with an absolute deviation from the median above 20*IQR are clipped. The signals outside the range of the scored hypnogram are trimmed. The recordings scored according to Rechtschaffen and Kales rules results in six scoring classes, i.e., awake, N1, N2, N3, N4, and REM. In order to use the AASM standard, we merge the N3 and N4 stages into a single stage N3. The loss function for stages as MOVEMENT and UNKNOWN is masked during the training procedure.

Data sampling

U-Sleep is trained using mini-batch Adam gradient-based optimizer. Each element in the batch is a sequence/segment of L = 35 EEG and EOG 30-second signals/epochs from a single subject. Each sequence/element is sampled from the training data as follows. (1) dataset sampling: one dataset is selected randomly. The probability that a dataset D is selected is given by P(D) = αP1(D) + (1 − α)P2(D), where P1(D) is the probability that a dataset is sampled with a uniform distribution 1/ND, where ND is the number of available datasets, and P2(D) is the probability of sampling a dataset according to its size. The parameter α is set to 0.5 to equally weight P1(D) and P2(D); (2) subject sampling: a recording SD is uniformly sampled from D; (3) channel sampling: one EEG and one EOG are uniformly sampled from the available combinations of channels in SD (e.g., if 2 EEG and 2 EOG channels are available, four combinations are possible); (4) segment sampling: a segment of EEG signal and a segment of EOG signal, both of length L = 35, are selected as follows: first a class from W, N1, N2, N3, R is uniformly sampled, then a 30-second epoch scored with the sampled class is selected randomly from the whole night recording, the chosen epoch is shifted into a random position of the segment of length L and finally the sequence is extracted.

Transfer learning

We define transfer learning as in the following clear and simple statements:

"Transfer learning and domain adaptation refer to the situation where what has been learned in one setting (e.g., distribution P1) is exploited to improve generalization in another setting (say, distribution P2)”49;

"Given a source domain DS and learning task TS, a target domain DT and learning task TT, transfer learning aims to help improve the learning of the target predictive function fT(  ) in DT using the knowledge in DS and TS, where DS ≠ DT and TS ≠ TT50.

In our study the source and the target tasks are the same, i.e. TS ≡ TT. The task is always to perform sleep staging with the same set of sleep classes/stages. We want to transfer the knowledge about the previously learned sleep recordings (e.g., different hardware, different subject distributions with different sleep disorders) and the knowledge about the sleep scoring-rules (i.e., inter-scorer variability in the different data centers). The process generally involves overwriting a knowledge from a small-sized database to a previous big-sized knowledge (result of a long training process). One big concern is to avoid ending up in what the data scientists call catastrophic forgetting: “Also known as catastrophic interference, it is the tendency of an artificial neural network to completely and abruptly forget previously learned information upon learning new information” as defined in51. Even if it is conceptually easy to understand, avoiding its occurrence is not trivial. To partially bypass this phenomena we fine-tune the architecture on the target domain using a smaller learning rate.

In our experiments we first pre-train the architecture on the data-source domain S (e.g., a set of different domains/databases \(\{{S}_{{{{{\rm{DB}}}}}_{1}},{S}_{{{{{\rm{DB}}}}}_{2}},...,{S}_{{{{{\rm{DB}}}}}_{{{{\rm{n}}}}}}\}\)), then we fine-tune the model on the data-target domain T. Formally, we first minimize the loss function LS, resulting in the learned parameters θ:

$$argmi{n}_{\theta }=\mathop{\sum }\limits_{({{{\bf{x}}}},{{{\bf{y}}}})\in {D}_{{{{\rm{S}}}}}}^{}{L}_{{{{\rm{S}}}}}({{{\bf{x}}}},P({{{\bf{y}}}}\parallel {{{\bf{x}}}}),{P}_{\theta }({{{\bf{y}}}},{{{\bf{x}}}}))$$

The parameters θ of the pre-trained model are used as the starting point on the data-target domain T. To transfer the learning on the new domain T, we fine-tune all the pre-trained parameters \({\theta }^{{\prime} }=\theta\) (i.e., the entire network is further trained on the new data domain T):

$$argmi{n}_{{\theta }^{{\prime} } = \theta }=\mathop{\sum }\limits_{({{{\bf{x}}}},{{{\bf{y}}}})\in {D}_{{{{\rm{T}}}}}}^{}{L}_{{{{\rm{T}}}}}({{{\bf{x}}}},P({{{\bf{y}}}}\parallel {{{\bf{x}}}}),{P}_{\theta }({{{\bf{y}}}},{{{\bf{x}}}}))$$

Conditional learning

Basically all the sleep scoring architectures learn in a conditional way. The aim is to maximize the conditional probability distributions P(YX), where X are the sequences of the biosignals in input and Y are the corresponding ground-truth labels. For each epoch xt in input the models aim to maximize the conditional probability distribution P(ytxt), where yt is the t − th one-hot encoded vector of the ground-truth label. Hence, the model is trained to minimize the prediction error conditioned only by the knowledge of X. We know that the sleep data X often come from different sources or data domains. Even in the same cohort, subjects with different demographics and sleep disorders may occur, resulting in significant shifts in their sleep data X distributions. Imagine to have in the same data cohort G different groups of subjects \(\left\{{g}_{1},{g}_{2},...,{g}_{{{{\rm{G}}}}}\right\}\), with \({g}_{1}=\left\{healthy\right\}\), g2 = {sleep_apnea} and so on. This additional information about the group (i.e., the sleep disorder group gi) to which the subject belongs can be given in input to the model. So, we can either train G fully separated models, each maximizing G different P(YX) functions, or either train a single model maximizing the conditional probability distributions P(YX, gi). The latter - i.e., train the joint model with the additional condition gi - is the smartest approach; the tasks are similar enough to benefit from sharing the parameters and the extracted features.

We decide to exploit the BN layers to insert the additional knowledge in the training of our model. In literature different normalization variants have been proposed by modulating the parameters of the vanilla BN layer52,53,54,55,56. We decide to exploit the sandwich batch normalization (SaBN) approach recently proposed in57.

The vanilla BN58 normalizes the samples in a mini-batch in input by using the mean μ and the standard deviation σ, and then re-scales them with the γ and β parameters. So, given the feature in input \(f\in {{\mathbb{R}}}^{B\times C\times H\times W}\), where B is the batch size, C is the number of channels and H and W are the height and width respectively, the vanilla BN computes:

$$h=\gamma (\frac{f-\mu (f)}{\sigma (f)})+\beta$$

where μ(f) and σ(f) are the mean and variance running estimates (batch statistics, i.e., moving mean and moving variance) computed on f along (N, H, W) dimensions; γ and β are the re-scaling learnable parameters of the BN affine layer with shape C. Clearly, the vanilla BN has only a single re-scaling transform, indirectly assuming all features coming from a single data distribution. In55, to tackle the data heterogeneity issue (i.e., images from different data domains/distributions), they propose the Categorical Conditional BN (CCBN), so boosting the quality of the generated images. The CCBN layer computes the following operation:

$$h={\gamma }_{{{{\rm{g}}}}}(\frac{f-\mu (f)}{\sigma (f)})+{\beta }_{{{{\rm{g}}}}}\qquad g=1,...,G$$

where γg and βg are the re-scaling learnable parameters of each g − th affine layer, where g corresponds to the domain index associated to the input. The parameters of each affine layer are learned to capture the domain/distribution-specific information. In57, instead, they propose the SaBN layer, an improved variant of the CCBN. They claim that different individual affine layers might cause an imbalanced learning for the different domains/distributions. They factorize the BN affine layer into one shared “sandwich” BN layer cascaded by a set of independent BN affine layers, computed as follows:

$$h={\gamma }_{{{{\rm{g}}}}}({\gamma }_{{{{\rm{sa}}}}}(\frac{f-\mu (f)}{\sigma (f)})+{\beta }_{{{{\rm{sa}}}}})+{\beta }_{{{{\rm{g}}}}}\qquad i=1,...,G$$

where γsa and βsa are the re-scaling learnable parameters of the “sandwich” shared affine BN layer, while, as above, γg and βg are the re-scaling learnable parameters of each g − th affine layer, conditioned on the categorical input g. The SaBN enable the conditional fine-tuning of a pre-trained U-Sleep architecture, conditioned by the categorical index in input g.


In all our experiments we evaluate U-Sleep as stated in11. The model scores the full PSG, without considering the predicted class on a segment with a label different from the five sleep stages (e.g., segment labeled as ’UNKNOWN’ or as ’MOVEMENT’). The final prediction is the results of all the possible combinations of the available EEG and EOG channels for each PSG. Hence, we use the majority vote, i.e., the ensemble of predictions given by the multiple combination of channels in input.

The unweighted F1-score metric59 is computed on all the testing sets to evaluate the performance of the model on all the experiments. We compute the F1-score for all the five classes, we then combine them by calculating the unweighted mean. Note that the unweighted F1-scores reduce the absolute scores due to lower performance on less abundant classes such as sleep stage N1. For this reason, we also report in Supplementary Table 10, Supplementary Table 11, and Supplementary Table 12 the results achieved in terms of weighted F1-score - i.e., the metric is weighted by the number of true instances for each label, so as to consider the high imbalance between the sleep stages. In that case, the absolute scores significantly increases on all the experiments. In Supplementary Table 10, Supplementary Table 11, and Supplementary Table 12 we also report the Cohen’s kappa metric, given its valuable property of correcting the chance of agreement between the automatic sleep scoring algorithm, i.e., overall predicted sleep stages, and the ground truth, i.e., the sleep labels given by the physicians.

* The Bern Sleep Data Base BSDB registry usage was ethically approved in the framework of the E12034 - SPAS (Sleep Physician Assistant System) Eurostar-Horizon 2020 program (Kantonale Ethikkommission Bern, 2020-01094).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.