Automated scoring of pre-REM sleep in mice with deep learning

Reliable automation of the labor-intensive manual task of scoring animal sleep can facilitate the analysis of long-term sleep studies. In recent years, deep-learning-based systems, which learn optimal features from the data, increased scoring accuracies for the classical sleep stages of Wake, REM, and Non-REM. Meanwhile, it has been recognized that the statistics of transitional stages such as pre-REM, found between Non-REM and REM, may hold additional insight into the physiology of sleep and are now under vivid investigation. We propose a classification system based on a simple neural network architecture that scores the classical stages as well as pre-REM sleep in mice. When restricted to the classical stages, the optimized network showed state-of-the-art classification performance with an out-of-sample F1 score of 0.95 in male C57BL/6J mice. When unrestricted, the network showed lower F1 scores on pre-REM (0.5) compared to the classical stages. The result is comparable to previous attempts to score transitional stages in other species such as transition sleep in rats or N1 sleep in humans. Nevertheless, we observed that the sequence of predictions including pre-REM typically transitioned from Non-REM to REM reflecting sleep dynamics observed by human scorers. Our findings provide further evidence for the difficulty of scoring transitional sleep stages, likely because such stages of sleep are under-represented in typical data sets or show large inter-scorer variability. We further provide our source code and an online platform to run predictions with our trained network.


Materials and methods
Data acquisition. The dataset consists of polysomnographic recordings of 18 mice with a total recording duration of 52 days (corresponding to 454,301 windows (segments) of 10 seconds each, see Table 1). Each mouse was recorded for 3 days, except for one mouse whose recording spanned 1 day because it unexpectedly passed away. The data were acquired during a study at the University of Tübingen which tested the influence of dietary variations on sleep.
Animal nurturing and treatment. All mice (age about 10 weeks, male, C57BL/6J) were kept at the animal facility FORS at the University of Tübingen, Germany. All applicable local and federal regulations of animal welfare in research (Directive 86/609, 1986 European Community) and ARRIVE guidelines were followed. Experiments were approved by the ethics committee of the Regional Council Tübingen, Germany, permit number MPV 3/15. Mice were housed under SPV conditions in IVC-cages with ceddar bedding and cellulose cloth and experiments were conducted at controlled temperature ( 20 ± 2 C • ), humidity, and 12h light/dark cycle at all times (light on at 06:00 or 7:00 summer/winter time). Out of the 18 mice, 9 were fed with a different diet (sucrose solution instead of water) to test an experimental paradigm to be published elsewhere. We did not consider the dietary groups as relevant for the present study and did not attempt a statistical comparison thereof. The mouse that passed away unexpectedly was part of the altered-diet group.
Polysomnographic recordings. The mice were chronically implanted with a synchronous recording device of one EMG and four EEG electrodes seven days before the first recording. Surgeries were conducted in the morning between 8am and 12pm. Four stainless steel screws were placed on top of the cortex using the following coordinates: frontal electrode (AP: +1.5 mm, L: +1.0 mm, relative to Bregma), two electrodes parietal left and right Table 1. Characteristics of the dataset. Data from 18 mice were split into three disjunct sets (training, validation, and test set). To allow for robust assessment of generalization performance of our networks outof-sample, data of each mouse was assigned to only one of these sets. Each recording was further split into segments of 10 seconds for which sleep stages (labels) were predicted. www.nature.com/scientificreports/ (AP: 2.0 mm, L: ±2.5 mm), and the reference electrode occipital (AP: -1.0 mm, L: 0 mm, relative to Lambda) according to the atlas of Franklin and Paxinos 39 . This setup was used to sample the electrical activity of the frontal, parietal-left, and parietal-right lobes. Two flexible stainless steel wires were inserted into the neck muscles to measure the EMG signal. During recordings, mice were plugged with a cable attached to a swiveling commutator, which was connected to the amplifier (Model 15A54, Grass Technologies, USA). The amplified (x1000) signals (EEG and EMG) were digitalized, filtered (EEG: 0.01-300 Hz; EMG: 30-300 Hz), and sampled at a rate of 992.06 Hz. The commutator units were built in-house. The software Spike2, version 6.07, was used for data acquisition and subsequent annotation.
Manual sleep-stage classification. Our expert (SW) manually scored sleep by visual inspection of EMG and EEG time series. The parietal right EEG was chosen for scoring decisions because it had the best signal-to-noise ratio across all mice. Recordings were divided into non-overlapping consecutive windows (segments) of 10 seconds. Three consecutive segments of the EMG and the parietal right EEG signal were visually presented to the expert who scored the middle segment. The expert was not required to score the recordings in sequential order but could jump to other parts of the recordings at all times. Epochs containing features of multiple sleep stages were assigned the stage that made up the majority of the epoch. Segments were scored as stage "Wake", "NREM", "pre-REM", or "REM" sleep (see Fig. 1 and Table 2). Artifact contaminated segments were scored as "artifact". The Wake state is characterized by elevated activity in the EMG signals when compared with NREM, pre-REM, and REM sleep. NREM sleep is characterized by slow waves between 0.5-4 Hz (delta band) while REM sleep is associated with rhythmic activity between 7-8 Hz (theta band). The pre-REM stage is characterized by successive short bouts ( < 10 s) of high amplitudes similar to spindles and the appearance of a theta rhythm 13,17 . Wake was the most frequent stage in our dataset ( 55.08 % , see Table 2), followed by NREM ( 37.7 % ) and REM sleep ( 5.08 % ). The intermediate stage pre-REM occurred only rarely ( 1.88 % ) and only 0.26 % of all segments were contaminated with artifacts.
Dataset preparation for machine learning. Train-, validation-, and test-set splits. The scored time series were prepared for training and evaluation of the neural network for automated sleep scoring as follows: To mimic the manual scoring process and make our automated scoring system more flexible with respect to different experimental setups (in which only one EEG derivation may be available), we chose to base the training of the neural network on the parietal right EEG only. The parietal right EEG time series were low-pass filtered using a 4-th order Butterworth low-pass filter (critical frequency: 25.6 Hz) in order to prevent aliasing effects due to subsequent downsampling. The low-pass filter was applied to each time series with a forward-and backward pass to avoid any phase shifts. Subsequently, the low-pass filtered time series were downsampled to 64 Hz by linear interpolation.
The recordings were assigned to a training, a validation, and a test set (see Tables 1 and 2). Subjects were distributed such that each set consisted of recordings from mice of both dietary groups. Validation and test set each consisted of recordings from two mice one of which was from the altered-diet group. The mouse that passed Table 2. Statistics of labeled segments of the whole dataset ("all data"), of the training set ("train. set"), the rebalanced training sets ("reb. train. set 1 and 2"), the validation set ("validation set"), and the test set ("test set"). NREM: Non REM sleep, pre-REM: Pre REM sleep, Wake: Wakefulness. perc: percentage, # segments: number of segments. www.nature.com/scientificreports/ away early was assigned to the test set so that later evaluation would allow us to test the trained network on the most novel subject in our dataset. The network parameters were trained on the training set, whereas the validation set allowed us to optimize hyperparameters of our network, of the training process and of regularization as well as data augmentation strategies. Finally, the test set was used to assess the performance of our network to predict sleep stages out-ofsample (i.e., on data from mice that were not available during training nor validation). We stress the importance of out-of-sample testing since it mimics the situation in sleep laboratories where a trained network predicts sleep stages on data from mice that are unknown to the network. To simulate this situation, we made sure that all data from each mouse was assigned to one and only one of the three sets (see Table 1).
We note that, when the network was trained to predict only the standard set of target stages (Wake, NREM, REM), the training, validation, and test set were created by reinterpreting pre-REM stages as NREM, while segments scored as artifact were removed.
Class rebalancing. The sleep stages identified by the expert occur with widely differing frequencies (e.g., Wake: 55.08 % vs pre-REM sleep: 1.88 % , see first row of Table 2). Such differing frequencies of classes (class imbalance) can pose non-trivial challenges when creating data-driven systems due to the tendency of systems to over-classify majority classes and misclassify infrequent classes 40,41 . To address this issue, we created rebalanced training sets that differed in their class frequencies (see rows 3 and 4 of Table 2) while maintaining the total number of segments in the set. The rebalanced training sets were created before each training epoch by random sampling with replacement from the respective classes of the original training set so that predefined class frequencies were reached. In the rebalanced training set #1, class frequencies were heuristically chosen such that infrequent classes were oversampled and frequent classes undersampled while class frequencies still reflected some of the class imbalance of the original training set. The rebalanced training set #2 was created to investigate the performance of our network to distinguish between the three main stages (Wake, NREM, REM) only.
Data augmentation. Creating additional synthetic training data is known as data augmentation and has repeatedly been demonstrated to reduce overfitting and to tackle class imbalance [42][43][44] . Moreover, if prior knowledge about the data cannot be easily incorporated into the design of a network architecture, the knowledge can often be utilized to create synthetic training data. This way prior knowledge can indirectly be induced into the network during training. A prime example is the creation of rotation invariance in image recognition systems by supplying rotated images (derived from the original data) during the training process 30 .
We expect variability in our data that does not reflect sleep dynamics but can be introduced by the data acquisition, the scoring process, or unrelated physiological processes. Our data augmentation strategies are aimed at inducing invariance in our network with respect to this variability.
(i) Amplitude variability. Different acquisition systems can use different signal amplification settings, thereby leading to a variability in the scaling of signal amplitudes. Let s denote the part of the time series that is presented to the input layer of the neural network during training. Let s t denote the sample of this time series at time step where u is a random variable drawn from the continuous uniform distribution in the interval [−1, 1) , and where a a ∈ [0, 1] is the strength of this type of augmentation. u is newly drawn for each time series entering the network.
(ii) Frequency variability. We expect that the frequency content of EEG signals varies to a certain extent independently of the actual sleep stage, due to different noise sources (e.g., caused by movements of mice or by the measurement device itself). We mimicked such a variability in frequency content by using window warping 45 as an augmentation technique in which a time series is stretched or contracted. Let T denote the length of the time series that enters the network. To create an augmented time series, in a first step, we determined a temporary length of a temporary time series, where a f ∈ [0, 1] denotes the strength of the stretching or contracting. The central sample point of the temporary time series is identical to the one of the original time series. In a second step, the temporary time series was resampled by linear interpolation such that its new length equals T. If T * > T , the temporary time series includes sample points prior to and after the original time series. In this case, the resampling operation contracts the time domain and thus frequencies are shifted towards higher values. If T * < T , the resampling operation stretches the time domain and frequencies are shifted towards lower values.
(iii) EEG montage variability (sign flip). Since each EEG signal represents a difference between the electrical potentials at two (or more) measurement sites, the sign of the amplitudes of an EEG signal depends on the chosen montage (i.e., the configuration of electrical potential differences between electrodes that is used to represent EEG signals). A change in montages can lead to a flipped sign of the amplitudes of the signal which, however, still reflects the same brain dynamics. The augmented time series � s * reads where a s ∈ [0, 1] is the probability of flipping the sign of the time series amplitudes.
(iv) Time shift variability. Sleep scoring associates a sleep stage to each of the consecutive EEG segments. Thus, the segment length introduces an artificial timescale on which sleep is characterized. Actual sleep stages do not necessarily change at segment boundaries but can change within a segment. To account for this variability, we created augmented data by shifting the time series segments with respect to the sleep scores by a random time shift t, where a t ∈ [0, 1] controls the extent of the shift and thus the strength of this type of augmentation. With � s = (s 1 , . . . , s T ) T denoting the original time series, the augmented time series reads Time series were augmented in the order (i), (iii), (ii), (iv) as described above using newly drawn random variables every time the time series entered the neural network for forward propagation. The strength of the augmentations were controlled by the parameters (a a , a f , a s , a t ).
Machine learning. Network architecture. Mimicking the manual scoring process, the input of the neural network consists of three consecutive segments of the parietal right EEG time series, while the outputs of the network are the probabilities of the middle segment to belong to Wake, NREM, pre-REM, REM sleep, or to contain artifacts (class artifact). We used the score with the highest probability as the predicted score by our network.
The network architecture, which draws upon experience gained in previous work by some of the authors 46 , consists of a feature extractor and a classifier (see Fig. 2). Before entering the feature extractor, the time series is batch normalized ("Batchnorm", see Fig. 2). The feature extractor consists of 8 convolutional layers, each of which is composed of 96 kernels of size d × 1 × 5 , where d denotes the depth of the output volume of the previous layer. Since one time series enters the network, d = 1 for the first convolutional layer, while d = 96 for all other convolutional layers. To downsample the time series information along the feature extractor, every other convolutional layer uses a stride of 2 with the other layers having a stride of 1. The resulting convolved signals of each layer are nonlinearly transformed by rectified linear units (ReLUs) and subsequently batch normalized (Batchnorm). Furthermore, we apply Dropout regularization 47 to every other convolutional layer. Finally, the output (also called features) of the feature extractor is concatenated into a vector ("Flatten", see Fig. 2), and Dropout regularization is applied before these features enter the classifier.
The classifier is composed of 2 fully connected (FC) layers with 113 · 96 = 10848 and 80 neurons, respectively. The first FC layer is equipped with rectified linear units (ReLUs) as nonlinearity and Dropout regularization, while the second FC layer uses a softmax activation function to determine the score probabilities p which represent the activations of the neurons of the output layer. The output layer consisted of 3 output neurons when the network was trained to distinguish between major sleep stages (Wake, REM, NREM). When the network was trained to also score pre-REM sleep and segments containing artifacts, the output layer consisted of 5 neurons. www.nature.com/scientificreports/ Training. Loss function. The training objective is implemented by the loss function L that becomes small when the classifier assigns a large probability to the correct class. Let x i denote a time series segment i and let k be the index of the middle segment of the input time series for which the network predicts the score probability vector p k . With c k ∈ {1, . . . , 5} we encoded the correct score for segment k as determined by our human expert. Thus, p k,c k is the predicted probability of the middle time series segment to belong to the sleep stage that was determined by our human expert. The loss is the negative log-likelihood function with an L2 regularization penalty term L 2 , where N b is the size of the minibatch, ≥ 0 is the L2 regularization strength, and where the sum of the L2 penalty is over all the weights w in the network (except the bias weights). Gradient descent. Since the neural network is differentiable, the loss function can be minimized by minibatch gradient descent. The network was trained by Adam 48 , a variant of stochastic gradient descent which uses exponential moving averaging of the gradient (equation (8)) and the squared gradient (equation (9)) to allow for adaptive learning rates. Let w t denote a weight (i.e., a free parameter) of the network and let g t = ∂L ∂w denote the gradient of the loss function with respect to the weight obtained for the minibatch at step t of the training. The weight is updated by where η denotes the learning rate, and m t and v t are defined as and respectively. We used standard values for the exponential decay rates β 1 = 0.9 and β 2 = 0.999 and set ǫ = 10 −8 to prevent division by zero in equation (7). Furthermore, to counteract the exploding gradient problem, we employed a common gradient clipping strategy 49 that ensured by rescaling that the norm of each gradient vector does not exceed a threshold θ = 0.1. www.nature.com/scientificreports/ Learning rate protocol. To speed up the learning process, we used large minibatches ( N b = 256 ) and adjusted the learning rate η according to the following protocol: The learning rate was linearly increased 50 from 10 −7 N b to 10 −6 N b over the course of the first 12 training epochs (warm-up period). This type of linear scaling of the learning rate with minibatch size in combination with a warm-up period was observed to facilitate the use of large minibatch sizes, thereby shortening training times 50 . A single training epoch was completed when all minibatches of the training set were used once for the gradient updates. The warm-up period was followed by a cool-down period during which the learning rate was exponentially decreased, η i = 10 −6 N b · e −α(i−12) , where i > 12 denotes the training epoch, and where the exponential rate was set to α = 0.06 . The cool-down period lasted for the remaining training process.
Regularization. We used four mechanisms to prevent our network from overfitting the training data. (i) L2 regularization, also known as weight decay (see second term in equation (6)), favors neural networks whose weights (parameters) take on low values, thereby effectively restricting the model space of networks where the loss function obtains low values 51 . We set the L2 regularization strength to = 10 −4 . (ii) We used Dropout regularization with a dropout probability of p dropout = 0.2 in the feature extractor as well as in the classifier of the network (see Fig. 2). Dropout regularization has been reported to prevent complex co-adaptation of neurons and thus drive them to create features on their own 47 . (iii) Limiting the search in parameter space during training (called early stopping) has a regularizing effect since it prevents the network from overfitting the training data 51 : After each training epoch, the network is evaluated on the validation set by calculating the F1 score (see Sect. "Evaluation measures"). If the F1 score does not improve over 5 consecutive training epochs, the training is stopped. The training is carried out for a minimum of 12 and a maximum of 50 epochs. At the end of the training, the network parameters are returned at that point of the training for which it achieved the best F1 score on the validation set. (iv) Using prior knowledge to create additional training instances is called data augmentation. Data augmentation is known to decrease the generalization error of the trained network and thus can be interpreted as a regularizing mechanism. We detail this approach in the section on data augmentation.
We note that we chose regularization parameters for all mechanisms after extensive hyperparameter exploration on the validation set.
Evaluation measures. To investigate and assess the accuracy of our classifiers to score sleep, we used the following evaluation measures.
Confusion matrix. We use confusion matrices to summarize the predictive performance of our networks. The matrix elements describe the number of one of the actual labels (vertical dimension) that have been assigned one of the target labels (horizontal dimension) by the network. Both the actual label frequency (vertical axis) and the predicted label frequency (horizontal axis) can be used to normalize the matrix elements (see Fig. 4 for an example) 52 . When normalized by the actual label frequency, matrix elements show the percentage of segments of a true class being labeled by the network as a predicted class. Each diagonal element corresponds to the network's recall of the respective class. Likewise, when normalized by the predicted label frequency, matrix elements show the percentage of all segments of a class predicted by a network to actually belong to the true class. Each diagonal element corresponds to the network's precision of the respective class.
F1 score. In a binary classification setting (e.g., Wake vs Non Wake; pre-REM vs Non pre-REM) the F1 score, also called F-measure, is defined as the harmonic mean of precision ( p precision ) and recall ( p recall ) 53 . Estimates of these values ( p precision , p recall ) are extracted from the normalized confusion matrices: The score takes on values between 0 and 1, with high values indicating better classification performances. Although this metric is not invariant to changes in the data distribution, it is commonly used to compare different algorithms against a consistent dataset 53 . Note that rebalancing of the training set affects the training score in unpredictable ways; however, we compare our networks based on the F1 scores calculated on the validation set which is not rebalanced. We expanded the definition given in equation (10) for our multiple class classification problem by calculating F1 , the average of the F1 scores of each sleep stage.
Markov transition matrix. Markov transition matrices summarize the empirical transition probabilities of a sequence (e.g., a sequence of sleep stages scored by a human expert or a model) interpreted as a Markov process. They are two-dimensional matrices whose elements describe the number of transitions from one label to another in either absolute numbers or, after normalization, percentages (see Fig. 5 for an example). We created Markov transition matrices for the actual labels and the labels predicted by our networks to measure how well the networks perceived the structure of the sleep architecture.

Results
We investigated how accurate our network predicts sleep stages under various training configurations. We extensively modified the network capacity, the training schedule, class rebalancing weights, and data augmentation. Our experiments were conducted on two sets of target stages; a standard set containing Wake, NREM, and REM stages, and an extended set also including the stages pre-REM and artifact. Our approach was to optimize hyperparameters on the extended set of sleep stages. Afterwards, we translated these parameters to the standard set to investigate if the predictions reproduced the known state-of-the-art performance.
In preliminary experiments, we also considered methods of cross-validation in order to reduce the variance of predictions on the validation set. We discovered, however, that the reduction in variance did not yield additional information that would influence our parameter search. We therefore abandoned this compute-intensive Influence of class rebalancing. To analyze class rebalancing, we compared the epoch-wise improvements in F1 score on the unbalanced and rebalanced dataset (see Fig. 3 panels (a,d) and (b,e) respectively). When trained on the unbalanced training set, F1 scores increased rapidly for the majority classes during the first 10 training epochs, while the F1 scores of the pre-REM and artifact classes only slowly increased. A similar behavior was observed for the F1 scores on the validation set. Meanwhile, the F1 scores of the pre-REM and artifact classes obtained when training on the rebalanced training set increased more quickly and reached higher optimal values. The optimal F1 scores on the validation set were comparable to the F1 scores obtained on the validation set by the network trained on the unbalanced training set.
While the optimal F1 scores on the validation set were comparable between trainings on the rebalanced and unbalanced training sets, we observed that training on the unbalanced dataset led to unstable results: minorityclass F1 scores were sometimes not recognized in lapses of several epochs. Such an event is visible for the F1 score of the artifact class in Fig. 3(a, d). Because of this instability, we continued to optimize the training configuration using only rebalanced training sets.
Classification on the extended set of sleep stages. We studied the classification performance under different training configurations when targeting pre-REM and artifact in addition to the standard classes Wake, Non-REM, and REM (see Table 3). Without regularization and augmentation, we observed a mean F1 score of F1 = 0.99 on the training set, and F1 = 0.76 on the validation set. We optimized this F1 score by changing the dropout probability p dropout ∈ [0.1, 0.2  Results of the trainings with data augmentation are reported in Table 3. We observed a slight decrease of mean F1 scores on the training set with augmentation (e.g., F1 = 0.92 for all augmentations combined) in comparison to training without augmentation ( F1 = 0.98 ). We interpret this decrease to reflect the regularizing effect of data augmentation that can generally be observed in a network with fixed model capacity when fitting data of increasing variability. Interestingly, this decrease was most pronounced for the pre-REM class (see Fig. 3(c,f)), in contrast to the artifact class that could be fitted perfectly by the network on the training set. Mean F1 scores on the validation set show that none of the data augmentation algorithms could improve on the F1 score achieved by the network trained only with the best regularization parameters. Consequently, we evaluated this network on the test set where we found a mean F1 score of F1 = 0.78.
Comparing the regularized with the un-regularized results indicates that regularization induced increases of the F1 scores in the minority classes. None of the augmentation techniques were able to improve the F1 scores of the individual classes, except time-shift augmentation, which increased the validation F1 score of pre-REM from 0.50 to 0.52.
The test set consisted of one mouse each from both dietary groups. We explored their individual F1 scores and did not find any deviations exceeding 2 %. In particular, the pre-REM F1 score for the typical mouse was 0.48; for the altered-diet mouse we found 0.48.
Reduction of the network to the standard stages. Next, we studied the classification performance of our optimized network architecture in distinguishing between the three standard sleep stages Wake, REM, and NREM (see Table 4). With neither regularization nor augmentation, we again observed our network to accurately approximate the training data; on the validation data, we found a mean F1 score of F1 = 0.94 . Using the best regularization from the optimization on five stages -p dropout = 0.2 , and = 10 −4 -increased the mean F1 score on the validation set to F1 = 0.96 . We accepted this network and found on the test set, reduced to 3 stages, a mean F1 score of F1 = 0.95 . A detailed analysis revealed that improvements were made in the REM class, the most under-represented class in the reduced datasets.
These classification performances are comparable with state-of-the-art results obtained in other studies 32,33,36 and demonstrate that the proposed network architecture is suitable for sleep scoring purposes.

Detailed analysis of the best network. Confusion matrices (see Sect. Evaluation measures) obtained
for the network trained without data augmentation (see Fig. 4) show that more than 98% of the Wake segments, Table 3. F1 scores for the training, validation, and test set at various training configurations. First, the network was trained without L2 regularization and without dropout (row 1). Next, the network was trained with both regularizations (row 2). Last, the network was trained using different types of data augmentation (rows 3-6; the parameters in column 1 resulted in the best validation F1 scores among the explored parameters), or with all data augmentations combined (row 7). F1 scores averaged over all classes (column 3) characterize how well the networks can predict sleep stages on all five classes. F1 scores in columns 4 and 5 present the capabilities of the networks to predict the two minority classes pre-REM and artifact, respectively. Only the network with the best validation F1 scores (best network) was evaluated on the test set. www.nature.com/scientificreports/ more than 94% of the REM segments, and close to 92% of the NREM segments in the test set were classified correctly (panel b). The network was also able to score 71% of the artifact segments correctly. The pre-REM stage was scored in 58% of the segments correctly but was confused in about 20% of cases with REM or in 20% of cases with NREM sleep. The confusion matrices also show that more than 96% of the segments in the test set predicted as Wake, more than 87% of the segments predicted as REM, and 97% of the segments predicted as NREM had been assigned the predicted stage by the human expert (panel a). Furthermore, more than 50% of the segments that were predicted as artifact by our network had been assigned artifact as the true stage. Predictions of pre-REM had pre-REM as the true stage in nearly 41% of cases, while in about 50% of cases, predictions had NREM and about 9% had REM as true stage.
To investigate whether the network was able to capture basic properties of sleep dynamics, we determined Markov transition matrices (see Sect. Evaluation measures) on the test set based on scores provided by our expert (Fig. 5a) and on scores as predicted by our network (Fig. 5b). The probabilities of transitions between the different stages obtained from the network indeed reflect typical sleep cycles, where Wake is followed by NREM sleep (5.4%), which is followed by either Wake (4.5%), pre-REM (4.1%), or more NREM sleep (91.1%). We highlight that there was only a small transition probability from NREM directly to the REM stage as determined by our network (0.3%), which was close to the transition probability obtained by our expert (0.1%). NREM transitioned via pre-REM to REM stages before the cycle closed with the transition to either Wake (17.7%) or NREM (4.2%). While transition probabilities of our network closely follow those as obtained from the expert, we find notable differences for transitions into and out of the pre-REM stage. pre-REM stages as determined by our network tended to last longer (pre-REM to pre-REM transition probability of 31.6%) than those determined by the scoring expert (17.9%). Table 4. F1 scores for the training, validation, and test set for predicting the three main stages Wake, REM, and NREM sleep at various configurations of the hyperparameters. The network was trained on the rebalanced training set #2 (see Table 2). Only the best configuration (best reduced network) was evaluated on the test set.  Figure 4. Confusion matrices of the best network evaluated on the test set (see Table 3): (a) Confusion matrix averaged by the predicted label frequency; diagonal elements correspond to precision. (b) Confusion matrix averaged by true label frequency; diagonal elements correspond to recall. A prediction of pre-REM is about 50% skewed towards NREM being the true stage, whereas a true pre-REM stage has equal probability to be mistaken for REM and NREM by the network.

Discussion
We proposed a neural network architecture that is able to distinguish between the three main stages Wake, REM and NREM as well as the infrequent stages pre-REM and artifact. When trained on these five stages, we observed our network to classify the majority sleep stages (Wake, REM, NREM) with high precision and recall (see Fig. 4), while the infrequent class pre-REM obtained lower precision and recall and a lower F1 score ( F1 = 0.48 , see Table 3) on out-of-sample data. Transition probabilities between sleep stages predicted by the network were largely consistent with those determined by the human expert. However, pre-REM phases predicted by the network tended to last longer than those by the expert. When the network was trained to distinguish between the three main stages only, classification performance as quantified by the average F1 score on out-of-sample data ( F1 = 0.95 , see Table 4) was comparable to state-of-the-art classifiers 32,33 . L2 and Dropout regularization as well as data augmentation increased F1 scores of the minority classes (pre-REM, artifact), while class rebalancing did not. We observed, however, the latter to stabilize the training process. The F1 score of pre-REM indicates that pre-REM as a transitional sleep stage is particularly challenging to score. This challenge is not confined to scoring sleep in mice but can also be encountered in other species such as rats or humans where transitional sleep stages can also be observed. In rats, systems for sleep scoring obtained consistently lower scoring accuracies for the transition sleep stage compared to the main sleep stages 20 with F1 scores ranging from 0.3 to 0.7 19,21 . In humans, automated scoring systems achieved F1 scores between 0.1 to 0.6 for N1 sleep 54,55 , a transitional stage between wakefulness and sleep. In mice, we cannot comparatively assess the scoring performance of our network for pre-REM since, to the best of our knowledge, no prior approaches exist at the time of this writing. We note, however, that the obtained F1 score for pre-REM in mice ( F1 = 0.48 ) is in the middle of the aforementioned ranges for F1 scores of transitional stages.
We hypothesize that pre-REM sleep is difficult to delineate from NREM and REM sleep for experts and automated systems alike. Indeed, Fig. 1 (bottom row) shows a transition from NREM over pre-REM to REM sleep, where the predicted class probability of pre-REM already increases in the third segment that was labeled as NREM sleep. We observed many transitions like this where the class probabilities of NREM, pre-REM, and REM slowly decreased or increased along many segments. If human scorers had more difficulties in identifying pre-REM compared to other sleep stages, we would expect an increased intra-rater variance (i.e., a lower probability of labeling the same segments as pre-REM when the dataset would be labeled several times by the same person). Moreover, we would also expect larger inter-rater variance for pre-REM than for other stages (i.e., a lower agreement between different human scorers on scoring pre-REM) as has been observed for the transition sleep stage in rats 20 . Such uncertainties in sleep scores (also called label noise) can limit the classification performance achievable by automated scoring systems trained on that data. Besides label noise, another hypothesis that may explain lower classification accuracies for pre-REM is class imbalance 40 . However, since we did not observe increased F1 scores when rebalancing classes (see Figs. 3d and e), we reject this hypothesis for the learning problem and dataset studied here.
One of the limitations of this study is the lack of separate annotations by several experts from which a group consensus for each sleep stage could have been derived. Such annotations would have also allowed to assess the extent of inter-rater variance for each class, thereby allowing to judge the network's performance in classifying pre-REM sleep compared to human scorers. We note, however, that the scoring performance of the network when restricted to Wake, REM, and NREM is comparable to values reported for inter-rater agreement in the literature 33 . Another limitation is that we could not investigate whether achievable F1 scores change due to age, Figure 5. Transition probabilities from one stage (row) to another stage (column) obtained from the validation set with scores provided by our scoring expert (SW) (a) and obtained with scores predicted by the best network (b). The network parameters were trained on the rebalanced training set without augmentation. www.nature.com/scientificreports/ behavior, disorders, dietary constraints, or genetic variations of mice since the dataset was too small. We expect some variability in F1 scores as indicated by previous studies (e.g., for genetic variations 32,56 ) and consider future studies into this direction as desirable. We consider several strategies as promising to improve upon existing data-driven systems for rodent sleep scoring like the neural network presented in this paper. (i) Statistical approaches such as confident learning [57][58][59] aim at identifying mislabeled training instances and have been demonstrated to successfully increase classification performance in tasks such as image recognition 57 by removing or correcting wrong labels, thereby reducing label noise. This approach might prove particularly fruitful for infrequent stages such as pre-REM sleep as we expect such stages to be more affected by label noise. (ii) Constraining the sequence of predicted sleep stages to transition probabilities as observed in the training data 32 may improve network predictions for infrequent stages such as pre-REM. We note, however, that such an approach can introduce bias in the frequencies of sleep stages when data was obtained under different experimental conditions. The reliable detection of such changes in sleep stage frequencies is, however, the matter of inquiry in many sleep studies. (iii) When label noise can be reduced or when datasets scored by a group of experts (and thus with known inter-rater variabilities) become available, neural network architectures that could model long temporal successions 60,61 of sleep stages might be particularly promising candidates to achieve new state-of-the-art classification performances for sleep scoring tasks. (iv) Finally, due to the lack of consensus in rodent sleep scoring 8 , we consider the creation of a public dataset annotated by a committee of experts to be particularly helpful to advance the creation of systems to automate scoring of animal sleep. Such a dataset should include data from different labs (to assess cross-lab variability of system predictions). Labels by a group of experts would establish a gold standard and would allow for an assessment of inter-rater variability, even for pre-REM sleep. Such approaches have been successfully pursued in other areas of sleep research, for instance for the challenge of sleep spindle detection 62 .
We are confident that deep-learning based systems, such as introduced here, will facilitate sleep studies with large amounts of data, including long-term studies and studies with thousands of subjects. The automation of the manual sleep scoring process, a cognitive repetitive task, will likely support the reproducibility of studies and allow researchers and lab personnel to focus on productive tasks.

Data availability
The datasets analyzed during the current study are available from the corresponding author on reasonable request.