Non-contact physiological monitoring of preterm infants in the Neonatal Intensive Care Unit

The implementation of video-based non-contact technologies to monitor the vital signs of preterm infants in the hospital presents several challenges, such as the detection of the presence or the absence of a patient in the video frame, robustness to changes in lighting conditions, automated identification of suitable time periods and regions of interest from which vital signs can be estimated. We carried out a clinical study to evaluate the accuracy and the proportion of time that heart rate and respiratory rate can be estimated from preterm infants using only a video camera in a clinical environment, without interfering with regular patient care. A total of 426.6 h of video and reference vital signs were recorded for 90 sessions from 30 preterm infants in the Neonatal Intensive Care Unit (NICU) of the John Radcliffe Hospital in Oxford. Each preterm infant was recorded under regular ambient light during daytime for up to four consecutive days. We developed multi-task deep learning algorithms to automatically segment skin areas and to estimate vital signs only when the infant was present in the field of view of the video camera and no clinical interventions were undertaken. We propose signal quality assessment algorithms for both heart rate and respiratory rate to discriminate between clinically acceptable and noisy signals. The mean absolute error between the reference and camera-derived heart rates was 2.3 beats/min for over 76% of the time for which the reference and camera data were valid. The mean absolute error between the reference and camera-derived respiratory rate was 3.5 breaths/min for over 82% of the time. Accurate estimates of heart rate and respiratory rate could be derived for at least 90% of the time, if gaps of up to 30 seconds with no estimates were allowed.


Supplementary method 1: Patient detection and skin segmentation performance
The proposed CNN networks for patient detection and skin segmentation were developed and evaluated with a two-fold cross validation procedure using images extracted only from the 15 preterm infants dataset labelled as "training" in table 2 of the main paper. The proposed CNN models were compared with the performance of three colour-based skin classifiers 1, 2 based on Naive Bayes 3 , Random Forests 4 and Gaussian Mixture Models (GMMs) 5 . The skin filters classify each pixel as a skin pixel based solely on skin colours and provide a skin probability map, which can be thresholded to a binary label. The skin models were trained on images that were converted to the Hue-Saturation-Lighting (HSL) colour space 6 with white balance correction applied 7,8 . Patient detection was performed using the ratio of skin to non-skin pixels and the average probability of predicted skin pixels to make a decision, as in the method described in 3 . Supplementary table 1 shows the results for the patient detection network compared with the baseline skin filters. The dataset without data augmentation consisted of a total of 3,436 images divided in 1,718 positive images (with an infant in the video frame) and 1,718 negative images (without an infant in the video frame). As explained in the "Methods" in the main paper, multiple variations of each image were generated using three data augmentation techniques: rotational, mirroring and lighting augmentation. The total number of the dataset with data augmentation was 44,668 divided in 22,334 positive and 22,334 negative images. The datasets were split equally between the training and test sets. The Naive Bayes classifier achieved the highest accuracy in the patient detection task among the baseline skin filters. It achieved 1.0% and 1.6% higher accuracy than Random Forests and GMMs respectively. In term of the area under the receiving operating curve (AUC), GMMs obtained the highest score (98.8%) followed by Naive Bayes (98.1%) and Random Forests (97.2%) respectively. Skin segmentation was performed considering only the positive images with an infant present in the video frame. Supplementary table 2 shows the results for the proposed skin segmentation networks. The dataset without data augmentation consisted of 1,718 images, the dataset with data augmentation consisted of 22,334 images. Random Forests achieved the best performance for skin segmentation, with 4.7% and 14.5% improvements in intersection-over-union (IOU) with respect to GMMs and Naive Bayes respectively.

Supplementary
The multi-task CNN model trained with data augmentation outperformed the other models for the majority of the metrics in both patient detection and segmentation tasks. For patient detection, the model achieved an accuracy of 98.8% and an AUC score of 98.2%. For skin segmentation, the network yielded an IOU score of 88.6% and a pixel accuracy of 98.1%.

2/9
Supplementary method 2: Intervention detection performance Time periods of clinical intervention were detected by combining information processed in the patient detection and skin segmentation network with temporal information computed from the optical flow between images over a sliding time window of length T and step size τ. For each T -second sliding window, L optical flows were computed from L + 1 video frames, each one extracted every second. The horizontal and vertical components of each optical flow vector were stacked together across input channels, as suggested by Simonyan and Zisserman 9 . The proposed clinical intervention network was developed and evaluated with a two-fold cross validation procedure using only the 15 preterm infants dataset labelled as "training" in table 2 of the main paper, as stated by the clinical study protocol.

Supplementary table 3.
Baseline performance of the two-stream architecture proposed by 9 evaluated using a window length T = 10 seconds and a step size τ = 1 second.

Model
AUC Accuracy Precision Recall Specificity Supplementary table 3 shows the performance of the baseline implementation compared with reference methods developed based on the two-stream convolutional architecture for action recognition proposed by Simonyan and Zisserman 9 and implemented using the VGG-M-2048 model 10 . As suggested in 9 , the number of optical flow stacks L was set to 10. To be comparable with the original implementation, we used a window length T = 10 seconds and a step size τ = 1 second. There were a total of 129,608 windows, split equally on windows during which a clinical intervention occurred and windows during which there were no clinical interventions. The results were consistent with those reported in 9 . The temporal network yielded higher accuracy (87.8%) than the spatial network (76.4%). The fusion of both networks, using either averaging or a linear Support Vector Machine (SVM), led to further improvement in the accuracy since they provided complementary information to support the classification task. The SVM fusion network resulted in the highest accuracy of 92.4%, a 4.7% improvement over the temporal network. Supplementary table 4 summarises the performance effects of using different sliding window configurations on the optical flow network. The number of windows were different according to each configuration. If more than half a time window was labelled as intervention by the annotators, the whole time window was marked as intervention. Therefore, there was more training data available for longer windows even though the step size is the same (for example T = 1 sec, τ = 1 sec compared with T = 10 sec, τ = 1 sec). The dataset was split equally on windows during which a clinical intervention occurred and windows during which there were no clinical interventions. The configuration of a 5-second sliding window with 1-second step size led to 92.2% accuracy and outperformed the other configurations. Increasing the size of the window length from 5 to 10 decreased performance.

Supplementary
Supplementary table 5 reports the performance of the local temporal network and the context network that were trained individually and used for constructing the multi-resolution and temporal context fusion networks. These networks were evaluated using a dataset containing a total of 121,426 time windows of length 5 seconds and a step size of 1 second, which was the best performing configuration for the optical flow network, as reported in supplementary table 4.

3/9
Supplementary table 5. Performance of the local optical flow network and context network evaluated using a window length T = 5 seconds and a step size τ = 1 second. Supplementary table 6 reports the classification performance for different fusion strategies evaluated using a dataset containing a total of 121,426 time windows of length 5 seconds and a step size of 1 second. The temporal context fusion method yielded the highest performance with an accuracy of 94.5%, a 2.3% improvement with respect to the optical flow network alone. Multi-resolution temporal fusion, with either fit or centre cropping, gave marginal performance improvements. In contrast, the spatio-temporal fusion method was unable to make effective use of spatial information extracted through the patient detection and skin segmentation network.

Supplementary table 6.
Performance of the different fusion approaches evaluated using a window length T = 5 seconds and a step size τ = 1 second.

Supplementary method 3: Typical clinical interventions and nursing activities in the NICU
Preterm infants experience routine clinical interventions several times a day in the Neonatal Intensive Care Unit (NICU). For example: checking the normal functionality of medical equipment, changing a nappy, taking temperature readings, administering medications or withdrawing blood from the heel for a blood gas test. During these events, clinical staff or parents actively interact with the infant, causing motion artefacts that pose challenges to the estimation of vital signs from video camera data. Supplementary table 7 summarises the activities carried out by the nurses in the care of the pre-term infants in the NICU. Parents also visit their newborn baby regularly and often take the infant from the incubator for kangaroo care (skin-to-skin contact with the parent). Every 4−12 hours -Give intravenous (IV) line medication.

5/9
Supplementary method 4: Signal Quality Index for heart rate estimation The assessment of the quality of the PPGi signal is of high importance as data corruption by subject movements and changes in the lighting conditions presented considerable challenges for video analysis. The assessment of the quality of the PPGi signal was extended from that described in 11,12 and further described in 13 . As an initial step, an activity index was computed based on changes in the segmented skin area over consecutive frames, corresponding to the movement of the subject. A Bayesian change point detection algorithm was then applied to identify step changes, often caused by sudden lighting condition changes, in the PPGi signal. The pulses occurring during the periods of high subject motion and step changes were flagged as invalid. In order to further identify whether each detected beat was of good quality, the algorithm performed a beat-by-beat quality assessment by combining multiple analysis methods: frequency bounding, clipping detection, amplitude thresholding, and multi-scale dynamic time warping. Finally, the signal quality index (SQI) of each detected beat was obtained as a combination of all these individual metrics. Suitable time periods for estimating heart rate from the video camera were defined when the movement of the infant was minimal. Changes in the segmented skin area over time can be used as an indicator of the degree of subject motion. The centroid (C x , C y ) of skin regions was defined as the average location of the predicted skin pixels in the horizontal and vertical directions. Motion M(i) at frame i was defined as the Euclidean distance between centroids for two successive frames: The SQI act of the kth beat was taken to be 0 if the Euclidean distance between the kth beat and two beats both before and after (5 beats in total) was higher than a threshold of 20 pixels, defined as: where b k−2 and b k+2 denote the location of the entire (k − 2)th and (k + 2)th beat respectively. The distance threshold was set to 20 pixels, corresponding to a distance of approximately 1 cm measured by the ruler in the colour chart placed near the subject (see figure 1c in the main paper). The camera was positioned approximately 30 cm away from the subject for all recording sessions.
Abrupt changes in the PPGi signal occurred due to changes in subject posture or sudden changes in the lighting conditions, for example: when the overhead light over the incubator was turned on or off; window blinds were opened or closed; or clinical staff walked pass by the incubator. In order to detect the location of these step changes, a Bayesian change point detection algorithms was applied to the PPGi signal 14 . Change point detection was performed on a window-by-window basis for different window sizes of 5, 10 and 15 seconds and a step size of 5 seconds. All the change points detected were then merged together. Given P all (m) is the probability of a change point at m merged from the detections at multiple window sizes, the SQI cp of the kth beat was defined as: where b k−2 and b k+2 denote the location of the (k − 2)th and (k + 2)th beats respectively. If a change point was detected at the location of the kth beat, the SQI cp values of this beat and the two beats before and after were set to zero.
Frequency bounding determined whether the instantaneous HR fell within the physiological range of typical preterm infants, taken to be within a range of 90 and 270 beats/min. Given that HR inst is the instantaneous heart rate of the kth beat, the SQI freq was taken to be 0 if the HR inst fell outside the valid physiological range: Clipping generally occurred as a result of motion artefacts. Signal clipping can be detected when the derivative of the signal crosses a given threshold 15 . Given that N length (k) is the length of the kth beat and N clipped (k) is the proportion of the derivative 6/9 of the kth beat that crosses a clipping threshold of 0.1, the SQI clip of the kth beat was set to 0 when more than one-third of the derivative was clipped: Amplitude thresholding was performed to determine whether the amplitude of each beat remained within three standard deviations σ w from the mean µ w of the window w. The statistics were calculated locally for each 15-second moving window w. The SQI amp of the kth beat at location b k was set to 0 if part of the beat was outside the valid range: Another quality metric was defined by measuring the similarity of the cardiac beats in the PPGi signal. Dynamic time warping (DTW) is a time series technique used to determine a distance (or a degree of similarity) between two given time series based on the best possible alignment between the two 16 . The DTW technique is suitable for the time series whose characteristics may vary in time. For example, similarities in each cardiac cycle could be measured using the DTW technique, even if heart rate was increasing or decreasing during the course of an observation. Each cardiac beat pulse can be warped in the time domain to determine a degree of similarity, independent of temporal variations. The classical DTW algorithm is computationally intensive as it needs to evaluate every possible warping path in order to obtain an optimal alignment. Fitriani 17 and Salvador 18 extended the algorithm to perform multi-scale warping by refining the search space for the optimal alignment between the two time series from a coarse to a finer resolution. The multi-scale DTW technique was extended for assessing the quality of pulsatile signals by determining the distance of the optimal alignment between each beat and a running beat template computed over a time window.
To compute the signal quality based on the DTW method (SQI dtw ), the PPGi signal was first divided into 15-second moving windows with a step size of 5 seconds. Each window was assessed independently of each other. Multi-scale DTW is described in more detail in 11 . The DTW distance was computed between each individual beat (X k ) and the average beat within the window Y k . The SQI dtw was defined as: SQI dtw ranges from 0 to 1, where a high value relates to a good-quality beat. On the training set, DTW values for goodquality beats ranged between 5 and 20 with SQI dtw values greater than 0.80. On the contrary, poor-quality beats had much higher DTYW distances and corresponding lower SQI dtw values.
Once all the individual SQI metrics for each beat had been calculated, the combined beat SQI (SQI beat ) for the kth beat was derived by simply multiplying all the SQI metrics together:

7/9
Supplementary method 5: Signal Quality Index for respiratory rate estimation The signal quality assessment employed a series of different measures to assign a signal quality index to each breath in the respiratory signal. The algorithms presented in this appendix were extended from that described in 11,19 and further described in 13 . Four signal quality indices were computed based on the analysis of the patient activity (SQI act ), a valid physiological breathing range (SQI freq ), the agreement between peak detectors (SQI peak ) and multi-scale dynamic time warping (SQI dtw ). The calculation of the first SQI followed what was described in the previous section for heart rate estimation. Frequency bounding determined whether the instantaneous RR fell within the physiological range of typical preterm infants, taken to be within a range of 18 and 120 breaths/min. Given that RR inst is the instantaneous respiratory rate of the kth breath, the SQI freq was taken to be 0 if the RR inst fell outside the valid physiological range: Unlike the estimation pipeline used for heart rate estimation, the assessment of quality of the respiratory signals was based on the agreement between two peak and onset detection algorithms 20 . Peak agreement is a measure of how much the peaks identified by the two peak-and-onset detectors agreed with each other over a given time window. Both detectors usually agreed with each other when the respiratory signal was clean and disagreed in the presence of noise and artefacts. Agreement was considered valid when the peaks identified by the two detectors were not located away from each other by more than 5 samples (or 0.25 seconds -half of the duration of the highest frequency rate that could be estimated). The measure of peak agreement SQI peak for the kth breath was calculated as the ratio of the number of peaks in agreement over the total number of peaks detected in a 10-second window, centred around the kth breath: The multi-scale dynamic time warping technique, used for heart rate estimation, was adapted and used to determine the optimal alignment between each peak in the respiratory signal and the template calculated by averaging the nearby peaks over a time window. Several modifications were needed to make it suitable for respiratory signals. Unlike PPGi signals, the morphology of the respiratory signal varies greatly according to the subject's breathing patterns. Preterm infants are known to have spontaneous breathing patterns 21 . Some infants, for example, may have a short inspiration phase followed by a prolonged expiratory phase; others may have a period of hold expiration followed by multiple expiratory flow peaks. The amplitude of each breath in the respiratory signal mainly depends on the depth of breathing or the volume of air inspired into the lungs. Hence, the criteria that were originally used for constructing a peak template from PPGi signals were too strict and not appropriate for respiratory signals.
Multi-scale dynamic time warping was carried out using a 15-second moving window w with a step size of 5 seconds. In order to measure the signal quality of each breath in the window, an average breath template was constructed. Once the template was calculated, the DTW distance was computed for each breath in the time window w. Let DTW be the distance between each breath and the window template, the SQI dtw was defined as: When the window template could not be calculated, the SQI dtw of all breaths in the window w was set to 0 (poor quality). This modification was needed as variations in respiratory rate were much greater than variations in heart rate.
Once all the signal quality measures were calculated, a combined signal quality index (SQI breath ) was computed for the kth breath as: SQI breath (k) = SQI act (k) · SQI freq (k) · SQI peak (k) · SQI dtw (k).
SQI breath was taken to be 0 (poor quality) during the periods of high subject motion (SQI act = 0) and abnormal instantaneous respiratory rate (SQI freq = 0). During a quiet and stable period, SQI breath relied mainly on SQI dtw .