Recurrent Neural Networks for Multivariate Time Series with Missing Values

Multivariate time series data in practical applications, such as health care, geoscience, and biology, are characterized by a variety of missing values. In time series prediction and other related tasks, it has been noted that missing values and their missing patterns are often correlated with the target labels, a.k.a., informative missingness. There is very limited work on exploiting the missing patterns for effective imputation and improving prediction performance. In this paper, we develop novel deep learning models, namely GRU-D, as one of the early attempts. GRU-D is based on Gated Recurrent Unit (GRU), a state-of-the-art recurrent neural network. It takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates them into a deep model architecture so that it not only captures the long-term temporal dependencies in time series, but also utilizes the missing patterns to achieve better prediction results. Experiments of time series classification tasks on real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets demonstrate that our models achieve state-of-the-art performance and provide useful insights for better understanding and utilization of missing values in time series analysis.


Introduction
Multivariate time series data are ubiquitous in many practical applications ranging from health care, geoscience, astronomy, to biology and others.They often inevitably carry missing observations due to various reasons, such as medical events, saving costs, anomalies, inconvenience and so on.It has been noted that these missing values are usually informative missingness [22], i.e., the missing values and patterns provide rich information about target labels in supervised learning tasks (e.g, time series classification).To illustrate the idea, we show some examples from two real world health care datasets (MIMIC-III, PhysioNet) in Figure 1.We plot the Pearson correlation coefficient between variable missing rates and the labels (mortality and ICD-9 diagnoses).We observe that the missing rate is correlated with the labels, and the variables with low missing rate are usually highly correlated with the labels, demonstrating the usefulness of missingness patterns in solving a prediction task.
In the past decades, various approaches have been developed to address the missing values [23].A simple solution is to omit the missing data and perform analysis only on the observed data.A variety of solutions have been developed to fill in the missing values, such as smoothing or interpolation [13], spectral analysis [17], kernel methods [20], multiple imputation [28], and the EM algorithm [7].[23] and references therein provide a good review on missing data solutions.However, existing solutions often result in a two-step process where imputations are disparate from the prediction models and arXiv:1606.01865v1[cs.LG] 6 Jun 2016  missing patterns are not effectively explored, thus leading to suboptimal performance in prediction and analysis [27].
In the meanwhile, Recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM) [9] and Gated Recurrent Unit (GRU) [4], have shown to achieve the state-of-the-art results in many applications with time series or sequential data, including machine translation [1,25] and speech recognition [8].RNNs enjoy several good properties such as strong prediction performance as well as the ability to capture long-term temporal dependencies and variable-length observations.RNNs for missing data has been studied in earlier works [3,26,18] and applied for speech recognition and blood-glucose prediction.However, there has not been works which directly model missing patterns into RNN for time series classification problems.Exploiting the power of RNNs along with the informativeness of missing patterns is a new promising venue to effectively model multivariate time series and is the main motivation behind our work.
In this paper, we develop novel deep learning models based on Gated Recurrent Units (GRU) to effectively exploit two types of informative missingness patterns, i.e., masking and time duration.
Masking informs the model which inputs are observed (or missing), while time duration encapsulates the input observation patterns.Our model captures the observations and their dependencies by applying masking and time duration (using a decay term) to the inputs and network states of GRU, and can be jointly trained using backpropagation.Thus, our models not only can capture the long-term temporal dependencies of time series observations but also can utilize the missing patterns to improve the prediction results.Empirical experiments on real-world clinical datasets as well as synthetic datasets demonstrate that our proposed models outperform strong deep learning models built on GRU with imputation as well as other strong baselines.These experiments show that our proposed method is suitable for many time series classification problems with missing data, and in particular is readily applicable to the predictive tasks in emerging healthcare applications.Moreover, our method also provides useful insights into more general research challenges of time series analysis with missing data beyond classification tasks, including: (1) Effective solutions to characterize the missing patterns of missing-at-not-random time series, such as masking and time duration modeling; (2) A general deep learning framework to handle time series with missing data; (3) An interesting analysis on relationship between missingness and the outcomes for the proposed models on a varieties of datasets.
2 RNN models for time series with missing variables 0.0 0.1 0.5 1.5 0.6 0.9 0.6 0.0 0.1 0.5 1.0 1.6 1.9 2.5 We denote a multivariate time series with D variables of length T as X = (x 1 , x 2 , . . ., x T ) T ∈ R T ×D , where x t ∈ R D represents the t-th observations/measurements of all variables and x d t denotes the measurement of d-th variable of x t .Let s t ∈ R denote the time-stamp when the t-th observation is obtained and we assume that the first observation is made at time t = 0 (s 1 = 0).A time series X could have missing values.We introduce a mask vector m t ∈ {0, 1} D to denote which variables are missing at time step t.The mask vector for x t is given by For each variable d, we also maintain the time duration since its last observation, and it is denoted by δ d t ∈ R and given as: An example of these notations is illustrated in Figure 2. We also denote the missing rate for a variable d as p d X and it is calculated as p In this paper, we are interested in the time series classification problem, where we predict the labels l n given the time series data D, where Tn , and l n ∈ {1, . . ., L}.

Recurrent neural networks for time-series classification
In this paper, we investigate the use of recurrent neural networks (RNN) for time-series classification, as their recursive formulation allows it to handle variable-length sequences naturally.Moreover, RNN shares the same parameters across all time steps which greatly reduces the total number of parameters we need to learn.Among different variants of the RNN, we specifically consider an RNN with gated recurrent units [4,6].
The structure of GRU is shown in Figure 3(a).GRU has a reset gate r j t and an update gate z j t for each of the hidden state h j t to control.At each time t, the update functions are shown as follows: where matrices W z , W r , W , U z , U r , U and vectors b z , b r , b are model parameters.σ is an element-wise sigmoid function, and we use for element-wise multiplication.This formulation assumes that all the variables are observed.

RNN baseline approaches
There are two straightforward approaches to using an RNN for time series data with missing variables.The first, and perhaps most naive, approach is to preprocess the time series so that it does not have any missing variables when presented to an RNN.We describe three such approaches in this section.

GRU-0
We may simply replace each missing observation with the mean of the variable across the training examples, i.e., where

GRU-f
We can exploit the temporal structure in each time series.That is, we assume that any missing measurement is same as the last measurement, i.e., where t < t is the last time the d-th variable was observed.This is the forward imputation model.
Our preliminary experiments showed that forward imputation works better than imputating missing values based on interpolation.If the first few measurements are missing, we do backward imputation.If all the measurements of a variable are missing, then we impute with empirical mean.

GRU-xmd
Instead of explicitly imputing missing values, we may simply indicate which variables are missing as a part of input to a GRU-RNN.We do this by concatenating the measurement, mask and duration vectors: , where x

Trainable decay models
Let us begin by relaxing the assumption made for the forward imputation model.Instead of using the last observation as it is, we may decay it over time toward the empirical mean (which we take as a default configuration).There are two things to be considered when decaying variables.First, we want the rate at which each variable decays to differ from the other variables based on the underlying nature of the variable.Second, the decay rate should be indicative of missingness patterns which are informative (as we have shown earlier).In short, we aim at modeling decay rates to be learned rather than fixed a priori, based on the missingness pattern.We model a vector of decay rates γ as where W γ and b γ are model parameters that we train jointly with all the other parameters of the RNN.We chose the exponentiated negative rectifier in order to keep each decay rate in a reasonable rate between 0 and 1.We however note that it is possible to use other formulations such as a sigmoid function instead of the exponentiated negative rectifier, as long as the resulting decay rate GRU-DM: Input decay This trainable decay scheme can be readily applied to the measurement vector by where x d t is the last observation of the d-th variable (t < t) and xd is the empirical mean of the d-th variable.When decaying the input variable directly, we constrain W γ to be diagonal, which effectively makes the decay rate of each variable independent from the others'.
GRU-DS: Hidden decay As the decay rates are trained together with the whole GRU-RNN, we can instead decay the hidden state of the RNN.Intuitively, this has an effect of decaying the features rather than raw input variables.This is implemented by decaying the previous hidden state h t−1 before computing a new hidden state h t : in which case we do not constrain W γ to be diagonal.

GRU-DI: Goal-oriented imputation model
We may alternatively let the GRU-RNN predict the missing values in the next timestep on its own.When missing values occur only during test time, we simply train the GRU-RNN to predict the measurement vector of the next time step as a language model [16] and use it to fill the missing values during test time.This is unfortunately not applicable for some time series applications such as in healthcare domain, which also have missing data during training.
Instead, we propose here to view missing values as latent variables in a probabilistic graphical model.Given a timeseries X, we denote all the missing variables by M X and all the observed ones by O X .Then, training a time-series classifier with missing variables becomes equivalent to maximizing the marginalized log-conditional probability of a correct label l, i.e., log p(l|O X ).
The exact marginalized log-conditional probability is however intractable to compute, and we instead maximize its lowerbound: where we assume the distribution over the missing variables at each time step is only conditioned on all the previous observations: Although this lowerbound is still intractable to compute exactly, we can approximate it by Monte Carlo method, which amounts to sampling the missing variables at each time as the RNN reads the input sequence from the beginning to the end, such that where . By further assuming that xt ∼ N µ t , σ 2 t where µ t = γ t (W x h t−1 + b x ) and σ t = 1, we can use a reparametrization technique widely used in stochastic variational inference [12,21] to estimate the gradient of the lowerbound efficiently.During the test time, we simply use the mean of the missing variable, i.e., xt = µ t , as we have not seen any improvement from Monte Carlo approximation in our preliminary experiments.We view this approach as a goal-oriented imputation method, and refer to this approach as GRU-DI.The whole model is trained to minimize the classification cross-entropy error log_loss and we take the negative log likelihood of the observed values as a regularizer.
3 Experiments We demonstrate the performance of our proposed models on one synthetic and two real-world healthcare datasets (MIMIC-III, PhysioNet) and compare it to several strong machine learning and deep learning approaches in classification tasks.We study the impact of informative missingness on the model performance.We also evaluate our models for different settings such as early prediction and different dataset sizes.Gesture phase segmentation data This UCI dataset [15] has multivariate time series features with 5 different gesticulations.It is regularly sampled and has no missing values.We extracted 378 time series and randomly introduced missing values to generate 4 synthetic datasets.The missing rates in these synthetic datasets are the same (around 50%) but have different correlations with the ground-truth labels.We use these datasets to study the impact of modeling missingness patterns in our models.

Dataset descriptions and experimental design
Physionet challenge 2012 data This dataset, from PhysioNet Challenge 2012 [24], is a publicly available collection of multivariate clinical time series from 8000 ICU records.Each record is a multivariate time series of roughly 48 hours and contains 33 variables such as Albumin, heart-rate, glucose etc.We used Training Set A subset in our experiments since outcomes (such as in-hospital mortality labels) are publicly available only for this subset.We conduct the following two prediction tasks on this dataset.
• Mortality (Phy-Mor) task -Predict whether the patient dies in the hospital.There are 554 patients with positive mortality label.We treat this as a binary classification problem.• All 4 (Phy-4tasks) tasks -Predict 4 tasks: in-hospital mortality (mortality), length-of-stay less than 3 days (los< 3), whether the patient had a cardiac condition (cardiac), and whether the patient was recovering from surgery (surgery).We consider this as a multi-task prediction problem.
MIMIC-III data This public dataset [10] has deidentified clinical care data collected at Beth Israel Deaconess Medical Center from 2001 to 2012.It contains over 58,000 hospital admission records of 38,645 adults and 7,875 neonates.For our work, we extracted 99 time series features from 19714 admission records for 4 events including input-events (fluids into patient, e.g., insulin), output-events (fluids out of the patient, e.g., urine), lab-events (lab test results, e.g., pH values and platelet count) and prescription-events (drugs prescribed by doctors, e.g., aspirin and potassium chloride).These events are known to be extremely useful for studying intensive care unit patients.All the time series are >48 hours of duration, and only the first 48 hours (after admission) time series data is used for training and testing our models.We perform following two predictive tasks on MIMIC-III.
• Mortality (MIMIC-III-Mor) task -Predict whether the patient dies in the hospital.There are 1716 patients with positive mortality label and we perform binary classification.

• ICD-9 Code Prediction (MIMIC-III-ICD9) task -Predict the ICD-9 diagnosis codes for each
admission.There are 20 diagnoses1 (e.g., respiratory system diagnosis) in our dataset.We treat it as a multi-task prediction problem.

Methods and implementation details
We categorize all evaluated methods into four groups: 1. Non-RNN Baselines (Non-RNN): We evaluate logistic regression (LR), support vector machines (SVM) and Random Forest (RF) which are widely used models in health care applications.

Ensemble Methods (Ensemble):
We evaluate several ensembles of proposed and baselines methods.
The non-RNN baselines cannot handle missing data directly.We carefully design experiments for non-RNN models to capture the informative missingness as much as possible to have fair comparison with the RNN methods.Similar to RNN baselines, we can concatenate the mask vector along with the measurements and feed it to non-RNN models.However, the time duration vector cannot be concatenated since non-RNN models only work with fixed length inputs.We regularly sample the time-series data to get a fixed length input and perform imputation to fill in missing values.For PhysioNet dataset, we sample the time series on an hourly basis and propagate measurements forward (or backward) in time to fill gaps.For MIMIC-III dataset, we consider two hourly samples (in the first 48 hours) and do forward (or backward) imputation.Our preliminary experiments showed 2-hourly samples obtains better performance than one-hourly samples for MIMIC-III.We report results for both concatenation of input and mask vectors (e.g., SVM-xm, LR-xm, and RF-xm) and only input vector without mask (e.g., SVM-f, LR-f, and RF-f).We use the sklearn python library for the non-RNN model implementation and tune the parameters by cross-validation.For SVM, we choose RBF kernel since it performs better than other kernels.
For RNN models, we use a binary logistic regressor on top of the last hidden state h T to do classification.We use 100 and 64 hidden units in GRU-0 for MIMIC-III and PhysioNet datasets, respectively.All the other RNN models were constructed to have a comparable number of parameters.
For GRU-xmd, we use mean imputation for input (Equation 1).We train all the RNN models with the Adam optimization method [11] and use early stopping to find the best weights on the validation dataset.All the input variables are normalized to be 0 mean and 1 standard deviation.We report the results from 5-fold cross validation for all the methods.For ensemble methods, we average the soft-labels of several classifiers and treat it as the ensemble prediction.
Recently RNN models have been explored for modeling diseases and patient diagnosis in health care domain [14,5,19] using doctor notes but are not readily applicable for comparison in our time series classification tasks since they don't handle missing data.
We will release our code to maximize reproducibility and to create a new benchmark for studying time series classification with missing data.

Impact of missingness and label correlation on synthetic dataset
To evaluate the impact of modeling missingness we conduct experiments on the synthetic Gesture datasets.Figure 4 shows the AUC score comparison of our proposed models (GRU-DM, GRU-DS, and GRU-DI) and two baseline GRU models (GRU-0 and GRU-xmd), given different correlations between missing rate and the label.Missing rate is the same for all the settings, but a higher correlation means the missingness is more informative.Since GRU-0 does not utilize masking or time duration, it performs similarly across all 4 correlation settings.All other models benefit from the missingness, especially when the correlation is high, and our proposed methods beat baselines in all settings.GRU-xmd, another baseline, performs well when correlation is high, but performs even poorer than GRU-0 when the correlation is low.This demonstrates that by simply concatenating the masking and time duration to the input, GRU-xmd cannot distinguish whether the missingness is useful or not.The results on synthetic datasets provide an insightful way to understand how our proposed models behave with different data properties.
Evaluation on real datasets Table 2 shows the prediction performance comparison of the models listed in Section 3.2 on mortality task for MIMIC-III and PhysioNet datasets.We observe the following: All models improve their performance when they feed missingness patterns along with inputs.Our proposed models achieve the best AUC score in both datasets.Our ensemble model based on proposed GRU models (GRU-DM, GRU-DS, and GRU-DI) and two non-RNN baseline models (SVM-xm and RF-xm) achieves the best performance with a significant improvement.This implies that our models exploit some knowledge which the baseline models do not capture.Also, Figure 5 and Figure 6 respectively show the AUC scores for all 20 ICD-9 diagnosis category prediction on MIMIC-III dataset and all 4 tasks on PhysioNet dataset for all RNN models.Our proposed models performs best in most of the tasks.

Discussions
Decay analysis Figure 7 shows the γ t plots of all the variables for our GRU-DM model in Phy-Mor experiments.We observe that the decay rate is almost constant for variables that correspond to vital signs and a few lab measurements.Since these variables have less missing rate, our models do not decay them over time.On the other hand, the variables with large decay mainly correspond to the lab test variables which have a long time duration between observations.Among these, variables such as Weight, Cholesterol, pH, Lactate, PaO2, etc. are known to be very important for clinical outcome prediction [29,2] and thus, our model decays them appropriately so that their more recent observations are used for mortality prediction task.
Per time step prediction Although our model is trained on the prediction of last time step, it can be used directly to make predictions before it sees all the time series.This is very useful in applications such as health care, where early decision making is beneficial for patient care.Figure 8 shows the online prediction results for MIMIC-III mortality tasks, where the model makes prediction before it  sees the entire time series.First, all the three proposed models beat the RNN baselines consistently from the very beginning.Second, with only part of the time series, the proposed methods can get the same or better performance than SVM and RF.Our models achieve similar prediction performance (i.e., same AUC) 11 hours earlier than SVM and 6 hours earlier than RF.
Performance with different training data size In many practical applications, model scalability with growing dataset size is very important.To evaluate the model performance with varying training dataset size, we generate three smaller datasets (2k, 5k, 10k admissions) from MIMIC-III by keeping the ratio of mortality label to dataset size similar to the original dataset.We compare our proposed models with three most competitive baseline models (SVM-xm, RF-xm, GRU-xmd) on these smaller datasets.We observe that all models can achieve improved performance from more training samples.However, the improvements of non-RNN baselines are quite limited compared with our models.This result indicates that the gap in performance between our models and the non-RNN baselines will continue to grow as more data becomes available.

Summary
In this paper, we proposed novel GRU-based models to effectively model multivariate time series with missing data.Our model captures the informative missingness by incorporating masking and time duration directly into the GRU architecture.Empirical experiments on real-world healthcare datasets showed promising results of our models.For future work, we are interested in exploring deep learning approaches to characterize missing-not-at-random data and conducting theoretical analysis to understand the behaviors of existing solutions to handle missing values.

A Supplementary A.1 MIMIC-III preprocessing details
Here, we describe the preprocessing details for MIMIC-III dataset To avoid/reduce ambiguity and noisy observations, we ensured that all the measurements for a particular variable has only one unit of measurement.We also aggregated the multiple readings of a feature at a single time stamp based on the feature type.For instance, some inputevents features should be averaged while others need to be summed up.This resulted in 99 variables being extracted from the four tables for 19714 patient admission records.For each of the admission records, we collected both the variable value x t and the time-stamp of observation s t .In addition, for each admission record we queried the database tables to get the ICD-9 diagnosis codes.One admission record can be associated with multiple ICD-9 codes.We also queried the discharge time and death time from the Admissions table of MIMIC-III to find the mortality label for each admission record.The ICD-9 diagnosis codes, shown in Table 3 were grouped into 20 categories according to the information from the Thomson Reuters webpage 3 .The class distribution of the ICD-9 codes is shown in Figure 10.A.2 Descriptions for Figure 1 In many time series applications, the pattern of missing variables in the time series is often informative and useful for prediction tasks.Here, we empirically confirm this claim on two health care datasets by investigating the correlation between the missingness and prediction label (mortality prediction task).We compute the Pearson correlation coefficient between p d X and label across the training time series.As shown in Figure 1, we observe that the variables with low missing rate are highly correlated with the target label, demonstrating the usefulness of missingness patterns in solving a prediction task.Note that p d X is dependent on mask vector (m d t ) and number of time steps T .

Figure 1 :
Figure 1: Correlations between missing rate and labels in real world health care datasets.Left: MIMIC-III mortality label; middle: MIMIC-III ICD-9 diagnoses labels; right: Physionet 2012 mortality label.x-axis, target label(s); y-axis, input variables; color: correlation value.Please refer to supplementary materials for more details about this figure.

Figure 2 :
Figure 2: An example sequence of measurement vectors xt, time stamps st, mask vectors mt and durations δt.
(n) t is from Equation 1 or 2.

Figure 3 :
Figure 3: Graphical illustrations of (a) the original GRU, (b, c) GRU with the trainable decay rates and (d) GRU with the dynamic imputation.

Figure 4 :Figure 5 :
Figure 4: Performance on Gesture synthetic datasets with different correlations between missingness and labels.x-axis: average Pearson correlation score of the each variable's missing rate and the target label in that synthetic dataset; y-axis: AUC score.

Figure 7 :
Figure 7: Decay γ t plots of all 33 variables for Phy-Mor task in GRU-DM model.Variables in green: lab measurements; in red: vital signs.Variables are sorted in decreasing order of missing rate.mr: missing rate.x-axis: time duration δ d t , in range [0, 24 hours]; y-axis: decay value γ d t , in range [0, 1].

Figure 10 :
Figure 10: MIMIC-III ICD-9 diagnosis code class distribution.x-axis, ICD-9 diagnosis category id; y-axis: the ratio of admission records with the diagnosis code.

Table 1 :
Dataset statisticsTo evaluate our proposed framework, we ran a series of prediction experiments on three datasets.Statistics of these datasets are shown in Table1.For each dataset, we only consider time steps when at least one measurement is available.

Table 2 :
Model performances measured by the Area Under ROC (AUC) score for predicting in-hospital mortality Variables in green: lab measurements; in red: vital signs.Variables are sorted in decreasing order of missing rate.mr: missing rate.x-axis: time duration δ d t , in range [0, 24 hours]; y-axis: decay value γ d t , in range [0, 1].
2. MIMIC-III provides several relational database tables containing information of data relating to patients who stayed within the intensive care units (ICUs) at Beth Israel Deaconess Medical Center.The admission table contains over 58,000 hospital admission records of 38,645 adults and 7,875 neonates.We chose four tables namely inputevents-mv (fluids into patient, e.g.insulin), outputevents (fluids out of the patient, e.g.urine), labevents (lab test results, e,g.pH,Platelet count) and prescription events (drugs prescribed by doctors, e.g.aspirin and potassium chloride) to collect the patient data recorded in critical care units and hospital record systems.The inputevents-mv table collects the intake for patients monitored using the iMDSoft Metavision system.For our work, we use 19714 admission records collected during 2008-2012 by Metavision data management system which is still employed at the hospital.The data collection and organization in Metavision system is much neater than the earlier Philips CareVue system[2001][2002][2003][2004][2005][2006][2007][2008].From each of the four tables, we chose the top 50 items (i.e.features/variables) since these items are present in many of the patients' records.