Recurrent Neural Networks for Multivariate Time Series with Missing Values on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Recurrent Neural Networks for Multivariate Time Series with Missing Values
Zhengping Che and Sanjay Purushotham and Kyunghyun Cho and David Sontag and Yan Liu
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.NE, stat.ML
more

Summaries/Notes 1

[link] Summary by Tiago Vinhoza 7 years ago

#### Motivation:
+ When sampling a clinical time series, missing values become ubiquitous due to a variety of factors such as frequency of medical events (when a blood test is performed, for example).
+ Missing values can be very informative about the label - *informative missingness*.
+ The goal of the paper is to propose a deep learning model that **exploits the missingness patterns** to enhance its performance.

#### Time series notation:

Multivariate time series with $D$ variables of length $T$: 
+ ${\bf X} = ({\bf x}_1, {\bf x}_2, \ldots, {\bf x}_T)^T \in \mathbb{R}^{T \times D}$.
+  ${\bf x}_t  \in \mathbb{R}^{D}$ is the $t$-th measurement of all variables. 
+ $x_t^d$ is the $d$-th component of  ${\bf x}_t$.

Missing value information is incorporated using *masking* and *time-interval* concepts.
 + Masking: says which of the entries are missing values.
  + Masking vector ${\bf m}_t \in \{0, 1\}^D$,  $m_t^d = 1$ if $x_t^d$ exists and $m_t^d = 0$ if $x_t^d$ is missing.
 + Time-interval: temporal pattern of 'no-missing' observations. Represented by time-stamps $s_t$ and time intervals $\delta_t$ (since its last observation).

Example: 
${\bf X}$: input time series with 2 variables, 
$$ {\bf X} = \begin{pmatrix}
47 & 49 & NA & 40 & NA & 43 & 55 \\ NA & 15 & 14 & NA & NA & NA & 15 
\end{pmatrix}
$$
with time-stamps
$${\bf s} =  \begin{pmatrix}
0 & 0.1 & 0.6 & 1.6 & 2.2 & 2.5 & 3.1
\end{pmatrix}
$$
The masking vectors  ${\bf m}_t$ and time intervals ${\delta}_t$ for each variable are computed and stacked forming the masking matrix ${\bf M}$ and time interval matrix ${\bf \Delta}$ :
$$ {\bf M} = \begin{pmatrix}
1 & 1 & 0 & 1 & 0 & 1 & 1 \\ 0 & 1 & 1 & 0 & 0 & 0 & 1 
\end{pmatrix}
$$
$$ {\bf \Delta} = \begin{pmatrix}
0 & 0.1 & 0.5 & 1.5 & 0.6 & 0.9 & 0.6 \\ 0 & 0.1 & 0.5 & 1.0 & 1.6 & 1.9 & 2.5 
\end{pmatrix}
$$

#### Proposed Architecture:
+ GRU (Gated Recurrent Units) with "trainable" decays:
 + Input decay: which causes the variable to converge to its empirical mean instead of simply filling with the last value of the variable. The decay of each input is treated independently
 + Hidden state decay: Attempts to capture richer information from missing patterns. In this case the hidden state of the network at the previous time step is decayed.


#### Dataset:
+ MIMIC III v1.4: https://mimic.physionet.org/
 + Input events, Output events, Lab events, Prescription events
+ PhysioNet Challenge 2012: https://physionet.org/challenge/2012/

|  	MIMIC III |	PhysioNet 2012 | 
-----------------------------------|--------------|---------------------
Number of samples ($N$) |	19714 |	4000
Number  of variables ($D$)	|99      |	33
Mean number of time steps	|35.89 |	68.91
Maximum number of time steps|150  |	155
Mean of variable missing rate 	|0.9621|	0.8225


#### Experiments and Results:
**Methodology**
+ Baselines:
 + Logistic Regression, SVM, Random Forest (PhysioNet sampled every 1h. MIMIC sampled every 2h). Forward / backfilling imputation. Masking vector is concatenated input to inform the models what inputs are imputed.
 + LSTM with mean imputation.
 + Variations of the proposed GRU model:
  + GRU-mean: impute average of the training set.
  + GRU-forward: impute last value.
  + GRU-simple: masking vectors and time interval are inputs. There is no imputation.
  + GRU-D: proposed model.
+ Batch normalization and dropout (p = 0.5) applied to the regression layer.
+ Normalized inputs to have a mean of 0 and standard deviation 1.
+ Parameter optimization: early stopping on validation set.

**Results**

Mortality Prediction (results in terms of AUC):
+ Proposed GRU-D outperforms other models on both datasets: 
 + AUC = 0.8527 $\pm$ 0.003 for MIMIC-III and 0.8424 $\pm$ 0.012 for PhysioNet
+ Random Forest and SVM are the best non-RNN baselines.
+ GRU-simple was the best RNN variant.

Multitask Prediction (results in terms of AUC):
+ PhysioNet: mortality, <3 days, surgery, cardiac condition.
+ MIMIC III: 20 diagnostic categories.
+ The proposed GRU-D outperforms other baseline models.

#### Positive Aspects:
+ Instead of performing simple mean imputation or using indicator functions, the paper exploits missing values and missing patterns in a novel way.
+ The paper performs lengthy comparisons against baselines.

#### Caveats:
+ Clinical mortality datasets usually have very high imbalance between classes. In such cases, AUC alone is not the best metric to evaluate. It would have been interesting to see the results in terms of precision/recall.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private