Results for in-hospital mortality, decompensation, length-of-stay, and phenotype prediction tasks. A subfigure corresponding to a benchmark task has three parts. The first part is a table that lists the values of the metrics for all models along with 95% confidence intervals obtained by resampling the test set K times with replacement (K = 10000 for in-hospital mortality and phenotype prediction task, while for decompensation and length-of-stay prediction tasks K = 1000). For all metrics except MAD, larger values are better. The second part visualizes the confidence intervals for the main metric of the corresponding task. The black circle corresponds to the mean value of K iterations. The thick black line shows standard deviation and narrow grey line shows 95% confidence interval. The third part shows the significance of the difference between the models. We count the number of resampled tests sets on which the i-th model performed better than the j-th model (denoted by ci,j). The cell at the i-th row and the j-th column of the table shows the percentage of ci,j in K. We say that the i-th model is significantly better than the j-th model if ci,j/K > 0.95 and highlight the corresponding cell of the table.