Weakly supervised classification of rare aortic valve malformations using unlabeled cardiac MRI sequences

Population-scale biomedical repositories such as the UK Biobank provide unprecedented access to prospectively collected cardiac imaging data, however the majority of these data are unlabeled, creating barriers to their use in supervised machine learning. We developed a weakly supervised deep learning model for Bicuspid Aortic Valve (BAV) classification using up to 4,000 unlabeled cardiac MRI sequences. Instead of requiring curated, hand-labeled training data, weak supervision relies on noisy heuristics defined by domain experts to programmatically generate large-scale, imperfect training labels. For BAV classification, training models using these imperfect labels substantially outperformed a traditional supervised model trained on hand-labeled MRIs. In a validation experiment using long-term outcome data from the UK Biobank, our classification model identified a subset of individuals with a 1.8-fold increase in risk of a major adverse cardiac event. This work formalizes the first deep learning baseline for aortic valve classification and outlines a general strategy for using weak supervision to train machine learning models using large collections of unlabeled medical images.

learning models.
Weakly supervised machine learning methods are promising for cardiac medical imaging, a speciality that poses many computational challenges. The heart is a dynamic anatomical structure moving in 3 dimensions with periodicity that may range from 1 to 3 Hz depending on age and health status. Cardiac imaging entails complex manual alignment to cardiac structures and the capture of multiple sequences coordinated to cardiac cycle and patient respiration. Due to the complexity of imaging output and need for human interpretation, studies utilizing cardiac MRIs are mostly limited to single centers relying on human readers of clinically obtained data for functional information. For these reasons, obtaining large-scale labeled data for the space of possible cardiac pathologies is especially challenging.
We build on recent weak supervision techniques to train a state-of-the-art hybrid Convolutional Neural Network / Long Short Term Memory (CNN-LSTM) model for BAV classification. Our pipeline closely matches a realistic application setting, where we have access to a large repository of unlabeled MRI sequences and a small, hand-labeled dataset for model verification. Weak supervision allows us to train deep learning models without manually constructing massive labeled datasets, substantially lowering the time and cost required to construct state-of-the-art imaging models. Finally, to assess the real-world relevance of our image classification model, we applied the CNN-LSTM to a cohort of 9,230 new patients with long-term outcome and MRI data from the UK Biobank. In patients identified by our classifier as having BAV, we found a 1.8-fold increase in risk of a major adverse cardiac event. Our findings demonstrate how weakly supervised methods help mitigate the lack of expert-labeled training data in cardiac imaging settings, and how real-world health outcomes can be learned directly from large-scale, unlabeled medical imaging data. and velocity encoded (VENC) [14]. Video examples are available in S1 Videos. In MAG and VENC series, pixel intensity directly maps to velocity of blood flow. This is performed by exploiting the difference in phase of the transverse magnetism of protons within blood when flowing parallel to a gradient magnetic field, where the phase difference is proportional to velocity. CINE images encode anatomical information without capturing blood flow.
All raw phase contrast MRI sequences are 30 frames, 12-bit grayscale color, and 192 x 192 pixels.

MRI preprocessing
All MRIs were preprocessed to: (1) localize the aortic valve to a 32x32 crop image size; and (2) align all image frames by peak blood flow in the cardiac cycle. Since the MAG series directly captures blood flow -and the aorta typically has the most blood flow-both of these steps are straightforward using standard threshold-based image processing techniques when the series is localized to a cross-sectional plane at the sinotubular junction. Selecting the pixel region with maximum standard deviation across all frames localized the aorta, and selecting the frame with maximum z-score identified peak blood flow. See S1 Appendix for implementation details. Both heuristics were very accurate (>95% as evaluated on the development set) and selecting a ±7 frame window around the peak frame f peak captured 99.5% of all pixel variation for the aorta. All three MRI sequences were aligned to this peak before classification.

Weak supervision
There is considerable research on using indirect or weak supervision to train machine learning models for image and natural language tasks without relying entirely on manually labeled data [9,15,16]. One longstanding approach is distant supervision [17,18], where indirect sources of labels are used to to generate noisy training instances August 9, 2018 6/25 from unlabeled data. For example, in the ChestX-ray8 dataset [19] disorder labels were extracted from clinical assessments found in radiology reports. Unfortunately, we often lack access to indirect labeling resources or, as in the case of BAV, the class of interest itself may be rare and underdiagnosed in existing medical records.
Another strategy is to generate noisy labels via crowdsourcing [20,21], which in some medical imaging tasks can perform as well as trained experts [22,23]. In practice, however, crowdsourcing is logistically difficult when working with protected health information such as MRIs. A significant challenge in all weakly supervised approaches is correcting for label noise, which can negatively impact end model performance. Noise is commonly addressed using rule-based and generative modeling strategies for estimating the accuracy of label sources [24,25].
In this work, we use the recently proposed data programming [9] method, a generalization of distant supervision that uses a generative model to learn both the unobserved accuracies of labeling sources and statistical dependencies between those sources [11,12]. In this approach, source accuracy and dependencies are estimated without requiring labeled data, enabling the use of weaker forms of supervision to generate training data, such as using noisy heuristics from clinical experts. For example, in BAV patients the phase-contrast imaging of flow through the aortic valve has a distinct ellipse or asymmetrical triangle appearance compared to the more circular aorta in TAV patients. This is the reasoning a human might apply when directly examining an MRI. In data programming these types of broad, often imperfect domain insights are encoded into functions that vote on the potential class label of unlabeled data points. This allows us to weakly supervise tasks where indirect label sources, such as patient notes with assessments of BAV, are not available.
The idea of encoding domain insights is formalized as labeling functions -black box functions which vote on unlabeled data points. The only restriction on labeling functions is that they vote correctly with probability better than random chance. In images, labeling functions are defined over a set of domain features or primitives, semantic abstractions over raw pixel data that enable experts to more easily encode domain heuristics. Primitives encompass a wide range of abstractions, from simple shape features to complex semantic objects such as anatomical segmentation masks. Labeling function output is used to learn a generative model of the underlying annotation process, where each labeling function is weighted by its estimated accuracy to generate probabilistic, training The relationship between unobserved labels y and the label matrix Λ is modeled using a factor graph [26]. We learn a probabilistic model that best explainsΛ, the empirical matrix observed by applying labeling functions to unlabeled data. In the basic data programing model, this consists of n accuracy factors between λ 1 , ..., λ n and y φ Acc Other dependencies among labeling functions (e.g., pairwise similarities) can be learned by defining additional factors. These factors may be specified manually or inferred directly from unlabeled data. The generative model's factor weights θ are estimated by minimizing the negative log likelihood of p θ (Λ) using contrastive divergence [27].
Optimization is done using standard stochastic gradient descent with Gibbs sampling for gradient estimation.
Learning dependencies automatically from unlabeled data is critical in imaging tasks, where labeling functions are dependent on a complex space of domain primitives. We use the generative model enhancements proposed in Varma et al. [11] to infer higher order dependency structure between labeling functions based on their interactions with primitives. This approach requires defining a space of feature primitives (e.g., the area of a binarization mask) that serves as an additional input to the generative model.
The final weak supervision pipeline requires two inputs: (1) primitive feature matrix; and (2) observed label matrixΛ. For generatingΛ, we take each patient's frame sequencex i = {x 1i , ...x 30i } and apply labeling functions Extracting domain primitives Primitives are generated using existing models or methods for extracting features from image data. In our setting, we restricted primitives to unsupervised shape statistics and pixel intensity features provided by off-the-shelf image analysis tools such as scikit-image [28]. Primitives are generated using a binarized mask of the aortic valve for each frame in a patient's MAG series. Since the generative model accounts for noise in labeling functions and primitives, we can use imperfect thresholding techniques such as Otsu's method [29] to generate binary masks. All masks were used to compute primitives for: (1) area; (2) perimeter ; (3) eccentricity (a [0,1) measure comparing the mask shape to an ellipse, where 0 indicates a perfect circle); (4) pixel intensity (the mean pixel value for the entire mask); and (5) ratio (the ratio of area over perimeter squared). Since the size of the heart and anatomical structures correlate strongly with patient sex, we normalized these features by two population means stratified by sex in the unlabeled set.

Designing labeling functions
We designed 5 labeling functions using the primitives described above. For model simplicity, labeling functions were restricted to using threshold-based, frame-level information for voting. All labeling function thresholds were selected manually using distributional statistics computed over all primitives for the expert-labeled development set. (See S1 Table for

Discriminative model
The discriminative model classifies BAV/TAV status using t contiguous MRI frames (5 ≤ t ≤ 30, where t is a hyperparameter) and a single probabilistic label per patient. This model consists of two components: a frame encoder for learning frame-level features and a sequence encoder for combining individual frames into a single feature vector. For the frame encoder, we use a Dense Convolutional Network (DenseNet) [30] with 40 layers and a growth rate of 12, pretrained on 50,000 images from CIFAR-10 [31]. We tested two other common pretrained image neural networks (VGG16 [32], ResNet-50 [33]), but found that a DenseNet40-12 model performed best overall, aligning with previous reports [30]. The DenseNet architecture takes advantage of low-level feature maps at all layers, making it well-suited for medical imaging applications where low-level features (e.g., edge detectors) often carry substantial explanatory power.
For the sequence encoder, we used a Bidirectional Long Short-term Memory (LSTM) [34] sequence model with soft attention [35] to combine all MRI frame features. The soft attention layer optimizes the weighted mean of frame features, allowing the network to automatically give more weight to the most informative frames in an MRI sequence. We explored simpler feature pooling architectures (e.g, mean/max pooling), but each of these methods was outperformed by the LSTM in our experiments. The final hybrid CNN-LSTM architecture aligns with recent methods for state-of-the-art video classification [36,37] and 3D medical imaging [38].
The CNN-LSTM model is trained using noise-aware binary cross entropy loss L: This is analogous to standard supervised learning loss, except we are now minimizing the expected value with respect toŶ [9]. This loss enables the discriminative model to take advantage the more informative probabilistic labels produced by the generative model, i.e., training instances with higher probability have more impact on the learned model. Fig 3 shows the complete discriminative model pipeline.
Training and hyperparameter tuning  state-of-the-art image classifiers [39]. We tested models using all 3 MRI series (CINE, MAG, VENC with a single series per channel) and models using only the MAG series. The MAG series performed best, so we only report those results here.
Hyperparameters were tuned for L2 penalty, dropout, learning rate, and the feature vector size of our last hidden layer before classification. Augmentation hyperparameters were tuned to determine final translation, rotation, and scaling ranges. All models use validation-based early stopping with F1 score as the stopping criterion.
The probability threshold for classification was tuned using the validation set for each run to address known calibration issues when using deep learning models [40]. Architectures were tuned using a random grid search over 10 models for non-augmented data and 24 for augmented data.

Code Availability
All code used in this study was written in Python v2.7. Deep learning models were implemented using PyTorch

Evaluation metrics
Classification models were evaluated using positive predictive value (precision), sensitivity (recall), F1 score (i.e., the harmonic mean of precision and recall), and area under the ROC curve (AUROC). Due to the extreme class imbalance of this task we also report discounted cumulative gain (DCG) to capture the overall ranking quality of model predictions [41]. Each BAV or TAV case was assigned a relevance weight r of 1 or 0, respectively, and test set patients were ranked by their predicted probabilities. DCG is computed as n i ri logr(i+1) where n is the total number of instances and i is the corresponding rank. This score is normalized by the DCG score of a perfect ranking (i.e., all true BAV cases in the top ranked results) to compute normalized DCG (NDCG) in the range August 9, 2018 11/25 [0.0,1.0]. Higher NDCG scores indicate that the model does a better job of ranking BAV cases higher than TAV cases. All scores were computed using test set data, using the best performing models found during grid search, and reported as the mean and 95% confidence intervals of 5 different random model weight initializations.
For labeling functions, we report two additional metrics: coverage (Eq. 3) a measure of how many data points a labeling function votes {−1, 1} on; and conflict (Eq. 4) the percentage of data points where a labeling function disagrees with one or more other labeling functions.

Model evaluation with clinical outcomes data
To construct a real-world validation strategy dependent upon the accuracy of image classification but completely independent of the imaging data input, we used model-derived classifications (TAV vs. BAV) as a predictor of validated cardiovascular outcomes using standard epidemiological methods. For 9,230 patients with prospectively obtained MRI imaging who were excluded from any aspect of model construction, validation, or testing we performed an ensemble classification with the best performing CNN-LSTM model.
For evaluation we assembled a standard composite outcome of major adverse cardiovascular events (MACE).
Phenotypes for MACE were inclusive of the first occurrence of coronary artery disease (myocardial infarction, percutaneous coronary intervention, coronary artery bypass grafting), ischemic stroke (inclusive of transient ischemic attack), heart failure, or atrial fibrillation. These were defined using ICD-9, ICD-10, and OPCS-4 codes from available hospital encounter, death registry, and self-reported survey data of all 500,000 participants of the UK Biobank at enrollment similar to previously reported methods [42].
Starting 10 years prior to enrollment in the study, median follow up time for the participants with MRI data included in the analysis was 19 years with a maximum of 22 years. For survival analysis, we employed the "survival" and "survminer" packages in R version 3.4.3, using aortic valve classification as the predictor and time-to-MACE as the outcome, with model evaluation by a simple log-rank test.
To verify the accuracy of the CNN-LSTM's predicted labels, 36 MRI sequences (18 TAV and BAV patients) were selected randomly for review by a single annotator (JRP). The output of the last hidden layer was visualized using a t-distributed stochastic neighbor embedding (t-SNE) [43] plot to assist error analysis.

Related Work
In medical imaging, weak supervision refers to a broad range of techniques using limited, indirect, or noisy labels.
Multiple instance learning (MIL) is one common weak supervision approach in medical images [44]. MIL approaches assume a label is defined over a bag of unlabeled instances, such as an image-level label being used to supervise a segmentation task. Xu et al. [45] simultaneously performs binary classification and segmentation for histopathology images using a variant of MIL, where image-level labels are used to supervise both image classification and a segmentation subtask. ChestX-ray8 [19] was used in Li et al. [46] to jointly perform classification and localization using a small number of weakly labeled examples. Patient radiology reports and other medical record data are frequently used to generate noisy labels for imaging tasks [19,[47][48][49].
Weak supervision shares similarities with semi-supervised learning [50], which enables training models using a small labeled dataset combined with large, unlabeled data. The primary difference is how the structure of unlabeled data is specified in the model. In semi-supervised learning, we make smoothness assumptions and extract insights on structure directly from unlabeled data using task-agnostic properties such as distance metrics and entropy constraints [51]. Weak supervision, in contrast, relies on directly injecting domain knowledge into the model to incorporate the underlying structure of unlabeled data. In many cases, these sources of domain knowledge are readily available in existing knowledge bases, indirectly-labeled data like patient notes, or weak classification models and heuristics.

Baseline models
We compare our weakly supervised models against two traditionally supervised baselines using identical CNN-LSTM architectures: (1) expert labels alone and (2) expert labels with data augmentation. In these experiments, the expert labeled development set was used as the training set. Due to class imbalance (6:100), training data was rebalanced by oversampling BAV cases with replacement.

Weak supervision performance at scale
We evaluate the impact of training set size on weak supervision performance. These models are trained using only weakly labeled training data, i.e., no hand-labeled MRIs. All probabilistic labels are split into positive and negative bins using a threshold of 0.5 and sampled uniformly at random with replacement to create balanced, training sets, e.g., sample 50 BAV and 50 TAV for a training set size of 100. We used balanced samples sizes of Models trained with 4,239 weak labels and augmentation performed best overall, matching or exceeding all metrics compared to the best performing baseline model, expert labels with augmentation. The best weak supervision model had a 64% improvement in mean F1 score (37.8 to 61.4) and 128% higher mean precision (30.7 to 70.0). This model had higher mean area under the ROC curve (AUROC) (+13%) and normalized discounted cumulative gain (NDCG) (+57%) scores. In Table 1, we report baseline model performance and the best weak   Labeling Function Scores Table 2 shows individual labeling function performance on test data, where metrics were computed per-frame.
Precision, recall, and F1 scores were calculated by counting abstain votes as TAV labels, reflecting a strong prior on TAV cases. Individually, each function was a very weak classifier with poor precision (0 -25.0) and recall (0 -69.1), as well as mixed coverage (9.8% -90%) and substantial conflict with other labeling functions (8 -41.7%).
Note that labeling functions provide both negative and positive class supervision, and sometimes performed best with a specific class, e.g., LF Intensity targets negative (TAV) cases while LF Perimeter targets positive (BAV) cases.  5) consistent with prior knowledge of co-incidence of BAV with comorbid cardiovascular disease [52,53]. In a linear model adjusted for age, sex, smoking, hyperlipidemia, diabetes, and BMI, individuals with model-classified BAV displayed a 2.5 mmHg increase in systolic blood pressure (p < 0.001).

Discussion
In this work we present the first deep learning model for classifying BAV from phase-contrast MRI sequences.
These results were obtained using models requiring only a small amount of labeled data, combined with a large, imperfectly labeled training set generated via weak supervision. The success of this weak supervision paradigm, especially for a classification task with substantial class-imbalance such as BAV, represents a critical first step in the larger goal of automatically labeling unstructured medical imaging from large datasets like the UK Biobank.
For medical applications of machine learning as described here, we propose an additional standard of validation; that the model not only captures abnormal valve morphology, but more importantly the captured information is of real-world medical relevance. In our model, BAV individuals showed more than an 1.8-fold increase in risk for comorbid cardiovascular disease.
The current availability of large unstructured medical imaging datasets is unprecedented in the history of biomedical research, but the use of data on cardiac morphology derived from medical imaging depends upon their integration into genetic and epidemiological studies. For most aspects of cardiac structure and function, the computational tools used to perform clinical measurements require the input or supervision of an experienced user, typically a cardiologist, radiologist, or technician. Large datasets exploring cardiovascular health such as MESA and GenTAC which both include imaging data have been limited by the scarcity of expert clinical input in labeling and extracting relevant information [54,55]. Our approach provides a scalable method to accurately and automatically label such high value datasets.
Automated classification of imaging data represents the future of imaging research. Weakly supervised deep learning tools may allow imaging datasets from different institutions which have been interpreted by different clinicians, to be uniformly ascertained, combined, and analyzed at unprecedented scale, a process termed harmonization. Independent of any specific research or clinical application, new machine learning tools for analyzing and harmonizing imaging data collected for different purposes will be the critical link that enables large-scale studies to connect anatomical and phenotypic data to genomic information, and health-related outcomes. For the purposes of research, such as genome-wide association studies, higher precision (positive predictive value) is more important for identifying cases. Conversely, in a clinical application, the flagging of all possible cases of BAV for manual review by a clinician would be facilitated by a more sensitive threshold. The model presented here can be tuned to target either application setting.
Our analytical framework and models have limitations. Estimation of the true prevalence of uncommon conditions such as BAV and ascertainment of outcomes within a given population is complicated by classical biases in population health science. Registries of BAV typically enroll patients only with clinically apparent manifestations or treatment for disease which may not account for patients who do not come to medical attention.
August 9, 2018 18/25 Estimates derived from population-based surveillance are usually limited to relatively small numbers of participants due to the cost and difficulty of prospective imaging, and small cohort sizes impede accurate estimates for rare conditions such as BAV. Age and predisposition to research participation may also affect estimates of disease prevalence, a documented phenomenon within the UK Biobank [56]. Mortality from BAV is accrued cumulatively over time, thus studies of older participants are missing individuals with severe disease who may have died or been unable to participate.
Conversely calcific aortic valve disease, which increases in incidence with age, may result in an acquired form of aortic stenosis difficult to distinguish from BAV by cardiac flow imaging [57]. Given that the 6.2% of individuals receiving a model-classification of BAV is higher than previous population estimates of BAV prevalence (0.5 to 2%), some proportion of BAV-classified individuals almost certainly have age-related calcific aortic valve disease. Additional scrutiny of model-classified BAV cases show that the model fails to differentiate regurgitation of the aortic valve from turbulent blood flow through an aortic valve with a normal circular or symmetrically triangular appearing orifice (Fig. 6). Thus even for the current best-performing model displaying good predictive characteristics for a class-imbalanced problem, misclassification events attributable to discreet failure modes are evident for subsequent iterations of the model.

Conclusion
This work demonstrates how weak supervision can be used to train a state-of-the-art deep learning model for BAV classification using unlabeled MRI sequences. Using domain heuristics encoded as functions to programmatically generate large-scale, imperfect training data provided substantial improvements in classification performance over models trained on hand-labeled data alone. Transforming domain insights into labeling functions instead of static labels mitigates some of the challenges inherent in the domain of medical imaging, such as extreme class imbalance, limited training data, and scarcity of expert input. Most importantly, our BAV classifier successfully identifed individuals at long-term risk for cardiovascular disease, demonstrating real-world relevance of imaging models built using weak supervision techniques.
Supporting information S1 Videos. Example MRI videos. BAV and TAV subject videos in CINE, MAG, and VENC encodings. S1 Appendix. Aorta localization and cardiac cycle alignment. Detailed overview of MRI preprocessing. functions are restricted to less ambiguous features near the peak frame. S1 Table. Complete Labeling Function Implementations.