A deep learning approach to automatic gingivitis screening based on classification and localization in RGB photos

Routine dental visit is the most common approach to detect the gingivitis. However, such diagnosis can sometimes be unavailable due to the limited medical resources in certain areas and costly for low-income populations. This study proposes to screen the existence of gingivitis and its irritants, i.e., dental calculus and soft deposits, from oral photos with a novel Multi-Task Learning convolutional neural network (CNN) model. The study can be meaningful for promoting the public dental health, since it sheds light on a cost-effective and ubiquitous solution for the early detection of dental issues. With 625 patients included in this study, the classification Area Under the Curve (AUC) for detecting gingivitis, dental calculus and soft deposits were 87.11%, 80.11%, and 78.57%, respectively; Meanwhile, according to our experiments, the model can also localize the three types of findings on oral photos with moderate accuracy, which enables the model to explain the screen results. By comparing to general-purpose CNNs, we showed our model significantly outperformed on both classification and localization tasks, which indicates the effectiveness of Multi-Task Learning on dental disease detection. In all, the study shows the potential of deep learning for enabling the screening of dental diseases among large populations.

a model taken pupil images as input 9 . Despite those works, the use of DL models for screening oral conditions is still much under-explored.
There also exists continuous efforts for enabling the automatic dental diagnosis with DL algorithms. For example, Joachim Krois 10 applied CNNs to detect periodontal bone loss (PBL) on panoramic dental radiographs, while Casalegno et al. performed caries segmentation from Near-Infrared-Light Transillumination (NILT) images 11 . Jae-Hong Lee evaluate the efficacy of deep CNN algorithms for detection and diagnosis of dental caries on periapical radiographs 12 . Yu et al. also evaluated the performance of CNNs for the skeletal classification with lateral cephalometry 13 . Different from the previous work, our task takes imageries as input, which can be less standard in distribution comparing to the medical imaging.
In this work, we initialize the study of applying DL on oral photos for screening gingivitis, dental calculus, and soft deposits. We formulated the task as mixture of dental condition classification and localization by considering the nature of the conditions, and developed a Multi-Task Learning to solve the two different types of tasks with an integrated model. We proved the effectiveness of our model by comparing with general-purpose CNNs and carrying out ablation tests. With the designed system, we expect to bring up the discussion for integrating deep learning into tools for improving public dental health.

Materials and methods
In this section, we first introduce the data collecting protocols and the resultant dataset for the data-driven study, followed by the description of data annotation process. Next, we illustrate the problem formulation for the detection of gingivitis (and its irritants), as well as the proposed DL model architecture. Details on the model implementation and training are then provided. Finally, we describe the metrics and statistical analysis methods we applied for validating and justifying our model.

Acquisition of data.
A total of 3932 oral photos were captured from 625 patients admitted at Department of Periodontics, orthodontics and endodontics, Nanjing Stomatological Hospital, Nanjing University, between January 2018 and December 2019. All the photos were captured by postgraduate dental students and dentists, and the patient's ages cover a range from 14 to 60. The project was approved by the Ethical Review Board at Nanjing University (approval 2019NL-065(KS)). The methods were conducted in accordance with the approved guidelines, written informed consent was obtained from each participant. For children under the age of 18, the written informed consent was obtained from their parents/guardians. To approximate the image quality in the practical scenario, the photos were collected with various equipments which include iPhone 8, iPhone 7, Samsung Galaxy s8, and Canon 6D. No specific in-or exclusion criteria about images, e.g. lighting and resolution, was applied. Three dental diseases were considered in this study: gingivitis, dental calculus, and soft deposit. Among the dataset, 3175 photos show gingivitis, 921 show dental calculus, and 746 images show soft deposits. Note that each photo can show none, one or more types of conditions. All photos were pseudonymized and no other image processing steps are performed.
We split the data into training, validation and testing subsets by randomly splitting the patients into three independent groups. All photos of each patient only existed in one of the three subsets. Table 1 shows the patient and photo split. Table 2 shows the distribution of photos with positive findings among the dataset and different subsets.
Ground truth annotations. We collected the reference annotations of the three dental conditions for all the photos from three board-certified dentists. In specific, the dataset was evenly split and assigned to the three dentists. Each image was independently labeled by one of the dentists with the given clinical report using the labeling software LabelBox (Labelbox, Inc, CA). For gingivitis and dental calculus, we collected the annotations  www.nature.com/scientificreports/ of bounding boxes for indicating the localizations of diseases. Note that since there can be no well-defined boundaries for diseases in some cases, we followed a common approach 14 by instructing the dentists to focus on the correctness of box centers. Meanwhile, for soft deposit, we only collected image-level classification labels, since such condition mostly appears all over cavities and its localizations are labor-costly to label. Problem formulation and model architecture. We formulated the problem as a mixture of object localization and image classification. In specific, we developed a CNN with Multi-task Learning (MTL) 15,16 for solving both tasks with a unified model in order to increase the model's generalization 17 1c) regresses over the feature maps derived from FNet for a set of location vectors, where each vector y encodes one bounding box by its coordinates, height, width, and probability for gingivitis and dental calculus. Similar with Liu et al. 18 , the proposed bounding boxes are aligned to nearest ground-truth boxes during training to approximate localizations, and are filtered with Non-maximum Suppression (NMS) 19 during inference to reduce overlapped findings. CNet ( Fig. 1d) performs fully connected operations over the extracted feature maps for a length 1 vector as outputs, whose value represented the probability for the existence of soft deposits. By optimizing the whole model consisted of FNet, CNet and LNet end-to-end, we enforce the FNet to learn representations that are effective for both classification and localization. Such constraint can possibly improve the generalization of the representations and reduce the model overfitting.
To help users comprehensively understand the diagnosis results, we aim to highlight the spatial locations of the detected dental conditions. For gingivitis and dental calculus, the bounding boxes from the model can already localize the ROIs. However, for soft deposit, the model only produces classification results since their ground-truth location maps are labor costly to annotate. Thus, gradient-based class activation maps 20 was used to reason the areas of the images that are most indicative to the classification. shows an example result from our system. The detected gingivitis and dental calculus are pinpointed with boxes, and soft deposits are hinted with a heat-map, where a higher temperature indicates the stronger relevance of a region. The whole model can be optimized end-to-end during training, and can produce both the classification and localization results in a single run during inference. Implementation and training strategy. To train the model, we defined the loss as an equally weighted sum of smooth L1 loss for bounding box regression, and cross entropy loss for classification 21 . We employed intensive augmentations to input images 22 , which includes random shifts, crops, rotations, scaling, and color channel shifts (random changes of hue, saturation, and exposure). Such augmentations is targeted to increase the robustness of the model for in-the-wild application. Moreover, we employed transfer learning by initializing our FeatNet from VGG-16 23 that pre-trained on large-scale image recognition tasks for speeding up the training process 24 .
The CNN model was developed using the PyTorch framework. The model was trained using a mini batch size of 16 per GPU on three Nvidia 1080 Ti GPUs. Validation set was used to determine the early stopping of  Evaluation metrics and statistical analysis. We evaluated the model from two aspects: (i) classification performance for telling the existence of a condition, and (ii) localization performance for indicating regions on images that related to a diagnosis. In terms of the classification performance, we utilized the Receiver Operating Characteristic (ROC) curve, which shows the true-positive rate (TPR), or sensitivity, against its false-positive rate (FPR), or 1 − specificity, as a function of varying discrimination thresholds. The ROC curve illustrates the diagnostic ability of binary classifier. For gingivitis and dental calculus, we took the highest probability of the detected bounding boxes as the classification probability of an image; meanwhile for soft deposit the classification model output is taken as the probability. To compare between different models, Area Under the Curve (AUC) was used as a numeric measurement of class separability, where a higher value indicates the better model performance.
In terms of localization performance, we utilized the Free-Response ROC (FROC) curve, a commonly used graphic measurement for medical anomaly detection [25][26][27][28][29] . In the FROC paradigm, a model is free to mark as many clinically suspicious regions; a mark is true positive if it is sufficiently close to an actual anomaly, otherwise it is scrod a a location-wise false positive. FROC measures the location-wise TPR against the average number of false-positive (FP) locations per image as a function of varying thresholds for box probabilities. Moreover, by following the practice of van Ginneken 24 , a predicted box was taken as a hit if its center falls into the range of a ground-truth box. To conveniently compare different models numerically, we followed Setio 28 and van Ginneken 24 to define a Localization Performance Metric (LPM) as the average sensitivity at the false positive numbers per image of 1/2, 1, 2, and 3.
In terms of measuring the quality of soft deposit localization, we followed Selvaraju 20 by collecting agreement ratings from three board-certified dentists for each localization heat-map of testing images. Specifically, we show dentists images that were detected with soft deposits together with the localization heat-maps that visualized as in Fig. 1e. Then a rating is given on a scale from 1 (strongly disagree) to 5 (strongly agree) by evaluating if a heat-map demonstrates the regions of the condition according to dentists' opinions. Tables 3 and 4 show the classification and localization performance of different models, respectively. Compared to the general-purpose classification CNNs (VGG-1621 and Residual-5027) and localization CNNs (SSD 20), our model has the advantage of handling both types of tasks. The model achieved classification AUC (95% CI) of 87.11 (82.27 to 91.49) for gingivitis, 80.11% (CI 75.99% to 84.45%) for dental calculus, and 78.57% (CI 74.32% to 82.78%) for soft deposits; meanwhile the model performed at LPM (95% CI) of 58.19% (56.15% to 60.20%) and 49.39% (44.40% to 54.69%) for localizing gingivitis and dental calculus, respectively. All the scores were highest scores among different methods, suggesting the effectiveness of the proposed system. Additionally, we Table 3. Summary of classification performance of different models. classification AUC (in percentage), sensitivity (Sens.), specificity (Specif.) are measured for gingivitis (GI), dental calculus (DC), and soft deposits (SD). GI AUC (95% CI)/% GI sens. GI specif. DC AUC (95% CI)/% DC sens. DC specif. SD AUC (95% CI)/% SD sens. SD specif.  Fig. 2a,b. The designed operating points of the model are shown with black diamonds and red dots. Two types of operating points were designed by following 30,31 : the high-specificity operating point with a higher discrimination threshold that aims for reducing false positives, and the high-sensitivity operating point with a lower discrimination threshold for keeping the missing rate low. The model achieved the mean specificities of 83.87%, 83.61%, and 79.98% under the highspecificity operating points, meanwhile the mean sensitivities of 87.83%, 77.79%, and 78.68% under the highsensitivity operating points, both for gingivitis, dental calculus, and soft deposit, respectively. For localization performance, the model achieved the mean box-wise sensitivities of 66.57% and 45.61% for gingivitis and dental calculus, respectively, at the high-sensitivity operating point. Moreover, Fig. 2c shows the dentists' ratings on the attention-based localization for soft deposits. While the attention-based method has been widely applied to interpret CNNs 32,33 , it lacks formal evaluations of location-indicating accuracy for the dental diagnosis purpose. According to the experiment, our model achieved scores with a median of 3.00, mean of 2.81, and standard deviation of 1.02, on a scale from 1 to 5. Based on the scorers' feedback, the following two factors can lead to the lower localization scores. First, different from bounding boxes that pinpointing the exact locations, the heat-maps can only circle out areas with larger ranges. Second, the model attention can often localize only part of the related regions. This can be explained as the model does not count on all regions for reaching a classification results 20,32 . Figure 3 (Fig. 3 were created with Matplotlib v3.2.1 (https:// matpl otlib. org/). depic ts selected results obtained on the testing images for a qualitative overview of our model's performance. Ground-truth annotations, heatmap predictions and bounding box predictions are shown in the left, middle, and right column. We can clearly see that the model can accurately tell the existence of dental conditions with acceptable accuracy of localization. By looking into the outputs, we found that the over-exposure, under-exposure and incorrect focus of photos can lead to wrong predictions.

Discussion and future work
Previous studies have explored predicting gum health with self-reported questionnaires 34,35 . Their results have shown that several self-reported measures and risk factors are strongly related to the presence of gum diseases. Different from those works, we aim to predict gingivitis as well as its irritants as early indicators from oral photos by learning their common appearance patterns. Such visual signals can be of more direct reflection of dental diseases than questionnaire feedbacks. Moreover, the method is promising since oral photos can be collected with smartphones, which have become increasingly low-cost and ubiquitous recently. Our work pioneers to examine the approach by designing, training and validating a deep learning model for the task. Built based on the detection results of deep learning, future systems can be developed to show targeted health-enhancing activities, proper hygiene routines, and clinical treatments to users, which will be meaningful for promoting the public dental health.
Considering that the users of such system can have limited knowledge about dental health, our model shows not only the existences of dental conditions but also their localizations. The localization can help users better understand the screening results, and help gain trust of users to a system with the increased explainability 36 . We formulated the localization of gingivitis and dental calculus as bounding box regression by considering the appearance of the conditions and saving labour cost. For soft deposits, we formulated the task as image-wise classification, and reasoned its locations with model attentions.
To improve the system efficiency, we employed Multi-Task Learning, such that both types of tasks, i.e. classification and localization, can be solved with one integrated CNN model. Moreover, our experiments indicated that our model also outperformed the state-of-the-art CNNs that carried out single type of task in accuracy, mainly because the co-optimization of multiple tasks increases the model generalization. We further confirmed this with ablation tests, where the model with MTL showed significant accuracy improvements comparing to  www.nature.com/scientificreports/ its subnets that trained for classification or localization solely. We believe the findings can help with the CNN design for other dental diagnosis with multiple goals. Our work still exhibits several shortcomings, and we discuss the possible solutions for future researches. First, our dataset is limited in the sense that the photos were collected from a single organization, and currently it only covered age range from 14 to 60. We have the plan to further enrich the dataset for a wider age range from multiple sites globally. Second, our model achieved a relative low accuracy on soft deposit for the localization task, partially due to the lack of spatial annotations as the guidance for training supervision. Instead of collecting pixel-wise segmentation maps, which can be extremely labour costly, we advocate that future studies could apply recently proposed weakly-supervised learning to train with low-quality spatial annotations, e.g. partial labels over images to indicate several typical areas of diseases 37 . Moreover, the model could also benefit from semisupervised learning by augmenting a part of dataset with pixel-wise labels 38,39 , while the other part only comes with image-wise labels. Third, the algorithm can also be complementary with the traditional questionnaire-based detections for higher reliability and accuracy. The current model cannot utilize data modality other than images for diagnosis. Encoding 40 and fusing of medical history and self-reported symptoms of a patient into CNNs could be promising to improve the model accuracy 41 .

Conclusion
In this study, a deep learning model for the detection of gingivitis, dental calculus, and soft deposits from oral photos was proposed. We formulated the model with Multi-Task Learning, which effectively improves its compactness and accuracy. We evaluated our model for both classification and localization tasks. Based on the results, we show deep learning is promising for enabling the cost-effective screening of dental diseases among large populations from oral photos, which can captured with smartphones and other commonly available devices. Built upon the deep learning model, systems can be developed to provide user-specific health-enhancing activities according to one's dental conditions, which can be promising to improve the public dental health. Our work also discusses the possible improvements of data quantity and model architectures.

Data availability
The data used in current study were collected from Medical School of Nanjing University and is available only for the granted research. However, the data can be made available if requested within data protection and regulation guideline.