Laryngopharyngeal reflux image quantization and analysis of its severity

Laryngopharyngeal reflux (LPR) is a prevalent disease affecting a high proportion of patients seeking laryngology consultation. Diagnosis is made subjectively based on history, symptoms, and endoscopic assessment. The results depend on the examiner's interpretation of endoscopic images. There are still no consistent objective diagnostic methods. The aim of this study is to use image processing techniques to quantize the laryngeal variation caused by LPR, to judge and analyze its severity. This study proposed methods of screening sharp images automatically from laryngeal endoscopic images and using throat eigen structure for automatic region segmentation. The proposed image compensation improved the illumination problems from the use of laryngoscope lens. Fisher linear discriminant was used to find out features and classification performance while support vector machine was used as the classifier for judging LPR. Evaluation results were 97.16% accuracy, 98.11% sensitivity, and 3.77% false positive rate. To evaluate the severity, quantized data of the laryngeal variation was used. LPR images were combined with reflux symptom index score chart, and severity was graded using a neural network. The results indicated 96.08% accuracy. The experiment indicated that laryngeal variation induced by LPR could be quantized by using image processing techniques to assist in diagnosing and treating LPR.

Selecting the appropriate image. This study proposed screening the sharpest image automatically from dynamic laryngeal endoscopic video using sharp contour of the sharp image. Four functions for judging sharpness were tested: the variance, the sum-modulus-difference, the gradient magnitude maximization and the energy of the Laplacian of the image. See supplementary material for method details. The variance reflects the image's gray level variation. Among endoscopic images, blurred and distorted images often have blurred contours. The more blurred the contour is, the smaller the variance. The sum-modulus-difference (SMD) uses the first derivative action as a high pass filter to extract the high frequency signals from the image. In terms of the endoscopic image calculation, the gray level difference between two adjacent pixels is calculated and the horizontal (SMD x ) and vertical (SMD y ) directions are processed. The sharp image is obtained by calculating the sum of SMD x and SMD y , and the endoscope image with the maximum value is selected. The gradient magnitude maximization uses Sobel operation and uses the first derivative to process images. Two masks are used in the endoscopic image for calculation. One is the x-direction and the other is the y-direction. The total gradient is calculated and the endoscopic image in which the maximum value occurs is selected. The energy of the Laplacian of the image uses the second differentiation of the image intensity function as a high pass filter. Laplace's operation for searching the larynx boundary in the endoscopic image is used for the second differentiation. The approximately discretized Laplace's operation kernel is used as a high pass filter.
Automatic segmentation method. Some Refs. 11,13,19,20 have performed region segmentation of the larynx for follow-up analysis. Pribuišienė et al. 21 indicated that the most distinctive signs of the LPR are the mucous membrane damage of the true vocal cords and arytenoid cartilage. This region can be displayed completely during shooting to avoid inconsistent feature comparison. This study used automatic segmentation to segment the left and right vocal cords, arytenoid cartilage and glottis. In comparison to manual segmentation, automatic segmentation saves time and is free of subjective cognition.
The glottis is a relatively dark region in the image, therefore the gray threshold was used for segmentation 22 . In order to highlight the differences in brightness, keep more details and avoid over-segmentation, this study calculated the gray level of the full image. The lower bound of the threshold of the image was calculated. This value was taken as binary threshold. Active contour method (ACM) 23,24 was used since the boundaries of the vocal cords, arytenoid cartilage, and peripheral tissues can be blurred sometimes. Its advantage is that a continuous closed segmentation boundary can be obtained even there is noise. In this method, the target object's edge detection problem is changed into an energy minimization framework. The purpose is to compute the deformation curve which can minimize the sum of the internal and external energy in an energy field. The internal energy aims to normalize the curve shape and the external energy aims to approach the target object edge and converge to the target object boundary to reconstruct the complete contour of the target object. Let s be the parametric curve such that v(s) = (x(s), y(s)), where x and y are the given curvilinear coordinates; the total energy function E snake is defined in Eq. (1): www.nature.com/scientificreports/ E internal aims to normalize the stretching and bending of the curve shape and represents the internal energy. E image is the image energy deduced from the image information, E con is the constraint energy, and the total energy of E image and E con is E external . E internal is defined as Eq. (2): where, |v s (s)| 2 is the curve elasticity, |v ss (s)| 2 is the curvature of the parametric curve, α(s) controls the stretching, β(s) controls the smootheness of the curve, and E internal makes the curve contract inward continuously and remain smooth. E image is defined as Eq. (3): where I(x,y) is the gradient of image I at (x,y) and the E external makes the curve approach the target object contour continuously until they are coincident. E snake minimization must be coincident, as expressed in Eq. (4): For the physical motion of the snake, E internal and E external will be balanced eventually, as expressed in Eq. (5): The target object can be approached effectively by the continuous initialization of the aforesaid snake model. When E snake reaches total energy minimization, the effect is reached. The parametric curve v(s) is the final contour of object.
Hue feature. A laryngeal endoscopic image is a color image and is directly applicable to the RGB (red, green, blue) color space. The RGB color space has rapid calculation and does not need to calculate coordinate conversions. Its defect is that the detection result will be affected by the environment and light source. In order to circumvent this issue, this study propose the use of two color spaces which are free from the effect of brightness: the HSV (hue, saturation, value) color space and the YC b C r color space 25 . In the HSV color space, H = 0° represents red, H = 120° represents green and H = 240° represents blue. S is 0-1, and the image is a gray level image when S = 0. V represents brightness, in which V = 0 represents black and V = 1 represents white. The main components of the YC b C r color space are the brightness (Y) and two chromaticies (C b , C r ). Y is the gray level of the gray scale image converted from a color image. The brightness separability is very high and is favorable for adjusting different chromaticity components. C b is the blue chromaticity component and C r is the red chromaticity component. The YC b C r color space can reduce the effect of brightness, hence it is used in image processing techniques. This study used chromaticity as a hue feature, and there were six chromatic values (R-G-B-H-C b -C r ) used as hue features. The left and right vocal cord, arytenoid cartilage, and the ratio of the vocal cords to arytenoid cartilage were analyzed. There were four region analysis with 24 hue features in all.

Textual features.
In terms of the perceptual experience of the human eye, the rough and directionality are the primary characteristics used by the human eye to distinguish texture. The Gray-level Co-occurrence Matrix (GLCM) 26,27 describes the grayness relationship between adjacent pixels in a local area or overall area of an image. To quantize the laryngeal variation induced by LPR, this study used the equalization, contrast, correlation and homogeneity of GLCM to describe the texture information of various regions of larynx. The angle was set as 0° to analyze the features of LPR. Normalization was performed before the GLCM eigenvalue were extracted and the sum of the elements of GLCM was set as 1 for computing. The eigenvalues used are discussed below.
Equalization (E) 28 . This eigenvalue is known as the energy, which was used in this study to measure the consistency and equalization of the gray level distribution in each region of an image. Consistency and equalization refer to the probability of the occurrence of a pixel pair and a higher probability of recurrence represents higher consistency and equalization. The range of (E) of GLCM was [0, 1]. It reaches the maximum value (E = 1) when the gray levels of the image were identical. Contrast (Con) 28 . This eigenvalue was used to measure the intensity contrast between adjacent pixels in each region of the image. A larger gray level difference between adjacent pixels represent a larger Con value of GLCM. In a k × k GLCM, the range of Con would be [0, (k − 1) 2 ].
The correlation (Cor). This one was used to measure the correlation between adjacent pixels in each region.
Homogeneity (Hom). It was used to measure the local grayness homogeneity in each region. If the local grayness homogeneity of the image was uniform, the Hom value would be large.
Classifiers. The features of various regions were extracted and classified to identify LPR. SVM 29 has a good training and classification execution speed and will separate the hyperplane during computing. This study extracted the hue and textural features of various regions of the larynx to accurately identify LPR. SVM was effective for dividing samples into negative and positive. An ANN has strong nonlinear fitting capability, strong noisy data tolerance and is characterized by multiple entry/exit features. The LPR severity analysis required www.nature.com/scientificreports/ multiple classification results and the ANN met the requirements of this study for strong error tolerance and multiple entry/exit characteristics. SVM 30,31 is a binary linear classifier that is used to find a hyper-plane in a space so that two classes of data can be separated. It is used to find a zone with the maximum boundary in two different classes of data. The hyper-plane is called the optimal hyper-plane and the class of the unknown data is determined according to the data position in it. The back-propagation neural network (BPNN) 32-34 has a forward recurrent learning ability network. This study used BPNN as a classifier to analyze the LPR severity because of its higher error tolerance and multiple entry/exit characteristics. The architecture was comprised of three layers; input layer, hidden layer and output layer as shown in supplementary Fig. 1.

Results and validation
Clear laryngeal endoscopic images were searched for automatic segmentation and feature analysis. The automatically-segmented regions included left and right vocal cords, and the arytenoid cartilage, and the hue and textural features of various regions were analyzed. LPR can be diagnosed by using image processing techniques to analyze larynx image features. This study proposed using the quantized data of variations of the throat induced by LPR to analyze its severity and provide doctors medical advice and assistance with patient diagnosis. The detection system was divided into five parts as shown in Fig. 1.
Image selection. Feature filtration. Non-throat images were filtered using red component. Blurred images were eliminated using the rapid execution of variance. Images with a red component less than 0.4 were filtered out. The average value of variance of the remaining images was calculated and images with a lower than average value of the variance were removed.
Glottis structure discrimination. The glottis is a relatively dark region in the image, as shown in Fig. 2A. The lower bound of threshold is used as the image binarization, as shown in Fig. 2B. The binary image is shown in . This study used glottis structure condition for screening. The conditions were: area, centroid position and aspect ratio. Regions that were too small were excluded, regions where the glottis was not in the center were eliminated, and since glottis shape is an inverted triangle, the east-west regions were filtered out using aspect ratio. The glottis segmentation is shown in Fig. 2D.
Sharp image. After the glottis is screened out, the sharp image is screened out of multiple images, because most images are blurred as the patient or lens moves, i.e. insufficient sharpness. The object boundary in the blurred image is quite blurred; on the contrary, the object boundary in the sharp image is relatively clear, as shown in Fig. 3. The Fig. 3A image is obviously blurred due to the movement of arytenoid cartilage (arrowhead); the Fig. 3B is a blurred image as the lens is unfocused; Fig. 3C is a sharp image. This study tested and compared 10 sample and used four functions for judging the sharpness: the variance, the sum-modulus-difference, the gradient magnitude maximization, and the energy of the Laplacian of the image. It is obvious that the red frame  www.nature.com/scientificreports/ images are relatively sharp. This study used sum-modulus-difference to search for the sharpest throat image, as this method had better execution speed and good precision as shown in Fig. 4.

Image compensation.
In order to allow similar color ranges of the laryngeal endoscopic images, histogram shifting was used to unify the average gray level of all images to the same value. Based on the difference between the average brightness value of the throat image and the set value, each pixel was increased or decreased to achieve the set average value. The formula is converted as follows: I(x,y) = O(x,y) + (S -Om), where (x, y) is the image coordinate, I is the grayscale value of the image after translation, O is the grayscale value of the current image, S is the set value, and Om is the grayscale value of the image before translation. The RGB was turned into YC b C r space where the Y channel value was shifted to 125. The image was turned back to RGB space to enable consistent standard for the subsequent hue features. The proposed method could avoid the light source problem resulting from the laryngoscope lens position and allow the segmentation to be more correct. Figure 5 shows the image compensation.
Region segmentation. Arytenoid cartilage segmentation. The image was binarized, relatively bright regions including the arytenoid cartilage were turned into white and the binary image was morphologically eroded. The largest region obtained by the notation was regarded as the arytenoid cartilage candidate block and discrimination of the arytenoid cartilage structure was performed. Finally, the region of arytenoid cartilage was obtained using ACM. The study then used the arytenoid cartilage structure conditions for screening which were: area, centroid position and aspect ratio. Regions that were too small were eliminated, regions where the arytenoid cartilage was not in the upper part were eliminated, and since the arytenoid cartilage shape is rectangular, the south-north regions were filtered out using aspect ratio. The arytenoid cartilage segmentation is shown in Supplementary Fig. 2. www.nature.com/scientificreports/ Vocal cord segmentation. Both the vocal cords were presented as columnar features near the glottis and appear relatively brighter. This study used the vocal cord position for segmentation. First, the glottis was morphologically dilated and the edge image was taken. The glottis was transected using its centroid, and the valley was searched and slit as shown in Supplementary Fig. 3.
Adaptive vocal cord segmentation. Laryngeal endoscopic images do not have a fixed lens distance, therefore the vocal cord size varies, which means the number of iterations for the ACM cannot be fixed as shown in Fig. 6. In addition, since the left and right vocal cord size could differ, there would be obvious over-segmentation or under-segmentation under a fixed value. In order to solve this, the study proposed an adaptive method to determine the number of iterations. The growth range of the ACM was mitigated gradually as the gap between the vocal cords and the false vocal cords narrowed. The difference between the gray level variations in the range grown by this iteration and in the growth range of the last iteration was slight and could be described by calculating the entropy of the gray level variation. This is to say, when the difference in the gray level and entropy in the range grown by two iterations were slight, the entropy will have a minimum difference. If the growth range increased continuously, the growth range would exceed the vocal cord and the entropy difference between the range grown by this iteration and the growth range of the last iteration would increase. Using this characteristic, the entropy of the growth range was calculated during every iteration and subtracted from the last iteration to find out the number of iterations with the minimum entropy difference. The result was the optimum number of iterations of the vocal cord. The minimum number of iterations of the ACM was 13, the maximum was 41, and the entropy of the growth range of every iteration was calculated. The entropy difference between iterations was calculated and the minimum difference was found out. In this example, the minimum entropy was 21st iteration and the entropy difference was 0.0043. As the minimum number of iterations of the ACM in this study was 13, the minimum entropy difference was the 21st iteration, meaning the growth ranges of the 33rd, 34th iterations of the ACM were approximately saturated. The optimum number of iterations of the left and right vocal cord could be found out as shown in Supplementary Fig. 4.

Segmentation accuracy validation.
In order to determine the segmentation accuracy, dice similarity coefficient (DSC) 35,36 was used to calculate the segmentation accuracy corresponding to each sample. This study drew 10 samples randomly and the manual segmentation performed by three doctors was compared with the method  Feature selection. In order to detect LPR, different features of various regions were required, therefore hue and textural features were used. This study performed the Fisher linear discriminant of 76 LPR and non-LPR samples. 36 features of different color spaces and textures in four regions were tested and the Fisher linear discriminator was used to find out the features with classification performance for LPR. It was found that the C b channel and C r of the arytenoid cartilage, the R channel of the vocal cords, the energy of GLCM of the arytenoid cartilage, and the contrasted Fisher linear discriminant of the vocal cord GLCM were 0.6425, 0.6409, 0.5213, 0.6568, and 0.5241 respectively representing the highest among the 36 features. Table 2 shows the classification capacity of various features.
LPR analysis. In order to diagnose LPR, this study used SVM as a classifier. There were 352 research samples, including 106 LPR and 246 non-LPR samples. According to Choi and Choi 37 and Javaid et al. 38 , when the  www.nature.com/scientificreports/ sample ratio is unbalanced, the positive-negative sample ratio of 1:1 is recommended for K = 10 cross validation during classification evaluation. Therefore, this study used 106 LPR and non-LPR samples for validation, and the results of the classification was evaluated 10 times using cross validation.
LPR identification results analysis. The non-LPR samples were defined as positives and the LPR samples were defined as negative. The results showed that both LPR and non-LPR are True. Table 3 shows the classification results. This study used accuracy, sensitivity, and false positive rate to evaluate the classification results.
Severity classification. The severity was preliminarily classified into three levels according to the classification of the digestant, which was combined with RSI as the classification criteria. RSI scores of 13-20 were primary, 21-30 were intermediate, and 31-45 were severe. There were 106 samples in this study, including 38 primary samples, 54 intermediate, and 14 severe samples. Stratified cross validation was used for classification, and tenfold stratified cross validation was used to evaluate the results of classification. The BPNN architecture was divided into the input, hidden, and output layers. The input layer had five processing units representing five eigenvalues, respectively which were the C b and C r channels of the arytenoid cartilage, the energy of GLCM, the R channel of the vocal cords and the contrast of the GLCM. There was one hidden layer and four processing units, and the output layer judge the LPR. The parameters of the BPNN used in this study were tested continuously through trial and error. The final cycle index was 1,000 times, the learning rate was 0.65 and the momentum factor was 0.5. Stratified cross validation was performed for this group of parameters to evaluate the results of classification and the overall recognition rate was 96.48% as show in Table 4.

Discussion
Manual examination of LPR using a laryngoscope is subjective. In this case, giving treatment according to the symptoms without specific diagnostic evidence may result in medical and economic burdens 38 . Using image processing techniques to analyze the hue and texture of the laryngeal images is a more objective technique. In order to identify LPR accuracy, obtaining sharp images and uniform light source is a priority for analyzing images. In the study, the sharpest larynx image was found out by using the variance and the sum-modulus-difference in laryngoscopic images. The image compensation proposed in this study used histogram shifting to give a consistent brightness range and to prevent the gray level of the image from exceeding boundaries that failed to display. The most distinctive sign of LPR is the mucosal damage on the true vocal cords and the arytenoid cartilage. In this study, the arytenoid cartilage, glottis, left and right vocal cords were segmented automatically for analysis. This study used DSC for segmentation validation and the results proved that the proposed automatic segmentation was accurate and stable. Du et al. 19 and Witt et al. 20 mentioned changes in hue and texture but did not   (36 features) and used the Fisher linear discriminant to find the features with classification performance for LPR. It was found that the C b and the C r channels of the arytenoid cartilage, the R channel of the vocal cords, the energy of GLCM of the arytenoid cartilage, and the contrasted Fisher linear discriminant of the vocal cord GLCM were outstanding. Our results revealed that LPR could induce changes in the larynx which cannot be described by human eye specifically. The aforementioned five features were combined with SVM to distinguish LPR and non-LPR conditions. The LPR recognition accuracy of the proposed method was 97.16%, the sensitivity was 98.11% and the false positive rate was 3.77%, proving that LPR could be identified according to the hue and textural features. This study used the quantized data of laryngeal variation induced by LPR and RSI as the training and output samples of the BPNN. The five features and RSI were used as training samples of the BPNN and the overall recognition rate was 96.48%. The RSI evaluation method was subjective, but large amount of RSI information approached the objective results. The test results of the RSI samples indicated that the severity of LPR could be classified by quantized data of the laryngeal variations induced by LPR. The results could be used by doctors to provide medication suggestion for patients in real-time treatment.

Conclusion
This study proposed searching for sharp larynx images in videos taken by a laryngoscope, to solve the difficulty in capturing sharp images. In order to eliminate the light source problem resulting from an inconsistent laryngoscope lens position, histogram shifting was used to give samples a consistent gray level range for subsequent region segmentation and feature analysis. The automatic segmentation of the larynx segments was consistent across all samples, the subjective differences of the manual segmentation were reduced, and the manual augmentation time was saved. The five features with discriminability for LPR were combined with SVM, and the LPR recognition result had high precision. In terms of severity of LPR, the five features were combined with RSI for ANN training. The results showed that using the quantized data of LPR images to classify the severity could assist doctors in diagnosis.
Ethics approval. The research protocol (No.: 1-108-05-132) has been reviewed and approved by the Institutional Review Board of Tri-Service General Hospital.

Consent for publication.
All methods were performed in accordance with the relevant guidelines and regulations. All patients provided written informed consent prior to participation.

Data availability
The datasets generated from this study are available from the corresponding author on reasonable request.