Ageing and degeneration analysis using ageing-related dynamic attention on lateral cephalometric radiographs

With the increase of the ageing in the world’s population, the ageing and degeneration studies of physiological characteristics in human skin, bones, and muscles become important topics. Research on the ageing of bones, especially the skull, are paid much attention in recent years. In this study, a novel deep learning method representing the ageing-related dynamic attention (ARDA) is proposed. The proposed method can quantitatively display the ageing salience of the bones and their change patterns with age on lateral cephalometric radiographs images (LCR) images containing the craniofacial and cervical spine. An age estimation-based deep learning model based on 14142 LCR images from 4 to 40 years old individuals is trained to extract ageing-related features, and based on these features the ageing salience maps are generated by the Grad-CAM method. All ageing salience maps with the same age are merged as an ARDA map corresponding to that age. Ageing salience maps show that ARDA is mainly concentrated in three regions in LCR images: the teeth, craniofacial, and cervical spine regions. Furthermore, the dynamic distribution of ARDA at different ages and instances in LCR images is quantitatively analyzed. The experimental results on 3014 cases show that ARDA can accurately reflect the development and degeneration patterns in LCR images.


Supplementary
The fitted regression models on the true age and the age predicted by different methods. a baseline model, b ARDA-constrained, c retest, d teeth region, e craniofacial region, f cervical spine region, g teeth region + craniofacial region, h teeth region + cervical spine region, i craniofacial region + cervical spine region.

Supplementary Note 1
The Squeeze-and-Excitation (SE) [1] module is architectural unit embedded in MBCov [2] [3] , which is the main block of EfficentNet [4] . In EfficientNet, SE follows the feature map U with Cchannels and H × W spatial size (U ∈ R C×H×W ) obtained by convolution. Firstly, a statistic z ∈ R c is generated by computing the global average for each channel of U, where the c − th element of z is calculated by: where u c (i, j) is the value of channel c and position (i, j) in the feature map U. Then the scale vector s ∈ R C is obtained by: where σ(⋅) and δ(⋅) are the sigmoid and ReLU function, respectively, W 1 ∈ R C r ×C and W 2 ∈ R C× C r are the learnable parameters. Finally, the feature map is scaled by s and fed into next convolution layer: EfficientNet uses a compound coefficient ϕto uniformly scale the network width, depth, and resolution in a principled way: ℎ: = α ϕ ℎ: = β ϕ : = γ ϕ (4) . . α ⋅ β 2 ⋅ γ 2 ≈ 2 α ≥ 1, β ≥ 1, γ ≥ 1 where , , and are constants that can be determined by a small grid search. is a coefficient that describes how many more resources are available for model scaling, while , , and specify how to assign these extra resources to the width, depth, and resolution of the network respectively.

Supplementary Note 2
The LCR images at each age were split into training, validation and test datasets according to the ratio of 7: 1.5: 1.5, which allows the age distribution of LCR images in each dataset to be consistent with that of the overall dataset.
The weights of the model were randomly initialized with a normal distribution. In this study, the mini-batch strategy was used to train the model, i.e., in each training iteration, batch-size samples are fed into the model. The optimizer used in our study is Adam, and its weight decay was set to 0.0001.
To prevent the model from converging to the local optimum and speed up the convergence, the learning rate also decreased with the number of times to traverse the training dataset (epoch). When training the network, if the performance on the validation dataset did not improve for 3 consecutive epochs or the epoch reaches the max value, which is set at 100 in this study, the training stops. After training, we selected the parameters of the network that performed best on the validation dataset and obtained final performances using the test.
Suppose the initial learning rate is  , we get the learning rate of epoch i : (5) where the initial learning rate 0.001   . The parameters were updated according to its gradient and current training iteration learning rate: where i w is the model parameters after i epochs. i w  is the gradient of the parameters obtained by the back propagation algorithm. The loss function used for training the model was 1 L loss: where N is the number of samples, n y and ˆn y are the age and predicted age of th n LCR image.

Supplementary Note 3
The baseline model we used in this study requires the size of the image must be fixed, since the feature map is flattened into a one-dimensional vector before being input to the fixed-size fully connected layer. Meanwhile, considering the preservation of the texture and structure information in the image and the efficiency of the model, the LCR images and the ARDA concentrated ageing region were resized to 1000 1000  and 500 500  respectively. However, the aspect ratios of the LCR image and the ARDA concentrated ageing are not 1:1, thus the direct resize operation will inevitably cause image distortion. Therefore, before resizing the short sides of LCR images and the ARDA concentrated ageing regions were padded to the length of their long sides so that the aspect ratios of the images were always 1:1.

Supplementary Note 4
Every LCR image and ARDA concentrated ageing region were also been augmented to increase the diversity of the samples, thereby improving models' generalization performance. The samples in the training dataset were augmented by Random Affine, Horizontal Flip, and Vertical Flip operations provided in PyTorch. Before the data augmentation, the contrast of the input image is enhanced by contrast limited adaptive histogram equalization. The source code of the Contrast Enhancement and Shape Fixing is provided in source code repository (https://github.com/LiuNingtao/ARDA/blob/master/consum_transformer.py).
To eliminate the effect of size differences on the quantified ARDA analysis, we only calculate the mean ARDA of the ageing-significant region within each instance. For a specific age : where is the quantified ARDA of instance , P is the number of elements in the instance, is the number of elements with average ageing salience greater than or equal to the ageing-significant region threshold of age , and (•) is the indicator function to determine whether the element with average ageing salience in ARDA belong to an ageing-significant region. In this study, the