Unsupervised learning architecture for classifying the transient noise of interferometric gravitational-wave detectors

In the data obtained by laser interferometric gravitational wave detectors, transient noise with non-stationary and non-Gaussian features occurs at a high rate. This often results in problems such as detector instability and the hiding and/or imitation of gravitational-wave signals. This transient noise has various characteristics in the time–frequency representation, which is considered to be associated with environmental and instrumental origins. Classification of transient noise can offer clues for exploring its origin and improving the performance of the detector. One approach for accomplishing this is supervised learning. However, in general, supervised learning requires annotation of the training data, and there are issues with ensuring objectivity in the classification and its corresponding new classes. By contrast, unsupervised learning can reduce the annotation work for the training data and ensure objectivity in the classification and its corresponding new classes. In this study, we propose an unsupervised learning architecture for the classification of transient noise that combines a variational autoencoder and invariant information clustering. To evaluate the effectiveness of the proposed architecture, we used the dataset (time–frequency two-dimensional spectrogram images and labels) of the Laser Interferometer Gravitational-wave Observatory (LIGO) first observation run prepared by the Gravity Spy project. The classes provided by our proposed unsupervised learning architecture were consistent with the labels annotated by the Gravity Spy project, which manifests the potential for the existence of unrevealed classes.

Gravitational waves are distortions of the space-time continuum that propagate (with high probability) at the speed of light.They are emitted during events such as the coalescence of compact star binaries and supernova explosions.The first observation of a gravitational wave, which was from the coalescence of a black hole binary, was achieved by the Laser Interferometer Gravitational-wave Observatory (LIGO) 1 located in Livingston, Louisiana and Hanford, Washington in the USA in September 2015 2 .Subsequently, LIGO and Virgo 3 in Europe made three international joint observation runs and observed as many as 90 events of gravitational waves emitted by the coalescence of compact binaries [4][5][6][7] .Moreover, GEO600 8 , in Germany and KAGRA [9][10][11][12] , in Japan, made a 2-week observation run (O3GK) in April 2020 13,14 .The subsequent fourth observation run (O4) is planned to be conducted jointly with LIGO, Virgo, and KAGRA.
When searching for a gravitational wave signal in the data from the interferometers, suitable techniques for separating the gravitational waves from instrumental noise in the observed data are essential because the signals of the gravitational waves are generally smaller than the detector noise.The gravitational-wave detector is sensitive to environmental and instrumental states (such as ground motions, air pressure, optics suspensions, fluctuations in the laser, vacuum, and mirror).Consequently, non-stationary and non-Gaussian noise, called "transient noise", frequently appears in the detector.Transient noise causes instability in the detector and the hiding and/or imitating of the gravitational-wave signals.The LIGO and Virgo collaboration reported that transient noise with a signal-to-noise ratio > 6.5 occurred at a rate of 1.10 events per minute at LIGO Livingston (LLO) in the first half of the third observation run (O3a) between 1 April 2019, 15:00 UTC and 1 October 2019, 15:00 UTC 5 , and at a rate of 1.17 events per minute at LLO in the second half of O3 (O3b) between 1 November 2019, 15:00 UTC and 27 March 2020, 17:00 UTC 7 , respectively.
Transient noise has various time-frequency characteristics that are related to its causes in the detector.Classifying transient noise could provide us with clues to explore its origins and improve the performance of the detector.Among others, the Gravity Spy project [15][16][17][18] is one such effort to classify transient noise.The Gravity Spy project used the Omicron software 19 to identify the signal of transient noise observed in the time-series data.Thereafter, Omega Scan 20 was used to create a time-frequency spectrogram around the identified transient noise as two-dimensional (2D) images.Based on a part of these created 2D images, using cloud resources in collaboration with LIGO detector characterisation experts and volunteer citizen scientists for the analysis, 22 types of labels associated with the characteristics or causes of transient noise were annotated.Both images and labels were recorded.Finally, they classified the transient noise in the remaining images by supervised learning using the pre-classified images and labels.As this process shows, the data annotation for machine learning is highly labour-intensive.
Previous studies 21 using unsupervised classification grouped together similar transient noise in the Gravity Spy dataset 16 .Bahaadini et al. used the DIRECT method 22 to analyse the feature embedding learned from the Gravity Spy dataset 16 and observed a different class of transient noise from the existing classes.Unsupervised clustering applying transfer learning 23 exhibited a new class of transient noise in addition to the 22 classes of the Gravity Spy project.Moreover, supervised classification using the latest observation O3 dataset presented a new class of transient noise 17 .
As unsupervised learning does not require any pre-assigned labels for the training dataset, this architecture is expected to reduce annotation work for the training data, increase the objectivity of the classification, and even classify a new class of the transient noise.Unsupervised learning is also useful in various fields, such as text categorisation, feature representation, and clustering [24][25][26][27] .In this study, we focus on unsupervised learning using a deep convolutional neural network (CNN) and propose a classification architecture for transient noise.Our proposed architecture consists of two processes: feature learning and classification.In the feature learning process, the features of transient noise are extracted from the time-frequency spectrogram images (2D images) using a variational autoencoder (VAE) 28,29 .In the classification process, invariant information clustering (IIC) 30 is used to classify images of the transient noise using features extracted by the encoder of the pre-learned VAE.We applied the proposed architecture to the dataset 16 created by the Gravity Spy project of the LIGO observation run 1 (O1) 4 as our input images, examined the validity of the unsupervised classification result, and analysed the correspondence with the labels of the Gravity Spy project.

Results
The result section consists of two subsections: the results of the training process and evaluation of the unsupervised learning architecture.The Gravity Spy dataset of LIGO O1, which was developed by the Gravity Spy project shown in Fig. 1, was used for training in our proposed architecture.This dataset contains a total of 8535 transient noises in four time durations: 0.5, 1.0, 2.0, and 4.0 s.Each data unit has a label with one of the 22 types which are related to the origins or characteristics of the transient noise.The labels annotated by the Gravity Spy project under Zooniverse, which is the online citizen science platform, were used only when evaluating the training results of the proposed architecture.In addition, the pre-processing of the dataset is shown in "Pre-processing" section.

Training process of our architecture
We investigated the training parameters to use for the VAE as follows.The dimensions of the feature variable z z z were 64, 128, 256, 512, and 1024; the training size rate was in the range of [0.6, 0.9] in increments of 0.1; the learning rate using the Adam 31 optimiser with parameters β 1 = 0.9, β 2 = 0.999 (coefficients used for computing running averages of gradient and its square) and ε = 10 −8 (term added to the denominator to improve numerical stability) was in the range of [5 × 10 −7 , 5 × 10 −2 ] in increments of one digit; the minibatch size was in the range of [32, 128] in increments of 32.The maximisation of the lower bound (3) (i.e.let δ = − ∑ N i L (x x x (i) , θ θ θ , φ φ φ )) was used as a training objective, and the minimisation of δ was used for training.
The value of δ does not have a significant effect on the dimension of z z z and the training size rate.By contrast, the learning rate and minibatch size are related to the value of δ and its stability.The representative parameters for training are shown on the left side of Fig. 2a), and the training curves using these parameters are shown in Fig. 2b.Considering Case 1 (black line in Fig. 2b), the learning rate seems too low and δ does not decrease.Regarding Case 2 (grey line), the result of the training is not stable, showing the fluctuation in the curve, although δ has decreased compared with Case 1.In Case 3 (blue line), δ decreases in both the training and evaluation and seems stable after 100 epochs.Considering these results, for the remainder of the study, the parameters of Case 3 were utilised in the proposed architecture.
Examples of the reconstructed images of the transient noise generated by the decoder of the VAE at 100 epochs are shown in Fig. 2c.The characteristics of the reconstructed images seem similar to those of the input images.We confirmed a similar tendency for all the other inputs and reconstructed images.Therefore, the encoder of the VAE at 100 epochs was applied to the IIC for the classification of the transient noise.
Furthermore, the validity of the features by VAE is shown in Supplemental Material "Feature Visualization of Transient Noise using t-SNE" section by visualised features z z z, which are projected using t-SNE.
After training the VAE, the training parameters of IIC were also investigated using the pre-trained encoder.The output classes were in the range of [22, 100] in increments of 2; the output over the classes was in range of [50, 500] in increments of 50; the classifier number was one of 3, 5, 10, 20; the learning rate of the Adam optimiser with parameters β 1 = 0.9, β 2 = 0.999 and ε = 10 −8 was in the range of [5 × 10 −7 , 5 × 10 −2 ] in increments of one digit; the minibatch size was in the range of [64, 256]  in increments of 32.Owing to the training, the mutual information from (4) was high, between 30 and 40 output classes, which is consistent with the fact that the subclasses are implied in the dataset.When the output over classes and the classifiers change, the mutual information does not seem to change.In this study, the IIC parameters shown on the left side of Fig. 2a were used for the classification.In addition, considering the parameters of spectral clustering with multiple classifiers, the number of classifiers K = 5, and the number of classes C = 36.These values are the optimal performance for classification using the accuracy shown in "Discussion" section.The training for the VAE and IIC with a 128 mini-batch size took approximately 1.0 h/100 epochs and approximately 0.3 h/100 epochs, respectively, using two NVIDIA GeForce RTX 2080 Ti GPUs, an Intel Xeon CPU E5-2637 v4 (core 8), and 125 GB of main memory.

Evaluation of our architecture
The evaluation results are presented in this section.The proposed architecture shown in "Proposed Architecture" section was trained using the pre-processing dataset described in "Pre-processing" section.
Fig. 3 shows a randomly selected image from each class (representative image) and similar images that have a high degree of similarity to the representative image in a class.These similar images are derived from the cosine similarity 32 between the representative image and the other images, using an affinity matrix which is calculated by spectral clustering.
The representative images seem to have different characteristics for each class, and similar images are close to their representative images.Moreover, the image of class (15) in Fig. 3 shows that the classifier recognises the same class even if the data are shifted in the time direction.Therefore, training that does not depend on the perturbation in the time duration is achieved by pre-processing the dataset.
The "Scattered_Light" class is separated into classes (2), ( 3), (11), and ( 16) on the confusion matrix, respectively.These classes are classified into different classes on unsupervised learning, whereas their characteristics are similar to Fig. 3.A previous study 17 on supervised learning with the Gravity Spy labels indicated the existence of a subclass that might be in the "Scattered_Light" class.The unsupervised classification yielded the same results as in the previous study, indicating the existence of a subclass of the "Scattered_Light" class.
Considering the "Blip" and "Koi_Fish" classes, both classes are separated into multiple classes as shown in Fig. 4. The representative images and their similarity images from separating the classes are shown in Fig. 5, where the similarity images are sorted in descending order and are sampled randomly from the cosine similarity to the representative image.Each separating class is grouped into its own class, even for images with low cosine similarity.The images of the classes separated from "Blip" have a common Gravity Spy label.Moreover, the frequency growth of the spectrogram image for classes (9), (20), and (30) looks roughly similar, and the unsupervised classification classifies each class using their characteristics details.Similar results can be observed in "Koi_Fish" (class (5) and class (7)).Therefore, the images of "Blip" or "Koi_Fish" may be classified into more detailed subclasses.
The "Paired_Doves", "Wandering_Line", and "Air_Compressor" classes are a few of the samples in the dataset (Fig. 1b)."Air_Compressor" is classified into one class; however, the other classes are not classified into any unique classes in the unsupervised classification.We assume that "Air_Compressor" is a class that cannot be divided further.Therefore, it is classified into one class, even with few data.Conversely, "Paired_Doves" and "Wandering_Line" are assumed to have more subclasses.The reason why they are not classified into a specific class can be explained by the fact that a limited amount of transient noise is classified into "Paired_Doves" and "Wandering_Line".
The "None_of_the_Above" class of the O1 dataset comprises data that do not belong to any other Gravity Spy labels.The unsupervised classification does not classify these data into unique classes; instead, it distributes them into various class types.This result is consistent with a previous study by Bahaadini et al. 16 .In fact, Soni et al. 17 used the O3 dataset 5 and reported that several of the "None_of_the_Above" appear in the "Blip" class or the new population of "Scattering_Light".A similar classification result is expected when applying our architecture to the O3 dataset and retraining it.
Based on the above results, the data of the Gravity Spy labels that are classified into multiple classes in unsupervised classification are shown in grey in the "Estimated number of class" in Fig. 4.These data that are separated from the Gravity Spy labels may imply the existence of subclasses.

Discussion
Let the number of Gravity Spy classes (labels) be C = 22 and the classified result (vector) whose unsupervised class is the i-th , is the number of the j-th images, and the Gravity Spy label is classified as the i-th unsupervised class.The total number of classified i-th unsupervised classes is expressed by the L 1 norm 32 | 1 is the ratio of the j-th image of the Gravity Spy label on the i-th unsupervised class.Therefore, we define the accuracy of unsupervised learning as It should be noted that the confusion matrix shown in Fig. 4 is not a square matrix, and its indices of unsupervised labels (columns) depend on the initial values of training.Therefore it is difficult to define the evaluation indicators, such as recall, precision, and F-measure.The accuracy of the proposed architecture was 90.9%, where the total number of unsupervised classes was set to C = 36.Comparatively, although (1) is a slightly different definition from the usual definition of the accuracy of supervised learning, the supervised learning of the Gravity Spy project 15 achieved 97.1% accuracy on the testing data using the same dataset as that used here.Furthermore, we compared our results with those (shown in Table I of reference 23 ) of different CNN models, such as Google Inception 33 (with versions 2 and 3), Microsoft ResNet 34 , VGG 35 (with 16 and 19 layers), and the retrained CNN model based on the Gravity Spy project 15,18 .Google Inception, ResNet, and VGG are the most popular image recognition architectures, all of which were submitted to the ILSVRC competitions 36 .Note that all models used the same dataset (Gravity Spy dataset of LIGO O1).The accuracy was more than 96% for all models.Although the accuracy of our model is less than the that of above models, unsupervised learning has the advantage that data annotations are not required, and our model has the potential to suggest the existence of subclasses, as shown in "Evaluation of our Architecture" section.
Let us now examine the classification results in Fig. 4, one of the factors that decrease the accuracy of unsupervised learning in (1).The representative images of the major characteristics and images of their low similarities are shown in Fig. 6.Considering classes (0) and ( 35), the classifier is able to identify the global features of images because the images are similar to the representative images that also exist in the data of other Gravity Spy labels.Regarding classes ( 13) and ( 34), the classifier cannot recognise the images properly and may be learning the background features.This problem can be solved by adjusting the neural-network configuration.Moreover, regarding class (26), it is observed that the minor images (such as "Power_Line") are mixed with the major class ("Air_Compressor").The same result can also be observed for class (32).Because the characteristics of both images are similar, it is possible that both noises have similar characteristics.Additionally, a comparison of the classification results shown in Fig. 4 with the feature visualisation using t-SNE is discussed in Supplemental Material "Feature Visualization of Transient Noise using t-SNE" section.Based on the above results, we can confirm the consistency between the label annotated by the Gravity spy project and the class provided by our proposed unsupervised learning architecture and provide the potential for the existence of the unrevealed classes.
Subsequently, we will build a system for the classification of transient noise using the proposed architecture in KAGRA.In addition, we will extend our architecture to self-supervised learning 37 to enhance the accuracy of the classification.This algorithm trains the data of a specific label, known as the golden set 15 , which generates a pseudo label to the given dataset and retrains it.Using the new classes classified by unsupervised learning, the semi-supervised learning can help reduce the annotation process for the training and can solve the problem of ensuring objectivity in the classification.We would like to construct a semi-supervised architecture that incorporates the advantages of both Gravity Spy's supervised and unsupervised learning.

Method
The proposed unsupervised learning method consists of two architectures: a variational autoencoder (VAE) and invariant information clustering (IIC).The VAE is used to learn the features from the time-frequency spectrogram (2D images) of transient noise, and the IIC classifies the transient noise from the features that are learned by the encoder of the VAE.Before we present the details of the method, we explain the target dataset.

Target dataset
The Gravity Spy dataset 16 , which is the input dataset, is an image set of transient noise obtained from the LIGO O1 4 .Omicron software 19 searches for transient noise in time-series data, and Omega Scan 20 software generates an image of the time-frequency spectrogram of each transient noise using Q-transformation 20,38 .Q-transformation is a method that estimates the frequency component of the time-series data by setting a window function on each time-frequency component, generating a 2D image of the time-frequency spectrogram.The spectrogram image of each transient noise in the Gravity Spy dataset has four time durations (0.5, 1.0, 2.0, and 4.0 s) at the centre, as shown in Fig. 1a.In addition, these transient noises are given 22 labels, which are related to cause as shown in Fig. 1b.For example, the images of 12 classes of transient noise are shown in Fig. 1c.

Pre-processing
The pre-processing applied to the Gravity Spy dataset for the training of our proposed architecture is shown in Fig. 7. Considering the characteristics of the time-frequency spectrogram, a small displacement in the time direction does not change its physical characteristics because this operation can be interpreted as a change in the event time.Therefore, the time-shifted images can be regarded as new events of transient noise, and it makes the architecture realise the classification of transient noise that does not depend on small displacements in the time direction.Conversely, a possible small displacement of the spectrogram in the frequency direction changes its physical characteristics.Therefore, the frequency-shifted images fall into different classes to that of the original image in the classification.Thus, the perturbation of transient noise is not applied in the frequency direction; nonetheless, they are applied only in the time direction.
In the training process of the proposed architecture, there is a random time shift of the image in the 0-24 px range used for the training data.The data that were cropped without a time shift were used for the evaluation of the VAE and the input image of the IIC.

Variational autoencoder
In this study, the features of transient noise are obtained from their time-frequency 2D spectrogram image using VAE, one of the approaches for feature learning 39,40 using convolutional deep learning.Generally, feature learning is a method for acquiring features that are effective for the prediction and classification of data.It also has the ability to convert high-dimensional data to low-dimensional features.
Let the input dataset be D = {x x x (1) , . . ., x x x (N) |x x x (i) ∈ R D , i = 1, • • • , N} and the marginal likelihood for D be p θ θ θ (x x x (1) , . . ., x x x (N) ), where D is the dimension number, N is the number of the input data, and θ are parameters for the architecture.The objective of the learning is to maximise the marginal likelihood.When the dataset D is independent and identically distributed, the log marginal likelihood becomes ∑ N i=1 lnp θ θ θ (x x x (i) ).Consider that the inference architecture q φ φ φ (z z z|x x x (i) ) (also known as encoder) approximates q φ φ φ (z z z|x x x (i) ) p θ θ θ (z z z|x x x (i) ), where z z z ∈ R J is a feature variable and J < D. Therefore, the log marginal likelihood lnp θ θ θ (x x x (i) ) can be expressed as lnp θ θ θ (x x x (i) ) = ln p θ θ θ (x x x (i) , z z z)dz z z ≥ q φ φ φ (z z z|x x x (i) ) ln p θ θ θ (x x x (i) , z z z) q φ φ φ (z z z|x x x (i) ) dz z z ≡ L (x x x (i) , θ θ θ , φ φ φ ). (2)

5/18
The second inequality is obtained by the Jensen's inequality, and L (x x x (i) , θ θ θ , φ φ φ ) is an objective function known as the lower bound.Let a prior and a posterior distribution of z z z be a multivariate Gaussian distribution, indicating that p θ (z z z) = N (z z z|0 0 0, I I I) and q φ φ φ (z z z|x x x (i) ) = N (z z z|µ µ µ φ φ φ (x x x (i) ), Σ Σ Σ2 φ φ φ (x x x (i) )I I I), where µ µ µ φ φ φ (•) and Σ Σ Σ φ φ φ (•) are the outputs from an encoder and I I I is the identity matrix of dimension J. Let a posterior distribution of x x x be the multivariate Bernoulli distribution, p θ θ θ (x x x (i) |z z z) = bern(x x x (i) |g g g θ θ θ (z z z)), where g g g θ θ θ (•) are the outputs from the decoder.Thus, the expression of the lower bound to be maximised is where D KL [•||•] is the Kullback-Leibler divergence of two distributions and z z z (i,l) is referred to as the reparameterisation trick, such that z z z (i,l) = g g g φ φ φ (ε ε ε , where ε ε ε ∼ N (0 0 0, I I I), and signifies the Hadamard product.

Classification using invariant information clustering
A typical method for clustering is the k-means, which uses the Euclidean distances between data.Recently, several variants of the k-means have been developed (e.g.k-means++ 41 , fuzzy c-means 42 , and x-means 43 ).Regarding clustering in a high dimensional space, the variance of the distance between data becomes small owing to the "curse of dimensionality".Alternatively, IIC 30 , which is a classification method, seems to be effective because it does not use the distances of the data for learning.In this study, transient noise is classified using IIC by maximising the mutual information.Let x x x ∈ R D be the input data, x x x be the perturbed data of x x x, C the number of output classes, and Φ Φ Φ(x x x) ∈ R C be a classifier in which the output layer of the classifier uses the SoftMax activation function.Consider a pair of cluster assignments for two inputs, x x x and x x x .Their conditional joint distributions and marginal distributions are ) T , respectively, where the superscript T denotes the transpose.The objective for the maximisation of the mutual information is expressed as max To improve the performance of the classifier, auxiliary over-clustering 30 is also used when calculating the mutual information.This over-clustering formula is the same as (4), except for Φ Φ Φ(x x x) ∈ R W , where C < W .

Proposed architecture
We propose the unsupervised classification architecture shown in Fig. 8.It is a deep learning architecture that trains timefrequency 2D spectrogram images of transient noise.Considering the proposed architecture, the feature variables of the input image x x x and its perturbation image x x x = ξ ξ ξ (x x x) are extracted by a pre-trained encoder of the VAE.The perturbation ξ ξ ξ is a transformation that does not change the information required for the classification (see "Pre-processing" section).Subsequently, the IIC learns to maximise the mutual information I(Φ Φ Φ(z z z), Φ Φ Φ(z z z )), which is composed of a pair of feature variables (z z z = µ µ µ φ φ φ (x x x), z z z = µ µ µ φ φ φ (x x x )).
The clustering of the IIC depends on the initial values of the neural networks in which the values are randomly provided.Thus, the classification results from each classifier varies slightly.Regarding the unsupervised learning, it is difficult to apply an ensemble average for each classification result to solve the dependencies of the initial values because the classified labels are random at each time.In this study, spectral clustering 44 was applied to compress the multiple results of classification into one result.The procedure is as follows:    The vertical axis of the confusion matrix represents the labels and number of data in the Gravity Spy dataset.The lower and upper horizontal axes denote the number of images classified into the unsupervised classes and the labels of the unsupervised classes, respectively.Each column of the confusion matrix is coloured using the ratio of the Gravity Spy-labelled images classified into the unsupervised class (i).In addition, the classes that are separated from the Gravity Spy labels on the confusion matrix, such as classes (0), ( 13), ( 26), ( 32), (35), and ( 36    , where f is the pre-trained encoder of the VAE, which is a mapping from the input space R 224×224×4 to the feature space R 512 .h is t-SNE, which is an embedding from the feature space R 512 to the 2D plane.By using the Gravity Spy labels for the input data, the embedding data have labels that are useful for visualisation on the plane.Top right: List of colours corresponding to the Gravity Spy labels.Bottom right: embedding parameters used for the t-SNE.The number of components indicates the number of dimensions after the embedding.Perplexity is a characteristic that is related to the number of nearest neighbours, and the iterator number refers to the maximum number of optimisations for exploring the nearest neighbours.The plotted data rate is the ratio of the number of plots to the entire dataset.

1 . 2 .
For each transient noise, stack the images of the time-frequency spectrograms with the four time widths shown in Fig. 7, and use it as the input data for this transient noise.The resolution of the transient noise image for each time duration is 224 px × 272 px (frequency and temporal direction, respectively), and the dimensions of the stacked images are 4 × 224 × 272 px.Convert the stacked data into two types: Input Image : Crop the left and right parts of the image equally such that the resulting image has dimensions of 4 × 224 px × 224 px Perturbed Image : Crop the left part of the image at the randomly time-shifted position in the range 0-24 px and also crop the right part of the image so that the resulting image has dimensions of 4 × 224 px × 224 px

Figure 1 .
Figure 1.(a) Example of 2D image of the time-frequency spectrogram of transient noise in the Gravity Spy dataset.Regarding each transient noise, four time durations (0.5, 1.0, 2.0, and 4.0 s from the left of the figure) are recorded from the centre time.(b) Table showing all the classes, the number of data, and its ratio to the number in the Gravity Spy dataset.There are 22 classes in total, and each of 21 classes is given a name related to an occurrence cause or a characteristic of the shape on the spectrogram of transient noise.The other is "None_of_the_Above", which does not belong to any class.(c) Example of the image for each class in the Gravity Spy dataset.The figure shows 12 of the 22 classes of the transient noise with 0.5 s.

Figure 2 .
Figure 2. Left (a) Training parameters for the VAE of the proposed architecture.The dimension of z z z is the output number of the encoder.The training size rate is the ratio of the total number of data to the data size of the input at training.Regarding the architecture evaluation, the input size is set to (1 − Training size rate).The learning rate is the initial learning rate, and the optimiser used is Adam 31 .Right (a) Training parameters for the IIC of the proposed architecture.The number of output classes is set to the number of classes to be classified.The classifier number is for multiple classifiers that are used to improve the performance of the classifier using spectral clustering.(b) Training curve during the training and evaluation of the VAE.The solid and dashed lines in the figure show the training objective δ ≡ − ∑ N i L (x x x (i) , θ θ θ , φ φ φ ) at the time of training and evaluation, respectively.(c) Reconstructed images generated by the decoder of the VAE at 100 epochs in Case 3.

Figure 3 .
Figure 3.The representative and similar images in all the classes were classified using unsupervised learning.This representative image which is denoted by i in the image is randomly selected from a class i ∈ c = {0, . . ., 35}, and its most similar image is to the right of a representative image in class (i).The cosine similarity to the representative image in class i is shown at the top of the image.

Figure 4 .
Figure 4. Confusion matrix of the classification results of the proposed architecture.The vertical axis of the confusion matrix represents the labels and number of data in the Gravity Spy dataset.The lower and upper horizontal axes denote the number of images classified into the unsupervised classes and the labels of the unsupervised classes, respectively.Each column of the confusion matrix is coloured using the ratio of the Gravity Spy-labelled images classified into the unsupervised class (i).In addition, the classes that are separated from the Gravity Spy labels on the confusion matrix, such as classes (0), (13), (26), (32),(35), and (36), also show the ratio values in the matrix.The potential number of classes on Gravity Spy labels which are estimated by unsupervised learning are shown in the right column of the figure.The notation "1" (in white cells) indicates that the number of classes labelled by the Gravity Spy matches the result of the unsupervised learning, and the inequality sign (in light grey cells) indicates that the class is separated into multiple classes in the unsupervised learning.The notation "0" (in dark grey cells) indicates an unclassified class in this training and dataset, and "-" notation indicates that they do not belong to any class of the unsupervised learning.
Figure 4. Confusion matrix of the classification results of the proposed architecture.The vertical axis of the confusion matrix represents the labels and number of data in the Gravity Spy dataset.The lower and upper horizontal axes denote the number of images classified into the unsupervised classes and the labels of the unsupervised classes, respectively.Each column of the confusion matrix is coloured using the ratio of the Gravity Spy-labelled images classified into the unsupervised class (i).In addition, the classes that are separated from the Gravity Spy labels on the confusion matrix, such as classes (0), (13), (26), (32),(35), and (36), also show the ratio values in the matrix.The potential number of classes on Gravity Spy labels which are estimated by unsupervised learning are shown in the right column of the figure.The notation "1" (in white cells) indicates that the number of classes labelled by the Gravity Spy matches the result of the unsupervised learning, and the inequality sign (in light grey cells) indicates that the class is separated into multiple classes in the unsupervised learning.The notation "0" (in dark grey cells) indicates an unclassified class in this training and dataset, and "-" notation indicates that they do not belong to any class of the unsupervised learning.

Figure 5 .
Figure 5. Representative images and images similar to unsupervised learning.Considering the figure, classes (9),(22), and (30) are separated from the' 'Blip" class, and classes (5) and (7) are separated from the"Koi_Fish" class.The representative images in the left column are sampled randomly from the images classified in class (i) using unsupervised learning.The similar images in the other columns are sorted in a descending order and are sampled randomly from the cosine similarity (a value at the top of an image), considering the representative image.

Figure 6 . 18 Figure 7 .
Figure 6.Examples of images in the classes with reduced accuracy in unsupervised learning.The major images in the left column are randomly sampled data from class i.The minor images in the other columns are sorted in an ascending order from the cosine similarity to its major image, indicating that they are sampled from the lowest similarity to the major one.The Gravity Spy label and the value of the cosine similarity are on top of the sampled image.

Figure 8 . 18 Figure S1 .
Figure 8. Proposed architecture for the classification of transient noise.The tables show the details of the architectures of neural networks.B denotes batch-normalise to an object, and M denotes the mini-batch size.Left: Schematic architecture of the VAE for feature learning.The VAE trains neural networks to maximise the lower bound in (3).The input to the VAE is a perturbed image x x x of the time-frequency spectrogram of the transient noise.This pre-process allows the encoder to learn features that do not depend on the perturbation.At the output layer of the encoder, the average and variance of the feature variable z z z are output from the same network and separated into the dimensions (M, 512).Subsequently, the feature variables z z z are constructed using the reparameterisation trick.The decoder uses z z z to generate a reconstructed image that is close to the input image.Right: Schematic architecture of the IIC for classification.The IIC trains neural networks to maximise the mutual information between the input data and its perturbed data.The inputs to the pre-trained encoders of the VAE are the original and perturbed images, respectively.Both encoders have the same architecture, and the dashed lines indicate the sharing weights of the neural networks in the figure.The IIC classifies transient noise using the SoftMax activation function at the output layer from the feature, which is the output of the pre-trained encoder.C is the estimated number of classes of the transient noise, and W is the number of classes used in the over-clustering.