Advanced Steel Microstructural Classification by Deep Learning Methods

The inner structure of a material is called microstructure. It stores the genesis of a material and determines all its physical and chemical properties. While microstructural characterization is widely spread and well known, the microstructural classification is mostly done manually by human experts, which gives rise to uncertainties due to subjectivity. Since the microstructure could be a combination of different phases or constituents with complex substructures its automatic classification is very challenging and only a few prior studies exist. Prior works focused on designed and engineered features by experts and classified microstructures separately from the feature extraction step. Recently, Deep Learning methods have shown strong performance in vision applications by learning the features from data together with the classification step. In this work, we propose a Deep Learning method for microstructural classification in the examples of certain microstructural constituents of low carbon steel. This novel method employs pixel-wise segmentation via Fully Convolutional Neural Network (FCNN) accompanied by a max-voting scheme. Our system achieves 93.94% classification accuracy, drastically outperforming the state-of-the-art method of 48.89% accuracy. Beyond the strong performance of our method, this line of research offers a more robust and first of all objective way for the difficult task of steel quality appreciation.


Introduction
Steel is still one of the most important and manifold class of materials because of its excellent mechanical properties while keeping costs low which thus gives a huge variety of applications 1,2 .The mechanical properties of steel are mainly determined by its microstructure 3 .So the performance of the material highly depends on the distribution, shape and size of phases in the microstructure 4 .Thus, correct classification of these microstructures is crucial 5 .The microstructure of steels has different appearances, influenced by different parameters such as alloying elements, rolling setup, cooling rates, heat treatment and further post-treatments 6 .Depending on how the steel is produced due to these different parameters, the microstructure consists of different phases such as ferrite, cementite, austenite, pearlite, bainite and martensite 7 .This motivation leads us to use Deep Learning methods which are recently grabbing the attention of scientists due to their strong ability to learn high-level features from raw input data.Recently, these methods have been applied very successfully in computer vision problems 8,9 .These methods are considered as a set of artificial intelligence networks 9 such as Convolutional Neural Networks(CNNs).They can be trained for recognition and semantic pixel-wise segmentation tasks.Unlike traditional methods in which feature extraction and classification are done separately, in Deep Learning methods, these parts are done simultaneously.The advantages of such methods are processing from not pre-processed information to semantic outputs.Masci et al. 10 used CNNs to find defects in steel.In this work, we show that Deep Learning can be successfully applied to identify microstructural patterns.Our method uses a segmenation-based approach based on Fully Convolutional Neural Networks (FCNNs) which is an extension of CNNs accompanied by a max-voting scheme to classify microstructure objects.Our experimental results show that the proposed method considerably increases the classification accuracy compared with the state-of-the-art.It also shows the effectiveness of pixel-based approaches compared with object-based ones in microstructure classification.

Related Works
Basically, based on the instrument used for imaging, we can categorize the related works in a few categories, such as Light Optical Microscopy (LOM) -based and Scanning Electron Microscopy (SEM) -based imaging.High-resolution SEM imaging is very expensive compared with LOM imaging in terms of time and operating costs.However, low-resolution LOM imaging makes distinguishing microstructures based on their substructures even more difficult.Nowadays, the task of microstructure classification is performed by observing a sample image by an expert and assigning one of the microstructure classes to it.As experts are different in their level of expertise, one can assume that sometimes there are different opinions from different experts.However, thanks to highly professional human experts, this task has been accomplished so far with low error which is appreciated.Regarding automatic steel microstructure classification, microstructures are typically defined by the means of standard procedures in metallography.Vander Voort G. 11 used LOM microscopy, but without any sort of learning the microstructural features.Their method defined only procedures with which one expert can decide on the class of the microstructure.4][15][16][17][18][19] works made morphological or crystallographic properties accessible.Any approach for identification of phases in multiphase steel relies on these properties.They aimed at the development of advanced metallographic methods for morphological analysis purposes using SEM and accompanied with pixel-and context-based image analysis steps.Previously, Pauly et al. 20 proposed a method using data mining methods by extracting morphological features and a feature classification step using SVMs 21 .They applied their method on a contrasted and etched dataset of steel, acquired by SEM and LOM imaging.As one of a few works on this task, Pauly et al.'s method showed interesting results.However, it could not even reach 50% accuracy in microstructure classification on the given dataset for four different classes due to high complexity of substructures and not discriminative enough features.Deep learning methods have been applied in object classification and image semantic segmentation for different applications.AlexNet, a CNN proposed by Alex Krizhevsky et al. 22 with 7 layers was the winner of ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 23 in 2012 which is one of the most well-known object classification task challenges in computer vision community.It is the main reason that Deep Learning methods drew a lot of attention.AlexNet improved ILSVRC2012 by 10% which was a huge increase in this challenge.VGGNet, a CNN proposed by Simonyan et al. 8 has even more layers than AlexNet achieving better accuracy performance.FCNNs architectures, proposed by Long et al. 24 , is one of the first and well-known works to adapt object classification CNNs to semantic segmentation task.FCNNs and extensions to the approach are currently the state-of-the-art in semantic segmentation on a range of benchmarks including Pascal VOC image segmentation challenge 2015 25 or Cityscape 26 .Our method transfers the success of Deep Learning for segmentation tasks to the challenging problem of microstructure

Review of recent Deep Learning techniques in computer vision
Previous work in the context of steel microstructure classification relies on hand designed features.Engineering high-level texture features e.g.morphological 20 features requires complex feature extraction algorithms and a tedious trail-and-error process of finding good features in the first place.Deep Learning methods are capable of learning complex features from raw input data that turn out to also be superior across a wide range of application domains.These methods are inspired by Neural Network and an "end-to-end" learning paradigm.Unlike classical methods, feature extraction and classification is done simultaneously in Deep Learning methods and optimized jointly.This is facilitated by composing complex, parameterized functions from simple, efficient, piece-wise differentiable building blocks.Therefore, training the whole system is facilitated by gradient descent on the parameters using a recursive gradient computation via the chain rule of differentiation (this is called back propropagation algorithm 27 ).Among these Deep Learning methods, Convolutional Neural Networks(CNNs) and their modification Fully Convolutional Neural Networks (FCNNs) have been shown particularily successful for image classification and segmentation.

Convolutional Neural Networks (CNNs)
CNNs were developed originally for vision tasks.The basic concept of CNNs dates back to 80s 9,28 , but due to the increase in computational resources, available data and better learning algorithms, they have recently attained a qualitative new level across a range of domains and therefore are the state-of-the-art approaches in many vision applications.As images are of high dimensionality applying traditional neural networks with fully connected neurons to process visual data would lead to a huge number of trainable parameters.Furthermore, they are not able to exploit the natural structures in the images like correlation among neighboring pixels and stationary image statistics.Typical convolutional neural networks consist of multiple, repeating components that are stacked in layers: convolution, pooling, fully-connected and classifier layers.Convolution Layer convolves the input data with a linear convolution filter as shown in Equation 1: Where k = 1, . . ., K is the index of k-th feature map in convolution layer and (i, j) is the index of neuron in k-th feature map and x represents the input data.W k and b k are trainable parameters (weights) of linear filter (kernel) and bias for neuron in k − th feature maps respectively.(h k ) i j is value of the output for the neuron in k − th feature map with position of (i, j).The spatial 2D convolution operation between input data and feature map has been represented by " * .
Pooling Layer is a non-linear down-sampling layer which either takes maximum or average values in each sub-region of input data.Today's CNNs typical employ a max pooling layer in order to achieve invariance to small shifts in the feature maps.
Fully-connected Layer is a classic neural network layer where the features of the next layer are a linear combination of the features of the previous layer as shown in Equation 2: Where y k represents the k − th output neuron, W kl is the kl − th weight between x l and y k .
Activation Function usually follows a pooling or fully connected layer and introduces a nonlinear activation operation like sigmoid or rectified linear unit (ReLU).ReLU function relu(x) = max(0, x) are most common as the gradient is piece-wise constant and does not vanish for high activations (in contrast to e.g.sigmoid).
Classifier Layer is the last layer of the network and computes a class posterior probability of input image.The most widely used classifier layer in CNNs is softmax function.The vector of real values between (0, 1) generated by this function denotes a categorical probability distribution shown by Equation 3 for j − th class and an input vector X.
Loss Layer is used to measure the difference between true class labels and corresponding predicted class labels.The most well-known loss layer for classification is cross-entropy loss.Mathematical formula of Cross-entropy loss has been shown in Equation 4Cross-entropy Loss Function = − ∑ in which softmax classifier (P(x)) is minimizing the cross-entropy between the "true" one-hot encoded distribution of data(P (x)) and the predicted class probabilities.
CNN Training is done end-to-end which means without separation between feature extraction and classification step.They receive raw input data (e.g.image pixels) to produce semantic outputs.CNNs are trained to learn the hidden pattern in the data using a set of training data.There is a loss function L which measures how far the outputs of the network from the correct output is.This optimization problem is solved by gradient descent 9 and a well-known process called Back-propagation 9 to propagate the loss function gradient to previous layers.The gradient descent is performed based on loss function gradient δ L δ w i j .Then, the weights are adapted in order to decrease the loss function output by modifying into the direction opposite to the direction of increasing gradient with a step size(learning rate) represented by η in Equation 5: Dropout Layer 29 is a technique to improve generalization power of CNNs by randomly ignoring (dropping) neurons and their corresponding parameters from the network architecture only during training phase.
As annotated SEM images are rare, most likely training network on such dataset will lead to overfitting to noise present in the training set.To address this problem, networks which have already been trained using large training datasets like ImageNet are trained on the new dataset.This trick is known as "Transfer Learning" or "Fine-tuning".Using this technique, we can initialize the weights more efficiently.Therefore, it can be assumed that the network is already close to the best local minimal solution and needs far less training data to converge.From pretrained CNNs, features can also be extracted without finetuning.In this case, the network is not fine-tuned.Instead, output of fully connected layer before classification layer is considered as feature vector.These features known as DeCAF 30 and can be classified with e.g.SVMs.Another trick in case of utilizing little datasets is to artificially enlarge the training set e.g. by flipping or rotating while preserving the class labels.This trick is known as "Data Augmentation".Network Architectures: In the following, we describe the three specific CNN architectures (with increasing depth) that we use for classification: CIFARNet is a CNN with 3 convolutional and max-pooling layers with two fully-connected layers.It is a modified version of LetNet 9 proposed by Lecun et al containing 431K parameters to which ReLU activation layer and dropout layer (will be described in the following) have been added.AlexNet proposed by Alex Krizhevsky et al. 22 is a deep CNN with 60 million parameters.AlexNet is deeper than CIFARNet.It has 8 layers including 5 convolutional, 3 max-pooling and 3 fully-connected layers.
VGGnet proposed by Simonyan et al. 8 .It is even deeper with 13 convolutional, 5 pooling and 3 fully-connected layers known as VGG16 with 138 million parameters.Another version called VGG19 is with 19 layers.VGG19 network was able to achieve empirically better performance than CIFARNet and AlexNet.In VGGnet, 3x3 convolution kernel size was applied which resulted in less parameters, but with the same support.

Fully Convolutional Neural Networks (FCNNs)
While CNNs have been shown successful for image classification, we are also interested in predicting a set of semantic classes for each pixel, which is called a semantic segmentation.CNNs can be extended to perform this task by removing the fully-connected layers and "sliding" the CNN over a larger image.However, the resulting semantic segmentation would not quite be of the same resolution as the original image.Therefore, Long et al. 24 proposed Fully-Convolutional Neural Networks with an additional up-sampling layer.
Upsampling layer can be achieved by simple bi-linear interpolation (which is differentiable).Today's FCNs learn also the up-sampling kernel -which is parameterized as shown in Equation 6: where y and x are input and output of up-sampling layer, and f and , stand for the up-sampling factor and fraction part respectively.One can consider up-sampling layer with factor f as a convolution operation, but instead of integer stride it has fractional input stride of 1/ f .To capture different information about the input data by preserving the dimensions of input feature maps, we can fuse different pooling layers together known as "skip layers".FCNN can be seen as an encoder-decoder system in which convolutional layers do encoding of input image by extracting high-level learned features and deconvolution layers do decoding on these feature to present semantic segmentation of the input.As image semantic segmentation datasets are typically small, these networks are frequently pre-trained on object recognition dataset and then fine-tuned to perform the segmentation task.
In this chapter, we describe our methods in order to classify microstructures in steel.First, we applied deep learning methods to classify each cropped steel object from SEM or LOM images which we call object-based microstructure classfication.Then we explain our main methods which classify each pixel as one of microstructure classes and then we classify each object by considering the classes of pixels inside the object.To avoid misunderstandings, substructure and texture are used equally, owed the different current languages in material science and computer vision.

Object-based Classification of Microstructures with CNNs
The first approach we are taking is classifying microstructures (second/dual phases) by CNNs.However, CNNs work on images of fixed size.Therefore, we normalize the image for each structure by cropping objects from the images, resizing to a fixed size and then applying the CNN classifier.
Cropping & Normalization is performed by using binary segmented LOM images using threshold binary segmentation.We multiply binary LOM image with corresponding SEM image to produce masked SEM images in which all of non-object contents (here ferrite) is removed as illustrated in

Segmentation-based Classification of Microstructures with FCNNs
Resizing of each cropped object image to a fixed size, as required in the object-based CNN approach, could destroy valuable information related to the phase texture by heavy distortion.On the other hand, pixel-wise classification (segmentation) can work with any image size.Thus, we propose SEM or LOM image segmentation-based microstructure classification approach using FCNN and max-voting scheme to classify each object.The processing pipeline for this approach is illustrated in Figure 3.We refer to this approach as max-voted FCNN (MVFCNN).Using this method, SEM or LOM images are classified pixel-wise.
In our experiments, we used network architecture proposed by 24 .The architecture is almost the same as VGG16, except converted fully-connected layers to convolutional ones and upsampling layers plus skip layers.They used skip layers to fuse coarse and local appearance information to improve the semantics.Using no skip layer, the network is called FCN-32s, with skip layer A and with A and B together denoted in Figure 3 network is called FCN-16s and FCN8s respectively.We use cropped raw SEM or LOM images as input to FCNN.However, the maximum size of the cropped images is determined by GPU memory.The output of an FCNN is a 3D matrix with the number of channels equal to the number of classes.Each pixel in this matrix has a value representing the score confidence or posterior probability (output of the softmax classifier function represented by P in Figure 3) of FCNN for the corresponding phase class C i .The pixel-wise classification step is then performed by choosing the class for each pixel with the highest posterior probability.Afterward, all of segmented patches belonging to input image are stitched together as illustrated in Figure 3.In order to classify objects(microstructures) rather than pixels, max-voting policy is applied on each object, assigning it to the class of the majority of the pixels.In other words, classification for each object can be done by class with the maximum classified pixels by that class.The location information of objects is obtained using binary LOM.The motivation for using this aggregation step is that using stitched patches we can decide the class of each object instead of part of it.

Implementation Details
In order to train and test CNNs and FCNNs, Caffe 31 framework and a K40m NVIDIA GPU were used.

Training Object-based CNN:
All of the cropped object images are resized to 224x224 which is the fixed input size of VGG16.We also consider this size of input for training networks from scratch.We use a fixed learning rate of 0.001, a momentum of 0.9 and weight decay of 0.004 in stochastic gradient descent algorithm.The training iterations continue until the training loss reaches a plateau.For training CIFARNet from scratch, the standard deviation of Gaussian noise for initial random weights for the first convolutional layer is 0.0001 and for the rest is 0.01.For finetuning, pre-trained VGG16 network is used.We initialized the last fully-connected layer with random Gaussian noise with standard deviation of 0.01.The learning rate of 0.0001 (chosen on validation set) was used to train CIFARNet and VGG16 respectively.

Using pretrained extracted features (DeCAFs):
To classify DeCAF features using SVMs, we trained a multi-class SVM with RBF kernel with extracted features from pretrained VGG19 8 network.In VGG19 architecture, fully-connected layer before classification layer with the size of 1x1x4096 was considered as the feature vector.
Regarding input images, patches were cropped with 1000 × 1000 pixel size with batch size of 1, due to memory issues.We first trained FCN-32s model, and then added skip layer (FCN-16s) and fine-tuned it.Afterward, another skip layer (FCN-8s)was added to fine-tune the final model.Direct training of FCN-8s gives worse results.Pre-trained FCN-32 model was trained with ImageNet dataset.The network was trained with 7000 iterations for ∼ 4.5 days.The inference time for a 1000x1000 pixel input image is ∼ 600 ms. 6/14

Class Balancing and Data Augmentation
In order to address the problem of class unbalance in the dataset, in MVFCNN method cropping is carried out for different classes with different stride parameters in horizontal and vertical directions which in the end all of classes have the same number of patches i.e. the class with least number of images has smaller stride than the class with the largest number of training images.
Resulting cropped patches were also rotated with 90, 180 and 270 degrees to augment dataset.In this case, the number of the training images are increased three times.

Dataset
In our experiments, we use a steel image dataset 20 provided by Material Engineering Center Saarland (MECS) in Saarbrücken, Germany.

Task Definition and Metrics
In the microstructure classification problem, the goal is to classify objects inside steel images based on microstructure classes.The notion of an object is depicted in Figure 5. Objects are second-phase regions which can be considered as "foreground" and matrix (ferrite) is first-phase region as "background".The substructure of an object can be seen as texture which we are trying to classify with the correct label.Therefore, the more objects are classified correctly, the more accurate the system is.In addition to the classification task, we also evaluate the semantic segmentation performance using the following four metrics 24 .In the following, n i j is pixel numbers of class i − th predicted as class j, n cl is the number of different classes and t i = ∑ j n i j is the whole pixel numbers of class i: • pixel accuracy: In microstructure classification via pixel-based MVFCNN approach, segmented objects are classified with the class that majority of those pixels vote for (max-voting).Then evaluation is carried out by enumerating correct classified microstructure objects.After this step, the results of object-based and pixel-based methods are comparable with each other as well as with the state of the art.In this example, the substructure (microstructure) is martensite.Matrix is the first phase and grain of second(dual) phase can be considered as object.

Evaluation
In Table 1, comparable results of our experiments with object-based CNN and pixel-based MVFCNN have been presented using SEM images as well as the performance of the state-of-the-art method Pauly et al. 20 as the state-of-the-art method is only able to offer 48.89% microstructure classification accuracy.Instead, from-scratch trained CIFAR-Net is able to outperform this method by achieving 57.03% accuracy.Using pretrained VGG19-Net Features with SVM classifier and finetuned VGG16-Net have even better performance of 64.84% and 66.50%.All of these methods apply object-based algorithms.However, Table 1 shows microstructure classification using pixel-based methods can achieve considerably higher accuracy of 93.94% accuracy which shows pixel-based classification of microstructures is a lot better strategy compared with object-based one.As it was mentioned, the accuracy of both strategies are comparable.In Table 2, the effect of data augmentation and fine-tuning using SEM images in MVFCNN approach have been depicted.The results shows finetuning and using balanced augmented training data achieves the best results of 93.94%.Among training data and training strategy, using finetuned MVFCNN has the most impact of 32.48% improvement compared with from-scratched trained MVFCNN with unbalanced data.In Table 3, the accuracy of the method using LOM images have been provided which is able to achieve 70.14% accuracy with the similar configuration which is able to achieve the best results using SEM images.This results indicates LOM is not as useful as SEM in microstructure classification.In Figure 4, confusion matrix of MVFCNN approach (finetuned with balanced and augmented SEM training data) as the best performing method without matrix (ferrite) has been shown.In this matrix, missed objects in segmentation step have not been taken into account (#48 objects).That is why the overall accuracy is 95.23% which is slightly higher than the best classification accuracy of 93.94%, shown in Table 1.Confusion matrix shows the number of samples for each class predicted by the system.Recall and precision numbers show correct classification percentage of actual classes and predictions respectively e.g. the network has classified 1190 martensite objects correctly which is 94.97% of the whole martensite objects.On the other hand, the network has misclassified 24 and 39 In Figure 6a some successful examples of SEM segmentation using FCNN network, trained with balanced and augmented training data have been shown besides some failure cases in Figure 6b.Regarding failure case of bainite, it should be noted that the network was trained on non-normalized bainite samples which showed no failure cases in test set.The final results of microstructure classification by stitching and applying max-voting scheme over the segmented patches have been depicted in Figure 7.The final results in this figure shows that most of objects in each microstructure image have been classified correctly.
If we consider the classification of the whole microstructure images, all of 10 test images have been classified correctly.
Table 2.The effect of finetuning and data augmentationt techniqutes using MVFCNN approach have been depicted.The results show finetuning together with data augmentation achieves the best result.However, the effect of data augmentation is not significant.To compare segmentation performance between segmented LOM and SEM images, in Figure 8a, segmentation of patches of four microsructures using LOM and SEM images using FCNN approach, trained with balanced and augmented training data 9/14    as network parameters can learn the pattern in this dataset.The surprising performance of MVFCNN method indicates that performance of object-based CNNs is negatively influenced by the image resizing step.In other words, we observe that resizing destroys the texture inside objects, hampering the accurate differentiation of objects bigger or smaller than input size of network.We also experimented by splitting big objects into 224x224 objects which resulted in higher performance which supports our claim.However this approach made the system less practical and introduces a hyper parameter into the system to choose the best split size.

Classification using MVFCNN:
The results in Table 1 shows MVFCNN approach using SEM images achieves strong performance.It indicates that classification using pixel-wise segmentation is more efficient and accurate compared with Object-based CNN method and significantly better than hand-crafted features.Resizing is no longer required in this method and therefore does not have an impact on MVFCNN performance.The confusion matrix in Figure 4 shows that the network still produces some misclassification of martensite objects due to the confusion of martensite with tempered martensite and bainite which have similar textures to each other -unlike pearlite's texture which is easy to distinguish.All wrong classified bainite objects belong to normalized bainite sample in the test set which is not present in the train set.It is impressive that in this condition the network is still able to classify more than half of objects in normalized bainite sample correctly.The achieved high accuracy indicates that considering each pixel and taking into account its neighboring pixels plays a crucial role in the correct classification of objects.Based on the results of Table 2, data augmentation improves the performance by 2% which is not considerable.One possible reason for this phenomenon is that different rotations of textures inside objects are already present in the dataset before data augmentation which the network has already seen and learned.For example, in pearlite microstructure, there are many cases that by rotating a patch, the resulting augmented patch still contains the previous orientations before the rotation.Table 3 results confirm our expectation that LOM images will perform poorly compared with SEM due to their low-resolution images.The artifacts due to stitching as well as the different illuminations in LOM images degrade classification accuracy.The results in Table 3 have the same trend compared with Table 1 and confirm our findings.Table 5 shows using pixel-wise classification using MVFCNN, one can achieve high performance of 93.94% accuracy in spite of a more difficult problem in the presence of complex microstructure textures than object-based classification approach.These results also show the better pixel-wise segmentation criteria are, the better microstructure classification will be.
Regarding Table 6, the results show how important finetuning is when working with little training data.Without finetuning, tempered martensite, bainite and pearlite microstructures could have not been segmented at all.Low performance in bainite class even by doing data augmentation could be due to the fact that bainite objects compared with other classes are a lot smaller.In the dataset, there are quite a few tiny areas belonging to matrix class which are not completely plain and contain some textures similar to those in bainite.As as result, balancing and augmenting the dataset makes the network decide that those small objects are more likely matrix than bainite.This assumption is verified by observing the bainite test image that pixels are either classified as bainite or as matrix.Another reason is also presence of normalized bainite which has different appearance with bainite and network was not trained on that.More training data, specially for bainite class would help to decrease the miss-classified bainite objects in the segmentation step.Regarding noise robustness, Figure 8b shows strong robustness of the system with different types of noises.We also noted that the system is rotation invariance which means similar patches with different rotation angles have the same system response.It is worth mentioning that MVFCNN can also have input images smaller than 1000x1000 pixel sizes.However, for input images smaller than 100x100 pixel the response is worse than for the 12/14 same area in 1000x1000 pixel input image.

Conclusion
This work demonstrates the feasibility of an effective steel microstructures classification using deep learning methods without a need of separate segmentation and feature extraction.In this work, we performed a pixel-wise microstructure segmentation using a trained FCNN network followed by a max-voting scheme.The improved classification results confirm our idea of leveraging the raw data input for training deep learning-based classification systems.Besides the high accuracy result, we are able to achieve a very fast prediction.We found that resizing objects directly as input to Object-based CNNs can eliminate discriminative texture information relevant to different microstructure classes.In contrast, MVFCNN approach does not have this problem and it is independent of the size of objects.Data augmentation was considered for further performance improvements.Furthermore, we found that rotating the SEM images does not introduce considerable new information and therefore the performance is not significantly improved.We conclude this work by concluding that pixel-wise segmentation using Fully Convolutional Neural Networks is an effective and robust way of determining the distribution and size of different microstructures when these networks are trained end-to-end.

Figure 1 .
Figure 1.Some examples of different microstructure classes.In columns from left to right: martensite, tempered martensite, bainite and pearlite phases as objects (second-phase) have been illustrated.Ferrite is the matrix phase in this images, having the role of a background.The upper row contains images taken by Scanning Electron Microscopy (SEM) and lower row taken by Light Optical Microscopy (LOM).

Figure 2 .
Afterward, each cropped object image is classified by CNN classification strategies.Since in this case, classification is done directly based on the object input image, we refer to this system as object-based CNN microstructure classification.Classification: We have considered three possible techniques of using CNNs for image classification: (I) full-training (from scratch) CNNs, (II) fine-tuning CNNs and (III) DeCAF features and SVM classifier.The (I) technique is a network which is trained with random initialized parameters.In this method we are free to choose size of input image.The (II) strategy is using transfer learning.However, in this case, we are limited to the input size of CNN.The (III) technique is using DeCAF features and SVM classification.The first and second strategy is illustrated in Figure2.The third strategy is similar except that classification is done by SVMs rather than softmax layer.The output of the object-based CNN before classification is P(C i |O), where C i is the class of each phase, O stands for the observation or the input image, and P is the posterior probability that the input image belongs to class i. Classification is performed by choosing the class with the highest probability.

Figure 2 .
Figure 2. Work-flow of object-based classification approach using CNNs.In this figure one object in SEM image is cropped and classified using trained CNN.224x224 is the fixed input size of VGG16 network and "C" stands for class.

Figure 3 .
Figure 3. Workflow of max-voted segmentation-based microstructure classification approach using FCNNs (MVFCNN).In this figure, the input image is an SEM image.It is cropped and segmented cropped using FCNN network are stitched together.Max-voted stitched output is used to classify microstructure objects.H and W represent height and width and third number is number of feature maps.

Figure 4 .
Figure 4. SEM and LOM example images for each microstructure class with ferrite as matrix and with diameter of up to 100µm.The columns show SEM, LOM and segmented LOM images from left to right.Size ratios of samples have been preserved.(a) Martensite, (b) Tempered martensite, (c) Bainite and (c) Pearlite microstructure.

Figure 5 .
Figure 5.A microstructure sample in an SEM image.The right image is an original SEM image and the left one shows a magnified object.In this example, the substructure (microstructure) is martensite.Matrix is the first phase and grain of second(dual) phase can be considered as object.

Figure 6 .
Figure 6.a)Examples of successful b)failure cases in SEM segmentation using the best MVFCNN approach configuration.The ground truth colors of Martensite, Tempered Martensite, Bainite and Pearlite are red, green, blue and yellow respectively.

(a)Figure 7 .
Figure 7. Examples of applying max-voting scheme to segmented stitched SEM patches for different microstructure classes.The ground truth colors of martensite, tempered martensite, bainite and pearlite are red, green, blue and yellow respectively.

Figure 8 .
Figure 8.(a) Comparison of LOM and SEM segmentation using FCNN network.Columns from left to right: Ground truth, LOM, LOM segmentation, SEM and SEM segmentation.(b) Noisy patches segmentation.First top row shows ferrite and the rests are noises.The ground truth colors are the same as previous figures.

2/14 classification
in the context of steel quality appraisal.It is the first demonstration of a Deep Learning technique in this context that in particular shows substantial gains over the previous state of the art.

Table 1 .
The table shows microstructure classification results using object-based CNN and pixel-based MVFCNN approaches.The results show object-based classification approaches improve over prior work at most by around 18%. Pixel-wise approach has even better performance by around 45% improvement.Tempered martensite and bainite respectively.This performance gives 94.97% recall rate and 99.08% precision rate.In Table5, the results of pixel-wise semantic segmentation with different configurations have been presented.In this table as expected, FCNN method for pixel-wise segmentation step can achieve the best results.It can achieve pixel accuracy of 93.92, mean pixel accuracy of 76.70, mean intersection union of 67.84 and frequency weighted intersection union of 88.81.In Table6, the same configurations have been evaluated pixel-wise for accuracy of each class.Matrix has the highest pixel accuracy of 94.22 as expected.And bainite has the lowest pixel accuracy of 37.32.Matrix is present in all of examples which is the reason the network has learned its structure well, however, bainite has small objects and also one of two bainite test images is normalized bainite which network has not seen leading to poor performance in this class.Surprisingly, LOM can achieve comparable pixel accuracy of matrix with 94.11 compared with the SEM images.

Table 3 .
The results of doing the same experiments with LOM images have been provided in this table.The results show inferior performance using LOM rather than SEM images.As LOM images are of lower resolution and contain noises such as stitching artifacts and different illuminations.

Table 4 .
Confusion Matrix of the best MVFCNN approach.Matrix shows the number of samples for each class predicted by the system.Due to unbalanced multi-class problem, percentage numbers for each class shows normalized recall rates.Note: #48 not-segmented objects have not been considered.Overall accuracy with them is 93.94%.

Table 5 .
Evaluation of segmentation-based approach using FCNNs with different training data and training strategies.As the results show using SEM images, finetuned network and augmented data best performance can be achieved.However, LOM has inferior performance compared with SEM.

Table 6 .
Pixel-wise accuracy evaluation of segmentation approach using FCNNs for each microstructure class.Matrix has the highest and bainite has the lowest pixel-wise accuracy.In most cases, SEM segmentation has a better accuracy, but it is interesting that still the network can learn correct pixel-wise classification from a low-resolution, noisy LOM images which suffer from different illumination and stitching artifacts.Figure8bshows a few sample noisy patches in SEM images with corresponding segmentation.Noises are originated mostly from dirt and dust.Matrix (Ferrite) segmentation which was considered as background has been also presented showing a very good robustness.These examples show strong robustness of the system with noises and the background.Based on Table1, all of deep learning methods outperform the state-of-the-art method which proves our motivation regarding using learned high-level features rather than engineered features.The results also indicate deeper networks show better results than shallower ones i.e. depth matters.Even with features extracted from pre-trained VGG19 which are classified with SVM, one can achieve comparable performance with the results of trained CIFARNET and VGG16.However, training VGG16 on MECS dataset makes features more informative and discriminative