Abstract
Datadriven machinelearning for predicting instantaneous and future faultslip in laboratory experiments has recently progressed markedly, primarily due to large training data sets. In Earth however, earthquake interevent times range from 10’s100’s of years and geophysical data typically exist for only a portion of an earthquake cycle. Sparse data presents a serious challenge to training machine learning models for predicting fault slip in Earth. Here we describe a transfer learning approach using numerical simulations to train a convolutional encoderdecoder that predicts faultslip behavior in laboratory experiments. The model learns a mapping between acoustic emission and fault friction histories from numerical simulations, and generalizes to produce accurate predictions of laboratory fault friction. Notably, the predictions improve by further training the model latent space using only a portion of data from a single laboratory earthquakecycle. The transfer learning results elucidate the potential of using models trained on numerical simulations and finetuned with small geophysical data sets for potential applications to faults in Earth.
Similar content being viewed by others
Introduction
In Earth, predicting instantaneous and future characteristics of fault slip has long been a fundamental goal of geoscientists from an earthquake hazards perspective, but also to improve the basic understanding of fault mechanics^{1}. Recent progress towards these goals has been achieved by applying a variety of machine learning (ML) approaches^{2,3,4} in the laboratory using shear experimental data to describe physical properties^{5,6,7,8,9,10,11} and in the Earth using geophysical data to characterize episodic slowslip that occurs in subduction zones^{8,12}, as well as transform faults^{13}. In shear experiments, earthquakes or “labquakes”, generated during a single experiment produce a sufficiently large data set for training ML models. However, on natural faults the repeat cycles for all but the smallest earthquakes can span timescales on the order of 10’s100’s of years. Thus, insitu geophysical measurements as input for datadriven ML models are generally not available or sufficiently complete for more than a portion of a single earthquake cycle. In particular this problem exists for large magnitude (M > 7) earthquakes that produce strong, damaging ground motions. This conundrum presents a serious challenge if the goal is to use datadriven modeling techniques to characterize the physics of fault slip throughout the complete earthquake cycle and to advance earthquake hazards assessment.
Our group’s first work in this subject area^{5} showed that seismic signals emanating from a laboratory fault experiment were rich with information regarding the physics of slip, gleaned from machine learning analysis of the continuous waveform. This led to a large number of complementary efforts by others applying similar approaches to laboratory data, as well as a Kaggle competition on the topic of laboratory earthquake prediction^{4}. Subsequently, we showed the same machine learning approach could identify slip characteristics in the Earth using seismic signals broadcast by the slowlyslipping subduction fault in Cascadia^{12} and the San Andreas Fault^{13}. The methodology worked for slow slip in Earth because that process exhibits relatively noisy tectonic tremor associated with slip deep on the fault. We have been working on the problem of applying similar approaches to characterize the physics of seismogenic fault slip. To date, these approaches have been challenging, and provide insight regarding how to transition from the laboratory to the Earth. Our belief is that slip rates on natural faults in Earth are so slow that an emitted signal, if it exists, is hidden within cultural and Earth noise that is inherent to continuous seismic recordings. A suite of datadriven models, as applied to Earth, have been unable to tease out a characteristic signal or pattern in the seismic noise. Devising a new approach to characterize faultslip is the logical next step to overcome these obstacles.
A type of model generalization known as transfer learning^{14,15} is one potential solution to overcome the problem of data sparsity. Generalizing ML models using transfer learning has been applied in a number of areas in geophysics; for instance in seismology applications, transfer learning has been used to improve nonlinear and illposed inverse problems associated with seismic imaging, subsurface feature classification, and fault detection^{16,17,18,19,20}. We postulate that transfer learning may provide a tractable means of bringing the success of datadriven approaches for predicting faultslip characteristics in the laboratory to Earth. To our knowledge, no attempt has been made to apply transfer learning using data from numerical simulations to make quantitative predictions of fault slip in laboratory experiments or Earth observations. Herein, we examine the application of such a transfer learning approach to laboratory experiments, which we posit as an important first step in elucidating and laying the foundation for the potential success of applications to Earth.
In this work, we develop a deep learning convolutional encoderdecoder (CED) model that employs a timefrequency representation of acoustic emissions (AE) from numerical simulations and from laboratory shearing experiments. The model has a Unet architecture that encodes the salient features to a latent space that is then decoded to estimate the instantaneous friction coefficient that evolves through the slip cycle, as measured in the experiment. In brief, the model is initially trained using numerical faultslip simulation data to learn the mapping between the AE and the friction coefficient. The latent space is then trained using only a small fraction of the laboratory experimental data, and the resulting crosstrained CED model is applied to unseen laboratory experimental data (Fig. 1). If such a procedure works at the laboratory scale, a next step is evaluating a similar approach in Earth, by conducting and applying fault simulations at scale in combination with data from seismogenic faults. In the following, we describe results from the CED model transfer learning and show the successful application of this technique for multiple data sets.
Results
Transfer learning from numerical simulations to laboratory shear experiments
The laboratory data^{10,21} is from a biaxial shear device that consists of a sliderblock bounded by fault gouge and external blocks through which a confining load is applied (Fig. 1). A constant shear velocity is applied and when the system approaches steady state conditions, repetitive stickslip motion occurs (see “Methods” section and Fig. 2b). The biaxial device setup simultaneously measures acoustic emission (AE) and the normal and shear stresses required to calculate the bulk friction coefficient.
The numerical simulation^{22} applies a combined finitediscrete element method (FDEM) model of a faultshear apparatus resembling the biaxial device used in the laboratory experiments described here^{23} (see “Methods” section and Fig. 1). The input training data to the CED model from simulation is the kinetic energy, which is a proxy for the measured continuous AE signal in the biaxial shear experiment. Changes in seismic moment are reflected in variations observed in the kinetic energy; therefore, the kinetic energy represents the kinematic behavior of the granular fault system (see “Methods” section). The CED label data is the bulk friction coefficient between the sliding blocks.
Data from the numerical simulation and laboratory experiment (Fig. 2) are used by the CED model (Fig. 3) that is fully described in the Methods section. The supervised learning approach is a regression procedure, using the AE from experiment or the kinetic energy from simulation to predict the instantaneous characteristics of slip, specifically the coefficient of friction. As a point of reference for the transfer learning approach, the results shown in Fig. 4 are produced by training, validating, and testing entirely on the numerical simulation data. The predicted friction coefficient captures the general slip trends including many frictional failures. However, the prediction results are modest as reported by the mean absolute percentage error (MAPE) of 4.237% for the numerical simulation data.
As a second point of reference, the same procedure is followed using only the laboratory AE and friction data to train a separate CED model. The first 20% of the AE signal (0–60 s) is used for training data. The friction predictions from the testing data produce a MAPE of 1.137% (Fig. 5a). The model performs very well for estimating the variations in friction coefficients and capturing the frictional failures associated with slip events.
For the first transfer learning exercise, we use the model trained on simulation data and apply it to predict the friction in the laboratory experiment—the trained model uses the experiment AE as input and the label is the friction from experiment. We emphasize that the CED model never sees experimental data during training with the simulation data. The predictions show a decrease in performance with a MAPE of 4.232% (Fig. 5b) when compared to the model trained solely on the laboratory data. The maximum friction drop, which has an equivalence to event moment (moment = GAu; where G is the gouge shear modulus, A is slip area and u is the fault displacement), is consistently less than that measured from the experiment. Underprediction of the event moment is a common problem with many ML models when applied to the biaxial shear data^{8,13}, without considering transfer learning. Nonetheless, we find the timing and scale of the predictions are surprisingly good considering the significant differences between the FDEM numerical simulation and the laboratory shear experiment.
With an eye to faults in Earth where obtaining sufficient training data is a challenge, we introduce transfer learning by crosstraining the model. Here, we allow the latent space of the CED to be trained on limited laboratory data, while fixing the encoder and decoder layers that are trained using only the simulation. This approach is an extension of an established transfer learning technique used in image classification tasks, see e.g.,^{24,25}. As applied to image classification, the convolutional layers of a model are pretrained on a large database, e.g., ImageNet^{26,27}, and then specific convolutional layers are extracted and held constant, then merged with an additional fullyconnected classification layer that is trained with data specific to the problem of interest.
Here, we apply a transfer learning approach in this same spirit. We emphasize that alternatively here, though analogously, it is the encoder and decoder layers that are directly extracted, and only the latent space weights are updated. For the ML models previously trained on the numerical simulation data, all the parameters in the encoder and decoder are rendered nontrainable, while the parameters in the latent space are updated and fit to the training data from laboratory experiments.
The resulting predictions shown in Fig. 5c and are very good with MAPE = 1.650%, which is a significant improvement from the MAPE of 4.232% before crosstraining. The model predictions are now comparable to the MAPE of 1.137% obtained when training directly on p4677 laboratory data.
As a more rigorous test on how well the crosstrained CED model predicts the laboratory experiments, we apply the identical model to a different laboratory experiment conducted in the biaxial apparatus. These experimental data are never seen during training of the latent space. This second experiment was conducted over a range of confining loads (normal stress) from 3 to 8 MPa (Fig. 6). The only information applied from the different confining load levels are the mean and standard deviation statistics used to normalize the AE and μ signals when producing the input scalograms to the model and when reconstructing the output scalograms (see “Methods” section).
The predictions applying the crosstrained model to the second experiment are shown in (Fig. 7). The results are remarkably good as indicated by the MAPE’s. The 3 MPa data exhibits the best MAPE, presumably because the confining load is close to the 2.5 MPa stress in the p4677 experiment that was used to train the latent space (Fig. 7a). The model predictions as manifest by the MAPE increase with increasing normal loads. The prediction errors appear to be due primarily to the poor predictions of the frictional failure magnitudes. Nonetheless, the instantaneous slipevent times are captured at all load levels, as are the stress buildups during interevent periods.
Transfer learning with extremely limited data in laboratory experiments
Because slip cycles in Earth are so long (decades to 100’s of years) and we rarely have more than a portion of associated seismicity within a full seismic cycle, we conduct a cross training exercise that mimics this datapoor circumstance. We do so by using only limited portions of a single slip cycle from the laboratory experiment for training the model latent space. Specifically, we train the latent space by applying only the postfailure or the prefailure μ signals from experiment p4677 data (Fig. 8). The postfailure comprises the timeperiod when the shear stress is increasing relatively rapidly following the previous slip event. The prefailure period comprises the period when the fault is late in the cycle, nearcritical state, and beginning to nucleate^{21}. The model encoder and decoder trained with the simulation data again remain unchanged. The model is trained and validated using 90% and 10% of the data, respectively, for the prefailure and postfailure analysis. Because the available data only spans a short time interval, the size of the sliding windows is reduced from 2 to 0.4 s and the step size is reduced from 0.2 to 0.1 s (the sizes have no significant impact on the model performances, see Methods). To prevent overfitting, the training is terminated when the validation does not reduce for 100 epochs.
After crosstraining the latent space using the two data sets from experiment p4677 (postfailure and prefailure) to produce two separate CED models, the models are used to predict the friction in the second experiment, p4581, for 3, 5, and 7 MPa applied load levels, on the postfailure and prefailure signals. The results are shown in Fig. 9. As before, the magnitude of the frictional failures are not well predicted—otherwise the trained models perform surprisingly well in both cases. The procedure is repeated using the laboratory data, without transfer learning, to show the improvements in the forward predictions when data is extremely limited (Fig. S1). The results applying the postfailure data are slightly better than that from the prefailure training data, suggesting the model has learned more frictional states during latentspace training. As anticipated, the model using experiment p4677 and trained applying six full cycles provides the most robust result; however, the model results with extremely limited training crosstraining data are very encouraging.
Transfer learning predicting time to failure in laboratory experiments
Transfer learning can also be applied to predicting other output time series (see e.g.,^{4,6}). Here we showcase the predictions on the signals of timetofailure (TTF). Failure times are defined as when the time derivative of the μ signal is below −10/s. The raw AE signal is used as input, just as for the instantaneous predictions of the friction coefficient. The encoder and decoder models are again directly applied from the CED model trained on the numerical simulation data for the task of predicting the μ signal. Next, the latent space is trained to fit the TTF training data from experiment p4677 (the first 20% of the signal, from 0 to 60 s, including six stickslip cycles). Data from p4581 are again used for testing purposes only. The predictions are illustrated in Fig. 10 for 3, 5, and 7 MPa load levels. The predictions are good, if not perfect considering the task, as underscored by their respective MAPE scores. Indeed they are notable considering the they are obtained from crosstraining and transfer learning. Repeating this procedure using only laboratory data, without transfer learning, indicates the model can better estimate the full cycle with values extending to zero seconds (Fig. S2) and agrees with initial studies extracting information from the continuous waveforms^{5}.
Discussion
The predictions of the instantaneous friction obtained applying transfer learning from FDEM simulations to laboratory data from the biaxial shear device are surprisingly good. When model crosstraining of the latent space is then applied, the predictions improve significantly. Further, when we apply the same crosstrained model to the second experiment conducted at multiple applied loads, the model predictions are still surprisingly good—there exists a larger misfit with increasing load, but the timing of the event is accurate regardless of the underprediction in friction failure magnitude. The results are even more remarkable considering the FDEM simulation was not meant to directly simulate the experiment—material properties and dimensions of the fault gouge and shearblocks were considerably different. Indeed, the results indicate the simulations contain an AE (kinectic energy) evolution captured in the spectral characteristics that can predict the actual AE in experiment. The results suggest the simulations, despite the differences, provide a sufficient distribution of behaviors for the models to learn and reproduce the laboratory behavior—the FDEM simulation exhibits more complex slip behaviors than the experiments in regards to the range of interevent times. Consequently, the trained model is able to predict the simpler behavior with more quasiperiodic interevent times exhibited by the experimental data. The slip frictionalfailure magnitude predictions are less accurate than the timing—the full range of frictional failures observed during sliding is underpredicted. Knowing this, one could conduct simulations that produce larger frictional failures to determine if this improves the laboratory failure predictions. It is also interesting that the latent space trained on the postfailure laboratory data produces better predictions than prefailure training. This may give us clues in Earth regarding where we might anticipate better predictions.
The transfer learning and crosstraining results are encouraging. As previously noted, for realworld seismic applications stickslip repeat cycles can be on timescales ranging from several decades to centuries. Thus in general, available geophysical recordings only include a partial earthquake cycle. Based on the work presented here, we imagine the following as one potential scenario for addressing the sparse data problem. After selecting a fault in Earth to be characterized, numerical simulations of numerous earthquake cycles will be conducted. A deep learning model is then developed applying simulation results, where continuous AE data are used as model input and fault displacement is used as target. Once the model is trained applying simulation results, we crosstrain the model latent space with continuous seismic data recorded from the actual fault. This model can then be tested using continuous seismic data not used during the training (e.g., a different time period), to determine if the model can predict geodetic measured surface displacement for that time period. This is one potential approach however there exist parallel approaches one could imagine and test as well. The general transfer learning and crosstraining approach may be of great value as we address evolving fault slip and earthquake hazards in the real Earth.
Methods
Numerical simulation and laboratory experiments
Numerical simulations of a laboratory experiment performed by our group (Gao et al.^{22}) were obtained by applying the combined finitediscrete element method (FDEM) using the Hybrid Optimization Software Suite package (HOSS)^{28} (Fig. 1). The FDEM used in this study was originally developed to simulate continuum to discontinuum transitional material behavior^{29}. FDEM combines the algorithmic advantages of the discrete element method with those of the finite element method. In FDEM, each discrete element is comprised of a subset of finite elements that are allowed to deform according to the applied load, which is particularly useful in capturing deformations in the fault gouge material as well as at the gouge particle–plate boundary.
The FDEM model was applied to simulate a twodimensional, photoelastic shear laboratory experiment conducted by Geller and others^{23}. Twodimensional plane stress conditions were assumed and the model comprised 2817 circular particles confined between two identical plates. Three thousand bidimensional particles with diameters of 1.2 and 1.6 mm were used, respectively (1500 of each). The plates had dimensions of 570 × 250 mm. At the plate interfaces semicircular shaped “teeth” were placed to increase friction between plates. The particles had Young’s modulus and Poisson’s coefficient of 0.4 GPa and 0.4, respectively, while the plates had Young’s modulus and Poisson’s coefficient of 2.5 MPa and 0.49, respectively, far smaller than those used in the biaxial shear experiment described below. Shearing velocity was 0.5 mm/s.
The laboratory data^{21,30,31,32,33} were obtained applying a biaxial shear apparatus (Fig. 1). Laboratory experiments fail in quasiperiodic cycles of stick and slip that mimic to some degree the seismic cycle of loading and failure on tectonic faults. The apparatus comprises a central steel block that is driven at fixed loading velocity of 10 μm/s for the experiment. This loading imparts shear stresses within two gouge layers that are 100 mm square with an initial thickness of 5 mm. The gouge layers are located on either side of the central driving block and confined by a second steel layer of 20 mm thickness. The gouge consists of monodisperse glass beads of 104–149 μm diameter with Young’s modulus of 70 GPa and Poisson coefficient of 0.3; the steel blocks have Young’s modulus of approximately 180 GPa and Poisson’s coefficient of approximately 0.29. A loadfeedback servo control system maintains a fixed normal stress of 2.5 MPa for experiment p4677, while measuring shear stress throughout the experiment. For experiment p4581, progressively larger loads were applied, and at each load level, steady state was achieved before a change to the successive load level. The shearing speed was 5 mm/s for both experiments. Mechanical data measured on the apparatus throughout the experiments included the shearing block velocity, the applied load, the gouge layer thickness, the shear stress, and the coefficient of friction. Continuous AE emissions from the fault zone seismic wave radiation were recorded with piezoceramics embedded inside blocks of the shear assembly^{34}.
We note that the AEs from the FDEM simulations were not propagated as elastic waves in the model. We assume the kinetic energy obtained from the fault simulations as being equivalent to the AEs recorded on the experimental shear apparatus based on previous analysis^{22}. Used here as an equivalent quantity to the AE is the kinetic energy (E_{k}) summed from the entire system. Since the plates and particles work together as an ensemble, it is the aggregate energy evolution that governs the stickslip behavior in granular fault gouge. In the biaxial experiment, the source of the AE signal is at the grain contact level^{35}. Fault gouge contacts broadcast AE independently and/or simultaneously^{35}, and displace the sideblocks equivalently along the dimensions of the block due to the extreme stiffness of the steel, in analogy to the E_{k} behavior in the simulation. Thus, the E_{k} is approximately equivalent to the magnitude of the continuous AE time series (norm of the acoustic emission recorded by the two channels of the lab experiments), which is the source of elastic waves. We say approximately because there is a modest amount of wave dissipation and wave scattering during wave propagation in the experiment from the fault gouge layer through the steel plates to the detectors.
The density distributions of the normalized inputs and outputs for the two data sets vary the most in the output friction values (Fig. S3). The input data show quite similar distributions, suggesting the model can extract the needed features from the simulation to make a prediction on the experiment. The friction values show much wider distribution in the simulation data compared to an apparent concentration around a peak from the laboratory data. The twosample Kolmogorov–Smirnov (KS) test shows that the distributions of the FDEM and Lab friction data are statistically different (KS statistic = 0.258 and pvalue = 0.0). This inconsistency helps validate our choice to use the transfer learning for adjusting the weights in the latent space to account for this outofdistribution of lab data from simulation data.
Training, validation, and testing data
The continuous time series signals (AE, kinetic energy, and friction coefficient) from the experiment and FDEM simulation are converted into scalograms using the Continuous Wavelet Transform (CWT; see ref. ^{36} for a comprehensive description of the method) to utilize the timefrequency signal strength in the CED models. We adopt the real Ricker (Mexicanhat, DOG (m = 2)) wavelet for the CWT, which is commonly used in analyzing seismic data^{37}. For comparison we also tested the Morlet wavelet and found the Ricker to produce improved MAPE results. The reconstruction of the signal from the scalograms (inverse CWT) is the sum of the real part of the wavelet transform over all scales.
For the FDEM simulation, the CWT is performed on the training/validation/testing (60/20/20% split) segments of the kinetic energy (E_{k}) and friction (μ) time series. Scalograms are calculated using moving windows with a size of 2 s and step of 0.2 s. The sliding window size does not impact the accuracy of the CED model, see “Methods” section Testing model design and training procedure. For a sampling frequency of f_{s} = 1000 Hz, each scalogram is 128 × 2000. The procedure produces 73 and 19 pairs of input E_{k} and output μ scalograms for the training and validation, respectively.
The training data is augmented by producing additional noisy E_{k} input signals. The procedure is as follows: (1) take the Fast Fourier Transform (FFT) of the original signal data, (2) shuffle the positivefrequency terms of the imaginary coefficients, (3) set the negativefrequency terms to the opposite of the shuffled terms, (4) and perform the inverse FFT to produce a new E_{k} signal with the same amplitude spectrum and random phase. The procedure is repeated three times for the training signal and the final training data contains 292 scalogram pairs.
The CWT transform procedure is applied to the laboratory experiment p4677 acoustic emission (AE) and friction (μ) time series to produce training/validation/testing (20/20/60% split) data. The scalogram dimensions are the same as the numerical simulations. The final data set contains 292 pairs of input AE_{norm} and output μ scalograms for the training and validation data. Scalograms are calculated for the laboratory experiment p4581 and only used as testing data for experiments conducted at different normal stresses.
Before applying the CWT, all input and output signals are normalized by subtracting the mean and dividing by the standard deviation using the statistics extracted from the training signal data. For FDEM data, the statistics are 3.28E−4 ± 5.00E−4 for the input E_{k} signals and 4.23E−1 ± 2.52E−2 for the output μ signals. For transfer learning on the p4677 data, the statistics from the training signals (0–60 s, including six stickslip cycles) are 8.932 ± 14.900 for the input AE signals and 0.657 ± 0.0382 for the output μ. In the cases of limited subcycle data, the postfailure training signal has statistics of 7.712 ± 10.667 for AE and 0.641 ± 0.0440 for μ, and the prefailure training signal has statistics of 10.205 ± 20.137 for AE and 0.667 ± 0.0377 for μ. When making predictions using the laboratory p4581 data with increasing normal loads, the statistics are extracted from the first 20% of the 3MPa signal (from 0 to 40 s, including five stickslip cycles) to obtain 17.776 ± 46.700 for AE and 0.433 ± 0.0230 for μ. For the TTF statistics the values are 4.816 ± 3.257 on the p4677 data and 4.817 ± 2.873 on the p4581 data, extracted from the same aforementioned signal segments.
Convolutional encoderdecoder model and transfer learning
The CED architecture is composed of an encoder branch containing the salient features that feed to a latent space, and a decoder branch to construct the output variable. The input signal is passed to an encoding branch with a preprocessing block containing two convolutional layers and a rectified linear unit (ReLU) activation function (Fig. 3). Preprocessing is used to reduce the image size in the time dimension by a factor of 25. This is passed through four downsampling blocks containing three convolutional layers, each with batch normalization, ReLU activation, and a skip connection. The latent space contains two convolutions and a ReLU activation. The decoding branch reverses the encoding using convolutional transpose layers. The postprocessing contains two convolutional transpose layers to obtain the original dimensions. The dimension of each layer, i.e., the filter size, depth, and skip connections are labeled in Fig. 3. The model contains five “skip” connections^{38} that directly link the weights from the downsample blocks in the encoder to the upsample blocks in the decoder at each level. The trainable weights are initialized using glorot_uniform and the biases are nontrainable and set to zero. The CED model contains 363,696 trainable parameters, with 73,984 in the latent space.
The model design utilizing the CWT images with skipconnections yields improved prediction accuracy over simpler morestandard approaches, such as alternatively directly inputting the waveform data and using 1D convolutions. As a point of comparison, a convolutional encoder model followed by a set of fully connected layers has 1–2 orders of magnitude more trainable parameters and does not outperform the adopted design. The number of filters and layers of the CED model (see Fig. 3) have been reduced and the performance was compared to validate the final model selection (Table S5). The adopted design produces the best overall performance.
Loss functions are calculated hierarchically for each pair of encoder/decoder blocks. This type of hierarchical regularization was recently introduced by Wang et al.^{39} to provide better interpretability and generalizability of CED models for learning fluidflow patterns in complex rock porestructures. This regularization is found to improve the MAPE accuracy in predicting FDEM test data by 1% and provides a similar level of MAPE for the transfer learning. The total loss is calculated as \({L}_{{{\mbox{total}}}}={L}_{{{\mbox{hier}}}}^{0}+{L}_{{{\mbox{hier}}}}^{1}+{L}_{{{\mbox{hier}}}}^{2}+{L}_{{{\mbox{hier}}}}^{3}+{L}_{{{\mbox{hier}}}}^{4}+{L}_{{{\mbox{reconstr}}}}+{L}_{{{\mbox{l2}}}}\). Where \({L}_{\,{{\mbox{hier}}}\,}^{i}\) is the mean square error (MSE) between the target and predicted values for each submodel linked with a “skip” connection (Fig. 3a). L_{reconstr} is the reconstruction loss using the entire CED model. And L_{l2} is the loss associated to the L2 regularization using a penalty multiplier^{40} set to 1E−5. After the initial training, the “skip” connections are deactivated so that information only passes down the encoder, through the latent space, and up the decoder for a prediction.
The model is trained with a NVIDIA Tesla P100 GPU using 292 pairs of scalograms with a batch size of 8, the Adam optimizer, and a learning rate of 1E−3. Validation is performed with 19 pairs of scalograms. The training is terminated when the reconstruction loss on the training data is below 0.1 and the validation reconstruction loss does not diminish for 100 epochs. The model with the lowest validation loss is used as the final CED model. Transfer learning is applied using the laboratory p4677 data. A new CED model is created with the weights from the final model trained on the FDEM simulation data. All trainable weights, except the latent space, are rendered nontrainable and held constant while the latent space is further trained with the laboratory data. Since the encoder and decoder branches are nontrainable layers, the total loss is L_{total} = L_{reconstr} + L_{l2} and the early stopping is the same. The initial training on the FDEM simulation data takes approximately 15 min and the latent space crosstraining on the laboratory data takes about 10 min for full convergence.
Testing model design and training procedure
Due to random variable initialization and the stochastic nature of training a neural network, repeating the training procedure gives different results and variations in the overall performance. We performed multiple runs of the same transfer learning workflow to assess the average performance of the trained CED models to assess these expected variations. (1) Five runs are performed starting from the same initialized model weights and no noisy data augmentation is added (Table S1). (2) Five runs are performed starting from the same initialized model weights and including the noisy data augmentation (Table S2). (3) Ten runs are performed starting randomly initialized model weights and including the noisy data augmentation (Table S3). The process is then repeated using laboratory data obtained at different confining stress (Fig. S4). The results of these tests show the random initialization and shuffling of the batches produce discrepancies between the model predictions, and increasing the noise through data augmentation reduces the variance and improves accuracy. The main results presented come from the CED model trained in Run No. 8 (Table S3) with an overall accuracy nearest to the mean performance of the ten separate runs with random initial weights and random noisy data augmentation.
The input data length is tested to evaluate the effect of the size of the sliding window on the model predictions. We performed six runs with randomly initialized model weights and noisy data augmentation, using different sizes of the sliding windows 0.4, 0.8, 1, 3, 4, and 5 s (Table S4). These tests indicate the window size produces little variation in the final results shown using a 2 s sliding window (Table S3). The transfer learning approach is robust to the hyperparameter of sliding window size.
Data availability
The numerical FDEM data used in this study are publicly available at https://doi.org/10.5281/zenodo.1248174. The experimental used in this study from experiments p4677 and p4581 are hosted by Chris Marone at the Pennsylvania State University, available at https://sites.psu.edu/chasbolton/.
Code availability
This study was performed using the Python package NumPy, TensorFlow and PyCWT. The Python code is under restricted access and is not available for public release due to institutional regulations at Los Alamos National Laboratory.
References
Scholz, C. H. The Mechanics of Earthquakes and Faulting (Cambridge University Press, 2019).
Bergen, K. J., Johnson, P. A., Maarten, V., & Beroza, G. C. Machine learning for datadriven discovery in solid Earth geoscience. Science 363, 6433 (2019).
Ren, C. X., Hulbert, C., Johnson, P. A. & RouetLeduc, B. Machine learning and fault rupture: a review. Adv. Geophys. 61, 57–107 (2020).
Johnson, P. A. et al. Laboratory earthquake forecasting: a machine learning competition. Proc. Natl Acad. Sci. 118, e2011362118 (2021).
RouetLeduc, B. et al. Machine learning predicts laboratory earthquakes. Geophys. Res. Lett. 44, 9276–9282 (2017).
RouetLeduc, B. et al. Estimating fault friction from seismic signals in the laboratory. Geophys. Res. Lett. 45, 1321–1329 (2018).
Lubbers, N. et al. Earthquake catalogbased machine learning identification of laboratory fault states and the effects of magnitude of completeness. Geophys. Res. Lett. 45, 13–269 (2018).
Hulbert, C. et al. Similarity of fast and slow earthquakes illuminated by machine learning. Nat. Geosci. 12, 69–74 (2019).
Jasperson, H. et al. Attention network forecasts timetofailure in laboratory shear experiments. JGR Solid Earth. 126, e2021JB022195 https://doi.org/10.1029/2021JB022195 (2021).
Bolton, D. C. et al. Characterizing acoustic signals and searching for precursors during the laboratory seismic cycle using unsupervised machine learning. Seismol. Res. Lett. 90, 1088–1098 (2019).
Zhou, Z., Lin, Y., Zhang, Z., Wu, Y. & Johnson, P. Earthquake detection in 1D timeseries data with feature selection and dictionary learning. Seismol. Res. Lett. 90, 563–572 (2019).
RouetLeduc, B., Hulbert, C., McBrearty, I. W. & Johnson, P. A. Probing slow earthquakes with deep learning. Geophys. Res. Lett. 47, e2019GL085870 (2020).
Johnson, C. W. & Johnson, P. A. Learning the low frequency earthquake daily intensity on the central San Andreas Fault. Geophys. Res. Lett. 48, e2021GL092951 (2021).
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2009).
Goodfellow, I., Bengio, Y., & Courville, A. Deep Learning (MIT Press, 2016).
Chevitarese, D., Szwarcman, D., Silva, R. M. D., & Brazil, E. V. Transfer learning applied to seismic images classification. In AAPG Annual and Exhibition (Salt Lake City, 2018).
Siahkoohi, A., Louboutin, M. & Herrmann, F. J. The importance of transfer learning in seismic modeling and imaging. Geophysics 84, A47–A52 (2019).
Cunha, A., Pochet, A., Lopes, H. & Gattass, M. Seismic fault detection in real data using transfer learning from a convolutional neural network pretrained with synthetic seismic data. Comput. Geosci. 135, 104344 (2020).
Zhang, Z. & Lin, Y. Datadriven seismic waveform inversion: a study on the robustness and generalization. IEEE Trans. Geosci. Remote Sens. 58, 6900–6913 (2020).
Yan, Z., Zhang, Z. & Liu, S. Improving performance of seismic fault detection by finetuning the convolutional neural network pretrained with synthetic samples. Energies 14, 3650 (2021).
Johnson, P. A. et al. Acoustic emission and microslip precursors to stickslip failure in sheared granular material. Geophys. Res. Lett. 40, 5627–5631 (2013).
Gao, K. et al. Modeling of stickslip behavior in sheared granular fault gouge using the combined finitediscrete element method. J. Geophys. Res. 123, 5774–5792 (2018).
Geller, D. A., Ecke, R. E., Dahmen, K. A. & Backhaus, S. Stickslip behavior in a continuumgranular experiment. Phys. Rev. E 92, 060201 (2015).
Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? Preprint at arXiv https://arxiv.org/abs/1411.1792 (2014).
Shin, H. C. et al. Deep convolutional neural networks for computeraided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016).
Deng, J. et al. Imagenet: a largescale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Huh, M., Agrawal, P., & Efros, A. A. What makes ImageNet good for transfer learning? Preprint at arXiv https://arxiv.org/abs/1608.08614 (2016).
Knight, E. E. et al. HOSS: an implementation of the combined finitediscrete element method. Comput. Part. Mech. 7, 765–787 (2020).
Munjiza, A. A. The Combined FiniteDiscrete Element Method (Wiley, 2004).
Dieterich, J. H. & Conrad, G. Effect of humidity on timeand velocitydependent friction in rocks. J. Geophys. Res. 89, 4196–4202 (1984).
Marone, C. Laboratoryderived friction laws and their application to seismic faulting. Annu. Rev. Earth Planet. Sci. 26, 643–696 (1998).
Niemeijer, A., Marone, C., & Elsworth, D. Frictional strength and strain weakening in simulated fault gouge: competition between geometrical weakening and chemical strengthening. J. Geophys. Res. 115, B10 (2010).
Scuderi, M. M., Marone, C., Tinti, E., Di Stefano, G. & Collettini, C. Precursory changes in seismic velocity for the spectrum of earthquake failure modes. Nat. Geosci. 9, 695–700 (2016).
Rivière, J., Lv, Z., Johnson, P. A. & Marone, C. Evolution of bvalue during the seismic cycle: Insights from laboratory experiments on simulated faults. Earth Planet. Sci. Lett. 482, 407–413 (2018).
Trugman, D. T. et al. The spatiotemporal evolution of granular microslip precursors to laboratory earthquakes. Geophys. Res. Lett. 47, e2020GL088404 (2020).
Torrence, C. & Compo, G. P. A practical guide to wavelet analysis. Bull. Am. Meteorol. Soc. 79, 61–78 (1998).
Gholamy, A. & Kreinovich, V. Why Ricker wavelets are successful in processing seismic data: towards a theoretical explanation. 2014 IEEE Symposium on Computational Intelligence for Engineering Solutions (CIES) 11–16 (IEEE, 2014).
Ronneberger, O., Fischer, P., & Brox, T. Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention 234–241 (Cornell University, 2015).
Wang, K. et al. A physicsinformed and hierarchically regularized datadriven model for predicting fluid flow through porous media. J. Comput. Phys. 443, 110526 (2021).
Chollet, F. Xception: deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (Cornell University, 2017).
Acknowledgements
K.W., K.C.B., and P.A.J. acknowledge support by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Chemical Sciences, Geosciences, and Biosciences Division under grant 89233218CNA000001. K.W. acknowledges additional support by the Center for Nonlinear Studies (CNLS) at Los Alamos National Laboratory. C.W.J. and K.W.B. acknowledge Institutional Support (Laboratory Directed Research and Development under projects 20200681PRD1 and 20210686ECR, respectively, at the Los Alamos National Laboratory. We thank Chris Marone for the laboratory data and Ke Gao for the numerical simulation data.
Author information
Authors and Affiliations
Contributions
K.W. and C.W.J developed the CED model and conducted the machine learning analysis with input from K.W.B. and P.A.J., K.W., and K.C.B. concieved and of the transfer learning analysis and K.W. and C.W.J devised the workflow. P.A.J. conducted experiments with collaborators at the Pennsylvania State University. P.A.J. oversaw the numerical simulation work and experimental data collection. All authors were involved in writing the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review information
Nature Communications thanks Luz Garcia and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, K., Johnson, C.W., Bennett, K.C. et al. Predicting fault slip via transfer learning. Nat Commun 12, 7319 (2021). https://doi.org/10.1038/s41467021275535
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021275535
This article is cited by

ExcavationInduced Fault Instability: A Machine Learning Perspective
Rock Mechanics and Rock Engineering (2024)

AI for tribology: Present and future
Friction (2024)

Using a physicsinformed neural network and fault zone acoustic monitoring to predict lab earthquakes
Nature Communications (2023)

On catching the preparatory phase of damaging earthquakes: an example from central Italy
Scientific Reports (2023)

Machine learning analysis and risk prediction of weathersensitive mortality related to cardiovascular disease during summer in Tokyo, Japan
Scientific Reports (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.