Space-based gravitational wave signal detection and extraction with deep neural network

Space-based gravitational wave (GW) detectors will be able to observe signals from sources that are otherwise nearly impossible from current ground-based detection. Consequently, the well established signal detection method, matched filtering, will require a complex template bank, leading to a computational cost that is too expensive in practice. Here, we develop a high-accuracy GW signal detection and extraction method for all space-based GW sources. As a proof of concept, we show that a science-driven and uniform multi-stage self-attention-based deep neural network can identify synthetic signals that are submerged in Gaussian noise. Our method exhibits a detection rate exceeding 99% in identifying signals from various sources, with the signal-to-noise ratio at 50, at a false alarm rate of 1%. while obtaining at least 95% similarity compared with target signals. We further demonstrate the interpretability and strong generalization behavior for several extended scenarios.


Introduction
The first direct detection of GWs coming from coalescing binary black holes (BBHs) was made by the LIGO/Virgo Collaboration [1], which verifies Einstein's General Relativity.As detectors become more sensitive, more and more GW events are discovered, enabling a new era of multi-messenger astronomy.A total of 93 events have been reported in the three observations [2].GWs have become a new probe allowing cross-validation with a variety of fundamental physical theories [3][4][5][6].
Ground-based GW detectors such as LIGO, Virgo, and KAGRA cannot detect GWs at frequencies lower than 10Hz due to seismic noise [7], therefore space-based detectors are being developed.Laser Interferometer Space Antenna (LISA) will be launched around 2034 [8], and Taiji [9] and TianQin [10] are also in progress.LISA is expected to observe a variety of GW sources [11], including Galactic binaries (GB), massive black hole binaries (MBHB), and extreme-mass-ratio-inspirals (EMRI).The most common GB sources are binary white dwarf (BWD) systems, which will populate the whole frequency band of the LISA detector.Massive black holes (MBH) exist in most galactic centers, and the MBHs merge along with the galaxies, which happens regularly in the Universe [12].The EMRI system is formed when the MBH captures compact objects (CO) surrounding them.Unlike stars, COs can avoid tidal disruption and approach the central MBH, radiating a significant amount of energy in GWs at low frequencies.Beyond these resolvable sources, a huge number of unresolvable events will sum up incoherently, forming a stochastic GW background (SGWB).The detection of those GWs in the LISA mission enables us to gain a better understanding of black holes and galaxies [13].
GW data processing is complicated due to the overwhelming noise, which is non-Gaussian, sometimes non-stationary [14,15], and containing sudden temporary glitches [14][15][16].Earlier GW detection methods are divided into two categories: a) theoretical template-based algorithms like matched filtering, b) template-independent algorithms.In principle, the most accurate results can be achieved by using a matched filtering algorithm to detect signals buried in Gaussian noise [17][18][19].This is currently the most widely used algorithm for the detection of GWs.The additional complexity of space-based detection over ground-based detection can be attributed to the different types of sources.The optimal template for matched filtering would have to include all the GW source parameters in the data.However, this is not practical because of the high parameter space dimension to be explored.Moreover, the typical duration of the compact binary coalescence signal detected by LISA is longer than that detected by ground-based detectors, resulting in an even larger computational effort for the matched filtering algorithm.For ground-based detection, a series of template-independent signal extraction and detection algorithms have been developed, such as CWb [20], and BayesWave [21], both based on the wavelet transformation.Ref. [22] proposes a total-variation-based method, and a novel approach based on the Hilbert-Huang transform was recently developed [23].The advantage of these algorithms is that they are not limited by theoretical template banks and can extract the signal from noisy data, or, in other words, reconstruct the signal waveform.The disadvantage of these algorithms is that they are only available for burst signals, which are not suitable for space-based GW signals.
In this article, we develop a uniform deep learning-based model for space-based GW signal detection and extraction for the four main GW sources of LISA.Our model is based on a self-attention sequence-to-sequence architecture that performs well when dealing with time series data.We have integrated long-term and short-term feature extraction blocks in our model to catch the dependency of the GW signal in high-dimensional latent space.To our knowledge, this is the first study to achieve high-accuracy detection and high-precision extraction for all main potential GW signals from space-based detection.The model's intermediate results can be interpreted as the encoded signal waveform, revealing a strong correlation between what needs to be learned and what has been learned by the neural network.In our test results analysis, we obtained average overlaps (see equation ( 14)) of 95% of our test samples being greater than 0.95.It takes less than 10 −2 seconds to perform extraction and detection, which is a factor of roughly 10 5 improvement over traditional approaches that often require several hours.Finally, our method can also achieve considerable extraction effects for some signals generated by other models that are not in the training dataset, demonstrating the strong generalization ability of the model.

Deep neural network
We extend the mask-based prediction method for speech enhancement in Conv-TasNet [42] with a self-attention-based network for our task.As shown in Fig. 1, the network consists of four processing stages: encoder, extraction net, decoder, and classifier.First, the encoder maps the signal from the detector to a high-dimensional representation and splits it into short chunks.This encoded representation is used to estimate a mask network for signal extraction.The extraction network produces a mask matrix with Transformer blocks catching both short-term and long-term dependency from chunks.The decoder uses a transposed convolution layer for the element-wise multiplying of the mask matrix and an encoded representation to reconstruct the extracted signal.Finally, the extracted signal is sent to a multi-layer linear perception classifier for detection.The classifier gives the predicted probability that the input data contains a true GW signal.

Generate space-based GWs dataset
Due to the large differences in signals of GWs from different sources, we generate training and testing datasets for each source.We choose a universal sampling rate of 0.1Hz for all datasets.All the data samples in the datasets have 16000 sampling points, hence the duration of each sample is 160000 seconds or 44.4 hours.The parameter space of signal generation is sampled with a uniform grid.The range of the parameters used to generate the GW signals are listed in the Table.4-6.Then, the parameters on each grid point are used to generate the corresponding GW signals.For noise generation, we use the noise power spectral density (PSD) of LISA [43] to simulate Gaussian noise.It should be noted that in this study, galactic confusion noise has not been taken into consideration.To simulate the different signal-to-noise ratio (SNR) levels in signal extraction, we set the SNR equal to 50, 40, and 30 following the equation (12).
The SGWB data is directly generated by the PSD given by equation ( 4), in which the parameter α characterizes the amplitude of the SGWB signal.We set α equal to -11.35, -11.55, and -11.75 to generate the signal in the test datasets.To train the model for the GW detection task, we generate that half of the samples contain signals and half are pure noise.Fig. 2 shows some sample cases for our generated data.Specific details for each dataset will be presented in Methods.

Interpretability of the network
To better understand what information the self-attention-based neural network learned, we explored the corresponding relationship between the attention mechanism and the embedded input signal.We calculate the attention maps of various layers of the network.The attention map presents the average output of attention heads in each layer, between each pair of tokens.The output of each attention head is a weighted sum of the embedded tokens of signal, which will be defined in detail in the Methods.
The core of our network is the extraction net, consisting of several STTBs (Short-Time Transformer Block) and LTTBs (Long-Time Transformer Block) stacked together.Both LTTB and STTB contain multi-head attention layers, which enable our model to have universal learning capability for different GW sources.In STTB, attention is only calculated between tokens in the same chunk, which indicates that all tokens exchange information within the corresponding chunk.The attention map of STTBs is a diagonal line consisting of squares in chunk size, implying STTBs are only interested in local structures.And the top panel of each sub-figure in Fig. 3 shows the attention maps between all embedded tokens in LTTB.We can see that for different GW sources, the model weights show different patterns.We sum each column of the attention map to obtain the attention received by each token, which is called attention weights (middle panel).As seen for the EMRI, MBHB, and BWD models, the attention weights and the signal after a sliding average follow the same pattern.We also applied the Augmented Dickey-Fuller (ADF) test to the summed attention map matrix, which is a 1 × 1999 vector, for both the BWD and SGWB models, owing to the stationary nature of the signals in the data.These results were -6.84 and -15.33, respectively, indicating that the attention maps are also stationary.This means that our LTTB can learn the global dependency of the data.The above experiments show that our self-attention-based model has the ability to learn both local and global structures for different physical scenarios.

GW signal extraction
The most straightforward test is to use our model to extract a piece of data with Gaussian noise and see whether it is able to extract the signal.In Fig. 4 we show some examples of the signal extraction effect of our model for different types of GW signals.Three subplots a,b, and c show the signal extraction effect of our model for EMRI, MBHB, and BWD signals, and the overlap between the model output and the template is calculated.The overlap shows a very good signal extraction effect.Since SGWB does not have a specific waveform, we do not do this test here for the corresponding model.The most straightforward test is to use our model to extract a piece of data with Gaussian noise and see whether it is able to extract the signal.In Fig. 4 we show some examples of the signal extraction effect of our model for different types of GW signals.Three subplots a,b, and c show the signal extraction effect of our model for EMRI, MBHB, and BWD signals, and the overlap between the model output and the template is calculated.The overlap shows a very good signal extraction effect.Since SGWB does not have a specific waveform, we do not do this test here for the corresponding model.
Then, we perform some statistical tests on the model.We generate the test data in the same way as we generate the training dataset.We use a coarser grid and create test datasets of 10, 000 samples containing signals with SNR equal to 30, 40, and 50 respectively for each type of GW.In the left column of Fig. 4, the signal extraction effect of the three models for each of the three types of GWs is depicted.The showcase is selected according to the 15 percent quantile of the testing overlap of each GW source.In the right column, it can be seen that for the case of SNR = 50, the signal extraction effect is performed with great accuracy for MBHB signals, and the overlap is higher than 0.99 for all test samples.The overlap for the BWD signal is greater than 0.97 for 95% of the test samples.Although it is slightly less effective in extracting the EMRI signal due to its complexity, 92% of samples have overlap greater than 0.95.

GW signal detection
The signals extracted by the neural network are used to build up four new datasets to test the ability of our model to detect GW signals.These four test datasets each included 10,000 samples containing signals and 10,000 samples containing pure noise.Here we utilize the detection rate, another term for true positive rate (TPR), to measure the probability of correctly identifying signals.For each signal type and an SNR of (30,40,50) and a false alarm rate at 1%, the detection rates are (98.20%,99.70%, 99.71%) for EMRI, (99.99%, 100%, 100%) for MBHB, (99.37%, 99.97%, 99.98%) for BWD, respectively.For SGWB signals with α of (-11.75, -11.55, -11.35) and a false alarm rate at 1%, the detection rates are (95.05%,99.97%, 100%), respectively.To quantify the performance of our model in detection tasks, we use the receiver operating characteristic (ROC) curve as shown in Fig. 5.In ROC analysis, the true positive rate (TPR) and the false positive rate (FPR) are plotted as the probability threshold for classifying a candidate as positive (i.e.signal) is altered.The area under the ROC curve (AUC) has been used to evaluate the classifier's performance, which is a single scalar value between 0 and 1.In general, the higher the AUC, the better the classifier.We calculate AUC using Scikit-Learn library [44].For all 4 signal types, the AUC are all close to 1, which indicates a fairly high sensitivity for classification (signal detection).We show the ROC curve on a logarithmic scale to better visualize the shape at a lower false alarm rate.The high AUC values indicate the strong ability of our method for multiple GW source detection.

Test on LDC2a
In the subsequent subsection, we provide an empirical assessment of our model employing the LDC2a dataset.As depicted in Figure 6, the model proficiently extracts all 15 signals, with an overlap exceeding 0.9 for each, and notably, 13 of these signals exhibit an overlap above 0.96.Concurrently, the model demonstrates a detection probability equal to 1.The signal #2 has a lower overlap because it is totally buried in the confusion noise.
When compared with MFCNN [40], our approach not only ensures the detection of all 15 signals but also generates denoised waveforms with an overlap greater than 0.9, underlining its efficacy and precision.
On the other hand, in order to compare with the traditional method [45], we run the MCMC test on the LDC2a dataset using the open source code from their repository 4 and compare the signal reconstructed by the best-fit parameters and the signal extracted by our neural network, the result is presented in Table 1.The test results show that our neural network has comparable waveform extraction accuracy to the traditional method.

Model generalization behavior
Next, we evaluate the generalization ability of the model.Results are shown in Fig. 7.The parameter space of our EMRI training dataset is only 4-dimensional with a fixed initial semi-latus rectum p 0 = 20M .Fig. 7a shows the result of testing our model using a signal with p 0 = 30M , which indicates a strong generalization capability beyond the training parameter space.The training dataset of MBHB only contains GW signals from spin-aligned MBHB systems with quasi-circular orbits, without considering the case of orbital eccentricity.Here we generate a GW signal with initial orbital eccentricity e 0 = 0.5 using the SEOBNRE model [56].In Fig. 7b, we can see the extracted effect, which demonstrates that our model has good generalizability.To test the generalization ability of the BWD model, our test signal is generated using an evolving BWD system that considers mass transfer, tidal forces, and gravitational radiation effects [57].The extracted result is shown in Fig. 7c.The output detection statistics labeled as detection probabilities of these three test signals are all equal to 1.The final test case is the SGWB model, here we consider broken power law signals following equation ( 5) with parameters α = −11.18,n t1 = 2/3, n t2 = −1/3, and f T = 0.002.This spectral shape might arise from the combination of two physically distinct sources.The classification test obtained AUC = 0.99997.
Then we evaluate our models' generalization ability to weaker signals.For EMRI, MBHB, and BWD models, we test them using data with lower SNR, for the SGWB model, we use test data with smaller amplitude.The histogram of overlap between extracted signal and the template is shown in Fig. 4, at the same time the ROC curve of signal detection results is shown in Fig. 5. Throughout this series of tests, we have demonstrated the generalizability of our models in a wide variety of scenarios.
From an astrophysical perspective, first, LISA will observe MBHBs with very high SNR, typically bigger than 100, out to very high redshift [11].Then for EMRI signals, due to its physical complexity, the detection SNR threshold is ∼ 30 [11].Next, for the LISA verification binaries almost half of them reach a SNR ≥ 30 [58].Finally, the upper limit of the SGWB is Ω GW (25Hz) ≤ 3.4 × 10 −9 for the case of n t = 2/3, corresponding to α = −11.74[59].In summary, the generalization ability of the model shows the potential of our model to be applied in practical situations.

a b c
Figure 7: Showcases of the generalization behavior of our method.a, EMRI signal with different semi-latus.b, MBHB signal with eccentricity.c, BWD signal with the mass transfer.The extracted waveform is compared with whitened templates.Only the middle part of the BWD waveform is presented to show the details of the waveform.The overlap between extracted data and waveform templates is shown on the top.

Discussion
With the results presented above, we show the efficacy of our Transformer-based deep neural network for spacebased GW signal extraction and detection of multiple sources.Our method is reusable for either simulated or future observational data and gives an almost real-time analysis with low computational cost.
One potential limitation of our method is generalization behavior.We show the generalization performance for multiple signals outside the training dataset, those cases are still simpler than realistic astrophysical conditions.For example, the time delay interferometry technique is normally required to suppress the laser frequency noise in the space-based GW data analysis, which introduces further complexity in the detector response to the GW signal.Therefore, from a deep learning perspective, the patterns in the time domain might be different and a re-training of our neural network is required.
In this paper, we present a pioneering proof-of-principle study that utilizes DNNs for the efficient detection and extraction of space-based gravitational wave signals.It's essential to clarify that our neural network model is not aimed at replace conventional matched filtering techniques.Instead, it seeks to offer an efficient way of processing the potentially large amount of data from space-based detectors, thereby facilitating more automated and real-time gravitational wave data analysis.
Our model has demonstrated promising results in analyzing MBHB signals, even in datasets with lower signal-to-noise ratios than LDC datasets.Nonetheless, our approach still faces limitations when addressing long-lasting signals, such as EMRIs and BWDs, which accumulate signal-to-noise ratios over time spans on the order of years.Given the current computational resources, analyzing these signals in a singular pass poses a considerable challenge Despite the challenges faced, this research stands as a stepping stone in the domain of GW data analysis.By laying the foundation for future endeavors, we are aiming at the continual development and optimization of deep learning techniques in this field.As computational resources and technology advance, our model holds the potential for adaptation and evolution to effectively tackle these challenges.

GW sources in space-based detection
As mentioned in the section 1, space-based GW detectors are being developed to detect GW signals at frequencies of [10 −4 , 0.1]Hz.The main GW sources in this frequency band are EMRI, MBHB, BWD, and SGWB.Next, we will describe the details of the signals that come from each GW source.

EMRI
The MBHs in the centers of galaxies are typically surrounded by clusters of stars.These stars eventually evolved into COs, which will be black holes, neutron stars, or white dwarfs.Some of those COs can get captured onto orbits bound to the central MBH and then gradually inspiral into the MBH via emission of GWs.Typically the ratio of the mass of the CO that is falling into the MBH to the mass of the MBH is ∼ 10 −5 , so these events are called EMRIs.
EMRI signal waveforms are characterized by the complex time domain strain h(t), in the source frame h(t) is given by [60]: where µ is the mass of the small black hole, t is the time of arrival of the GW, θ is the source-frame polar angle, ϕ is the source-frame azimuthal angle, d L is the luminosity distance, and {l; m; k; n} are the indices describing harmonic modes.The indices l, m, k, and n label the orbital angular momentum, azimuthal, polar, and radial modes, respectively.Φ mkn = mΦ φ + kΦ θ + nΦ r is the summation of phases for each mode.A lmkn is the amplitude of GW. S lmkn (t, θ) is spin-weighted spheroidal harmonic function.
In the detector frame, the EMRI signal waveform is determined by 17 parameters: {M, µ, a, ⃗ a 2 , p 0 , e 0 , x I,0 , d L , θ S , ϕ S , θ K , ϕ K , Φ φ,0 , Φ θ,0 , Φ r,0 }.M is the mass of the MBH, a is the dimensionless spin of the MBH, θ S , and ϕ S are the polar and azimuthal sky location angles.θ K and ϕ K are the azimuthal and polar angles describing the orientation of the spin angular momentum vector of the MBH.⃗ a 2 is the spin vector of the CO, which doesn't considered in the waveform model.p is semi-latus rectum, e is eccentricity, I is orbit's inclination angle from the equatorial plane, and x I ≡ cos I.

MBHB
Most galaxies appear to host black holes at their centers.Galaxies and MBHBs coevolved during the evolution of the Universe.So the observation of GWs from the MBHB system can improve our understanding of important astronomical phenomena such as the formation of the MBH and the merging of galaxies [11].
In this paper we just consider the GW from spin-aligned MBHB system, which characterized by {M tot , q, s z 1 , s z 2 }, where M tot = m 1 + m 2 , m 1 and m 2 are the mass of two black holes respectively.q = m 2 /m 1 (m 1 > m 2 ) is the mass ratio.s z 1 and s z 2 are spin parameters of two black holes, and z represent the direction of orbital angular momentum.

BWD
The Milky Way contains a large population of compact binaries, most of which are BWD with a period of ∼ 1hour.This is right in the LISA's sensitive frequency band.The signal of BWD in the source frame is quite simple: ( A is the overall amplitude, ϕ 0 is the initial phase at the start of the observation, ι is the inclination of the BWD orbit to the line of sight from the origin of the Solar System Barycentric (SSB) frame.The intrinsic parameter is the frequency of the signal f and its derivative ḟ .
Frequency evolves slowly and some binaries will be chirping to higher frequency due to the decay of the orbit through the emission of GWs, but other binaries will be moving to lower frequency as a result of mass transfer between the binary components driving an increase in the orbital separation [11].

SGWB
There are many resolvable sources, but there is also a large number of events that cannot be resolved individually, resulting in SGWB.Astrophysical background components are guaranteed in the LISA band, originating from unresolved Galactic Binaries (GB) and stellar-originated black hole mergers.SGWBs that are Gaussian, isotropic, and stationary can be fully described by their spectrum [61]: where ρ GW is the energy density of gravitational radiation contained in the frequency range 8πG is the critical density of the universe, where c is the speed of light, and G is Newton's constant, H 0 = 67.9kms −1 Mpc −1 is the Hubble constant.
We intend to follow the simplified assumption that the signal can be well described by a power law, defined as amplitude and slope, as most studies have done previously.Then the signal is described by [62]: where h is dimensionless Hubble constant, f * is pivot frequency, α characterize its amplitude at f * and n t is the slope of spectrum.
Another formalism used to test our model's generalization ability is broken power law which is defined: where n t1 and n t2 are the slopes of two spectrum segments, H(f ) is Heaviside step function.

Data Curation
First, we simulate the noise data from the LISA sensitivity curve: where and are the optical metrology noise and acceleration noise respectively.L = 2.5 × 10 9 m, f * = 19.09mHz, R(f ) is the transfer function.The full expression for R(f ) used here is computed numerically [43].Next we specify the generation of each dataset.
We use the augmented analytic kludge (AAK) [63,64,60] model to generate the EMRI signal.It is because the AAK model combines the accuracy of the numerical kludge (NK) model and the computational speed of the analytic kludge (AK) model quite well.Note that the parameter ι in the AAK model is a orbital parameter: cos ι = L z / L 2 z + Q, where Q is Carter constant, L z is z component of the specific angular momentum.For simplicity, we just consider a small parameter space to generate the training data.Detailed parameters range is shown in Table 4, and the lower bound of a and e 0 is limited by the FastEMRIWaveform toolkit we used [65].For MBHB signal generation, we used SEOBNRv4_opt, which is a version of the SEOBNRv4 code [66] with significant optimizations.It could produce signals for a high spin, high mass ratio MBHB system.We adopted the log-uniform distribution for the parameter M tot from Ref. [67].Detailed parameters range is shown in Table 5.For the BWD dataset, we generate the signal directly using equations (2).We follow the parameter setting of the LISA Data Challenge (LDC) 1-4 dataset [68], but we focus only on the intrinsic parameters f and ḟ , see Table 6 for details.Upon generating the signal, we project it onto the LISA detector.For this work, being a proof-of-concept, we did not incorporate the time delay interferometry (TDI) technique.Instead, we employed the long-wavelength approximation, as described below: where F + I,II and F × I,II are the antenna pattern functions: and where ψ S is the polarization angle.Furthermore, this antenna pattern function varies with time as a result of the motion of the LISA detector.In this paper, we set the sky position and the polarisation angle equal to zero for simplicity.
Because of the length of our data, the Doppler shift effects could be ignored.
For those 3 datasets, we inject the signal to the noise with specific optimal SNR as: where s represent the signal template, the inner product (h | s) is: where This inner product can also be used to measure how well the output of our model matches the signal waveform template, we calculate the overlap between them, which is defined as: where h represent the model output and s represent the template.
The SGWB dataset is generated in a very different way than several other datasets, We just need to simulate SGWB data based on its PSD which is defined by: We choose n t = 2/3 α = −11.35and f * = 10 −3 Hz (see equation ( 4)) according to LDC configuration and Ref. [62] , which represent SGWB formed by compact binary coalescences.We could generate the SGWB signal by S h (f ) directly.We then combine the signal and noise and perform the whitening and normalization operations as described above.Lastly, we present the pure signals with noise PSD and SGWB PSD in the Fourier domain and the data generated to train our model in Fig. 2.

Transformer
Transformer [69] is a kind of deep neural network (DNN) proposed for machine translation, and it soon achieved superior performance in various tasks in natural language processing [70] and computer vision [71].Based entirely on attention, Transformer has a great ability to capture both long-range and short-range dependency in sequence data.In this section, we introduce the key structures of a general Transformer.

Self-Attention
Self-Attention can be described as a function with an input vector Query and an output vector pair Key-Value.The Key-Value pair is a weighted sum, which returns the information of the Query with the corresponding Key.In the Transformer network block, all query vectors and key-values pairs have the same dimension d.Given a sequence of queries Q ∈ R l×d with length l, keys K ∈ R l×d and values V ∈ R l×d , the Transformer compute scaled dot-product attention as:

Multi-Head Attention
Instead of applying a single attention function with d model -dimensional queries, keys, and values.The Transformer uses multi-head attention to combine information from different linear projections of original queries, keys, and values.
If the Transformer has H heads, the sequence of attention output head i is: where , are projected queries, keys and values, corresponding to head i ∈ {1, ..., H} with learning parameters W Q i , W K i , W V i , respectively.Here, is also called attention map.A ∈ R l×l , in which each element A qk indicates how much attention token q puts on token k.With collection of all parameters W H ∈ R Hd model ×d , the multi-head attention (MHA) of these H heads can be written as: The multi-head attention mechanism can be computed in parallel for each head, which leads to high efficiency.Moreover, multi-head attention connects the information from different projection subspaces directly, helping the Transformer learn the long-term dependencies of the input sequence easier.

Feed Forward and Residual Connection
In addition to attention layers, Transformer blocks have a fully connected feed-forward network that operates separately and identically on each position: where H ′ is the output of last layer, and W

Network structure
Let the T-length observation x ∈ R T be a time-domain signal we receive.x is a mixture of a target GW signal s and noise n as x = s + n, where the noise is from the environment and instruments.Our goal is to recover s from x.The recovered signal ŝ ∈ R T can be written as: The decoder reconstructs the signal with encoded input x element-wise multiplication by the mask M predicted by the masking net.After recovering the signals, we add a multi-layer linear perception to detect whether it is a GW signal or a pure noise.

Encoder and Decoder
We use a CNN as an encoder because it can extract local features from a long time-domain sequence, which could compress information.With time-domain input x ∈ R T , the encoded Enc(x) ∈ R L×T ′ .Since the output estimated signal has the same length as input x ∈ R T , the decoder for reconstruction uses a transposed convolution layer.

Masking Extraction Net
The masking network is built by following a basic structure in SepFormer [72].We employ two Transformer blocks the STTB (Short-Time Transformer Block) and the LTTB (Long-Time Transformer Block) in the masking net.The masking network is fed by the encoded input.We split the input signal into chunks and concatenated them to be a tensor g ∈ R L×C×N , where C is the length of each chunk and N is the number of chunks.
The tensor g is processed by Transformer blocks.The STTB computes the multi-head attention in each chunk respectively, which catches the short-time dependency in the chunk.Then the LTTB focuses on another dimension of tensor g, modeling the long-time dependency by the attention across chunks.The output from Transformer blocks passes through a PReLU non-linearity layer and a 2-D convolution layer for matching the output size of the decoder.Then a two-path convolution with different linear functions is used to get the mask.

Multi-Layer Perception
The extracted signal recovered by the decoder feeds to the MLP for detection.We use two fully connected layers to classify GW signals and noise.The first linear layer has 512 dimensions, and the second linear layer outputs the vector to a probability of a true GW signal.

Loss function
Our loss function combines both the extraction loss and the detection loss.The extraction loss is based on the scale-invariant signal-to-distortion ratio [73] in audio enhancement, which is defined as: where x is the estimated output and the x is the target.
The detection loss is the binary cross-entropy.Suppose that the data set has N samples with label y, and the ŷ is the predicted probability of the sample.The BCE loss is defined as: Therefore, the total loss of the deep neural network is:

Implementation details
Our extraction network repeats both STTB and LTTB twice (M = 2), with 4 parallel attention heads and a 512dimensional feed-forward layer in each block.We set the length of split chunks C = 25.In the training stage, the model is trained for 100 epochs.We set initial learning rate as lr max = 1e −3 .After 35 epochs, the learning rate is annealed by halving if the validation performance does not improve for two generations.Adam [74] is used as the optimizer with β 1 = 0.9, β 2 = 0.98.The network is trained on a single NVIDIA V100 GPU.All of our code is implemented in Python within the SpeechBrain [75] toolkit.

Data avalibility
The datasets used in this study are generated by the custom code, which is provided in the repository mentioned in the Code availability section.To reproduce the datasets, please follow the instructions provided in our repository's documentation.

Code availability
The PyCBC, FastEMRIWaveform, and SpeechBrain codes used in this study are publicly available.The custom code developed for this research can be accessed on GitHub at the following repository: https://github.com/AI-HPC-Research-Team/space_signal_detection_1.The code is distributed under the MIT License.

Figure 1 :
Figure 1: The overall architecture of our Transformer based deep neural network.Beginning with a convolutional network-based encoder, data is transformed and feed into the Transformer-based extraction network.This network composed of several Short-Term Transformer Blocks (STTB) and Long-Term Transformer Blocks (LTTB), excels in capturing both local and global dependencies within the GW data, aimed at extracting GW signals.The final stage is a multi-layer perception-based classifier, responsible for the signal detection and provide a predicted probability.

Figure 2 :Figure 3 :Figure 4 :
Figure 2: Training data samples of each GW source in the frequency domain and time domain.a, an example of EMRI, MBHB, and BWD signal power spectra along with SGWB power spectral density (PSD) and LISA sensitivity curve in the frequency domain.b-e, each sub-figure shows an example of a GW signal from a specific source.Here we plot the whitened h I (t) (orange) and h I (t) with noise (purple).The signal-to-noise ratio (SNR) of EMRI, MBHB, and BWD signals is 50, and the SGWB signal has n t = 2/3 and α = −11.35(see equation (4)).

Figure 5 :
Figure 5: The signal detection performance from our classification perspective.a-c, each sub-figure shows the ROC curve of a model aimed at detecting EMRI, MBHB, and BWD signals with test data SNR equal to 30, 40, and 50 respectively.d, the sub-figure shows the ROC curve of a model aimed at detecting SGWB with 3 different amplitudes.We show the ROC curve on a logarithmic scale to better visualize the shape at a lower false alarm rate.The high AUC values indicate the strong ability of our method for multiple GW source detection.

Figure 6 :
Figure 6: Results of denoising and detection for the LDC 2a dataset using our model.Each panel represents one of the 15 signals, with the number of the MBHB signal from the LDC2a dataset displayed at the top, along with the computed overlap between the denoised waveform generated by our model and the target template.Additionally, the predicted detection probability is provided.All 15 signals are effectively denoised, with most exhibiting high overlaps and detection probabilities equal to 1.

min = 3 × 10 − 5
Hz and f max = 0.05Hz.h(f ) and s(f ) are frequency domain signals and the superscript * means complex conjugate, S n (f ) is the noise PSD.Here following the setting of the LDC dataset, we set the SNR of the training data to 50.Then the data was whitened and normalized to [−1, 1].During the whitening procedure, we applied the Tukey window with α = 1/8.

Table 1 :
The table presents the overlap between the waveform reconstructed by the MCMC method and the template compared with the overlap between the waveform reconstructed by our neural network and the template.The major advantage of deep neural networks compared with the traditional method is the speed.Table.2summarizes the computational cost of traditional data analysis methods based on technical notes submitted by various research groups in MLDC Round 1, Round 1B, and Round 3 as well as related papers.We couldn't find the MLDC Round 2 and 4 technical notes.We can see that traditional approaches for searching for GW signals inside 1 or 2-year MLDC and LDC datasets typically take a few hours.The computational cost of our method is presented in Table.3.While our model can evaluate 474 days of data in 1.9 seconds, it takes 2.5 minutes to evaluate the entire test dataset, which contains 101250 data samples (79 iterations with a batch size of 256).The network could handle a batch of samples in parallel in a single computing iteration (see the 4th and 5th row of Table.3)(here batch means the number of data samples input to the network per computing iteration), but this does add up the signal duration time and it only means the model can process independent data segments in parallel with more computing resources.

Table 2 :
Computational cost of traditional method on MLDC and LDC datasets

Table 4 :
Summary of parameter setups in EMRI signal simulation.

Table 5 :
Summary of parameter setups in MBHB signal simulation.

Table 6 :
Summary of parameter setups in BWD signal simulation.× 10 −17 , 6 × 10 −16 ]Hz 2 range-2 [4, 15]mHz [−3 × 10 −15 , 4 × 10 −14 ]Hz 2 2 S )(cos 2ϕ S cos 2ψ S − cos θ S sin 2ϕ S sin 2ψ S ), S )(cos 2ϕ S sin 2ψ S + cos θ S sin 2ϕ S cos 2ψ S ), 1 , b 1 , W 2 , b 2 are trainable parameters.The dimension of input and output is equal to the model's dimension d model , and the inner-layer dimension d f f n should be larger than d model .In a deeper Transformer model, a residual connection module is inserted followed by a Layer Normalization Module.The output of the Transformer block can be written as: