From rumor to genetic mutation detection with explanations: a GAN approach

Social media have emerged as increasingly popular means and environments for information gathering and propagation. This vigorous growth of social media contributed not only to a pandemic (fast-spreading and far-reaching) of rumors and misinformation, but also to an urgent need for text-based rumor detection strategies. To speed up the detection of misinformation, traditional rumor detection methods based on hand-crafted feature selection need to be replaced by automatic artificial intelligence (AI) approaches. AI decision making systems require to provide explanations in order to assure users of their trustworthiness. Inspired by the thriving development of generative adversarial networks (GANs) on text applications, we propose a GAN-based layered model for rumor detection with explanations. To demonstrate the universality of the proposed approach, we demonstrate its benefits on a gene classification with mutation detection case study. Similarly to the rumor detection, the gene classification can also be formulated as a text-based classification problem. Unlike fake news detection that needs a previously collected verified news database, our model provides explanations in rumor detection based on tweet-level texts only without referring to a verified news database. The layered structure of both generative and discriminative models contributes to the outstanding performance. The layered generators produce rumors by intelligently inserting controversial information in non-rumors, and force the layered discriminators to detect detailed glitches and deduce exactly which parts in the sentence are problematic. On average, in the rumor detection task, our proposed model outperforms state-of-the-art baselines on PHEME dataset by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$26.85\%$$\end{document}26.85% in terms of macro-f1. The excellent performance of our model for textural sequences is also demonstrated by the gene mutation case study on which it achieves \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$72.69\%$$\end{document}72.69% macro-f1 score.

Sequential synthetic data generation such as generating text and images that are indistinguishable to human eyes have become an important problem in the era of artificial intelligence (AI). Generative models, e.g., variational autoencoders (VAEs) 1 , generative adversarial networks (GANs) 2 , recurrent neural networks (RNNs) with long short-term memory (LSTM) cells 3 , have shown outstanding generation power of fake faces, fake videos, etc. GANs as one of the most powerful generative models estimate generative models via an adversarial training process 2 . Real-valued generative models have found applications in image and video generation. However, GANs face challenges when the goal is to generate sequences of discrete tokens such as text 4 . Given the discrete nature of text, backpropagating the gradient from the discriminator to the generator becomes infeasible 5 . Training instability is a common problem of GANs, especially those with discrete settings. Unlike image generation, the autoregressive property in text generation exacerbates the training instability since the loss from discriminator is only observed after a sentence has been generated completely 5 . To remedy some of these difficulties, several AI approaches (e.g., Gumbel-softmax 6,7 , Wasserstein GAN (WGAN) 8,9 , reinforcement learning (RL) 4,10 ) have been proposed 11,12 . For instance, the Gumble-softmax uses a reparameterization trick and softmax calculation to approximate the undifferentiable sampling operation on the generator output, which allows the model to perform backward propagation as well as provide discrete outputs approximating to actual values. GANs with Gumbel-softmax take the first step to generate very short sequences of small vocabulary 7 . WGAN method for discrete data directly calculates Wasserstein divergence between discrete labels and generator's output as the criterion of discriminator. As a result, WGAN models can update parameters to learn the distribution of discrete data and produce some short sentences in character-level 9 . As a result, generating natural language-level sentences is still non-trivial. GANs with RL can skirt the problem of information loss in the data conversion by www.nature.com/scientificreports/ modeling text generation as a sequence of decisions and update the generator with reward function. Comparing to previous methods, RL can help GANs generate interpretable text closer to natural language 4 . In addition to the recent development in GAN-based text generation, discriminator-oriented GAN-style approaches are proposed for detection and classification applications, such as rumor detection 13 . Differently from the original generator-oriented GANs, discriminator-oriented GAN-based models take real data (instead of noise) as the input to the generator. Fundamentally, the detector may get high performance through the adversarial training technique. Current adversarial training strategies improve the robustness against adversarial samples. However, these methods lead to reduction of accuracy when the input samples are clean 14 .
Social media and micro-blogging have become increasingly popular 15,16 . The convenient and fast-spreading nature of micro-blogs fosters the emergence of various rumors. Social media rumors / misinformation / fake news are major concerns especially during major events, such as the global rise of COVID-19 and the U.S. presidential election. Some of the coronavirus rumors have been verified later to be very dangerous false claims, e.g., "those that suggest drinking bleach cures the illness" 17 have made social media companies such as Facebook to find more effective solutions 18 . Commercial giants, government authorities, and academic researchers take great effort in diminishing the negative impacts of rumors 19 . Rumor detection has been formulated into a binary classification problem by a lot of researchers. Traditional approaches based on hand-crafted features describe the distribution of rumors 20,21 . However, early works depending on hand-crafted features require heavy engineering skills. More recently, with the rise of deep learning architectures, deep neural network (DNN)-based methods extract and learn features automatically, and achieve significantly high accuracies on rumor detection 22 . Generative models have also been used to improve the performance of rumor detectors 13 , and formulate multi-task rumor classification systems 23 to realize rumor detection, tracking, stance and veracity classification. However, binary rumor classification lacks explanation since it only provides a binary result without expressing which parts of a sentence could be the source of the problem. The majority of the literature defines rumors as "an item of circulating information whose veracity status is yet to be verified at the time of posting" 24 . Providing explanations is challenging for detectors working on unverified rumors. Comparably, fake news is more well-studied, as it has a verified veracity. Attribute information, linguistic features, and semantic meaning of post 25 and/or comments 26 have been used to provide explainability for fake news detection. A verified news database has to be established for these approaches. However, for rumor detection, sometimes a decision has to be made based on the current tweet only. Text-level models with explanations that recognize rumors by feature extraction should be developed to tackle this problem.
Gene classification and mutation detection usually work with textual-gene data and also relate to a broad range of real-world applications, such as gene-disease association, genetic disorder prediction, gene expression classification, and gene selection. Machine learning-based classification and prediction tools have been proposed to solve these genetic problems 27,28 . Since essentially a gene sequence is of textual nature, we can process a genetic sequence as text. Gene mutation detection looks for abnormal places in a gene sequence 29 . Hence, we propose to solve this problem by using a natural language processing-based mutation detection model. When comparing a gene sequence with a natural language sequence, we observe that the mutations in genetic sequences represent abnormalities that makes the sequence do not fit well compared to other sequences from a biological perspective. The known genetic mutation detection and classification problem has been effectively explored in the literature, while the unknown mutation detection and classification has remained as a harder problem in both medical and machine learning fields. To detect unknown mutations and classify them, we propose a GANbased framework that maintains a high performance level while facing unseen data with unknown patterns and providing explainability capabilities.
In this work, we propose a GAN-based layered framework that overcomes the afore-mentioned technical difficulties and provides solutions to (1) text-level rumor detection with explanations and (2) gene classification with mutation detection. In terms of solving the technical difficulties, our model keeps the ability of discriminating between real-world and generated samples, and also serves as a discriminator-oriented model that classifies real-world and generated fake samples. We overcome the infeasibility of propagating the gradient from discriminator back to the generator by applying policy gradient similar to SeqGAN 4 to train the layered generators. In contrast to prior works, we adopt a RL approach in our framework because by combining the GAN and RL algorithmic strategies the framework can produce textural representations with higher quality and balance the adversarial training. The training instability of long sentence generation is lowered by selectively replacing words in the sentence. We solve the per time step error attribution difficulty by word-level generation and evaluation. We show that our model outperforms the baselines in terms of addressing the degraded accuracy problem with clean samples only.
Our GAN-based framework consists of a layered generative model and a layered discriminative model. The generative model generates high-quality sequences by first intelligently selecting items to be replaced, then choosing appropriate substitutes to replace those items. The discriminative model provides classification output with explanations. For example, in the gene classification and mutation detection task, the generative model mutates part of the genetic sequence and then the discriminative model classifies this genetic sequence and tells which genes are mutated. The major contributions of this work are: (1) this work delivers an explainable rumor detection without requiring a verified news database. Rumors could stay unverified for a long period of time because of information insufficiency. Providing explanations of which words in the sentence are problematic is critical especially when there is no verified fact. When a verified news database is achievable, our model is capable to realize fake news detection with minor modifications. (2) Our model is a powerful textural mutation detection framework. We demonstrate the mutation detection power by applying our proposed model to the task of gene classification with mutation detection. Our model accurately identifies tokens in the gene sequences that are exibiting mutations, and classifies mutated gene sequences with high precision. (3) The layered structure of our proposed model avoids the function mixture and boosts the performance. We have verified that using one Table 1. Macro-f1 and accuracy comparison between our model and baselines on the rumor detection task. The models are trained on PHEME and tested on both original dataset PHEME and augmented dataset PHEME+PHEME' . *indicates the best result from the work that proposed the corresponding model. L represents the model is evaluated under leave-one-out principle. Variance results in cross-validations are shown in Table 2. The best results are marked in bold.  baselines on the rumor detection task. The models are trained on augmented dataset PHEME+PHEME' and tested on both original PHEME and augmented PHEME+PHEME' . L represents the model is evaluated under leave-one-out principle. www.nature.com/scientificreports/ baselines under PHEME+PHEME' . Our model reaches the highest values in both versions of PHEME+PHEME' and the variation of our model with LSTM cells follows as the second best. Under leave-one-out (L) principle (i.e., leave out one news topic for test and use the rest for training), our proposed model and the variation achieve the highest macro-f1 scores in all cases. These results confirm the rumor detection ability of the proposed layered structure under new, out-of-domain data. Adversarial training of baselines improves generalization and robustness under PHEME+PHEME' , but hurts the performance under clean data as expected. Although our model and the variation are trained adversarially, they achieve the highest macro-f1 under clean data PHEME. The results confirm that our model outperforms the baselines in terms of addressing the accuracy reduction problem. Table 3 shows two examples that are correctly detected by our model but incorrectly detected by other baselines. For the first rumor, baselines CNN, LSTM, VAE-CNN, and VAE-LSTM provide scores 0.9802, 0.9863, 0.4917, and 0.5138, respectively. Our model provides a very low score for a rumor, while other baselines all generated relatively high scores, and even detect it as non-rumor. This is a very difficult example since from the sentence itself, we as human rumor detection agents even cannot pick the suspicious parts confidently. However, our model gives a reasonable prediction and shows that it has the ability to understand and analyze complicated rumors. For the second non-rumor, baselines CNN, LSTM, VAE-CNN, and VAE-LSTM provide scores 0.0029, 0.1316, 0.6150, and 0.4768, respectively. In this case, a non-rumor sentence gains a high score from our model, but several relatively low scores from the baselines. This example again confirms that our proposed model indeed captures the complicated nature of rumors and non-rumors.
Explanation results. A component for decision explanation is realized by D explain , which offers insight into the detection problem by suggesting suspicious parts of given rumor texts. Our model's D explain recognizes the modified parts in sequences accurately. In 2-class PHEME experiments, its macro-f1 on PHEME'v5 and PHEME'v9 are 80.42% and 81.23% , respectively. Examples of D explain predicting suspicious parts in rumors are shown in Table 4. In the first rumor, "hostage escape" is the most important part in the sentence, and if these two words are problematic, then the sentence is highly likely to be problematic. Given an unverified or even unverifiable rumor, D explain provides reasonable explanation without requiring a previously collected verified news database.
Rumor/non-rumor, true/false, and real/fake. Misinformation, disinformation, fake news, and rumor classifications have been studied in the literature 23,30-32 and frequently suffer from small-scale datasets. The difference between misinformation, disinformation, fake news, and rumor is not well-defined and the labeling in these tasks is sometimes ambiguous and imprecise. In this work, we specifically refer rumor as a piece of information whose veracity is not verified, and its label in detection task is rumor (R)/non-rumor (N). With the consideration of veracity status, we refer facts as true (T) and false statements as false (F). Furthermore, we refer purely human-written statements as real (E) and machine-generated statements as fake (K). In the previous detection section, we do binary classification in rumor detection task. Our generative model replaces parts of a sequence and due to the uncertain nature of rumors, we label the generated (modified) rumors as R, and non-rumor in original dataset as N to emphasize the purpose of filtering out non-rumor in real-world applications. However, with real / fake and true/false labeling in misinformation or fake news classification, the labeling should be precise and 2-class labeling is not sufficient anymore for the generated (modified) sequences. Specifically, if an input sequence is labeled as Y, its modified version (i.e., the output of our generative model) is labeled as Y ′ to represent that it is modified from a sequence with label Y. In what follows, we perform the following experiments: (1) rumor classification with PHEME again using 4-class labels: R, R ′ , N, N ′ ; (2) misinformation (disinformation) classification with FMG (a misinformation/fake news dataset) using 4-class labels: T, T ′ , F, F ′ ; and (3) fake news classification with FMG using 4-class labels: Experimental results of PHEME (4-class) are shown in Table 5. Similar to previous PHEME experiment in Table 1, we generate a dataset PHEME' to do data augmentation. However, different than before, this new Table 3. Examples of D explain and D classify 's prediction on rumor (first) and non-rumor (second). The suspicious words in the rumor predicted by D explain are marked in bold. D classify provides a score ranging from 0 to 1. 0 and 1 represent rumor and non-rumor, respectively.

0.1579
Who's your pick for worst contribution to sydneysiege mamamia uber or the daily tele 0.8558 Glad to hear the sydneysiege is over but saddened that it even happened to begin with my heart goes out to all those affected www.nature.com/scientificreports/ generated PHEME' (4-class) has four labels: R, R ′ , N, N ′ and our GAN models are trained with 4-class classification. In addition, we train baselines with augmented dataset PHEME+PHEME' (4-class) and test it with PHEME. Moreover, we find that training with augmented data improves the performance of baselines. Our models (-LSTM and -CNN) still provide best results compared to (augmented) baselines. Besides rumor detection, we apply our framework in misinformation and fake news detection tasks using a fake news dataset (FMG) 33 , which includes both real/fake and true/false data. In real/fake task, models differentiate between purely human-written statements and (partially or fully) machine-generated statements, while in true/false task, models are required to identify true statements and false claims. We augment the original dataset (denoted as FMG) with our GAN-generated data (denoted as FMG') and train several models with the augmented dataset (denoted as FMG+FMG'). Similarly in PHEME (4-class) experiments, we find that models trained with augmented FMG+FMG' achieve higher performance on original FMG as shown in Table 6. From these experimental results, we conclude that our framework is effective in data augmentation and helps models to achieve higher accuracy. One thing to note is that in this experiment, our models do not outperform augmented LSTM and CNN in provenance classification task (although it is better than unaugmented ones). This could be due to the fact that the nature of provenance classification is to distinguish patterns between human-written and machinegenerated sentences. In the early training process of our model, the training data (generated sequences) of our discriminative model are low-quality since the generative model is not well-trained. The generated sequences contain our machine-generated noisy patterns, which could make our model converge to suboptimal results.
Limitations and error cases in rumor detection. Examples of error cases of our model in rumor detection task are presented in Table 7. For some short sentences, D explain sometimes fails to predict the suspicious parts. The reason is that the majority of training data are long sentences, hence D explain performs better with long sentences. Table 5. Marco-f1 and accuracy comparison between our model and baselines on the extended 4-class experiments of rumor detection task on PHEME dataset. U indicates that the model is trained on PHEME+PHEME' , otherwise it is train on original PHEME dataset. All models are tested on PHEME (R/N) and PHEME+PHEME' (R/N/R ′ /N ′ ). The best results are marked in bold.  Table 6. Marco-f1 and accuracy comparison between our model and baselines on the extended 4-class experiments of provenance (real/fake) and veracity (true/false) tasks. U indicates that the model is trained on FMG+FMG' , otherwise it is train on FMG. All models are tested on FMG and FMG+FMG' . The best results are marked in bold. www.nature.com/scientificreports/ We can solve this problem by feeding more short sentences to our model. In most cases, although D explain does not generate predictions, D classify still can provide accurate classification. As shown in Table 7, D classify outputs low score, i.e., classifies the input as rumor, for four out of five rumors.

FMG (E / K) FMG+FMG' (4-class) FMG (T / F) FMG+FMG' (4-class)
Gene classification with mutation detection. Genetic sequence classifications, gene mutation detection/prediction, DNA / RNA classification all work with genetic sequences, and deep learning-based methods in the literature take sequential data as input, and output the classification results 27,28,34 . Since our proposed framework demonstrates very good results for sequential / textural data (as shown in previous sections), next, we adopt a textural representation 35,36 of gene sequences and investigate a gene mutation phenomenon. Note that binary format representation of genetic sequences is also frequently used in the literature 37,38 . In our GAN framework, the input to the models is first encoded into a high-dimensional vector, therefore, the binary formatting does not affect the experimental results. In this experiment, we first perform a mutation in genetic sequences by the generative model, and then use D classify to classify a genetic sequence and predict which parts of the sequence is mutated. We find that our framework not only provides high accuracy in classification task, but also accurately identifies the mutations in the generated sequences.
In this experiment, all models are trained under NN269+NN269' (an augmented dataset) to ensure fairness, and we follow the labeling rule in misinformation/fake news detection task. When testing with NN269+NN269' , there are 8 classes in total: AP, AN, DP, DN from NN269 (original splice site dataset) and AP ′ , AN ′ , DP ′ , DN ′ from NN269' (generated dataset). Detailed experiment setup can be found in "Methods" section. If solely clean data from NN269 is accessible during training, then our proposed model and the variation of our proposed model are the only models that can recognize if a given sequence is modified or unmodified. Comparison between our model's (and the variation's) D classify and baselines is shown in Table 8. Under long acceptor data, baselines perform significantly worse than our model and the variation. Under short donor data, our model and the variation achieve highest AURoCs. This implies that our model and the variation are stronger when the input are long sequences. The layered structure and adversarial training under the augmented dataset provide our model the ability of extracting meaningful patterns from long sequences. For short sequences, our model and the variation provide highest AURoC, and simpler models such as CNN can also give good classification results. This is because for short sequences, textural feature mining and understanding is relatively easier then in long   Table 9. The results suggest that our model can not only classify a gene-sequence, but also provide an accurate prediction that explains which part of the sequence is modified.

Discussion
Rumor, as a piece of circulating information without verified veracity status, is hard to detect, especially when we have to point out why it is a rumor. Misinformation, whose veracity is determined, can be detected where there exists a verified database containing information about why the misinformation is wrong. Rumor detection is a hard problem and rumor detectors in the literature usually suffer from the low accuracy. The reason for unsatisfactory performance is multi-fold: for example, rumor dataset is usually small and imbalanced. The data-driven machine learning detectors don't have sufficient high-quality data to work with, hence the data shortage causes the low or extremely imbalanced performance. Rumors usually emerge violently during emergent national or even international events and confirming the veracity of rumors can take a long time and an aggressive amount of human resource. Therefore, rumors could stay as floating and circulating pieces of information without veracity confirmed for a long time and provoke social panic, such as in the recent coronavirus breakout events. Rumors are associated with different events, so if the detector is trained with previously observed rumors on other events, the detection of current unseen rumors associated with the new event usually results in low accuracy because the patterns of the rumors are changed. Compared to the detection problem, pointing out the problematic parts of the rumors is even more difficult due to the similar reasons. We propose a framework that addresses the afore-mentioned issues. To solve the limited and imbalance data issue and the low performance problem, our proposed GAN-based framework augments the dataset by generating new rumors/misinformation/fake news and uses the augmented data to train the discriminators to achieve high accuracy. The layered generative model intelligently decides about where and how to modify the input sequences. This process injects noise in data and pushes the discriminators to learn the essential semantic and syntactic features of the rumors. Therefore, this process alleviates the impact of event-associated patterns. To provide reasonable explanations of why the sentence is potentially a rumor, we improve the discriminator in GAN to include a layered structure to (1) make the detection decision, (2) generate the explanation, and (3) provide a corresponding layered model-tuning signal to the layered generative model.
Genetic sequences classification, genetic mutation detection/prediction, gene-disease association, and DNA expression classification all work with gene sequences. Machine learning-based methods such as support vector machines and deep neural networks have already been used to solve these problems. We propose and verify the applicability of our designed framework on gene classification and mutation detection in this work. The fundamental rationality comes from that the genetic sequence essentially is textual data. Since our proposed framework is aiming to take textual data as input and make classification decisions, it is reasonable to apply the framework to gene data. Mutation detection in gene data is to find the abnormal places in a gene sequence and rumor detection with explanation is to find the abnormal places in a sentence. One problem facing by gene mutation detection is that there might be some unknown patterns in the gene sequence, which is similar to the generalization problem in rumor detection: unknown patterns exist in unobserved rumors. Hence, our proposed GAN-based model can alleviate this issue by intelligently augmenting the dataset. From an algorithmic perspective, the problem of rumor detection and gene classification can be formulated as a textual sequence classification problem. (Although genetic sequence representation can be in binary format, we have discussed that binary formatted genetic sequences can be further encoded into vectors as the input to our model, which does not generate different results in our experiments). Therefore, our framework as a sequential data classification model should be applicable to both rumor and gene classification. We can learn which parts are suspicious/machine generated in a rumor, and this is no different than given a sequence, we learn which parts contain abnormal patterns. Following similar reasoning, in gene mutation detection task, our model learns which parts in a genetic sequence are www.nature.com/scientificreports/ abnormal. The difference is that language has intuitive semantic meanings, however, genetic sequence may have unknown hidden semantic meanings. Our goal is to investigate them both even though are different in order to provide this as an example of a methodology for interdisciplinary research and analysis. In summary, we proposed a layered text-level rumor detector and gene mutation detector with explanations based on GAN. We used the policy gradient method to effectively train the layered generators. Our proposed model outperforms the baseline models in mitigating the accuracy reduction problem, that exists in case of only clean data. We demonstrate the classification ability and generalization power of our model by comparing with multiple state-of-the-art models in both rumor detection and gene classification with mutation detection problems. On average, in the 2-class rumor detection task, our proposed model outperforms the baselines on clean dataset PHEME and enhanced dataset PHEME+PHEME' by 26.85% and 17.04% in terms of macro-f1, respectively. Our model provides reasonable explanation without a previously constructed verified news database, and achieves significantly high performance. In the gene classification with mutation detection task, our model identifies the mutated gene sequence with high precision. On average, our model outperforms baselines in both NN269 and NN269+NN269' (2-class) by 10.71% and 16.06% in terms of AURoC, respectively. In both rumor detection and gene mutation detection tasks, our model's ability of explanation generation is demonstrated by identifying the mutations accurately (above 70% macro-f1). We find that using two discriminators to perform classification and explanation separately achieves higher performance than using one discriminator to realize both functions. We also found the pre-train of D classify and varying N replace contribute to the high accuracy of D explain .
Despite the high performance in both applications, we do find a limitation of our framework. D explain sometimes fails to provide explanations in rumor experiments when the input sentences are very short, even though the corresponding D classify generates accurate predictions. One potential reason for this result is that the dataset contains a small number of short sentences and the model is not trained enough in short sentence cases. We also observed D explain performs a bit worse in gene mutation detection experiments than in rumor detection task. It could be caused by the choice of N replace (the number of items to be replaced in a sequence), which is a hyper parameter that affects the mutation detection ability. As part of our future work, to improve the performance of the discriminators, we would like to choose N replace intelligently. To enhance the performance of our generators, we would like to explore the application of hierarchical attention network 39 . We will also investigate the dependencies between the discriminators of our model to benefit D explain from the accurate D classify .
We believe our proposed framework could be beneficial to numerous textual data-based problems, such as rumor and misinformation detection, review classification for product recommendation, twitter-bot detection and tracking, false information generation and attack defense, and various genetic data-based applications. We connect the genetic data processing and the natural language processing field and provide new angles and opportunities for researchers in both fields to contribute mutually.

Methods
Our model-overview. Figure 2 shows the architecture of our proposed model. We have a layered generative model, which takes an input sequence and makes modifications intelligently; then a layered discriminative model to do classification and mutation detection. In rumor detection task, the generators must intelligently construct a rumor that appears like non-rumor to deceive the discriminators. Given a good lie usually has some truth in it, we choose to replace some of the tokens in the sequence and keep the majority to realize this goal. In our framework, two steps for intelligently replacing tokens in a sequence are: (1) determine where (i.e., which words / items in the sequence) to replace, and (2) choose what substitutes to use. G where and G replace are designed to realize these two steps. Having constructed the strong generators, the discriminators are designed to provide a defense mechanism. Through adversarial training, the generators and discriminators grow stronger together, in terms of generating and detecting rumors, respectively. In the rumor detection task, given a sentence, there are two questions that need to be answered: (1) is it a rumor or a non-rumor, and (2) if a rumor, which parts are problematic. D classify and D explain are designed to answer these two questions. We found that realizing two functions in one layer either in discriminative model or generative model hurts the performance. Hence, our framework was designed to embed a layered structure, and the detailed descriptions of the generative and discriminative model are as follows.
Our model-generative model. The sequence generation task is done by the generative model: G where and G replace . Given a human-generated real-world sequence input x = (x 1 , x 2 , . . . , x M ) with length M, such as a tweet-level sentence containing M words, G where outputs a probability vector p = (p 1 , p 2 , . . . , p M ) indicating the probabilities of each item x i ( i ∈ [1, M] ) to be replaced. p is applied to input x to construct a new sequence x where with some items replaced by blanks. For example, x 2 becomes a blank and then x where = (x 1 , _ , . . . , x M ).
where f (·) binarizes the input based on a hyperparameter N replace . It determines the percentage of the words to be replaced in a sentence. Operator • works as follows. If a = 1 , then a • b = b . If a = 0 , then a • b = _ . G replace is an encoder-decoder model with the attention mechanism. It takes x where and fills in the blank, then outputs a sequence x replace = (x 1 , x replace 2 , . . . , x M ) . The generative model is not fully differentiable because of the sampling operations on G where and G replace . To train the generative model, we adopt policy gradients 40 from RL to solve the non-differentiable issue. www.nature.com/scientificreports/ work. D classify provides a probability in rumor detection, and D explain provides the probability of each word in the sentence being problematic. The explainability of our model is gained by adversarial training. We first insert adversarial items in the sequence, then train D explain to detect them. Through this technique, our model can not only classify data with existing patterns, but also classify sequences with unseen patterns that may appear in the future. Adversarial training improves the robustness and generalization ability of our model.
Training. In the rumor detection task, a sequence x has a true label Y being either a rumor R or a non-rumor N. After manipulating the sequence x , output of the generative model x replace is labeled as R since it is machine generated. The objective of a φ-parameterized generative model is to mislead the θ-parameterized discriminators. In our case, D θ classify (x replace ) indicates how likely the generated x replace is classified as N. D θ explain (x replace ) indicates how accurately D θ explain detects the replaced words in a sequence. The error attribution per time step is achieved naturally since D θ explain evaluates each token and therefore provides a fine-grained supervision signal to the generators. For example, a case where the generative model produces a sequence that deceives the discriminative model. Then the reward signal from D θ explain indicates how well the position of each replaced word contributes to the error result. The reward signal from D θ classify represents how well the combination of the position and the replaced word deceived the discriminator. The generative model is updated by applying a policy gradient on the received rewards from the discriminative model.
The rumor generation problem is defined as follows. Given a sequence x , G φ where is used to produce a sequence of probabilities p indicating the replacing probability of each token in x . G We apply a discriminative model provided reward value to the generative model after the sequence is produced. The reason is that our G φ replace doesn't need to generate each and every word in the sequence, but only fills a few blanks that are generated by G φ where . Under this assumption, long-term reward is approximated by the reward gained after the whole sequence is finished.
The discriminative model and the generative model are updated alternately. The loss function of discriminative model is defined as follows: where explain D and classify D are the balancing parameters. We adopt the training method in GANs to train the networks. In each epoch, the generative model and the discriminative model are updated alternately. Over-training the discriminators or the generators may result in a training failure. Thus hyper-parameters G STEP and D STEP are introduced to balance the training. In each epoch, the generators are trained G STEP times. Then discriminators are trained D STEP times. Experiment setup-model setup. Our model contains a layered generative model, G where and G replace , and a layered discriminative model, D explain and D classify . The architecture setup is as follows. G where consists of an RNN with two Bidirectional LSTM (BiLSTM) and one dense layer and seeks to determine the items in a sequence to be replaced. The G where architecture we used in all experiments has the architecture of EM-32-32-16-OUT, where EM, OUT represent embedding and output, respectively. G replace is an encoder-decoder with attention mechanism and is responsible for generating the substitutes for the items selected by G where . The encoder has two GRU layers, and the decoder has two GRU layers equipped with attention mechanism. The architecture of G replace we used in all experiments is EM-64-64-EM-64-64-OUT. D explain has the same architecture as G where and is responsible for determine which items are problematic. D classify is a CNN with two convolutional layers followed by a dense layer. It is used for classification. The architecture we used in all experiments is EM-32-64-16-OUT.  www.nature.com/scientificreports/ Experiment setup-data collection and augmentation. We evaluate our proposed model on a benchmark Twitter rumor detection dataset PHEME 43 , a misinformation/fake news dataset FMG 33 , and a splice site benchmark dataset NN269 44 . PHEME has two versions. PHEMEv5 contains 5792 tweets related to five news, 1972 of them are rumors and 3820 of them are non-rumors. PHEMEv9 contains 6411 tweets related to nine news, 2388 of them are rumors and 4023 of them are non-rumors. The maximum sequence length in PHEME is 40, and we pad the short sequences with zero padding. FMG dataset contains two parts corresponding to a veracity detection task (i.e., determine a news is true/false) and a provenance classification task (i.e., determine a news is real/fake), respectively. Input sequences with true label in veracity classification task are verified fact and false sequences are verified false statements. Input sequences with real label in provenance classification dataset are purely human-written sentences while the fake data are generated with pre-trained language models. We set the maximum sequence length as 1024 and 512 in true/false and real/fake tasks, respectively, and we pad the short sequences with zero padding and do post truncation on the text longer than length threshold. NN269 dataset contains 13231 splice site sequences. It has 6985 acceptor splice site sequences with length of 90 nucleotides, 5643 of them are positive AP and 1324 of them are negative AN. It also has 6246 donor splice site sequences with length of 15 nucleotides, 4922 of them are positive DP and 1324 of them are negative DN.
In rumor detection task, we generate a rumor/fake news/misinformation dataset denoted as PHEME' (and FMG'), and then augment the original dataset with the generated sequences. Similarly, for the gene classification with mutation detection task, the proposed model generates a dataset NN269' by replacing nine characters in acceptor sequences and three characters in donor sequences. We label the generated sequences by the following rules. In rumor detection with explanation task, (1) generated rumors based on PHEME are labeled as R (rumor) in 2-class classification (corresponds to results in Table 1); (2) in 4-class classification (corresponds to results in Table 5 and Table 6), if the input sequence x has label Y, then the output sequence x replace is labeled as Y ′ , indicating that x replace is from class Y but with modification. In gene mutation detection task, we follow the labeling rule described in (2), and the final classification output of our model is two-fold: AP, AN for acceptor, or DP, DN for donor. We merge the generated classes AP ′ , AN ′ and DP ′ , DN ′ with original classes to evaluate the noise resistance ability of our model. Given a sequence, our model can classify it into one of the known classes, although the sequence could either be clean or modified. Experiment setup-baseline description. In the rumor detection task, we compare our model with six popular rumor detectors: RNN with LSTM cells, CNN, VAE-LSTM, VAE-CNN, a contextual embedding model with data augmenting (DATA-AUG) 45 , and a GAN-based rumor detector (GAN-GRU) 13 . One of the strengths of our proposed model is that under the delicate layered structure that we designed, the choice of model structure affects the results but not significantly. To showcase this ability of the layered structure, we generate a variation of the proposed model by replacing G replace with a LSTM model as one baseline. It utilizes an LSTM-based encoder-decoder with architecture EM-32-32-EM-32-32-OUT as G replace . Our model generates a set of sequences by substituting around 10% of the items in original sequences. We pre-train the D classify by fixing the number of replacement N replace = 10% . We then freeze D classify and train the other three models. During training, we lower N replace from 50% to 10% to guarantee data balancing for D explain and better results in terms of explanations. All the embedding layers in the generators and discriminators are initialized with 50 dimension GloVe 46 pre-trained vectors. Early stopping technique is applied during training. The generated data in the rumor task are labeled as R, and we denote this dataset as PHEME' . For fairness and consistency, we train baselines LSTM, CNN, VAE-LSTM, and VAE-CNN with PHEME and PHEME+PHEME' . For all baselines, we use two evaluation principles: (1) hold out 10% of the data for model tuning, i.e., we split the dataset into training (with 90% data) and test (with 10% data) set. (2) Leave-one-out (L) principle, i.e., leave out one news for test, and train the models on other news. E.g., for PHEMEv5, where there are 5 events in the dataset, we pick 1 event as our test set and use the remaining 4 events as our training set. (Similarly, for PHEMEv9, where there are 9 events in the dataset, we pick 1 event as our test set and use the remaining 8 events as our training set.) Moreover, with L principle, we apply 5-and 9-fold cross validation for PHEMEv5 and PHEMEv9, respectively. Final results are calculated as the weighted average of all results. L principle constructs a realistic testing scenario and evaluates the rumor detection ability under new out-of-domain data. For DATA-AUG and GAN-GRU, we import the best results reported in their papers.
In gene classification with mutation detection task we compare our models with five models: RNN with LSTM cells, CNN, VAE-LSTM, VAE-CNN, and the state-of-the-art splice site predictor EFFECT 47 . The first four baselines are trained under NN269+NN269' , and tested on both NN269+NN269' and clean data NN269. We import EFFECT's results from the original work 47 . The architectures of baselines LSTM, CNN, VAE-LSTM, and VAE-CNN used in both tasks are defined as in Table 10. VAE-LSTM and VAE-CNN use a pre-trained VAE followed by LSTM and CNN with the architectures we defined in Table 10. The VAE we pre-trained is a LSTMbased encoder-decoder. The encoder with architecture EM-32-32-32-OUT has two LSTM layers followed by a dense layer. The decoder has the architecture IN-32-32-OUT, where IN stands for input layer.  Table 10. Baselines' architecture setup in both rumor detection task and gene classification with mutation detection task.

Model
Gene mutation detection task Rumor detection task