Assessing systemic risk in financial markets using dynamic topic networks

Systemic risk in financial markets refers to the breakdown of a financial system due to global events, catastrophes, or extreme incidents, leading to huge financial instability and losses. This study proposes a dynamic topic network (DTN) approach that combines topic modelling and network analysis to assess systemic risk in financial markets. We make use of Latent Dirichlet Allocation (LDA) to semantically analyse news articles, and the extracted topics then serve as input to construct topic similarity networks over time. Our results indicate how connected the topics are so that we can correlate any abnormal behaviours with volatility in the financial markets. With the 2015–2016 stock market selloff and COVID-19 as use cases, our results also suggest that the proposed DTN approach can provide an indication of (a) abnormal movement in the Dow Jones Industrial Average and (b) when the market would gradually begin to recover from such an event. From a practical risk management point of view, this analysis can be carried out on a daily basis when new data come in so that we can make use of the calculated metrics to predict real-time systemic risk in financial markets.

www.nature.com/scientificreports/ refers to the risk that a shock or failure event triggers the breakdown of the entire system; the scale could be domestic or transnational. We are interested in applying our DTN method to study the systemic risk during the 2015-2016 stock market selloff and the COVID-19 pandemic.
Research has been conducted to extract insights from textual information: for instance, to explain abnormal volatility in financial markets through news 28 , to forecast market stress and volatility 29,30 , to improve the accuracy of predictions in time series volatility models 31 , and to correlate Internet searches with subsequent movement in stock prices 32 . Sentiment analysis is another way to make use of textual information: for instance, the use of Twitter feeds to predict the Dow Jones Industrial Average (DJIA) 33 and the use of posts on Weibo to predict Chinese stock price movement 34 . These studies suggested that textual information provides certain predictive powers in financial markets. On the other hand, some studies have used network analysis to depict relations between different entities 35 . Initially, network analysis was applied predominantly in the field of social sciences 36 , but it has now been extended to other fields of study: for example, in the field of medicine to understand the transmission of infectious diseases 37 , the relations between genetics and human diseases 38 , and drug-target interactions 39 and in the finance field to assess systemic risk and contagion effects in financial markets 40 . Due to the global pandemic, an increasing number of studies have also made use of network analysis to predict and visualise pandemic risk [41][42][43][44] and its influence on financial market connectedness [45][46][47] .
Although real-time financial news is readily available, research combining textual information and network analysis to predict financial markets is limited. Some similar approaches have previously been proposed. One study, for example, made use of the topic modelling technique to evaluate the temporal change in extracted topics using Japanese news 48 ; in that paper, however, articles are grouped by month before the analysis, so the networks might not be responsive enough to detect any sudden changes or potential systemic risk in the market, especially in the case of an unprecedented event such as the global pandemic. Another paper proposed combining the two analytical techniques: In that study, documents were semantically analysed and a similarity network was subsequently constructed 49 ; however, the research focused mainly on the visualisation perspective and was therefore not correlated with financial markets for the prediction of market volatility. Given the gap in the existing literature, this paper intends to make two contributions. First, it explores the possibility of quantifying the topological features of extracted topics dynamically. We adopted a rolling window method to obtain updated topological results when new articles are provided; this way we could continuously keep track of and understand how connected the networks were and therefore detect any abnormal patterns immediately. Second, the results might serve as an indicator, on top of other existing financial indices, that topological features could explain systemic risk and opportunity in financial markets in the face of a tremendous change or event.
We used the 2015-2016 stock market selloff and COVID-19 as use cases to demonstrate our DTN approach using all U.S. news articles from Reuters between 1 January 2015 and 31 December 2016, and between 1 January 2019 and 30 September 2020. After data cleaning, we were left with 2,123,284 news articles and 290,771 unique words. The results revealed significantly less connected topic networks in mid May and November 2015, a few months before the downfalls of DJIA in August 2015 and January 2016 during the stock market selloff, and in mid-and late-March 2020, around and after the time when COVID-19 was declared a global pandemic. After exhibiting an abnormal pattern in March and May 2015, network connectedness showed a gradual rebound to normal levels. In March 2020, after exhibiting an abnormal pattern for more than a week, network connectedness reverted to its original level. This might indicate an increase in investors' confidence in the financial market, which was reflected by the gradual pick up of the DJIA a few months or a few days later. The results also showed the possibility of quantifying and visualising the influence of terms, in particular the emerging term coronavirus in mid-March 2020 in our use case, on topic similarity networks in different time periods. Figure 1 shows the design of this research with its six main steps.
Step 1 involved collecting news articles of interest for the analysis. In step 2, the input articles were cleaned, lemmatised, and had repeating phrases removed; they were then processed using a sliding window method in step 3. Each article subset was semantically modelled in step 4, and the results were further analysed by constructing topic similarity networks in step 5. Lastly, in step 6, we visualised the final results and compared them to financial indices.

Results
After the standard text cleaning process, lemmatisation, and the removal of repeated phrases, news articles were sliced by a sliding window of 7 days with a stride of 1 day for dynamic news retrieval to obtain M t documents at time t. Each segment of the M t documents was semantically analysed using the Latent Dirichlet Allocation (LDA) model, with the number of topics being set to K = 80 50 . These topics, each represented by words in the corpus and their respective topic-word probability matrices, β t , served as the input to form DTNs. We used the similarity between β it and β jt to investigate the topological features of the financial news over time, where β it is the i-th topic's word distribution at time t. The topological features were summarised using network statistics which can provide insights on financial market evolution, market variation, and systemic risk in financial markets. More detailed description of the construction of the DTNs can be found in the "Methods" section. The results are presented and visualised in the following subsections.
Visualisation of dynamic topic networks. To explore the network structures of latent topics from financial news, we present network diagrams where network nodes represent latent topics and edges are formed by the similarity of β it among 80 topics. As an illustration, Fig. 2 shows three network structures at different dates, where Fig. 2a-c represent the fifteenth day of January, March, and May 2020, respectively. The first date is before the WHO declared COVID-19 a pandemic on 11 March 2020, and the last two dates are after the announcement. In the figures, each node represents a topic, whose importance is reflected by its node size, proportional to the number of words generated by the topic across the entire corpus or proportional to the topic probabilities www.nature.com/scientificreports/ in θ t . If the topic contains the word coronavirus, its within-topic frequency percentage is obtained from β it and represented by the red gradient according to the percentage. In Fig. 2a, which shows the topic network on 15 January 2020, it is observed that topics were generally highly connected, without any main deviation of nodes from the main cluster. At this time, the pandemic was not an influential subject, therefore we do not see an influence of COVID-19 on any topics and all nodes are of the same colour. Figure 2b for the topic network on 15 March 2020, on the other hand, shows that the network structure was comparatively less connected and some important topics (with big node sizes) were noticeably farther away from the main cluster. This is not surprising as the nodes with different colours suggest that these topics are coronavirus related. With the emergence of this new pandemic-related word and the sudden growth of its influence on news articles, the network structure was altered from dense networks in January 2020 to networks with multiple small clusters in March 2020, causing a sudden drop in network connectedness, as seen in Fig. 2b. Moving forward to 15 May 2020, we observe in Fig. 2c that the term coronavirus still existed in various topics, but it no longer showed a significant influence in the news articles, as it did in March 2020, but instead was integrated with other terms. We can also see that the coronavirus-related nodes have lighter colours compared to those on 15 March 2020, suggesting a much lower influence of the term within topics as well. The network structure as a whole appears similar to that in January 2020 (Fig. 2a).  Network graphs at three different dates. Node size represents topic importance, which is proportional to the number the tokens generated by the topic across the whole corpus. For example, a size of 0.2 represents that an estimated 20% of tokens was covered by the chosen topic i. If the topic is coronavirus-related, its node will be represented by a red gradient, whose colour is mapped to the range between 0 and 0.1 according to the word-topic probability for coronavirus in β it . www.nature.com/scientificreports/ Network connectedness. To understand how connected the latent topics were, Figs. 3 and 4 show two network statistics, average degree ( D t ) and average clustering coefficient ( C t ), as presented in the "Methods" section, between 1 January 2015 and 31 December 2016, and 1 January 2019 and 30 September 2020. The average degree, D t , in the networks measures how many edges (out of 79, excluding itself) on average link to a node. The higher the average degree, the more dense the network at time t will be. Similarly, the average clustering coefficient, C t , measures how big the clusters formed by neighbourhoods of a node of the network are on average at time t. In our context, both D t and C t measure the level of connectedness of the topic network at time t.
In Fig. 3, the connectedness reached the lowest around late March 2015, with the lowest points 20 March ( D t = 43.8, C t = 0.783) and 28 March ( D t = 32.125, average C t = 0.723). After around a month of resuming to the normal level, the connectedness began to go downhill again in late April/early May 2015 and reached another local minimum on 12 May ( D t = 40.25, C t = 0.781). Moving forward, the connectedness reached yet again another local minimum in mid-November, and the drop was in particular obvious for the C t ( D t = 49.175, C t = 0.807). There were also some moderately low points in late April/early May, early August and early November in 2016.
As observed in Fig. 4, topics were highly connected (with D t of around 70) at all time t except for March 2020. The two network statistics, D t and C t , began to show a downward trend on 10 March 2020, and the trend continued until the statistics reached the lowest point on 18 March 2020 for average clustering coefficient, ( C t = 0.657) and on 19 March 2020 for average degree ( D t = 24.475). An interesting phenomenon is that D t and  www.nature.com/scientificreports/ C t rebounded quickly to "normal levels" after a sharp drop in mid-March 2020. This structural change in topic network connectedness in March 2020 indicates that semantic features summarised by the DTNs constructed by financial news may give us insights into how financial markets react to unexpected events or even catastrophes. Therefore, studying the relationship between these DTNs and financial market movement may provide systemic risk implications from financial news. Given the results from the network statistics, we try to understand what the topics were at several time points in the pre-pandemic period, the period when the network statistics started to drop to a low value, and the period when the network statistics returned to a normal level. The fifteenth day of January, March, May, and July 2020 were studied, and the top 30 most salient words are shown in Table 1. A majority of the words are finance related, and some of these words, such as share, target, and rating, are in the list of the most salient words in all four randomly selected dates. One interesting finding was that the word coronavirus, which was not the most salient word on other dates, topped the ranking on 15 March 2020, a few days after the WHO declared COVID-19 a global pandemic.
As mentioned, the word coronavirus was one of the most significant words on 15 March 2020, and thus we show the top four most important COVID-19 related topics on that date in Fig. 5 to understand the underlying word distribution for topic i, i.e. β it . The four topics are topics 1, 3, 4, and 6, where the first three topics contain the word coronavirus as a major word and topic 6 contains travel and flight as major words. Topics are represented by bubbles, whose sizes are proportional to their importance and which are clustered together on the basis of their similarities. The four topics, 1, 3, 4, and 6 are highlighted by red bubbles in four diagrams. We can see that coronavirus had a significant importance for the selected news period (9-15 March 2020). Also, it comes as no surprise that the bubbles corresponding to the four most important topics are close to each other, indicating a relatively higher across-topic similarity. Topic 1 is related to the financial market in general and how it reacted to the global pandemic. Topic 3 focuses on the societal perspective and the latest development of the virus. Topic 4, on the other hand, links the pandemic to politics in the United States, and Topic 6 shows how coronavirus was affecting the world and the travel industry. The above topic characteristics explain why the connectedness Relationships with financial indices. We chose to relate our dynamic topic network statistics to two indices: (1) DJIA: Composed of the 30 largest stocks listed on the U.S. stock exchanges, the DJIA is a good representative of U.S. stock market performance; (2) VIX: This is a derived index from S&P 500; it is widely used by investors and financial institutions to measure market risk and stability and is a good proxy for systemic risk in financial markets. A comparison of the network statistics with the VIX and DJIA gave us some interesting results. Figures 6 and 7 display the network statistics and the DJIA in the same graph. It can be seen that the topic network statistics, D t and C t , showed two main drops around April/May and November 2015, several months before the drops of DJIA in September 2015 and January 2016, respectively and a general downward trend that aligned with the DJIA in March 2020. During the COVID-19 pandemic, the two network statistics took a sudden dip for a few days and then quickly rebounded to their original levels and became stationary again, whereas the DJIA reached the lowest point of 18,591.93 on 23 March 2020 before gradually recovering to its original level.
Given the patterns in Figs. 6 and 7, one might be curious to see if there is any correlation or lead-lag relationship between the time series of the network statistics and the DJIA. Since we were also interested in whether one series was ahead of another, we needed a way to quantify the synchrony between the two series. We adopted the dynamic time warping (DTW) algorithm 51 to detect the leader-follower interaction. Specifically, we used the DTW algorithm to evaluate the similarity between network statistics and DJIA. It starts with the first time step of the query series and calculates its distance with all data points on the reference series. Then it moves to the second time step of the query series and it repeats the same calculation until the last time step. The best alignment between the two series is found by getting the minimum warping path using the calculations. Figures 8 and 9 show the point-by-point comparison between the average degree, D t , and the DJIA for the years 2015-2016 and 2020 respectively. The grey lines connect the two series on the basis of the shortest Euclidean distance with the   March 2020 are even steeper and show stronger evidence that the average degree "led" the DJIA to return to normal levels. The responsive nature of the network statistics potentially provides insights for academics, investors, and financial institutions on when the market would start to recover after an unprecedented event, such as the COVID-19 pandemic investigated in this study. Figure 10 shows another comparison between the average degree of the topic networks, VIX, and DJIA in 2020. These three series experienced an unusual movement during the pandemic period in February and March 2020. VIX showed an increasing trend starting in late February 2020, while the DJIA exhibited a substantial drop at a similar time until it reached its lowest point of 18591.93 on 23 March 2020. The two topic network statistics, D t and C t , started to show a drastic decrease in early March, dropping to their lowest value on 18 March 2020 before responsively returning to their original value a few days later.

Discussion
The severe downturn of financial markets or systemic risk can be attributed to different events or financial incidents. A possible incident is related to credit markets and a substantial increase in default probabilities in a short period, for example, in the financial tsunami in 2008. There is common scientific evidence of having higher systemic risk from a significant increase in financial network connectedness 45,[52][53][54] . The idea of using the topic network connectedness of latent topics to capture possible systemic risk is explained in Fig. 2. When an event that is of great financial relevance is evolving, there can have a transition in the topic network structure from  www.nature.com/scientificreports/ Fig. 2a,b, where we see the emergence of event-related words (in this case pandemic-related words), causing the quick formulation of small clusters of new topics and a sudden drop in the topic network connectedness. The small clusters of new topics may signal unexpected responses in the financial markets to the big events which can cause big impacts to the financial markets. To further elaborate the use of the dynamic topic networks as a possible tool to provide early warning signals for severe financial downturns, we have included the use cases of the 2015-2016 stock market selloff and the COVID-19 pandemic. We find a substantial drop in the topic network connectedness in March and May 2015 before seeing a substantial adjustment in the DJIA in August 2015. We also see a sudden drop in the topic network connectedness in March 2020 during the COVID-19 pandemic.
To understand the underlying topics in news articles and whether they could provide an early assessment of systemic risk in financial markets, we made use of LDA for topic modelling 55 , and the latent topics were used to construct DTNs on the basis of topic similarities 36 . The final outputs (i.e. metrics of network connectedness) were related to benchmark market indices to investigate the predictive power of our approach. Using the 2015-2016 stock market selloff as a case, Fig. 3 reveals a sudden drop in the topic network connectedness in March and May 2015. These significant decline in topic network connectedness can be an indication of the propagation of fear in financial markets which is reflected in the evolution of new influential topics in the corpus. As seen in Fig. 6, following the significant drops in the topic network connectedness in about three months, we see a steep downturn in DJIA in August 2015. Using COVID-19 as a case having worldwide impacts, we conclude that DTNs and their network statistics could be auxiliary measures to detect systemic risk and, in addition, could provide an opportunity to determine when a market might pick up after a tremendous change in the financial markets due to extraordinary events or catastrophes. As seen in Fig. 4, the topic networks began to be less connected on 10 March 2020. As we used a sliding window length of 7 days, this means the analysis on that date included news articles from 4 to 10 March 2020. This might be an indication that investors already foresaw instability in the financial market even before WHO declared COVID-19 a global pandemic on 11 March 2020, and this loss in confidence was captured in the news articles prior to the slump in the DJIA. Slightly more than a week after the global pandemic declaration, the topic network statistics started to pick up again and rapidly returned to their original level. The DJIA exhibited a similar pattern a few days after the recovery of the topic network statistics, the only difference being that the increasing trend was much slower.
As stated in the "Introduction", VIX is one of the widely recognised indices for measuring market volatility and a main risk indicator for numerous financial institutions and investors to refer to, although some have argued that it might have limited predictive power 12,13 . In this research, we included VIX for comparison with the network analysis results. As can be seen in Fig. 10, VIX started to drop at a similar time to when the network statistics rose. Same as the DJIA, however, the rate was comparatively slow due to the fact that it was highly related to the derivative products on the S&P500 index 8 , and it was not until May 2020 that VIX began to reach a plateau. The topic network statistics are much more responsive measures compared to VIX, possibly because news articles are real-time information that contain a variety of insights and latest events from around the globe. Financial news, in particular, is a useful source that captures and encapsulates investors' confidence in the market. The increase in the topic network statistics a week after the global pandemic declaration might have provided an indication that investors were experiencing less uncertainty and becoming less anxious about future movement in the financial market, which could therefore subsequently explain the gradual increase in the DJIA that followed a few days later.
The point-by-point comparison in Figs. 8 and 9 also suggest that the average degree, D t , was ahead of the DJIA (i.e. possibly showing a leading time series signal for future movement of the DJIA). During the COVID-19 pandemic, When the average degree began to recover in mid-March 2020, it provided an indication that the DJIA would also slowly pick up in the coming month. This further highlights the possibility of using news articles to ascertain when the recovery of global stock markets might happen after a period of crash and thus detect opportunities in the financial market. The dynamic information in news articles can supplement other indices, such as VIX, in guiding investors and financial institutions to have better predictive power in financial markets, especially on systemic risk management. It is important to note that the DTW algorithm tries to match the patterns by constructing the optimal warping paths and does not necessarily imply causality. Nevertheless, it provides us with a glimpse of the similarity between the two series and how time indices in one series could be matched to those in the other series.
This paper shows the possibility of combining the topic modelling and network analysis of textual information to give predictions in the financial market. A limited number of studies have attempted to combine the two methods to understand news articles but from other perspectives 48,49 . This paper looks at the methodology from the perspective of constructing DTNs and uses topic network statistics to capture network connectedness hidden in text and financial news. We also relate our topic network results to the DJIA and VIX. The results show the potential of using this novel approach to predict the influence of the pandemic on financial risk. The COVID-19 pandemic induced pandemic-related salient words in the news articles on 15 March 2020, and it is evident from Fig. 2 that their presence caused topic similarity networks to be less coherent. The word coronavirus was still in the corpus after two months; however, it was much less influential and began to become part of other bigger topics. Moreover, our proposed topic network approach allowed us to easily quantify the influence of latent topics dynamically, as in Fig. 4. Through the combination of topic modelling and network analysis, we have gained a better understanding of the influential words and topics in selected time periods.
This paper takes the COVID-19 pandemic as a use case. The paper 45 documented the extraordinarily high financial network connectedness due to the impacts of the COVID-19 pandemic. In addition, the papers 46,47 presented evidence that the pandemic network connectedness can lead the financial network connectedness. The above research documented the possible contagion or transmission of the pandemic risk to financial risk. Possible channels of the transmission can also be found in 24 . The implication of our methodology does not limit to the COVID-19 pandemic. To illustrate further insights of the dynamic topic networks on financial risk Latent topics from financial news reflect contemporary issues or events relevant to the financial markets. Our findings present evidence that new latent topics may emerge suddenly and become influential to financial investors. The emergence of new topics can induce small clusters of new topics as seen in Fig. 2. This paper demonstrates that by extracting the dependence in the latent topics to form dynamic topic networks, we can provide hints to extraordinary financial market downturns or systemic risk. We have presented evidence from the financial market adjustment in 2015 and the COVID-19 pandemic. For applying the dynamic topic network methodology to other financial markets, we are recommended to use corpora relevant to the financial markets of interest. For example, if we want to infer and trace the systemic risk in the financial market in Hong Kong, we can construct the dynamic topic networks using the news corpus of Hong Kong or Asia. From the risk management point of view, the dynamic nature of the research framework means that the analysis can be continuously updated to generate predictions on a daily basis, provided that the latest financial news articles are available. Just as we quantify the influence of the COVID-19 pandemic in this study, in the future, we could also capture and quantify the sentiment of financial institutions and investors through news, observe if any anomalies exists in world events that could change investors' confidence, and dissect the fluctuations via observation of topic network connectedness over time. We can also attribute these changes in network connectedness to particular words to provide a comprehensive and scientific view of the semantic dynamic of the news. Further research will be conducted to solidify the findings and extend the applicability of the proposed research framework.

Methods
This section describes the methodology of the research. Each step corresponds to the research design in Fig. 1. Our data analysis is entirely based on Python. We used gensim for most of the natural language processing, in particular the LDA that we heavily focused on in this research. To create network graphs, we used networkx. To create topic distribution graphs in Fig. 5, we used pyLDAvis. For the DTW algorithm, we used dtw-python. All other visualizations are made with the package matplotlib.
Data. For this research, we collected U.S. news articles between 1 January 2015 and 31 December 2016, and 1 January 2019 and 30 September 2020 from Reuters. Table 2 shows a few examples of financial news with the posting date, headline, and body text.
Text cleaning. For each article, the headline and the body text were combined to form the full text. Standard text-cleaning procedures were carried out, including the removal of white spaces, non-alphabetical characters, and stop-words and the conversion of all words to lower case. Only nouns and proper nouns (part-of-speech tags: NOUN and PROPN) 56 were retained before applying lemmatisation. Words that appeared only once in a given period are removed from the constructed dictionary. On the other hand, there were numerous repeated phrases in the articles: for instance, 'Reporting by ... ' , 'For company coverage: ... ' , and 'Click the following link to watch video: ... ' , and so on. These phrases do not add value to the analysis and were therefore removed using regular expressions. The final cleaned version consisted of 2123284 news articles with 290771 unique words, and this served as the input for the topic modelling. Tables 3, 4 Sliding window. This study adopted a sliding window approach in which news articles in a given window length were extracted, semantically modelled, and analysed with a topic similarity network; then, the window would slide to the next period. These steps were repeated until the window reached the last available date of the provided news articles. In this study, the window length was set as 7 days with a stride of 1 day so that the LDA in the next step could capture the latent topics with enough news articles while also being able to reflect changes in a responsive manner. Specifically, on day t, we used the documents on day t − 6 to day t to perform the LDA. By doing so, we intended to observe how the connectedness between latent topics changed over time.  55 . This generative probabilistic model represents a document as a random mixture of latent topics, where each topic is further characterised by a distribution of words. Suppose we have N words in the corpus and M t documents on days t − 6 to t. Given M t documents as input at time t, let K be the number of latent topics in the documents. The M t documents contain a sequence of words which are assumed to be generated from a Dirichlet mixture of multinomial distributions under LDA in two steps. For each document at time t, the first step is to generate a latent topic probability vector θ t from a K-dimensional Dirichlet distribution, denoted by Dirichlet K (α t ) with a K × 1 parameter vector α t . In the second step, given the latent topic probability vector θ t , we first simulate a latent topic vector z n for the n-th word. Then, for z n = i , where i is a topic index between 1 and K, we generate the n-th word from the multinomial distribution with the N × 1 probability vector β it . Combining all K vectors β it , i = 1, ..., K , we can form a K × N topic-word probability matrix at time t. We are interested in estimating α t and β t using LDA to determine the topological features of the latent topics of financial news.    In an unsupervised learning setting, there is no ground truth of how many topics K are behind the articles. Literature has suggested the use of a large number. In previous studies, K was arbitrarily set to a large number 50 . In this paper, we selected 80 as the number of topics.
Topic similarity network. The primary focus of network analysis is to better understand relationships between entities. This study intended to take the output from the LDA model as the input into the topic similarity networks so that we could further understand topic connectedness and how it changes over time.
Measuring topic similarity. As previously mentioned, topics are represented as distributions of words and captured in the matrix β . Therefore, the similarity between two topics can be measured by comparing their word distributions. Kullback-Leibler (KL) divergence is often used when calculating the similarity between distributions. Although it is intuitively used to measure the distance between probability distributions, it is not a true metric given its asymmetry and the fact that it does not satisfy the triangle inequality. Alternatively, this study adopted Jensen-Shannon (JS) divergence, which is based on KL divergence, to calculate a normalised score. In other words, JS divergence can be considered as a symmetric and smoothed version of KL divergence. Equation (1) shows how JS divergence between probability distributions P and Q is calculated: where When P = Q , the JS divergence will be zero. The smaller the JS(P||Q), the more similar P and Q will be.
Topic network. A topic similarity network G t = (V t , E t ) can be constructed, where V t is the set of K vertices representing K topics at time t and E t is the set of edges at time t. As previously mentioned, K equals 80 in this paper. As shown in Eq. (3), an edge between two topics e ijt for 1 ≤ i, j ≤ n and i = j at time t is established if their divergence JS(β it ||β jt ) in terms of topic-word distribution is below a certain predefined threshold, κ . In this paper, we have included a general guideline which is similar to an 'empirical rule' in Statistics to filter out abnormally large JS divergence values. After constructing the dynamic topic networks, we propose using the mean and standard deviation of the median JS divergence of distance pairs to define the threshold, κ . Our data analysis on the two cases of the 2015-2016 stock market selloff and the COVID-19 pandemic indicates that setting κ at three standard deviations above the mean highlights topological properties of the dynamic topic networks which are able to generate early warning signals of severe financial market downturns or systemic risk. As in an early paper 44 , we have also conducted the sensitivity analysis to study the effect of the threshold on the topic network connectedness. Figure 11 shows the result during the 2015-2016 stock market selloff. The signals from using different thresholds are largely consistent in the sense that we see obvious drops in the average degree of the topic networks in March and May 2015, though the empirical rule of 'three standard deviations above the mean' may give a better result or a clear early warning signal of severe financial adjustments. The sensitivity analysis gives us extra confidence in using the dynamic topic network methodology as an additional tool for assessing systemic risk.
The data analysis suggested that the median JS divergence of distance pairs given a week of news articles had an mean of 0.73 and a standard deviation of 0.009 for the 2015-2016 time series, and those of 0.807 and 0.0056 for the 2019-2020 series. We preformed a sensitivity analysis to understand how the average degree of networks would change and suggested the empirical rule. In this paper, we present results by setting κ to three standard deviations above the mean, i.e. 0.76 for the 2015-2016 stock market selloff and 0.82 for the COVID-19 pandemic.  www.nature.com/scientificreports/ Average degree D t as shown in Eq. (4) is a straightforward measure to calculate how many edges |E t | are in G t with |V t | vertices.
To measure the degree to which the nodes tend to be linked together at time t, the average clustering coefficient C t is calculated. For any vertex i, its neighbourhood H it can be defined as H it = {j : e ijt ∈ E t } . The number of vertices of H it is defined as k it . Therefore, the local clustering coefficient C local i for vertex i is calculated as The average clustering coefficient C at any given time is simply the average of the local clustering coefficients of all vertices in G t : Visualisation. pyLDAvis was used to help visualise the topic model in an interactive manner. Topics were projected on a two-dimensional space using principal coordinates analysis, in which the distance matrix was created by calculating the JS divergence between topic-term distributions β t . The importance of topic k is represented by its node size, A k ∝ N k / k N k , proportional to the number of tokens generated by the topic across the corpus, with N k = d θ dk N d , where N d denotes the number of words in document d and d = 1, 2, ..., M t . There is a relevance metric to control the ranking of terms. It has a range of 0 ≤ ≤ 1 , with 1 indicating terms are ranked on the basis of their topic-specific probability in a descending order and 0 meaning terms are entirely ranked by their lift, a term's topic-specific probability over its marginal probability across the whole corpus. In this paper, is set to 0.6 in accordance with the literature 50 .