News Cohesiveness: an Indicator of Systemic Risk in Financial Markets

Motivated by recent financial crises significant research efforts have been put into studying contagion effects and herding behaviour in financial markets. Much less has been said about influence of financial news on financial markets. We propose a novel measure of collective behaviour in financial news on the Web, News Cohesiveness Index (NCI), and show that it can be used as a systemic risk indicator. We evaluate the NCI on financial documents from large Web news sources on a daily basis from October 2011 to July 2013 and analyse the interplay between financial markets and financially related news. We hypothesized that strong cohesion in financial news reflects movements in the financial markets. Cohesiveness is more general and robust measure of systemic risk expressed in news, than measures based on simple occurrences of specific terms. Our results indicate that cohesiveness in the financial news is highly correlated with and driven by volatility on the financial markets.


Introduction
With the growth of Internet the relationship between online information and financial markets has become a subject of ever increasing interest. Online information offers with respect to its origin and purpose and reflects either interest of some profile of users in the form of query or knowledge about certain topic in the form of news blogs or reports.
Financial markets are strongly information-driven and these effects can be seen by studying either search query volumes or social media sentiment. Many studies have analysed the effects of search query volumes of specific terms with movements in financial markets of related items [1]. Bordino et al. [2] show that daily trading volumes of stocks traded in NAS-DAQ 100 are correlated with daily volumes of Yahoo queries related to the same stocks, and that query volumes can anticipate peaks of trading by one or more days. Dimpfl et al. [3] report that the Internet search queries for term "dow" obtained from Google Trends can help predict Dow Jones realized volatility. Vlastakis et al. [4] study information demand and supply using Google Trends at the company and market level for 30 of the largest stocks traded on NYSE and NASDAQ 100. Chauvet et al. [5] devise an index of investor distress in the housing market, housing distress index (HDI), also based on Google search query data. Preis et al. [6] demonstrate how Google Trends data can be used for designing a market strategy or defining a future orientation index [7].
In principle, different effects between information sources and financial markets are expected considering news, blogs or even Wikipedia articles [8]. Andersen et al. [9] characterize the response of US, German and British stock, bond and foreign exchange markets to real-time U.S. macroeconomic news. Zhang and Sikena exploit [10] blog and news and build a sentiment model using large-scale natural language processing to perform a study on how a company's media frequency, sentiment polarity and subjectivity anticipate or reflect stock trading volumes and financial returns. Chen et al. [11] investigate the role of social media in financial markets, focussing on singleticker articles published on Seeking Alpha -a popular social-media platform among investors. Mao et al. [12] compare a range of different online sources of information (Twitter feeds, news headlines, and volumes of Google search queries) using sentiment tracking methods and compare their value for financial prediction of market indices such as the DJIA (Dow Jones Industrial Average), trading volumes, and implied market volatility (VIX), as well as gold prices. Casarin and Squazzoni [13] compute Bad News Index as weighted average of negative sentiment words in headlines of three distinct news sources.
The idea of cohesiveness of news as a systemic financial risk indicator is related to recent works studying mimicry and co-movement in financial markets as phenomena reflecting systemic risk in financial systems [14,15,16,17,18]. Harmon et al. [17] show that the last economic crisis and earlier large single-day panics were preceded by extended periods of high levels of market mimicry -direct evidence of uncertainty and nervousness, and of the comparatively weak influence of external news. Kennet et al. [18] define an index representing the balance between the stock correlations and the partial correlations after subtraction of the index contribution and study the dynamics of S&P 500 over the period of 10 years (1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010).
The idea of cohesiveness as a measure of news importance is simple: if many sources report about same events then this should reflect their importance and correlate with the main trends in financial markets. However, in order to capture the trends of systemic importance one must be able to track different topics over majority of relevant online news sources. In other words, one needs: (i) an access to the relevant news sources and (ii) a comprehensive vocabulary of terms relevant for the domain of interest. We satisfy the second prerequisite for systemic approach through the use of large vocabulary of financial terms corresponding to companies, financial institutions, financial instruments and financial glossary terms. To satisfy the first prerequisite, in our analysis we rely on financial news documents extracted by a novel text-stream processing pipeline NewStream [19] (http://newstream.ijs.si/), from a large number of Web sources. These texts are then filtered and transformed into a form convenient for computing NCI for the particular period of time.
We show that importance of financial news can be measured in a more systemic way than via sentiment towards individual entities or number of occurrences of individual terms, and that strong cohesiveness in news reflects the trends in financial markets. There is already a strong evidence linking co-movement of financial instruments to systemic risk in financial markets [18]. We hypothesize that cohesiveness of financial news reflects in some part this systemic risk. Our News Cohesiveness Index (NCI) captures average mutual similarity between the documents and entities in the financial corpus. If we represent documents as sets of entitites then there are two alternative views on similarity: (i) two documents are more similar than some other two documents if they share more entities and (ii) two entities are more similar than some other two entities if they co-occur in more documents. We construct NCI so that the overall similarity in a corpus of documents is equal regardless on the view we choose to adopt.
We analyse the NCI in the context of different financial indices, their volatility, trading volumes, as well as Google search query volumes. We show that NCI is highly correlated with volatility of main US and EU stock market indices, in particular their historical volatility and VIX (implied volatility of S&P500). Furthermore, we demonstrate that there is a substantial difference between aggregate term occurrence and cohesiveness in their relations toward financial indices.

News Cohesiveness Index
In order to measure the herding effects in financial news we introduce a News Cohesiveness Index (NCI) -a systemic indicator that quantifies cohesion in a collection of financial documents. A starting point for the calculation of NCI is a document-entity matrix that quantifies occurrences of entities in each individual document collected over certain period of time. We use concept of entity (instead of e.g. term) to represent different lexical appearances of some concept in texts. In our case we use a vocabulary of entities that includes financial glossary terms, financial institutions, companies and financial instruments. The full taxonomy of entities is available in Supplementary Information section 3. We start with the definition of occurrence, which says whether some entity is present or not in some document, regardless of how many times it occurs in the document. This makes document-entity matrix A a binary matrix: A is an m × n matrix, where m is the number of documents published in selected time period and n is the total number of entities we monitor. Documententity matrix A also corresponds to a biadjacency matrix of a bipartite graph between documents and entities. An edge between document d i and entity e j exists if the entity e j appears in the document d i .
The overall similarity in the collection of documents should be equal regardless whether we choose to view it as the similarity between the documents or between the entities. To achieve this we define the similarity as the scalar product of either document pairs d i , d j or entity pairs e i , e j . Now we define NCI as a Frobenius norm of scalar similarity matrix between all pairs of documents C d ij = d i , d j or entities C e ij = e i , e j : where C is either C d or C e . Frobenius norms of both document-document similarity matrix C d = AA T and entity-entity similarity matrix C e = A T A are equal, therefore the cohesion is conserved whether we measure it as the document or the entity similarity: In the network representation these two similarity matrices correspond to two projections of a bipartite graph of the original document-entity matrix, as illustrated in Figure 1. Moreover, one can exploit properties of the Frobenius norm of scalar similarity matrix and express cohesiveness as a function of singular values of document-entity matrix A (proof in the Supplementary Information section 2): where σ i are the k largest singular values of matrix A in a singular value decomposition: Because singular values are calculated on the original document-entity matrix and not its document or entity projections, we claim that we capture an intrinsic property of the corresponding document-entity matrix that is invariant to projection. This can also be inferred from the fact that the eigenvalues of similarity matrices AA T and A T A are equal and they correspond to the singular values of document-entity matrix A.
This approach can be beneficial for large document-entity matrices as it is much more efficient in terms of time and memory compared to explicit calculation of similarity matrix. We can calculate just first k values incrementally, until we reach the Matrix representation of document-document and entity-entity similarity matrices (left), and the corresponding network representations of entity and document projections (right). Frobenius norms of the two similarity matrices correspond to the sum of squares of connection weights in two projections, and they are equal, which means that cohesiveness is conserved in both projections. desired accuracy of NCI (see Supplementary Information section 1). In practice, only a small number of singular values is enough to calculate NCI up to the desired precision.
As the number of documents is changing each day while the number of entities stays constant, all NCI indices in our analyses are normalized with the number documents 1/m in the corpus. We have statistically confirmed that the NCI is largely above the level of fluctuations of cohesiveness random null model (see Supplementary Information section 2).

Semantic partitions of NCI
Sometimes it is interesting to perform detailed analysis of which groups of entities or documents contribute the most to the overall cohesiveness. For this purpose we can divide entities or documents into groups using any appropriate semantic criteria and calculate cohesiveness for each group separately or between pairs of groups. Semantic partitions in the entity projection are created via grouping of entities in mutually disjoint groups, defined by their taxon- Figure 2: Semantic partitioning. Semantic partitioning for two entity semantic groups -"Financial Institutions" and "Financial Context", and three document semantic groups -"day 1", "day 2" and "day 3". Frobenius norm in shaded regions quantifies cohesiveness within each semantic group, while Frobenius norms of all other regions quantify cohesiveness based on pairs of semantic groups. omy labels (hence semantic interpretation). On the other hand, semantic partitions in the document projection can be created via grouping of entities either by their temporal or source membership. Figure 2 illustrates the concept of partitioning in the context of different projections.
We can calculate cohesiveness separately for each semantic group or a combination of semantic groups. Note that even in this case we do not need to explicitly calculate similarity matrices (see Supplementary Information section 1). Following the taxonomy of entities described in Supplementary Information section 3 we defined four semantic groups: companies, regions, financial instrument and Euro crisis terms. Figure 3 shows the most frequent entities in each of the semantic partitions, based on the news corpus collected over the period of analysis.

NCI in relation to financial markets and query volumes
In order to asses NCI's utility as a systemic risk indicator, we use correlations analysis and Granger  Figure 3: Occurrences of most frequent entities in each of the semantic partitions. One of the most frequent entities are those defining major regions corresponding to the world's leading financial markets: United States, China, Europe, United Kingdom, London, Japan, Germany. Considering the frequency of United States, it is no surprise that majority of other frequent entities, from companies to instruments, are also tied to the US financial market and terminology.
causality tests against the pool of financial market and information indicators. The analysis should also provide deeper insight into the interplay between news, trends in financial markets and behaviour of investors. We adopt terminology from [4], and treat our news based indicators (NCI variants and entity occurence) as indicators of the information supply in online media, while volumes of Google Search Queries will be treated as indicators of information demand or as a proxy of investor interest. We group indicators as follows: • Inormation supply indicators: -cohesiveness index based on all the news (NCI) from NewStream, cohesiveness index based only on filtered financial news (NCI-financial) from New-Stream, total entity occurrences based on the aggregate from all news documents, and total entity occurrences based on strictly financial documents of NewStream.
• Information demand indicators: -these are volumes of Google Search Queries (GSQ) for 4 finance/economy related categories from Google Finance (from Google Domestic trends -Fi-nance&Investment, Bankruptcy, Financial Planning, Business).
• Financial market indicators: -these include daily realized volatilities, historical volatilities and trading volumes of major stock market indices (S&P 500, DAX, FTSE, Nikkei 225, Hang Seng) as well as implied volatilities of S&P500 (VIX).
Details on preparation of individual indicators are given in Methods section.
We start the analysis with a simple comparison of NCI calculated on all news and NCI calculated on filtered financial news. Figure 4 shows dynamics of NCI, and NCI-financial in comparison to VIX (implied volatility of S&P 500, the so called "fear factor" [20]). Scatter plots on the right show that correlation of VIX and NCI-financial is significantly higher than VIX and NCI. This is a first illustration of the importance of the filtering the right content for the construction of indicators from texts. For more details on how filtering affects correlations with other indices see Supplementary Information section 3. Figure 5 shows Pearson correlation coefficients between different information indicators and financial market indicators. Corresponding p-values are calculated using a permutation test and are available in Supplementary Information section 5. All correlations reported in this article have p-value < 10 −4 unless explicitly stated.
Interesingly, the correlations between total entity occurrences, NCI and NCI-financial are relatively low, confirming that cohesiveness captures very different signal from the entity occurrences. Furthermore, correlations between total entity occurrences, NCI and financial indices are, on average, much lower than correlations between NCI-financial and financial indices. Relatively low correlation between NCIfinancial and NCI confirms importance of filtering out strictly financial market-related articles from the NewStream, rather than having all the articles that contain some of the entities from the vocabulary. We have performed a more detailed analysis of these effects by studying in parallel behavior of different variants of entity occurrences and NCI-financial using different subsets of the vocabulary and the document space, independently. The main insight gained was that entity occurences become more informative when a smaller vocabulary of the most frequent entities is used, but this requires use of the whole document space. NCI has proven to be much more robust to the choice of both vocabulary and document space (details in Supplementary Information section 6).
Interestingly, the NCI-financial index is highly correlated with implied volatility (R > 0.7, Figure 5), as well as with historical and daily realized volatilities (R > 0.4, Figure 5). These correlations are much higher than the correlations of the GSQ categories (R < 0.3, Figure 5). In contrast to NCI-financial, GSQ categories exhibit relatively stronger correlations with stock trading volumes (0.3 < R < 0.4, Figure 5). Google Bankruptcy and Google Unemployment are significantly correlated with NCI-financial (correlation above 0.2, Figure 5), which is most probably due to similarities in vocabulary used in constructing NCI-financial and respective GSQ indica-tors.
A more in depth picture of the news cohesiveness index is obtained when observing individual semantic components of NCI-financial and their correlation patterns with financial and Google search query indicators. Semantic components based on [region] and [eurocrisis] taxonomy categories all have similar correlation patterns to NCI-financial (with correlation above 0.7 for [eurocrisis] and above 0.5 for [region], Figure 5); this also shows that these categories are most important for the behavior of NCIfinancial. On the other hand, semantic components based on [company] and [instrument] exhibit quite different, in many parts, opposite correlation patterns (with correlations close to 0 or even negative). It is interesting to note that both the NCI-financial and GSQ indicators have strong negative correlation with Nikkei 225 volatility and trading volume (up to -0.4 for NCI-financial and up to -0.5 for GSQunemployment).

Granger causality relations
The Granger-Causality test (GC test) is frequently used to determine whether a time series Y (t) is useful in forecasting another time series X(t). The idea of the GC test is to evaluate if X(t) can be better predicted using both the histories of X(t) and Y (t) rather than using only the history of X(t) (i.e. Y (t) Granger-causes X(t)). The test is performed by regressing X(t) on its own time-lagged values and on those of Y (t) included. An F-test is used in examining if the null hypothesis that X(t) is not Grangercaused by Y (t) can be rejected.
In Table 6 we show results of pairwise G-causality tests between information supply and demand indicators and financial indicators. Cells of the table give both directionality (X → Y , Y → X or bidirectional X ↔ Y ) and significance at two levels of F-test (pvalues ≤ 0.01; ≤ 0.05). Besides GC testing NCIfinancial and its semantic components at higher taxonomy levels, we show also results obtained for NCI (non-filtered news NCI) and total entity occurences as a baseline.
The results in Table 6 paint a much different picture than the correlation study. Firstly, Granger qq q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q qq q q q q q q q q q q   Our financial news indicator NCI-financial seems to be G-caused solely by FTSE daily volatility. This finding is in contrast with the fact that the NCI-financial is strongly correlated with several other indicators like implied volatility (VIX) (R > 0.7, GSQ indicators seem to be divided in two groups by their G-causality: (i) those that are G-caused mainly by trading volumes (Business and industrial, Bankruptcy, Financial planning and Finance and investment) and total entity occurrences in the news, and (ii) those that are strongly G-caused by all other GSQ categories (Unemployment). Interestingly, total entity occurrence in the news, seem to be the strongest G-causality driver of the GSQ volumes, while two of the semantic components of NCIfinancial are G-caused by GSQ Finance and investment and Financial planning.

Discussion
This work introduces a new indicator of financial news importance based on a concept of cohesiveness of texts, from large corpora of news and blogs sources. In contrast to indicators introduced by other authors which are based on sentiment modelling, NCI measures cohesiveness in the news by approximating the average similarity between texts.
Our correlation results confirm the main hypothesis that cohesiveness of the financial news is a signal that is strongly correlated with systemic financial market indicators in particular volatilities of major stock exchanges.
The analysis of Granger causality tests over a pool    Google Unemployment N C I N C I fin a n ci a l o cc u re n ce s fin a n ci a l G o o g le bu ss in e s a n d in d u st ri a l G o o g le b a n kr u p tc y G o o g le fin a n ci a l p la n n in g G o o g le fin a n ce a n d in ve st m e n t G o o g le U n e m p lo ym e n t V IX S P 5 0 0 h is to ri ca l vo la til ity F T S E h is to ri ca l vo la til ity D A X h is to ri ca l vo la til ity H a n g S e n g h is to ri ca l vo la til ity N ik ke i 2 2 5 h is to ri ca  of financial and information related indicators suggests that NCI-financial is mainly related with the volatility of the market. In our analysis most important semantic components of NCI-financial are mainly G-caused by implied (VIX), historical and daily volatilities. This implies effects from both short term and long term risks in the financial market. The only exception (bidirectional causality between [region]x[eurocrisis] and Hang Seng daily volatility) might be plausibly explained as a time zone effect. This does not seem to be the case for GSQ indicators which are mainly driven by trading volumes, with the exception of GSQ Unemployment, which seems to be driven mostly by search volumes of other GSQ categories. Similar to findings of some previous studies [21,13], in which aggregate sentiment or financial headline occurrence were used as measures of state of the financial market, information supply comovement as measured by NCI-financial, seem to be primarily caused by trends on the financial market rather than the opposite. We find that similar results holds also for the GSQ categories which approximate information demand side in our case. G-causality patterns show, similarly to correlation, that cohesive-ness captures quite different signal with respect to total entity occurrence; the results also suggest the presence of somewhat circular interplay between information supply and information demand indicators. For example, total entity occurrence is G-causing three of the GSQ categories (Business and industry, Bankruptcy and Financial planning), while Financial planning and Unemployment are G-causal for semantic components [instrument]x[eurocrisis] and [eurocrisis]x[eurocrisis], which suggests feedback mechanisms between news and search behaviour.
In comparison with the findings of studies which used simpler measures of news importance or sentiment, we find that financial news cohesiveness reflects the level of the volatility in the market and is GC driven both by current level of volatility and implied volatility, while GSQ volumes are driven mainly by trading volumes. Impact of news cohesiveness and GSQ volumes in the reverse direction, as determined by GC tests, is only weakly implied in case of semantic components of NCI-financial and Hang Seng index. This is not in line with previous works that report predictive utility (mostly for GSQ volumes) with respect to certain financial instruments. However, one has to bear in mind that the results of GC tests reflect average of lagged correlations between indicators over the specific period in time (in our case Oct 2011 -Jul 2013). It is possible that direction of causality between information and financial indicators changes in time, but this was hard to detect in our data due to the limited length of time series. Another possible reason for different results is that most of previous works were based on a limited number of Google search query terms, typicaly more closely related to the particular stock market index of interest. In principle, this is different from volumes of GSQ term categories in our case, which reflect aggregates over larger number of different query terms. GSQ categories closely resemble the concept of semantic components and it is possible that the application of the concept of cohesiveness, if adapted to GSQ category volumes, may produce signals more predictive with respect to financial market trends.

Data
Access to structured information about the financial markets with its various instruments and indicators is available for several decades, but systematic quantification of unstructured information hidden in news from diverse Web sources is of relatively recent origin.
We base our analyses on a newly created text processing pipeline -NewStream [19], designed and implemented within the scope of EU FP7 projects FIRST (http://project-first.eu/) and FOC (http://www.focproject.eu/). NewStream continuously downloads articles from more than 200 worldwide news sources, extracts the content and stores complete texts of articles. It is a domain independent data acquisition pipeline, but biased towards finance by the selection of news sources and the taxonomy of entities relevant for finance.
For the purpose of filtering, efficient storing and analytics, expert based financial taxonomy and vocabulary of entities and terms have been created, containing names of relevant financial institutions, compa-nies, finance and economics specific terms, etc. The NewStream pipeline has been collecting data since October 2011. In our analyses we use text corpora from October 2011 to June 2013 and we have filtered over 1,400,000 financially related texts stored in the form of document-entity matrices. Full structure of the taxonomy is in the Supplementary Information  section 3, and the list of the domains from which most documents were downloaded in the Supplementary Information section 4.
Filtering of financial documents NewStream pipeline downloads articles from more than 200 Web sites of online news and blogs. Moreover, despite the selection of financial news sites, there are many articles which are only indirectly related to finance, such as politics or even sport. To obtain a clean collection of strictly financial texts, we have developed a rulebased model utilizing taxonomy categories as features to describe documents, and a gold standard of human labelled documents (≈ 3500 documents). A machine learned rule-based model is used as a filter for extracting strictly finance related texts from a corpus. This model has a recall of over 50%, with precision of well over 80%. The rule-based model for filtering financial documents can be found in Supplementary Information section 3.

Financial indicators
We analyse NCI in comparison to the financial market indicators of worldwide markets and Google Search Query volumes. For that purpose we have downloaded stock market indices from Yahoo finance Web service (http://finance.yahoo.com/): High, Low, Open, Close prices and volume of S&P 500, DAX, FTSE, Nikkei 225 and Hang Seng index. We also use implied volatility of S&P500 (VIX). Implied volatily is calculated for the next 30 days by Chicago Board Options Exchange (CBOE, http://www.cboe.com/) using current prices of indices options. Historical (realized) volatilities are calculated from the past prices of the indices themselves. We use daily prices of individual indices to calculate a proxy of daily realized volatility. daily volatility = High t − Low t 0.5 (Close t + Close t−1 ) .
Historical (realized) volatilities are calculated as standard deviations of daily log returns in the appropriate time window: where p t are daily prices, and n is time window. In our analyses we used a window of 21 working days.
Google Search Query Volumes Almost all previous studies used search query volumes of specific terms. Instead, we used Google search query volumes of predefined term categories from Google finance website. We have chosen five categories from Google Domestic Trends that are related to financial market: Business and industrial, Bankruptcy, Financial Planning, Finance and investing, Unemployment. We downloaded YOY (Year Over Year) change values for these categories from Google finance Web service (https://www.google.com/finance).

Granger causality testing
We have used functions of the R packages tseries, lmtest, vars, urca to download and calculate indices, construct joint time series dataset, determine correlations and study Granger causality relations. We have followed the methodology of Toda and Yamamoto [22] for Granger causality testing of nonstationary series. Details of the procedure are given in Supplementary Information section 5.

Cohesiveness through SVD approximation
We start from the denition of the NCI index as Frobenious norm of similarity matrix C = AA T : ,where tr denotes the trace of the matrix. Since, the matrix C = AA T =⇒ tr(C T C) = tr(AA T AA T ) By making a singular value decomposition A = U × S × V T we get: which proves the equality. If we use only rst k singular values then we get best low rank approximation of similarity matrix C by EckartYoung theorem.
It is very important to note the run-time and memory improvement (see Figure 1) by using the singular approach on large matrices. In order to calculate the rst k singular values, one can use the iterative Lanczos algorithm [2]. As the rst singular values contain the most of the energy the approximation of N CI can be done with just a few values (see Figure 2).  Sometimes it is necessary to perform detailed analysis of which entities or documents contribute the most to the overall cohesiveness. For this purpose we can divide entities or documents into groups using any appropriate semantic criteria and calculate cohesiveness for each group separately or between each pair of groups. Note, that even in this case we do not need to explicitly calculate similarity matrices but still use the singular values technique. Cohesiveness between any pair of the groups is calculated by exploiting the properties of Frobenius norm. If we have two semantic groups G 1 and G 2 and if the cohesiveness for each of them is G 11 F and G 22 F respectively, while for their combination is G 1212 F , then the cohesiveness between the two groups G 12 F is: 2 Statistical signicance of cohesiveness In this section we will quantify the statistical signicance of N CI with respect to a cohesiveness null model of in our system. Let us rst recall the denition of normalized N CI of document entity matrix A of size m × n: Now, we will start with rather simple question. What is the probability that m documents have high N CI * index just by chance in a system with n entities ? In the special case, when each of m independent random documents x i contains exactly one entity from the vocabulary of size n which are equally likely the expected N CI * index has the following upper bound: Only    In the Figure 6, Figure 8 illustrates behaviour of these aggregations for dierent fractions of entities from the vocabulary (plots: A and B) and dierent fraction of documents from the corpus (plots C and D). In plots A and B, we start from having only the most frequent entities and add those less frequent until we end with the complete vocabulary. We aggregate over all documents. In plots C and D, we start from having only the documents with highest number of entities and gradually add those that have smaller number of entities from the vocabulary.
We aggregate taking into account whole vocabulary of entities.
If we focus rst on plots A and B, we can observe that NCI and its close approximation 3 (a diagonal part of cohesiveness in entity A T A projection) give very stable and high correlations over a very broad choice of entities from the vocabulary. Note, also that the approximation 4 (a diagonal part of cohesiveness in document AA T projection) gives very poor results. Contrary, the NCI measures has the same stable performance both in entity (A T A) and document (AA T ) projection. On the other hand total entity occurrence exhibits very low correlation with volatility and has its maximum for very low number of most frequent entities. Its normalized version seems to be a very good approximation of cohesiveness, but its behaviour is again dependent on the choice of entities.
One has to bear in mind that frequent entities are determined on a daily basis -i.e. the most frequent entities in principle change from day to day, which means that we cannot observe this kind of behaviour with small vocabulary. In other words, although only small number of entities is responsible for overall cohesiveness, we need larger vocabulary in order to capture concept drift in the news.
If we now observe plots C and D in Figure 8, we see that relatively large fraction of documents is needed to obtain rather stable and high correlations of NCI with volatility. In these plots aggregation 1 and 2 attain much lower correlations than in plots above, which is the consequence of aggregation over all entities from the vocabulary.
This analysis also emphasizes specic properties of news corpora: i) that most of the cohesiveness signal is based on a small fraction of top most frequent entities for the particular day and ii) that one needs rather large fraction of documents to get stable (high) correlation with volatility. This analysis supports further the hypothesis that the cohesiveness is a more robust measure of news importance than entity volume, and is thus more appropriate measure for systemic risk reected in nancial news. Rather high correlation of NCI and its close approximation (3) with implied and realized daily volatility is obtained for very low number of most frequent entities for a particular day. Total entity occurrence (1) and its normalized variant (2) are not stable with respect to the choice of vocabulary. Plots C and D: Inuence of the choice of the documents, while using whole vocabulary. We observe that relatively large fraction of documents is needed in order to get high correlation of NCI and its approximation (3) with volatilities. Total entity occurrence (1) and its normalized variant (2) do not achieve are not stable with respect to the choice of vocabulary.