The CoRisk-Index: Measuring economic risks related to COVID-19 in real time

,


Introduction
The Covid-19 pandemic has caused the largest global economic disruption of the 21st century [1][2][3] .Travel bans, supply chain failures, and store closures pose significant risks to entire industries.In an attempt to mitigate the general economic downturn, governments are mobilising considerable financial support 4,5 .While immediate general support programmes can help to stabilise economies ad-hoc, targeted and continuous countermeasures require reliable and up-to-date information on economic trends.However, the fastly changing dynamics of the Covid-19 pandemic pose a challenge for economic forecasting and targeted policy intervention.
In the absence of timely empirical data, investigations on the economic impact of Covid-19 explore two pathways.One stream of research relies on metrics based on past economic shocks, such as economic simulations 6,7 .Another stream, following the paradigm of computational social science 8 , explores alternative data sources, such as stock markets [9][10][11] , news articles 12 , website content 13 or search queries 14 .
However, both approaches come with limitations in the current situation.Traditional economic indicators, like composite indices 15 or labour market statistics have a sizeable time lag.Empirical data with a high time resolution, such as stock markets, search queries, and news articles include a lot of noise and are prone to herd behaviour.Moreover, most metrics do not allow for a granular analysis of the economic repercussions by industry.Examining current stock market dynamics, for example, reveals a general economic downturn, but it does not allow to isolate the sector and domain-specific risks related to the economic threats of the pandemic.First attempts to quantify risk perceptions related to Covid-19 on the firm-level are limited to the company perspective 16 and lack the relation to the macro-economic environment.
In response, we propose a real-time analysis of business reports (10-K reports) filed to the U. S. Securities and Exchange Commission (SEC) using data-mining and natural language processing techniques.SEC filings represent financial statements of publicly listed companies including a risk assessment.These reports are imperative to comply with legal and insurance requirements and therefore contain reliable risk evaluations.Prior work has underlined the forward-looking nature of SEC filings, as they allow for an effective prediction of volatility on stock market returns and other financial metrics [17][18][19] .
In collecting all 10-K reports published from 30th January 2020 -the day the term coronavirus first appeared in a 10-K report -to 30th June, we assess and track the reported risk assessments related to Covid-19 for the different sectors 1 of the US economy.The 5,800 reports analysed here cover the risk assessments of U.S. companies representing more than 44 million U.S. employees, i. e. one-third of the total U.S. economy.We construct three measures: (a) the number of 'corona' words in each 10-K report, (b) the share of corona-mentioning firms per industry, and (c) the average text negativity of the sentences mentioning corona in each report.The text negativity is calculated as the share of negative words 20 per corona-mentioning sentence.The geometric mean of corona-words and text negativity builds the CoRisk-Index, which is calculated for each industry.Moreover, to understand how companies in different industries assess Covid-19 as a risk over time, we use natural language processing to identify the topical context in which Covid-19 related risks are described.
As we will demonstrate, the CoRisk-Index represents a highly informative, reliable, up-to-date and low-cost data source.The data can help researchers and policy-makers to better understand and forecast the economic repercussions related to COVID-19 in different industries and to design better policy responses.For immediate access to all raw data we offer an interactive online dashboard. 2he remaining paper proceeds as follows.In the next section, we introduce the data and CoRisk-Index in more detail.We show that the CoRisk measures relate to stock market and unemployment trends.Moreover, we demonstrate that the CoRisk-Index can be generalised to retrace past shocks such as the U.S. -China trade war in 2018.Subsequently, we discuss policy implications and critically reflect upon the limitations of our approach.The Methods section provides a comprehensive step-by-step explanation of the data collection, pre-processing and analysis.

Anticipation of the economic shock
The business risk measures captured by the CoRisk-Index anticipated the economic reactions of the unfolding crisis before the number of Covid-19 cases surged in the United States (Fig. 1A).Before stock markets plummeted on February 20th (Fig. 1B), the share of 10-K reports mentioning corona has been rising steadily (Fig. 1C).In early May, when the number of new infections started to decrease, the share of corona-mentioning reports established itself on a high level of just below 80 %, before rising further in June, when the number of new cases started to rise again.
Similarly, the average share of negative words in sentences mentioning 'corona' or 'covid' (text negativity) spiked in early February (Fig. 1D), peaking at a share of 7 % negative words per corona-sentence just before the most severe stock market losses ('Black Monday') on March 9th.The text negativity measures the sentiment with which firms report about Covid-19 related business issues.The sharp increase reflects the uncertainty in global economic outlooks, which businesses suddenly became aware of after the pandemic turned from a regional to a global health emergency.In the subsequent months, the text negativity steadily decreased, supported by economic aid packages and the release of stay-at-home orders, declining to a level of around 5 % recently.

The CoRisk-Index
Based on the count of 'corona' keywords and the text negativity, we construct the CoRisk-Index (Fig. 2A) for eight out of 13 sectors of the U.S. economy, which are well represented by SEC filings (see Methods section).The index serves as a real-time economic indicator measuring industry risks related to Covid-19.Sectors differ in their timing and magnitude of risk awareness.For example, Finance shows relatively low levels of risk awareness, while Manufacturing and Wholesale & Retail reveal early and strong risk sentiments.Overall, 75 % of the firms in manufacturing and 74 % of the retail companies have mentioned the coronavirus as a potential economic risk.Only 28 % of the businesses in finance have done so.The topical attribution of corona-mentioning sentences allows us to investigate how the pandemic influenced business perceptions related to the topics of travel restrictions, supply shocks, demand disruptions, and governmental aid (Fig. 2B).As the U.S. government posed first travel restrictions on January 31st, travel-related reporting in Manufacturing surged followed by Wholesale & Retail a month later.Subsequently, both industries began to express significant concerns regarding supply delays throughout February and March.The supply disturbance was overshadowed by a substantial demand shock with the announcement of stay-at-home orders by mid-March.Since then, the demand disruptions particularly affected the reporting in the Wholesale & Retail sector.More recently, reporting in both industries started to deal with the implications of the "Coronavirus Aid, Relief and Economic Security (CARES) Act" (signed by President Trump on March 27th releasing more than USD 500 billions in loans to the U.S. economy).

Application & Generalisability
To validate the CoRisk data as a broadly applicable real-time economic indicator, we relate the weekly changes of the CoRisk-Index with the U. S. unemployment initial claims data (as a percentage of total  The methodology allows to extract the china-related negative sentiment from the reports and it reflects negative (red) and positive (green) events during the crisis, pointing towards the general applicability of the tool to reveal granular insights about the effects of events on businesses.
unemployment per sector; see Methods section).Despite limited data availability (currently, in the very early phase of the crisis, not much unemployment data are available), Figure 3A reveals correlations between the two stationary time series for a number of industries.However, the crisis has not yet reached all sectors of the economy.Hence, the financial industry, for example, shows little reaction in terms of rising unemployment initial claims, which results in a very low correlation.
The CoRisk methodology can be generalised beyond the case presented here.Reports filed to the U.S. Securities and Exchange Commission represent a rich source of textual data reflecting political and economic events, which can be used to track and predict the influence of external events on the business environment.We exemplify the generalisability of the approach on the case of the U.S.-China trade war in 2018.Figure 3B shows the text negativity of 'china' sentences, i. e. share of negative words in sentences mentioning the keyword 'china'.The negativity varies between 1.7 % and 3.4 % over the course of the trade war, with tariff announcements being reflected by periods of sharply increasing negativity and relaxing events such as trade talks being reflected by decreasing text negativity.The example illustrates that the CoRisk methodology provides a flexible tool that is able to track business risk factors in numerous applications.Besides this example, we provide a comparison of the overall report negativity (share of negative words in 10-K reports) and macroeconomic variables in the appendix.

Discussion
Governments are eager to counterbalance the dooming global economic crisis induced by the Covid-19 pandemic with cyclical and fiscal policy packages of enormous volumes.The CoRisk-Index represents an attempt to contribute to the pressing demand for empirical data on the economic impact of the ongoing pandemic.It equips policy-makers and researchers with a tool to analyse and track industry-specific risk perceptions related to Covid-19 in real time, based on data that represents a large part of the U.S. economy.Our analysis shows that industries vary greatly in their risk perception as well as with regards to the topics that they are concerned with.It underlines the necessity to incorporate alternative data sources of up-to-date empirical information in the current situation, in order to better understand the differential effects of Covid-19 on different sectors of the economy.The CoRisk data will help to forecast the economic effects of the crisis and to design viable support instruments targeting the most affected industries.
However, the analysis comes with potential limitations.Firstly, we limit our investigation to the United States given the proven reliability of 10-K reports as a source of company risk reporting 17 .Secondly, the value of the approach relies on company self-reporting.As many of the implications of the corona crisis are still uncertain, our approach thus reflects a way to approximate potential implications on current estimations of company experts in the different sectors.Naturally, this does not include risks that are unforeseeable for companies themselves at a given point in time.Thirdly, the CoRisk-Index is currently limited to the United States, as it appears to be the only country that provides company risk reports in a highly standardised and freely accessible format.Nevertheless, we are working on implementing the index for other countries.
Nonetheless, we believe that our measures represent a valuable source of empirical information about the issues faced by different industries in the current situation.The rapidly changing global pandemic calls for a constant re-assessment of economic repercussions and the corresponding policy responses.In this situation, agile computational social science methods such as data-mining and natural language processing allow us to extract valuable information at low-cost and in real time.As the crisis unfolds, we continuously extend our work.The CoRisk-Index is updated weekly until the end of the pandemic, in order to support fellow researchers and policy makers with timely data on industry-specific risk perceptions.In addition to the findings presented in this paper, we provide detailed industry-specific insights and all raw data (including filter-and download options) via an interactive online dashboard. 3

Automated extraction of 10-K reports
The U. S. Securities and Exchange Commission (SEC) stores all reporting in a central repository 4 .Here, users can access meta-level information, such as index files, e.g., lists of all reports issued in the second quarter of 2020 directly.Alternatively, individual filings and meta-level information can be retrieved via various statistical packages, e.g., "edgar" (R) or "sec-edgar-downloader" (python), or a freemium API 5 .
However, several limitations make the use of these ready-made devices impractical for the specific research purposes of this work.Downloading all relevant reports as .txt/.html files via the ready-made packages is possible in theory but requires a lot of time and hard disk space.In addition, the .txtfile contains several unwanted css/html code patterns that make the identification of corona-sentences and the counting of words from them unreliable.
Alternatively, we initially refer to the crawler.idxindex file 6 in the SEC repository.The index file holds a full list of all reports issued in a given quarter.This document shows the meta-level index.htmpage for each company.On this page, the most recent 10-K report is linked 7 .Unfortunately, the htm-version of each 10-K report has a cryptic file name that can not be anticipated or guessed with the knowledge of company, date or industry parameters.Hence the currently implemented scraper (python 2.7) pipeline of our project, first, fetches the list of recently listed 10-K reports from crawler.idx, secondly, constructs the meta-level index.htmfor each company and finds the link to the most recent 10-K report.The algorithm than scrapes the report text before, lastly, identifying the sentences related to corona (examples of text element that contain the term 'coronavirus' are displayed in Figure A4).All sentences and their respective report properties are stored for later processing.

Collection of stock market data
The collection of stock market data follows the incentive to provide reference data for the company SEC filings.We obtained stock market data through the Yahoo!Finance API following two successive steps: First, we extracted the CIK (central index key) identifier of each company from the SEC filings to compile a comprehensive list of all companies, their CIK and stock ticker identifiers.Using the stock tickers, we then retrieved the historic closing stock values of each company per trading day between January 1st 2018 and May 15th 2020.

Collection of unemployment data
Due to the recent outbreak of the crisis, there are not many data points of monthly industry-specific unemployment rates available, which are provided by the Bureau of Labour Statistics.The crisis had first labour market repercussions only in March; until 5th June, there are only unemployment rates for March (published in April) and April (published in May) available.Therefore, we consider a different data set to approximate the labour market repercussions of the pandemic.We use weekly unemployment initial claims data provided by the Economic Policy Institute. 8The data covers the number of weekly initial claims per sector in nine weeks from 14th March to 9th May 2020 for 19 US states.Additionally, the data contains information about the total employment per industry and state.
Using this dataset, we can calculate the share of weekly unemployment initial claims per total employment (the total employment is assumed to be constant in the observation period) in the 19 US states.

Specification of corona keywords
Since this study examines the attention attributed to COVID-19 in the SEC filings, the discovery mechanism of relevant COVID-19 mentions is of central importance.To mitigate susceptibility to errors due to word splitting, stemming and other text preprocessing, we decided for the most simple approach based on the matching of regular expressions.We scanned the reports for the two relatively unambiguous terms "corona" and "covid", also accounting for "coronavirus" and "covid-19" without duplication.For this process, the entire text is set to lower case.

Topic detection
We apply unsupervised methods for topic detection from Natural Language Processing to further analyse the reports with regard to topic related risks.Different sectors are facing different challenges, therefore companies are reporting about different corona-related risks.We aim to capture these risk topics via a keyword search on predefined topics.In order to explore possible topics, we used Latent Dirichlet Allocation (LDA) for unsupervised topic modelling, similar to Dyer et al. ( 2017) 21 .We only apply the topic model to corona-related paragraphs in the risk sections.We additionally examine the most frequent words and bi-grams in the documents.Using this exploratory analysis, we define a set of topics, which are specified by keywords.We then conduct a keyword search to count how much these terms are mentioned in the different industries in order to estimate the topic prevalence.

Unsupervised topic modelling
We use unsupervised learning techniques to explore the space of topics that companies discuss when describing coronavirus-related risks.Latent Dirichlet Allocation (LDA) is a Bayesian computational linguistic technique that identifies the latent topics in a corpus of documents 22 .This statistical model falls into the category of generative probabilistic modelling: a generative process which defines a joint probability distribution over the observed random variable, i.e. the words of the documents, and the hidden random variables, i.e. the topic structure.In other words, LDA uses the probability of words that co-occur within documents to identify sets of topics and their associated words 21 .The number of topics has to be defined in advance.LDA is a frequently used technique to identify main topics in a corpus.Nevertheless, the interpretation of these topics can sometimes be difficult.We thus perform LDA for explorative purposes in our research only and apply the following steps:

Sample restriction
We filter all sentences from the risk sections that mention either "corona" and "covid", thereby also accounting for "coronavirus" and "covid-19".
Text preparation Before we train the LDA model we prepare the documents to achieve better performance of the method.We remove all common English stopwords, which are frequent words such as "is," "the," and "and" as well as those words which appear in at least 80% of the documents.These words are not useful in classifying topics as they are too frequent and therefore decrease performance.Moreover, we delete all words that do not occur in at least two documents.
LDA We turn the documents into numerical "Bag of words" feature vectors, disregarding word order.We then use LDA to extract the topic structure.Like any unsupervised topic model, this requires setting the number of topics a priori.We selected this key parameter based on semantic coherence, evaluating a range of two to eight topics leading to a final model of four topics.The top ten terms of each topic are displayed in Table A1.

Dictionary-based topic search
The algorithmically derived topics give a good insight into the general narratives of risks used in the documents.Nevertheless, they are hard to interpret, as early corona-related risk reports are still generic in that various risk factors are covered.Topic four, for example, provides an unspecified context with regard to the outbreak of the illness in China.Similarly, topic three covers the potential impact of the crisis in an unspecified context.In contrast, most of the business and economics related keywords appear to be covered in topic two.Moreover, the unsupervised methods are not deterministic.To ensure a robust and comparable topic identification over time, we use a dictionary-based keyword search.For this, we combine the results of the unsupervised methods with domain knowledge from economics to label the five main topics and specify defining keywords, displayed in Table A2.The most frequently mentioned bi-grams and words are considered.Using these, we can conduct dictionary-based searches in the filings.We measure the topical context of each corona-sentence by calculating the share of topic keywords relative to the word length of the sentence.This metric is later aggregated for industries and specific points in time.

Matching industry classification systems
The SEC classifies firms into industries using an amended version of the 1987 Standard Industrial Classification (SIC). 9The SIC is a system for classifying industries by a four-digit code.It was replaced by the North American Industry Classification System (NAICS) in 1997.The NAICS system is still in use, in its last revision from 2017, and it is being used by US government agencies.Thus, it is important to match the SIC codes used by the SEC with the NAICS system, in order to allow researcher and economists to use the CoRisk data in industry-specific applications.
To match the different classification systems, we use merging tables provided by the US Census Bureau. 10he code (R) and the merging tables are available on GitHub. 11After having merged the four-digit SIC codes, we use the NAICS overview provided by the US Census Bureau to identify the larger two-digit sectors, on which the CoRisk data are aggregated. 12During merging, some industries are assigned to several categories.In order to drop duplicates, we only keep the most frequent section per SIC code.While it might be possible that some information is lost during the merging process from the 1987 four-digits SIC system to 2017 two-digit NAICS sectors, we assume that less categories, which are compatible with the system used in current US statistics, provide a better overview of relevant processes than an aggregation, which is too fine-grained and which consists of low sample sizes within each category.

Filing of 10-K reports to SEC
Companies with more than 10 million USD in assets or a class of equity securities that is held by more than 2,000 owners must file annual 10-K reports to the SEC, regardless of whether the securities are publicly or privately traded 13 .All 10-K reports are made publicly available by the SEC.In particular, but not exclusively, in the risk section of the report, the company lays anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors.Companies are required to use "plain English" in describing these risk factors, avoiding overly technical jargon that would be difficult for a layperson to follow.

Comparing 10-K and 10-Q reports
In addition to annual 10-K reports, the SEC requires companies to publicly disclose their actions in a set of reporting formats (10-X).Apart from historical reporting standards and niche reporting categories, company disclosures with the SEC can be separated in two groups: Annual 10-K and quarterly 10-Q reporting.One of the four quarterly 10-Q reports is subsumed by the annual 10-K report.
We limit our analysis to the information contained in 10-K reporting alone for the following reason.On March 4, the SEC has explicitly advised public companies to assess what the coronavirus means for their future operations and financial results and to make appropriate disclosures to their shareholders and other members of the investment community.Furthermore, the SEC encouraged companies to delay SEC filings if necessary to develop the information required to make accurate and complete disclosures of the impact of the coronavirus on its operations and financial conditions.Specifically, the SEC issued an order stating that public companies that are unable, because of the coronavirus, to meet filing deadlines for SEC reports due to be filed March 1 to April 30, 2020, will have 45 additional days to file these reports so long as, among other things, they file reports on Form 8-K describing the reasons why the report may not be filed on a timely basis 14 .This announcement has caused and interference with the normal reporting procedures in at least two ways.Companies delayed their quarterly reporting until the last possible date (April 30, 2020) and companies were incentivised to "talk" about Covid-19 related issues.
From our perspective, this reporting bias should manifest most strongly in 10-Q reports, as they are more susceptible to short-term changes in reporting standards.We decide to include solemnly 10-K reports in our CoRisk analysis, as they have a long-term outlook of one year and a lower susceptibility to ad-hoc changes in reporting standards compared to 10-Q reports.
have examined historical SEC filings data 2017 and calculated the share of negative words (one of the two key components of the CoRisk-Index) for these filings in two ways.
First, we calculated the share of negative words per report in all 152,694 reports that have been filed between 2000 and 2018.To compare this historical text negativity with macroeconomic data, we aggregated the per-report negativity score for all reports filed in one quarter and compared this to US quarterly unemployment rates and quarter-to-quarter changes in GDP. 18Figure A7 shows the results.During the two recessions (grey bars) that happened in the period from 2000 to 2018, the GDP (upper panel) dropped significantly and the unemployment rate increased (central panel). 19Correlated with these overall macroeconomic developments, the share of negative words in 10-K reports (lower panel) increased in these periods.In periods of economic recovery (lowering unemployment rates in 2004 to 2008 and in 2010 to 2012), the overall text negativity decreased.
This observation provides supportive evidence that textual data extracted from SEC filings contains information that is correlated with the changing real-world economic circumstances.However, the overall text negativity is a highly aggregated measure, not comparable to the fine-grained data on specific topics that covered by the text analysis of sentences mentioning specific keywords, such as Covid-19.Moreover, during the course of the past 20 years, the length of 10-K reports has increased substantial, which might influence text mining measures on the report level.

A
Share of 10-K reports mentioning 'corona' or 'covid' (14-days m.a.) C Daily new cases of Covid-19 patients in the US (14-days m.a.) Corona-sentence Text Negativity in 10-K reports (avg.daily share of negative words per 'corona' sentence) B A S&P 1200 Global Index (index points)

Figure 1 .Figure 2 .
Figure 1.The CoRisk measures anticipate the expected economic shocks of the Covid-19 pandemic.(A) Daily cases of Covid-19 infections in the United States (source: www.ourworldindata.org): the cases started to rise sharply in March and April 2020.(B) S&P 1200 Global Inddex (source: www.spglobal.com):As the Covid-19 pandemic reaches the level of a global health crisis stock markets plummet on 20th February.(C) Share of companies mentioning 'corona' in 10-K reports: weeks before the pandemic hits the United States, the share of firms reporting about the virus starts to rise.(D) Share of negative words in 'corona' sentences of 10-K reports: The negative sentiment surges before the global stock market crash and peaks before "Black Monday's" most severe stock market losses.

Figure 3 .
Figure 3.The CoRisk data reflect macroeconomic developments.(A) The CoRisk-Index (weekly growth: CoRisk t−1 / CoRisk t−2 ) and weekly unemployment initial claims in different industries (28th March to 09th May): changes of the CoRisk-Index are correlated with unemployment claims in 'All Industries', Manufacturing, and Wholesale & Retail.By May, the economic crisis has not led to a rise of unemployment in Finance.(B) Generalisability of the CoRisk methodology: share of negative words in 'china' sentences in 10-K reports during the 'US-China trade war' in 2018 (solid line and grey dotted trend line).The methodology allows to extract the china-related negative sentiment from the reports and it reflects negative (red) and positive (green) events during the crisis, pointing towards the general applicability of the tool to reveal granular insights about the effects of events on businesses.

Figure 5 .
Figure 5. Number of employees working in different sectors of the US economy and share of employees working in firms that report to the SEC.In eight out of 13 sectors, the firms that file to SEC represent at least 22 % of all employees in that sector.

Figure 6 .
Figure6.Share of reports per quarter of the eight different industries.There is some seasonality, which is mainly driven by Finance's large number of reports in the first quarter.The other industries do not show substantial variation within a calendar year.

Figure 7 .
Figure 7.Comparison of quarterly GDP change (upper panel), unemployment rates (central panel), and share of negative words in 10-K reports (lower panel).The figure exemplifies that the sentiment of 10-K reports tend to correlate with overall macroeconomic developments.