CovidCounties is an interactive real time tracker of the COVID19 pandemic at the level of US counties

Management of the COVID-19 pandemic has proven to be a significant challenge to policy makers. This is in large part due to uneven reporting and the absence of open-access visualization tools to present local trends and infer healthcare needs. Here we report the development of CovidCounties.org, an interactive web application that depicts daily disease trends at the level of US counties using time series plots and maps. This application is accompanied by a manually curated dataset that catalogs all major public policy actions made at the state-level, as well as technical validation of the primary data. Finally, the underlying code for the site is also provided as open source, enabling others to validate and learn from this work.


Introduction
The disease known as COVID-19 was first reported in December of 2019 in Wuhan, China 1 . Three months later it was declared a pandemic by the WHO, and since then its death toll has reached over 820,000 while infecting over 24 million people across 210 countries worldwide 2 . Additionally, the pandemic has disrupted the daily lives of billions and has incurred significant socioeconomic costs at the global level.
In the US, the very assessment of the disease's impact has been challenged by limitations in accurate data capture and analysis. Variable testing, uneven reporting, barriers to data sharing, and a lack of easy-to-use analytic tools have all contributed to a lack of clarity in establishing and trending the state of the pandemic. As a consequence, policy makers at all levels have been forced to make decisions of great socioeconomic consequence in the face of significant uncertainty.
To improve the accessibility of basic COVID-19-related information in the US, especially by the general public and policymakers without a data science background, we report the creation of a new interactive visualization tool that depicts daily disease trends at the level of individual US counties. This web application features the novel reuse of several publicly available sources of data while also introducing a new, manually curated dataset accompanying this manuscript. This site features several unique views, including local doubling times and estimated ICU bed requirements by county. Additionally, we report the technical validation of the primary data (counts per county or state per day) against other official-and commonly used sources of data.
Public policy. The data and tools incorporated into CovidCounties support the effectiveness of social distancing measures, consistent with several events that have occurred following the initial release of the website. South Dakota, one of six states which did not have a statewide shelter in place order as of April 15, 2020, experienced rapid case growth following an exposure at a meat plant (Fig. 2a). This accounted for more than half of the state's cases 3 as of April 15, 2020, with the fastest statewide doubling time of 4.5 days on April 15, 2020 (Fig. 2b) Web application deployment. The web application located at covidcounties.org was first released to the public on April 3, 2020. It features two sections: a line plot depicting time-series trends in disease dynamics, and a map depicting geospatial relationships (Fig. 3). The site has had over 15 thousand unique site in the first week and a total of nearly 27 thousand unique site visits over the lifetime of the site, most of whom accessed the website using a mobile device.

Discussion
The effective management of the COVID-19 pandemic has been hindered by both inaccurate data collection and reporting, as well as relative inaccessibility by non-data scientists. Taken together, these difficulties have impeded optimal policymaking by both government (imposing social distancing policies) and health systems (anticipating ICU utilization) alike. Consequently, responses across institutions have been highly variable and with varying degrees of success. To help address these gaps we developed covidcounties.org and performed the technical validation reported in this work.
The curation of COVID-19 case and death counts by The New York Times is an impressive effort by over 60 reporters to collect, curate and analyze a constantly growing and evolving dataset 4 . However, they acknowledge , and daily deaths (n = 2,348) (c) reported by CovidCounties against corresponding data reported by the California Department of Public Health. Each point corresponds to a measurement from a given California county on a particular date where both datasets report counts. Data is from 6/28/2020 -8/8/2020. (d-f) Comparison of the estimated hospital bed occupancy (n = 248) (d), daily cases (n = 240) (e), and daily deaths (n = 240) (f) reported by CovidCounties against corresponding data reported by the Connecticut Department of Public Health. Each point corresponds to a measurement from a given Connecticut county on a particular date where both datasets report counts. Data is from 6/28/2020 -8/8/2020. (g,h) Comparison of the estimated daily cases (n = 121,944) (g) and daily deaths (n = 121,944) (h) reported by CovidCounties against corresponding data reported by the website Corona Data Scraper. Data is from 6/28/2020 -8/8/2020. Each point corresponds to a measurement from any US county in the dataset at a particular time where both datasets report counts. (i-k) Comparison of the estimated hospital bed occupancy (n = 287) (i), daily cases (n = 287) (j), and daily deaths (n = 287) (k) reported by CovidCounties www.nature.com/scientificdata www.nature.com/scientificdata/ that the underlying data is extremely fragmented and comes from thousands of different sources at both the state and county levels and thus is inherently limited by accuracy, consistency, and timeliness. The New York Times notes that reported cases have been corrected mere hours after the initial report and there have been numerous instances where data has disappeared from databases without explanation. The New York Times has also chosen to count patients where they were treated rather than their place of residence and report on a number of geographic exceptions in their dataset (https://github.com/nytimes/covid-19-data) including the treatment of cities like New York City and Kansas City and the allocation of cases from cruise ships. Further, there are a subset of cases where the patient's county of residence cannot or has not yet been identified which is generally a small fraction of a state's total cases but can be a significant number in a small state like Rhode Island (Table 1).
Taken together, these subtleties of the data collection process imply that the COVID-19 data from The New York Times may not exactly agree with the numbers reported by various state and county Departments of Public Health. We quantified the consistencies between The New York Times COVID-19 data at the county (Fig. 1b,c,e,f) and state (Fig. 1j,k) level using state Department of Public Health data and found the datasets to be largely comparable. Based on the exact agreement, it seems likely that The New York Times is deriving their data for Connecticut directly from the Connecticut Department of Public Health (Fig. 1e,f).
Anomalies in the data become more apparent when comparing different reporting sources. In Fig. 1j,we observe a set of points where the CovidCounties daily cases for the state Rhode Island obtained from the New York Times are zero and the Rhode Island Department of Public Health reports these values to be nonzero. Inspection of the dates corresponding to these values reveals that these all occurred on weekends and that all the cases that occurred over the weekend were attributed to the following Monday. While this would have a small impact on our prediction of current hospitalizations and current ICU patients due to the 10-day average length of stay, this does have a direct impact on daily hospitalizations (Fig. 1i).
Despite the simple model formulation used to estimate ICU patients and hospitalizations and parameters that are derived from averages across multiple states rather than for specific states or counties, the concordance with data reported by State Departments of Public Health is quite reasonable (Fig. 1a,d,i). We would like to highlight that this concordance and variability is similar to the concordance of underlying data used by the model in the form of daily cases reported by the New York Times and the daily cases reported by the States Departments of Public Health (Fig. 1b,j). This variability is due to the inherent noise and error in the reporting and aggregation of this data from a vast number of sources without any specified reporting standards. The true power of using such a simple model is that it is easily extrapolated to any county in the United States on any given date provided the previous 10 days of daily cases are available. While our benchmarking demonstrates the internal validity of our modeling approach, due to a lack of readily available external datasets we are unable to assess the external validity which is a limitation of this approach.
With the advent of the COVID-19 pandemic we have observed a trend towards government agencies at the municipal, county, state, and national levels making their data increasingly accessible for re-use and therefore provide potential value. However, many of the most popular tools which are built upon this freely available data do not provide their source code for further development. The Johns Hopkins dashboard 2 , which receives more than 1.2 billion hits per day, has made their data publicly available 5 , however, the source code for their dashboard is not made available for further development by third parties. Similarly, the IHME dashboard 6 which has been referenced by the White House for making policy decisions 7 has had their dashboard peer reviewed 8 , however, their epidemiological model has yet to be peer reviewed 9 . While IHME provides open source code on their data aggregation process (https://github.com/beoutbreakprepared/nCoV2019) and some features of their model including the curve fitting of their projections (https://github.com/ihmeuw-msca/CurveFit), the whole  www.nature.com/scientificdata www.nature.com/scientificdata/ CovidCounties represents an improvement over existing dashboards in terms of both scope and granularity. Existing COVID-19 dashboards generally focus either on county level data within a particular state (primarily at a static timepoint) or at the state level across the United States. We have developed an intuitive tool that facilitates temporal comparisons between all counties in the US. However, we are inherently limited by the availability of data. While CovidCounties' estimation of ICU needs at the county level allows for higher resolution allocation of resources compared to the widely used state level model from IHME (https://covid19.healthdata.org/ united-states-of-america), zip code level data would further improve the value of these estimations for resource allocation. States like Maryland 12 , Arizona 13 , and South Carolina 14 and counties like Johnson County, Kansas 15 , San Diego County, California 16 , and King County, Washington 17 have already made zip code level data available. However, there are many states and counties that are hesitant to provide data of this granularity due to concerns over privacy thus highlighting the challenge of balancing privacy with public good.
A limitation of CovidCounties is the inherent dependence on publicly available data. To date, most states and counties are primarily providing case and death data with an increasing number also providing hospitalization data and testing data. The increased availability of testing data opens new avenues to make inferences on the infection rate in the population and the improvement of model trajectories. As testing continues to ramp up, this will also allow for the evaluation of claims that there has been an under ascertainment of cases especially in the asymptomatic 18 , which can influence case rates. States and counties are continuously ramping up testing and this sudden availability of tests can artificially distort counts by attributing individuals who were infected previously to a later date due to an earlier shortage of tests. These numbers are further complicated by the wide variety of commercially available tests that rely on different technologies with varying sensitivity and specificity.
With its release, covidcounties.org represents a powerful open-source platform to empower non-data scientists to track the current trends of the COVID-19 pandemic at the county level to help facilitate policy and healthcare decisions which can help improve outcomes. We welcome volunteers (both technical and non-technical) to help us to further develop CovidCounties (https://covidcounties.org/buttelabcovid/www/volunteers.html). www.nature.com/scientificdata www.nature.com/scientificdata/ Methods Data sources. Data on state-wide and county-level counts were obtained from The New York Times 4 via their github repository (https://github.com/nytimes/covid-19-data). County-wise population data were obtained from the US Census 19 using the R package tidycensus 20 . Data on ICU bed availability per county was obtained from Kaiser Health News 21 .
As per The New York Times, cases and deaths reported from New York, Kings, Queens, Bronx and Richmond counties were assigned to New York City. Similarly, Cass, Clay, Jackson and Platte counties in Missouri were assigned to Kansas City. When a patient's county of residence was unknown or pending many state departments reported these cases as coming from "unknown" counties. Cases reported from unknown counties were only included at the state level.
Data related to state-wide implementation of social-distancing policies were manually curated by web search and independently reviewed by a second author; disagreements were rare and resolved by discussion. Government websites were prioritized as sources of truth where feasible; otherwise, news reports covering state-wide proclamations were used. All citations are captured in the open data file accompanying this manuscript on Dryad 22 . These data were up to date and confirmed as of the date of data deposit: April 19, 2020.
Ground truth data used for validation were manually curated from the websites of multiple state departments of public health as well as Corona Data Scraper [https://coronadatascraper.com/], a commonly used resource for aggregating county-level tracking of COVID-19 over time. Citations of the validation data are included in the data file accompanying this manuscript on Dryad 22 .
Descriptive statistics on all datasets except that of the US Census and validation data are reported in Tables 1-3. Doubling time. Doubling time was calculated for each state and county by taking the reciprocal of difference between the log (base 2) case counts corresponding to adjacent days, then applying the R function loess for smoothing. The input of this model required a minimum of 8 days of data where the minimum number of cases was greater than 10. Regularization was performed by replacing extreme doubling times (>500 days) with the average of the surrounding values. ICU bed occupancy model. We incorporated modified parameters related to rates of hospitalization and ICU admission from work previously published by Ferguson et al. 23 . Although simpler than other models, it fit publicly available county-level ICU bed data in California well and was easier to understand for the user than more complicated models proposed 9,24-26 . Our adapted model assumed an 8.26% rate of hospitalization among all new cases, a 29.64% rate of intensive care unit admission among hospitalized patients, and a 10-day average length of stay (time until discharge or death). The 29.64% average rate of ICU admission was estimated from empirical data on COVID-19 ICU admissions and total COVID-19 hospitalizations in obtained from state department of public health websites for each California county, and statewide for Minnesota, Idaho, Illinois, and Indiana from June 28th, 2020 to August 8th, 2020. The 8.26% average rate of hospitalization for positive cases was estimated from empirical data on the cumulative rate of hospitalization per 100,000 population from June 28th, 2020 to August 8th, 2020 across 14 states in COVIDNet 27 . The cumulative hospitalization rate was normalized to daily hospitalization rate for positive cases by normalizing by the cumulative positive cases per 100,000 population for the 14 states across the same time span.
To our knowledge there does not currently exist a high quality machine-readable standardized source of data for ICU and hospitalization rate parameter estimation. To ensure that our simple ICU model utilizes up to date www.nature.com/scientificdata www.nature.com/scientificdata/ parameters that reflect the current state of the pandemic, we will manually extract the relevant data each month and update these parameters. The parameters used in the live version of CovidCounties and all prior hospitalization and ICU rate parameter estimations will appear on the web application. This will serve as a log to reflect how time, testing, policy decisions, antivirals, and treatment options all affect the state of the pandemic.
Incident cases were chosen as the basis for the ICU bed occupancy model rather than incident deaths as they better track the dynamic changes in hospitalizations and ICU admissions. A model built around incident deaths has the advantage of being less dependent on testing capacity and thus the hospitalization parameters of such a model would be agnostic to the gamut of testing capabilities across states over the course of the pandemic. However, incident deaths are a lagging indicator of the current state of the pandemic as they result from infections from weeks prior and are therefore less informative about the immediate impacts of events such as the introduction or lifting of state mandates which are informative from a public policy standpoint. Fig. 4 for an overall schematic of the web application. The source code was written in R (4.1.0) 28 using the shiny 29 , shinyjs 30 , tidyverse 31 and plotly 32 packages. Software version control was achieved using Docker. The entire software code for the site is publicly available on github (https://github.com/vivical/ButteLabCOVID) and dockerhub (https://hub.docker.com/r/pupster90/ covid_tracker). The web hosting was organized as a unified data share between all instances running R shiny Fig. 4 Database schematic. Source data was obtained from The New York Times, US Census Bureau, Kaiser Health News, and from a manual curation of state governmental websites and news outlets as described in Methods. Data was processed to reflect case and death counts at the level of states and counties. Functions were written to perform x-and y-axis rescaling, normalization by population, doubling time estimation, and ICU bed utilization. Results were depicted using interactive line plots and maps.