Chest imaging representing a COVID-19 positive rural U.S. population

As the COVID-19 pandemic unfolds, radiology imaging is playing an increasingly vital role in determining therapeutic options, patient management, and research directions. Publicly available data are essential to drive new research into disease etiology, early detection, and response to therapy. In response to the COVID-19 crisis, the National Cancer Institute (NCI) has extended the Cancer Imaging Archive (TCIA) to include COVID-19 related images. Rural populations are one population at risk for underrepresentation in such public repositories. We have published in TCIA a collection of radiographic and CT imaging studies for patients who tested positive for COVID-19 in the state of Arkansas. A set of clinical data describes each patient including demographics, comorbidities, selected lab data and key radiology findings. These data are cross-linked to SARS-COV-2 cDNA sequence data extracted from clinical isolates from the same population, uploaded to the GenBank repository. We believe this collection will help to address population imbalance in COVID-19 data by providing samples from this normally underrepresented population.

studies performed thus far in the US are chest radiographs. The image collection presented here is consistent with these guidelines.
The ability to sequence the whole genome of the SARS-CoV-2 virus is essential for tracking the evolution of the virus and the resulting disease. GenBank 7 contains 22,542 SRA runs and 11,142 Nucleotide records (as of July 20, 2020) and is growing daily. We have contributed to this repository sequences of the virus strains infecting our local population.

Methods
Patient selection. Images and clinical correlate data were obtained with approval of the UAMS Institutional Review Board (IRB number 228350) which includes open access publication of this data in de-identified form. All data were collected from hospitalized patients with a positive COVID-19 lab test (PCR) verified diagnosis who had imaging studies within eight days prior to diagnosis and at least one imaging study post diagnosis. Emphasis was placed on patients with multiple post diagnosis imaging studies. All imaging studies of the head were excluded due to issues of potential patient identification from volumetric CT reconstructions 8,9 . Imaging. Imaging techniques were standard of care, primarily portable (97%), digital chest radiographs collected either by computed radiography, CR, (19 patients, 26 image series) or direct digital capture, DX, (100 patients, 236 series). CT (computed tomography) studies were performed on 21% (23) of patients depending on the severity of symptoms and clinical status (resulting in 199 CT image series). Radiographs were performed using a single view AP (anteroposterior) technique or PA (posteroanterior) and lateral views depending on patient mobility and intubation status. CT was performed on multi-slice, spiral scanners. Contiguous 1 mm axial images were obtained through the chest (Online-only Table 1 provides acquisition parameters for each of the 199 CT series). Axial (3 mm spatial resolution) and 2 mm sagittal and coronal reconstructions were performed and subsequently reviewed on a dedicated Picture Archive and Communication System (PACS) workstation. A total of 256 chest imaging studies were identified, including 233 radiographs and 23 CT studies performed on a total of 105 patients. Figure 2 is an illustrative example of a chest radiograph and CT for the same patient taken on the same day. Final radiology reports for these examinations signed by board-certified radiologists were analyzed for key imaging findings.  www.nature.com/scientificdata www.nature.com/scientificdata/ Summary of key imaging findings. The most frequent pattern of imaging findings is 'Organizing pneumonia' which is essentially a pattern of lung changes as a response to inflammation. The reported key imaging findings on radiographs and CT include confluent and patchy multifocal airspace opacification, either ground-glass opacities or consolidation. These airspace opacities were predominantly bilateral, multifocal, and peripheral. Although more commonly, these changes were diffuse and bilateral, lower lobe predominance was seen in cases that were more focal. A total of 59/232 (25%) radiographs and 8/23 (35%) CT were negative for airspace opacification. In our patient population, when present, ground glass changes were more common than consolidation, likely suggesting an earlier presentation in the disease course. Pleural effusions were rare and, when present, were only trace to mild. Other atypical findings like mediastinal lymphadenopathy, cavitation, and pneumothorax were not identified in our patient population. One of the 23 CT studies showed bilateral segmental pulmonary emboli.
The majority of ICU patients (28/29) showed radiographic AP changes, with 59% showing bilateral diffuse changes with left lower lobe involvement. One Radiographic study for one ICU patient had no significant findings. Seventy percent (7/10) of ICU mortality occurred in patients showing left lower lobe opacities. In contrast, one of these patients had no significant findings on CR or DX radiographic studies, 1 had general atelectasis, and 1 had diffuse disease without left lower lobe opacity. Clinical correlates. The average age in the cohort was 54.3 years old (range 19-91) with an even sex distribution (52 Male, 53 Female). The worldwide incidence is reported to be close to 1:1, with a 50% increase in hospitalizations, ICU stays, and mortality among males 10 . The racial characteristics of the cohort are presented in Fig. 1 in comparison to the total population of Arkansas and the current characteristics of the state-wide infected population. The average BMI in the cohort is 33.1 (18.7-64.9), well within obese range (30.0 or higher). Key Comorbidities include burns (2%), malnutrition (3%), pregnancy (4%), chronic kidney disease (11%), diabetes (21%), organ transplant (3%), and cancer (24%).
The overall ICU admission rate was 28% (29/105). The Average age among those admitted to the ICU was 58 (range 25-89), slightly higher than the average for the cohort as a whole. The racial breakdown of ICU admissions included 28% of the white patients, 25% of the black, 50% of other, and 100% of Pacific islanders. The ICU population was 66% male and 33% female and included 1 pregnant patient. Forty three percent of patients admitted to the ICU had BMI greater than 30. Of the black patients, 21% had diabetes and 29% chronic renal disease, while among white patients the highest percentage comorbidities were cancer (10%) and diabetes (10%). The ICU mortality rate was 34.4% (10/29) which is 1.5 times the national average of 23.6% 11 . The overall mortality rate was 10% (10/105) and all 10 patients in the mortality group were first admitted to ICU. Arkansan males were 1.95 times more likely to go to ICU (19/52 vs. 10/53). Our data shows an almost even overall mortality rate among males (5/53) and females (5/52). The data also suggest that once in the ICU, female mortality occurs at a rate 1.9 times that of males (5/19 vs. 5/10). Figure 3 summarizes rates of hospitalization, ICU admission and mortality stratified by sex and race.  SARS-CoV-2 specific primer set version 3 (218 primers) was kindly provided by Dr. Joshua Quick, University of Birmingham. The PCR amplification and extension condition of SARS-CoV-2 genome was performed using the COVID-19 ARTIC v3 Illumina library construction and sequencing protocol V.4 (https://doi.org/10.17504/ protocols.io.bgxjjxkn). The library was prepared and loaded on a R9.4.1/FLO-MIN106 flow cell, and sequenced with the MinION Mk1B (Oxford Nanopore Technology, ONT). Base-calling of the resulting FAST5 files was performed using Guppy (version 3.4.5) 12 , and the RAMPART software (v1.0.5) from the ARTIC Network 13 was used to monitor sequencing in real-time. The ARTIC Network bioinformatics protocol was used for all genome assembly and variant calling steps 14 . Figure 4 illustrates the resultant SARS-CoV-2 genome organization and identified mutations. Two isolates of UAMS SARS-CoV-2 genome were identified and grouped into a clade 15 . Black circles and black squares in Fig. 4 identify the locations used for clade identification characterized by Global Initiative on Sharing All Influenza Data (GISAID) 15 . TrackViewer software was used for visualization 16 .

Data Records
COVID-19-AR image collection. Image data were extracted from the clinical PACS (Sectra AB, Linkoping, Sweden) at the University of Arkansas for Medical Sciences using Digital Imaging and Communications in Medicine (DICOM) query/retrieve software. All image data were de-identified and stored in DICOM standard format 17 on TCIA as collection COVID-19-AR 18 . Clinical data structure. All clinical data were obtained from the Arkansas Clinical Data Repository (AR-CDR) 19 . The AR-CDR is a research data warehouse that provides a single and secure source of data for use in clinical and translational research; housing data for this project that are extracted from the EPIC (Epic Systems Inc, Verona, WI) electronic health record (EHR) system and legacy systems.
Data are provided on TCIA 18 in a spreadsheet format (Microsoft excel) with one tab for patient data and a second for imaging study related data. Patient demographics, outcomes, co-morbidities and other clinical information are provided. Online-only Table 2 defines categorical variables (column headings) that are included. Fig. 4 The illustration of SARS-CoV-2 genome organization and a plot of mutations. Two isolates of UAMS SARS-CoV-2 genome, UAMS001 (a) and UAMS002 (b), are grouped into clade G (D614G mutation) but different sub-group (GH and GR, respectively). Genome size of both isolates is 29,903-nucleotides. The leader sequence is represented by a small grey square at the 5′ terminus of the genome. Open reading frames (ORFs) 1a (yellow) and 1b (brown) encode the nonstructural proteins. Spike (red), envelope, membrane, and nucleocapsid are shown in purple and the accessory proteins in green. Vertical lines represent the presence of a mutation. Above each line, the variants are annotated as the nucleic acid change (such as g.C241T) and amino-acid change (such as p.D614G) at that specific site. The non-synonymous mutations are presented by circles and the synonymous mutations squares. Black circles and black squares are the locations used for clade identification characterized by Global Initiative on Sharing All Influenza Data (GISAID). (2020) 7:414 | https://doi.org/10.1038/s41597-020-00741-6 www.nature.com/scientificdata www.nature.com/scientificdata/ Genomic data structure. Viral genomes are stored in the GenBank FASTA format 20 which stores both a nucleotide sequence and its corresponding quality scores. All required metadata, including collection dates, location, nucleotide sequence, amino acid sequences, and gene annotations, as described in the GenBank data model 21 were provided for each sequence.
Data are available from the NCBI Sequence Read Archive 22 and via direct links to each SARS-CoV-2 viral genome 23,24 . An NCBI BioProject was created (https://identifiers.org/bioproject:PRJNA643530) and also provides links to the viral genomes. Links to these data are also provided on TCIA 18 .

Technical Validation
Imaging and clinical data. All image data were processed using standard TCIA curation workflows. TCIA uses a standards-based approach for de-identification of images stored in the Digital Imaging and Communications in Medicine (DICOM) format. DICOM Standard PS3.15 2016a -Security and System Management Profiles 25 defines how to correctly de-identify DICOM objects. The image data described in this data descriptor were de-identified using the "Basic Application Confidentiality Profile, " amended by inclusion of profile options: Clean Pixel Data Option (removal of information burned into the pixel data), Clean Descriptors Option (removal of PHI from data elements of type text or string), Retain Longitudinal With Modified Dates Option (all dates are shifted in a consistent manner by a random number so that relative temporal information is retained but absolute dates are masked), Retain Patient Characteristics Option (descriptive patient information is retained), Retain Device Identity Option (acquisition system information of potential scientific value is retained), and Retain Safe Private Option (retain scientifically valuable information stored in vendor private data elements) as is standard TCIA practice. Additional details on the de-identification process including the process for modifying dates by shifting using a random number are provided on the TCIA web site (wiki.cancerimagingarchive. net/display/Public/Submission+and+De-identification+Overview).
The TCIA curation team verifies completeness of the received collection, full removal of all PHI, proper labeling of all information to facilitate retrieval, and proper linkages among components of the collection [26][27][28] . TCIA curation procedures assure image quality and data integrity. TCIA uses the Posda open-source framework to implement its curation process 29,30 . Posda-based TCIA curation workflows ensure the scientific utility of data and eliminate protected health information. After Posda processing is complete, a data curator visually inspects every DICOM image.
Patient demographics, outcomes, and other clinical information were also validated during the TCIA curation process. Prior to this review, two of the authors (AB, FP) manually reviewed all clinical data for consistency, accuracy, and completeness, while SD and FP reviewed all image data and radiology reports. Images were reviewed on the clinical PACS and using TCIA's radiology image viewer 31 .

SARS-CoV-2 genome sequencing.
To validate the ARTIC protocol for both experimental and computational methods, we included the Washington strain SARS-CoV-2 genomics RNA (2019-nCoV/USA-WA1/2020, MN985325.1) acquired from the American Type Culture Collection (ATCC) as the positive control and nuclease-free water as the negative control. The positive control and negative control were sequenced simultaneously with the clinical sequences and analyzed using the ARTIC protocol to obtain SARS-CoV-2 genomes. The SARS-CoV-2 genome sequence from the positive control showed no differences relative to the original genome sequence, MN985325.1. For the negative control, no viral sequence was obtained.