A comprehensive collection of systems biology data characterizing the host response to viral infection

The Systems Biology for Infectious Diseases Research program was established by the U.S. National Institute of Allergy and Infectious Diseases to investigate host-pathogen interactions at a systems level. This program generated 47 transcriptomic and proteomic datasets from 30 studies that investigate in vivo and in vitro host responses to viral infections. Human pathogens in the Orthomyxoviridae and Coronaviridae families, especially pandemic H1N1 and avian H5N1 influenza A viruses and severe acute respiratory syndrome coronavirus (SARS-CoV), were investigated. Study validation was demonstrated via experimental quality control measures and meta-analysis of independent experiments performed under similar conditions. Primary assay results are archived at the GEO and PeptideAtlas public repositories, while processed statistical results together with standardized metadata are publically available at the Influenza Research Database (www.fludb.org) and the Virus Pathogen Resource (www.viprbrc.org). By comparing data from mutant versus wild-type virus and host strains, RNA versus protein differential expression, and infection with genetically similar strains, these data can be used to further investigate genetic and physiological determinants of host responses to viral infection.


Background & Summary
With the recognition that host responses to pathogen infection play key roles in disease severity and mortality, virologists have shifted toward integrated systems biology approaches to identify therapeutics that target host pathways 1,2 . To support the cross-disciplinary approaches necessary to address a systemslevel analysis, the Division of Microbiology and Infectious Diseases (DMID) at the National Institute of Allergy and Infectious Disease (NIAID) established the Systems Biology for Infectious Diseases Research (SysBio) program that provided support for four centers from 2008-2013: Systems Influenza, Systems Virology, Systems Biology of Enteropathogens, and Mycobacterium tuberculosis Systems Biology 3 . The Systems Influenza and Systems Virology centers focused on elucidating the mechanisms of how viral regulation of the host cellular circuitry contributes to viral replication and disease severity, thereby elucidating host pathways that could serve as potential new therapeutic targets. Several publications have reported the findings from these virology-focused SysBio centers. Overall characterization of the host responses demonstrated that pro-inflammatory interferon (IFN) signalling pathways were enriched following infection with either influenza A or Betacoronavirus infection in vitro 4,5 and in vivo 6,7 . These studies characterized several aspects of viral-host interactions that influence disease severity. Experiments designed to tease out the contribution of the virus and the host to the severity of host response reported that specific mutations in the H5N1 viral genome affected the kinetics and magnitude of the host response, rather than changes in specific host response factors 8 . Another study found that early response signatures to influenza virus infection correlated with the severity of late disease 9 . Lipidomics analysis of mice infected with various influenza A virus strains of varying pathogenicity have also been reported 10 . Similarly, the disease severity of SARS-CoV correlated with the dysregulation of the urokinase pathway and an increase in fibrinolysin pathway activity 7 . An integrated network interrogation approach using both transcriptomic and proteomic data sets predicted a subset of key regulator molecules responding to SARS-CoV and influenza infections that could be used as host targets for therapeutic intervention 11,12 . Genetic modification of non-structural protein 16 in SARS-CoV showed enhanced susceptibility to type I and III interferon responses, making it a good vaccine strain candidate 13 .
To support the broad dissemination of research datasets from these SysBio centers and other supported programs, DMID established the Bioinformatics Resource Centers for Infectious Diseases (BRC) program to provide public database and analysis resources for the infectious disease research community. A data dissemination working group, consisting of representatives from NIAID, the SysBio centers, and the BRCs was established to develop a data management and release approach that would maximize the reuse potential of the generated data. The working group leveraged existing public archiving strategies to store the primary assay results and devised metadata standards to represent the experiment descriptions and interpreted results in a consistent and standardized way. The final derived host response data and associated structured standardized metadata were made publically available at the BRC appropriate for the system under study. As the virus-centric BRCs, the Influenza Research Database (IRD, www.fludb.org) 14 and the Virus Pathogen Database and Analysis Resource (ViPR, www.viprbrc. org) 15 host data produced by the Systems Influenza and Systems Virology centers.
Here we describe 47 transcriptomic and proteomic datasets from 30 studies generated by these SysBio centers and made available through public resources. These studies characterized the host response to infection by members of the Orthomyxoviridae and Coronaviridae virus families, including pandemic influenza A H1N1 virus, highly pathogenic H5N1 avian influenza (HPAI) virus, severe acute respiratory syndrome coronavirus (SARS-CoV), and Middle East respiratory syndrome coronavirus (MERS-CoV); all of which are capable of causing severe respiratory infections and pose significant threats to humans on a global scale 4,16 . Host responses were evaluated in either in vivo (mouse) or in vitro (human cell line) model systems using three major experimental designs: longitudinal time-course, dose response, and genetic modification (involving both genetically manipulated viruses and hosts) comparisons. Host responses to these manipulated variables were assessed using transcriptomic (gene expression microarray) and proteomic (mass spectrometry) assays. Together, this collection represents the first coordinated effort to create a systems level description of host-pathogen interactions using multiple viral strains, host models, and '-omics' technologies.

Methods
The experimental designs reported here include longitudinal (i.e. time course), dose response, and genetic modification of both virus and host strain comparisons. The intent of these designs was to assemble a unified view of the virulence mechanisms and replication strategies resulting from viral regulation of host cellular processes. For the present experiments, either transcriptomic (using gene expression microarrays) or proteomic (using liquid chromatography coupled with mass spectrometry) methodologies were used to collect the reported data. The materials used, sample preparation protocols, validation procedures, data processing, and hypothesis testing that were performed are described below. Throughout the Methods section, Study IDs are appended to indicate which studies used a given method. A summary of the overall experiment workflow and the experiment factor conditions that were compared is presented in Fig. 1. The relationship between the experiment workflow and the experimental metadata, analysis metadata, primary assay results, and derived data is also presented. Study designs for all experiments, including relevant repository identifiers, are summarized in Table 1 Table 2 (available online only) and individual protein experiment samples detailed in Table 3 (available online only). Sample tracking from animal subjects to experiment samples is given in Table 4 (available online only).

Viruses
Within the studies reported here, the host response to viral agents from two major viral families, Orthomyxoviridae and Coronaviridae, are characterized. From Orthomyxoviridae, two influenza A subtypes H5N1 and H1N1 were chosen in order to investigate influenza viruses with higher and lower reported pathogenicity, respectively. Highly Pathogenic Avian Influenza H5N1 (HPAI) has been commonly referred to as 'Avian Flu' as it is endemic in bird populations. HPAI is known to exhibit 'stuttering transmission' in humans, with a relatively high mortality rates (>50%), but to-date has not acquired efficient human-to-human transmission. This high mortality rate together with its pathological similarities to the 1918 H1N1 'Spanish Flu' make it an important model to study and prevent future lethal pandemics. On the other hand, the more recent seasonal and pandemic H1N1 strains are considered to have lower pathogenicity, rarely causing severe disease or death. Since its emergence, pandemic H1N1 has been found to recur in the human population as a seasonal influenza. H1N1 influenza isolates used in these studies include recent pandemic strains, previous seasonal strains, and a reconstructed 1918 strain.
In addition to the influenza A strains, the studies reported here investigate recently characterized Coronaviridae strains. These strains are all members of the betacoronavirus genus including Severe Acute Respiratory Syndrome (SARS-CoV) and Middle Eastern Respiratory Syndrome (MERS-CoV) Coronaviruses. Both viruses cause severe acute respiratory illness, with SARS-CoV and MERS-CoV infection resulting in an overall mortality rate of~10% and 38%, respectively. SARS-CoV was marked by its ease of transmissibility between humans, while MERS-CoV appears to be more limited in its human transmissibility. These viruses represent emerging pathogens not formerly recognized as a threat to human health, which adds an aspect of urgency to their research and characterization.
The SARS-CoV virus preparations used in these studies include a wild-type infectious clone derived from SARS-CoV (abbreviated icSARS-CoV), icSARS-CoV dORF6 in which accessory ORF6 was deleted, and Bat-SRBD, a reconstruction from a Bat-SCoV consensus genome with SARS-CoV receptor-binding domain (RBD) 27 . These strains were obtained from the Baric laboratory's existing infectious clone constructs as previously described [28][29][30]

Human cell lines, infections, and extractions
Most cell line studies involving infection with influenza A viruses were conducted in Calu-3 cells, a human lung adenocarcinoma cell line (kindly provided by Raymond Pickles; University of North Carolina, Chapel Hill, NC, USA). Cells were maintained in a 1:1 mixture of Dulbecco's modified Eagle's medium and Ham's F12 nutrient medium (DF12; Invitrogen, Carlsbad, CA) supplemented with 10% fetal bovine serum. For infection, Calu-3 cells were plated in 6 well plates at 1x10 6 cells per well. Cells were washed after 24 hours and infected 48 hours after plating. Infections were performed by removing culture medium, washing cells twice with 1x PBS, and inoculating each well with 300 μl of DF12 supplemented with 0.3% bovine serum albumin (DF12-BSA) containing virus at a multiplicity of infection of 1 to 3 plaque forming units per cell. Cultures were mock-infected by inoculating with DF12-BSA lacking virus. Cells were then incubated for 40 min at 37°C, and after the inoculum was removed, cells were washed with 1x PBS and cultured in 2 mL of DF12-BSA containing L-(tosylamido-2-phenyl) ethyl chloromethyl ketone (TPCK)-treated trypsin. At each sample collection time point, aliquots of medium were reserved for plaque assays (see below). Time zero samples were immediately harvested using the appropriate procedure depending on the assay to be performed.
Studies examining members of the Betacoronavirus genus were performed in a clonal population of Calu-3 cells sorted for high levels of expression of the SARS-CoV cellular receptor angiotensin-converting enzyme 2 (ACE2), referred to as Calu-3 2B4 cells (kindly provided by Chien K. Tseng, University of Texas Medical Branch, Galveston, TX) 32 . Cells were maintained in 1x MEM (Gibco catalog number 11095) supplemented with defined Fetal Bovine Serum (HyClone catalog number SH30070.03) and antibiotic/antimycotic (100x Gibco catalog number 15240). Infections of Calu-3 2B4 cells were performed in a similar manner to the above Calu-3 cells. In the case of Calu-3 2B4 cells, 1.5 × 10 6 cells per well were seeded with 3 mL of complete culture medium. Culture medium was refreshed after 24 hours and infections administered after 48 hours as above. (SCL005, SCL006).
A subset of studies involving infection with influenza A virus and members of the Betacoronavirus genus were performed in human airway epithelial (HAE) cells, which were obtained from airway specimens resected from patients undergoing surgery under University of North Carolina Institutional Review Board, with approved protocols by the Cystic Fibrosis Center Tissue Culture Core. Primary cells were plated at a density of 2.5x10 5 cells per well on permeable Transwell-COL (12-mm-diameter) supports. HAE cultures were generated over 6 to 8 weeks, forming well-differentiated polarized cultures that resembled in vivo pseudostratified mucociliary epithelium 33 . Infection of HAE cultures began by washing with 1x PBS followed by challenge with 200 μL of either viral agent or mock to the apical surface. Cultures were incubated at 37°C for 2 hours and the inoculum and unbound viruses were removed by washing three times with 1x PBS. Apical wash samples were harvested to analyze viral growth kinetics and titers by plaque assay in Vero E6 cells (see below). (SHAE002, SHAE003, SHAE004).
Isolation of RNA from cell cultures was performed as follows: cell monolayers were first washed with 5 mL of cold PBS and then lysed directly in the culture dish by adding 1 mL of TRIZOL reagent. Cells were pipetted up and down until cells had a uniform color and consistency. They were then transferred to a 2 ml polypropylene tube (O-ring, screw cap), vortexed thoroughly, and incubated for 5 minutes at room temperature. Lysates were flash frozen on dry ice and stored at −80°C until shipping on dry ice. (ICL004, ICL006, ICL010, ICL011, ICL012, ECL001, SCL005, SCL006, SHAE002, SHAE003, SHAE004).
Isolation of protein from cell cultures was performed as follows: culture media was removed and cells were washed with 1 mL cold 150 mM (pH 8.5) ammonium bicarbonate buffer. This process was repeated for a total of 3 washes. 300 μL of cold 8 M urea dissolved in 50 mM ammonium bicarbonate buffer [pH 7.8] was then added directly to the well. Cells were scraped into a pre-sterilized 2 ml siliconized tube and incubated at room temperature for 15 minutes minutes to inactive viruses. Samples were then frozen in liquid nitrogen or dry ice/ethanol bath and stored at −80°C until further processing. (ICL004, ICL006, ICL010, ICL011, ICL012, SCL005, SCL006).

Animal strains, infections, and extractions
The following mouse strains were obtained from the Jackson Laboratory (Bar Harbor, ME, USA): C57BL/6 J (stock no. 000664), Serpine1 − / − (stock no. 002507), PLAT − / − (stock no. 002508), B6;129S6-Ppp1r14c tm1Uhl /J (stock no. 013041), B6.129S2-Tnfrsf1b tm1Mwm /J (stock no. 002620), B6.129S7-Tnfrsf1b tm1Imx /J (stock no. 003243), B6.129S4-Timp1 tm1Pds /J (stock no. 006243), B6.129P2-Cxcr3 tm1Dgen /J (stock no. 005796), and B6.129-Ido1 tm1Alm /J (stock no. 005867). Young female BALB/c (Six-to eight-week-old) mice were purchased from Charles River Laboratories (Wilmington, MA, USA). All animal housing, care and treatment were conducted in accordance with the University of Wisconsin-Madison School of Veterinary Medicine Animal Care and Use Committee, or by the University of North Carolina Institutional Animal Care and Use Committee or under the guidance of the CDC's Institutional Animal Care and Use Committee in an animal facility accredited by the Association for Assessment and Accreditation of Laboratory Animal Care International. For virus infections, mice were anesthetized by isofluorane inhalation (VN1203) or lightly anesthetized with ketamine/xylazine (SARS-CoV and CA04) and intranasally inoculated with either 50 μL of phosphate-buffered saline (PBS) alone (mock infection), or PBS containing a viral agent. Individual or group mouse body weights were collected on a daily basis to monitor the disease course, and mice were humanely euthanized upon reaching the experimental endpoint (i.e. sample collection or severe clinical symptoms). Weight loss significance was determined by Student's t test (Microsoft Excel), and significance in survival data was determined by the Mantel-Cox test (GraphPad). Both viral titers and qPCR assays were used to confirm the infection status of all mice (see below). For more details, see animal strain and cell line descriptions given in Table 6 (available online only).
To prevent RNA degradation, immediately after dissection, lung tissues were directly submerged in RNAlater stabilization solution (Ambion, Catalog number AM7021) following manufacturer's recommendations. Specifically, tissue was cut into small chunks ( o0.5 cm in any single dimension) and placed immediately into 10-20 volumes (w/v) of RNAlater. Samples were thoroughly immersed in RNAlater solution and incubated overnight at 4°C, followed by freezing at −80°C.
Homogenization of tissue for protein isolation was performed as follows: lung tissue were rinsed in 50 mM ammonium bicarbonate to remove blood and other fluid, then frozen at −80°C and stored until the next step. The tissue was homogenized in 2 ml of 8 M urea dissolved in 50 mM ammonium bicarbonate buffer [pH 7.8]. The lung homogenates were then centrifuged at low speed (e.g. 5000xg) to remove large pieces of debris. The supernatants were transferred to new 2 mL siliconized microcentrifuge tubes, and incubated at room temperature for 1 hour to inactivate viruses. Samples were then frozen in liquid nitrogen or dry ice/ethanol bath and stored at −80°C until further processing. (CA04M001, IM001,  IM002, IM004, IM005, IM006A, IM006B, IM007, SM001).
Alternatively, mice were anesthetized with 350 μL of Avertin (2, 2, 2-Tribromoethanol) administered intraperitoneally. The virus inoculum was administered intranasally in a total volume of 30 μL (10 5 PFU of virus in 1x PBS). After infection, mice were kept on a heating pad until they regained consciousness. At 12 hours following infection, mice were sacrificed by CO 2 inhalation. A hole was clipped in the trachea and bronchoalveolar fluid (BALF) collected. Lungs were perfused three times with 1 mL cold 1x PBS, for total of 3 mL. Whole lungs were then removed and homogenized in 1 mL TRIzol. (SBRI_AA_E1).

Viral titers and infection confirmation
Influenza viral titers were determined by standard plaque assay 34 . Briefly, MDCK cells were plated in 6-well plates, washed twice and then inoculated with 100 μL of 10-fold diluted virus in minimum essential medium supplemented with 0.3% BSA (MEM-BSA). Cells were incubated at room temperature for 1 hour, after which the inoculation solution was then removed. Cells were washed three times with PBS and overlaid with 1% Seaplaque agarose in MEM-BSA, supplemented with TPCK-treated trypsin (final 0.5 μg/mL). After the agarose solidified, plates were incubated upside down at 37°C and 5% CO 2 for 48 hours. To visualize and count plaques, all wells were fixed with 10% phosphate-buffered formalin overnight, after which the agarose plugs were removed and plates were dried. (CA04M001, ICL004,  ICL006, ICL010, ICL011, ICL012, IM001, IM002, IM004, IM005, IM006A, IM006B, IM007, IM009,  IM010, SBRI_AA_E1).
Plaque assays for members of the Betacoronavirus genus were performed on supernatants collected at various times post-infection. Briefly, 5x10 5 Vero E6/Vero 81 cells were plated in 6-well plates. 10-fold dilutions of harvested viruses were generated in 1 mL of PBS. Then 200 μL of diluted virus was added to each well. Inoculated cells were incubated at 37°C for 1 hour (rocking gently every 15 minutes). Then 3 mL per well of agarose, consisting of 2x DMEM, 1.6% Seaplaque low melting point agarose, 10% final concentration fetal clone II for SARS-CoV and 4% for MERS-CoV, and 1x Antibiotic/Antimycotic was added to the cells. After solidification, plates were incubated face up at 37°C and 5% CO 2 for~48 hours. 1x neutral red solution was added to the top of the agar and incubated for at least 2 hours. Neutral red was then removed and the plaques were counted. (ECL001, SCL005, SCL006, SHAE002, SHAE003, SHAE004, SM001, SM003, SM004, SM007, SM009, SM012, SM014, SM015, SM019, SM020).
In addition to viral titers, real-time PCR quantification of viral genomic and viral mRNA species was performed to confirm viral infection and for comparisons between time points both within an infection and across studies. For Betacoronavirus genus, first-strand cDNA synthesis was performed using 500 ng of total RNA and Thermo-Script reverse transcriptase (Invitrogen) according to the manufacturer's protocol. For influenza, first-strand cDNA synthesis was performed similar to above except 15 pmol of RNA-specific primers for VN1203, CA04 or NL602 nucleoprotein genomic RNA and mRNA and endogenous control were used to prime the reactions. Primers for NP RNAs were designed with unique 5′ sequences, thereby ensuring differentiation between genomic RNA and mRNA species. The reverse transcriptase reaction primer sequences used were: NP genomic RNA (VN1203, CA04, and NL602) -5′GGCCGTCATGGTGGCGAATAGCAAAAGCAGGGTAGATAATCACTC3′. NP mRNA (VN1203) -5′CCAGATCGTTCGAGTCGTTTTTTTTTTTTTTTTTCTTTAATTGTC3′. NP mRNA (NL602 and CA04) -5′CCAGATCGTTCGAGTCGTTTTTTTTTTTTTTTTTCAACTGTCATACTC3′. RPL14 (human endogenous control) -5′TTCAATCTTCTTGGCCCATC3′. RPL10 (mouse endogenous control) -5′GAACGATTTGGTAGGGTATAGGAG3′.
The qPCR assay was performed using a SYBR green kit (Applied Biosystems, Carlsbad, CA, USA) with specific primers for the different RNA species, according to the manufacturer's standard protocol. Relative RNA quantities were determined using the comparative threshold cycle (CT) method, with human RPL14 or mouse RPL10 as endogenous. The qPCR primer sequences for endogenous controls used were: RPL14 forward -5′TTCATCCTCAAGTTTCCGC3′ RPL14 reverse -5′TTCAATCTTCTTGGCCCATC3′ RPL10 forward -5′TGAAGACATGGTTGCTGAGAAG3′ RPL10 reverse -5′GAACGATTTGGTAGGGTATAGGAG3′.

Isolation of total RNA and quality control
Isolation of RNA from Trizol lysates was performed using Qiagen RNeasy Mini kit (Qiagen Inc., Valencia, CA). The steps of the protocol were performed as follows: lysates were mixed by vortexing followed by a 5-minute incubation at RT. To each sample, 200 μL of chloroform per mL Trizol was added and the tubes shook vigorously for 15 seconds. In many cases only a portion of the 1 ml Trizol lysate was used and the volume of chloroform was adjusted accordingly. After incubation for 10 minutes at RT, samples were spun at 12,000 g for 15 minutes at 4°C. The aqueous phase was transferred to a clean tube and an equal volume of RLT +BME added. The volume was then doubled by adding 70% ethanol and mixed well by pipetting. This solution was applied to a RNeasy mini column and the RNA was isolated from the column per the manufacturer's instructions. Appropriate amounts of RNA for quality control were removed and the remaining RNA was frozen on dry ice and store at −80°C. Determination of RNA quality and quantity was done using an Agilent Bioanalyzer 2100 and a Nanodrop spectrophotometer following the manufacturers' recommendations. (CA04M001, ICL004, ICL006, ICL010, ICL011, ICL012,  IM001, IM002, IM004, IM005, IM006A, IM006B, IM007, IM009, IM010, ECL001, SCL005, SCL006,  SHAE002, SHAE003, SHAE004, SM001, SM003, SM004, SM007, SM009, SM012, SM014, SM015,  SM019, SM020). Alternatively, chloroform (0.2 mL per 1 mL TRIzol) was added to samples, tubes were inverted 20 times, and incubated at room temperature for 3 minutes. Samples were centrifuged at 12,000 g for 15 minutes at 4°C. The aqueous phase (top phase) was transferred to a new tube and an equal volume of isopropyl alcohol was added and samples incubated for 10 minutes. Tubes were centrifuged at 12,000 g for 10 minutes at 4°C. The RNA pellet was washed once with 70% ethanol and re-suspended in 100 μL of H 2 0, 1 μL of glycogen and 1/10 volume of 3 M NaAc. RNA was re-precipitated with an equal volume of 100% ethanol and washed 2 times with 70% ethanol. The dried pellet was resuspended in 30 μL water. The RNA concentration and purity was determined by absorbance at 260 nm, 270 nm, and 280 nm using a plate reader. RNA integrity was determined using a BioAnalyzer (Agilent, Santa Clara, CA). (SBRI_AA_E1).
Labeling and quantification of total RNA for agilent microarray Fluorescently labeled probes were generated from each total RNA sample using Agilent one-color Low Input Quick Amp Labeling Kit (Agilent Technologies) following the manufacturer's instructions. The cRNA probe synthesis began with adding 100 ng total intact RNA to 3.25 μL water, combining RNA and Master Mix (consisting of 1.0 μL H2O, 0.25 μL Spike-in (dilution 2), 0.8 μL T7 Promoter Primer, 2.05 μL Total Volume per reaction), denaturing for 10 minutes at 65°C, and incubating at 4°C for 5 minutes. Next, double-stranded cDNA was synthesized by combining the above cRNA with 4.7 μL per reaction of First Strand mix (consisting of 2.0 μL 5X first stand Buffer1, 1.0 μL 0.1 M DTT, 0.5 μL 10 mM dNTP mix, and 1.2 μL AffinityScript RNase Block Mix) at room temperature followed by incubations at 40°C for 2 hours, 70°C for 15 minutes, and 4°C for 5 minutes.
Labeling of each sample was done by adding 6.0 μL of transcription master mix (consisting of 0.75 μL H 2 O, 3.2 μL 5X Transcription Buffer, 0.6 μL 0.1 M DTT, 1.0 μL NTP Mix, 0.21 μL T7 RNA Polymerase Blend, and 0.24 μL of Cy3 per reaction), mixing by pipette, and incubating at 40°C for 2 hours in the dark. Next, probes were purified with the Qiagen Rneasy kit (RNA cleanup protocol). After 84 μL of nuclease free water was added (to 100 μL total volume), samples were transferred to 1.5 mL microfuge tubes. Next, 350 μL of RLT, and 250 μL of 100% ethanol were added and mixed thoroughly by pipetting. 700 μL of this mixture was then applied to a column, centrifuged 30 seconds at 13,000 rpm at 4°C (discard flow through and collection tube). The column was then transferred to a new collection tube, 500 μL of RPE washing buffer was added, and centrifuged for 30 seconds at 13,000 rpm at 4°C (discard flow through, reuse tube). Another 500 μL of RPE was added and centrifuged at 60 seconds at 13,000 rpm at 4°C, then transferred to a clean microfuge tube and spun for 30 seconds on max to dry column.
Quantification of cRNA probes was done by the Nanodrop spectrophotometry. For each sample 1.2 μL of purified probe was measured and cRNA concentration (ng/μL) and dye incorporation (pmol/μL) recorded. Specific activity was calculated by: (dye incorporation/ cRNA concentration)*1000. Results greater than 8.0 were used for hybridization.

Agilent microarray hybridization and scanning
In summary, 250 ng of each RNA sample was hybridized to either an Agilent 4x44K mouse wholegenome oligonucleotide microarray G4122F (design ID 014868) or 4X44K human HG (design ID 014850) array according to the manufacturer's instructions (details below).
Prior to hybridization, cRNA was first fragmented by mixing Cy3-cRNA per reaction volume of 55.0 μL (consisting of Cy3-cRNA 1650 ng, 11 μL of 10X Blocking Reagent, and Nuclease free H 2 O; Bring volume to 52.8 μL), with fragmentation buffer (consisting of 2.2 μL 25X fragmentation buffer) and then incubating at 60°C for 30 minutes in the dark. 55 μL of 2X Gene Expression Hybridization buffer HI-RPM was then added (Total volume = 110 μL). Samples were mixed well by pipetting, and spun for 1 minute at room temperature at 13,000 rpm.
Subsequent to fragmentation, hybridization of 100 μL of the fragmented cRNA samples was transferred from above onto the center of each of the wells (4x44K format has 4 wells per slide with one sample per well). A slide was carefully lowered, bar code facing up, onto the bead of liquid, followed by the metal chamber cover and clamp. The chamber was rotated vertically to wet the gasket and make sure that all bubbles were mobile. Two chambers were load at a time into the hybridization oven, set at 65°C and level 10-rotation speed (10 rpm) for 17 hours. Slides were then washed once with slide cover on in Gene Expression Wash Buffer 1, then cover was pried off and the slide washed in Gene Expression Wash Buffer 1 for a second time. Next, slides were washed in Gene Expression Wash Buffer 1 on a stir plate for~1 minute and placed into pre-warmed Gene Expression Wash buffer 2 on a stir plate for~1 minute. The excess liquid was blotted onto a Kimwipe and the dry slides placed into a light tight slide box to be scanned at once. Dry slides were scanned on an Agilent DNA microarray scanner (model G2505B) using the XDR setting. Raw images were analyzed using the Agilent Feature Extraction software (version 9.5.3.1) and the GE1-v5_95_Feb07 extraction protocol. (CA04M001, ICL004, ICL006, ICL010, ICL011, ICL012, IM001,  IM002, IM004, IM005, IM006A, IM006B, IM007, IM009, IM010, ECL001, SCL005, SCL006, SHAE002,  SHAE003, SHAE004, SM001, SM003, SM004, SM007, SM009, SM012, SM014, SM015, SM019, SM020).

Agilent microarray processing and expression analysis
Normalization and QA/QC: Raw data consisted of the output from the Agilent Feature Extraction (AFE) software. Data was read into the R/Bioconductor Framework and visualization of raw intensity level was performed using boxplots, histograms, and correlation plots. All intensities were background corrected using the norm-exp method and quantile normalized using Agi4x44PreProcess and RMA Bioconductor packages. Distribution of control probes was examined and then those probes were removed. Descriptive statistics were reported for non-control replicate probes and the intensity distributions were examined via histograms. All non-control replicate probes were then summarized by taking the mean intensity measurement. To assess the quality of the reading for each probe, five Agilent QC flags were considered. The first is a test for minimal spot diameter and signal greater than 1.5 times background noise (gIsFound); the second requires positive intensity and background-subtracted intensity greater than 2.6 times background noise (gIsWellAboveBG); the third is true if more than 50% of spot pixels are greater than the saturation threshold (gIsSaturated); the fourth examines variance of spot pixels (gIsFeatNo-nUnifOL); the fifth checks if replicate probe is greater than 1.42 times the interquartile range (gIsFeatPopnOL).
A given probe must pass all 5 QC flags for all treatment (i.e., infected) replicates of at least one experimental time point. This allowed a conservative criteria regarding probe quality, but was moderate enough to detect transcripts that may be expressed only briefly (i.e., as little as one time point) during an infection time course. It was imperative that this filtering occurred before differential expression analysis so that low intensity (below background) and low quality probes were not considered in the analysis. Final QA/QC involved CV plots, heatmaps, cluster plots, scatter plots and PCA as needed in order to assess if there were sample outliers or additional batch effects.
Differential expression analysis was performed by comparing infected samples to time-matched mockinfected controls, based on a linear model fit for each transcript/probe using the R package Limma 35 . Fold changes were calculated by taking the average log 2 probe intensity of infected replicates and subtracting the average log 2 probe intensity of mock-infected replicates from the same time point. Significant differential expression of a probe was defined using the following criteria: log 2 fold change >|1.5| and an adjusted false discovery rate (FDR) P value of o0.05 for a given time point.
Positive-strand cDNA synthesis, fragmentation, and single-stranded cDNA for affymetrix microarray analysis cDNA was synthesized from isolated RNA (250 ng from each sample) using the Ambion WT kit (Life Technologies, Grand Island, NY, USA -http://tools.lifetechnologies.com/content/sfs/manuals/ cms_064619.pdf) following the manufacturers protocol. Briefly: First strand cDNA was prepared from total RNA using reverse transcriptase and specific primers (5 μL of First-Strand master mix) resulting in single stranded cDNA which contains a T7 promoter sequence. Second strand cDNA was synthesized using Second-Strand master mix and the newly created double stranded DNA was isolated. Antisense cRNA was synthesized from the second strand DNA using T7 RNA polymerase then stored at 4°C overnight. In the next step, cRNA was purified using cRNA Binding Mix and the yield was determined by absorbance at 260 nm. Purified cRNA was used to synthesize positive sense strand cDNA using reverse transcriptase and random primers. In this reaction, dUTP was added at a fixed ratio relative to dTTP. cRNA was then hydrolyzed using RNase H and cDNA purified using cDNA Binding Mix and cDNA yield determined by absorbance at 260 nm.
Single stranded cDNA was prepared for chip hybridization using the Affymetrix GeneChip WT Terminal Labeling Kit (Affymetrix, Santa Clara, CA, http://www.umich.edu/~caparray/Files/protocols/ affymetrix/Ambion%20WT%20Labelling%20and%20Hyb.pdf) following the manufacturer's protocol. Briefly: Single-stranded cDNA was fragmented using Fragmentation master mix from the kit, then  Affymetrix microarray processing and expression analysis Probe intensities were measured using the Affymetrix GeneChip Scanner 3000 and processed into CEL files using the Affymetrix GeneChip Operating Software. Microarrays were normalized at the gene level using the BrainArray custom CDF (Entrez Gene, Version 14) for probeset definitions and RMA as implemented in the justRMA function of the Bioconductor package affy for background adjustment, quantile normalization, and summarization. Linearized data was imported into Genedata Analyst software (Genedata AG, Basel, Switzerland). ANOVA was performed on all samples (100 balanced permutations) and genes with a permutation Q-value of 0.01 or less and fold change of at least 2 (between any two groups) were exported. (SBRI_AA_E1).

Preparation of protein extracts for proteomic analysis
Protein concentrations of cell lysates or lung tissue homogenates were determined by BCA protein assay. The range of final protein amount was between 100-226 μg with an average of 150 μg per sample. Each sample was then diluted to uniform volume in 50 mM ammonium bicarbonate, pH 7.8. Proteins were reduced with 10 mM dithiothreitol, followed by alkylation of free sulfhydryl groups with 40 mM iodoacetamide at 37°C in the dark; each reaction was performed for 1 hour at 37°C with constant shaking at 550 rpm. Denatured and reduced samples were diluted 10-fold with 50 mM ammonium bicarbonate, pH 7.8, and CaCl 2 was added to a final concentration of 1 mM prior to enzymatic digestion. Sequencing-grade modified trypsin was activated by adding 20 μL of 50 mM ammonium bicarbonate, pH 7.8, to 20 μg lyophilized trypsin and incubating for 10 min at 37°C. Activated trypsin was added to the samples at 1:50 (w/w) trypsin-to-protein ratio, and samples were digested at 37°C for 3 hours with constant shaking at 800 rpm; reactions were quenched by rapid freezing in liquid nitrogen. Digested samples were desalted using solid phase extraction columns (Discovery C18, Supelco, Bellefonte, PA, USA), which were conditioned with 3 mL of methanol and rinsed with 2 mL of 0.1% trifluoroacetic acid (TFA) in water. Digest-loaded columns were washed with 4 mL of H 2 O/acetonitrile (95:5, v/v) containing 0.1% TFA, and peptides were eluted with 1 mL of acetonitrile/H 2 O (80:20, v/v) containing 0.1% TFA. Samples were concentrated to 100 μL in vacuo (Speed-Vac SC 250 Express, Thermo Savant, Holbrook, NY, USA), and a BCA protein assay was performed to verify final peptide concentrations. Samples were stored at −80°C until either strong cation exchange fractionation followed by liquid chromatographytandem mass spectrometry (LC-MS/MS) or quantitative LC-MS analyses. (CA04M001, ICL004, ICL006,  ICL010, ICL011, ICL012, IM001, IM002, IM004, IM005, IM006A, IM006B, IM007, SCL005,  SCL006, SM001).

Reversed-phase capillary LC-MS-MS and LC-MS analyses
Capillary LC-MS/MS analysis was used to generate separate accurate mass and time (AMT) tag databases for virus-infected cells and lung homogenates (see below). For this, dried peptide fractions from pooled samples were reconstituted in 30 μL of 25 mM ammonium bicarbonate, pH 7.8, and analyzed using a 4-column custom-built capillary LC system coupled online to a linear ion trap mass spectrometer (LTQ; Thermo Scientific, San Jose, CA, USA) by way of an in-house manufactured electrospray ionization interface. Electrospray emitters were custom made using 150 μm outer diameter (o.d.) x 20 μm inner diameter (i.d.) chemically etched fused silica. Reversed-phase capillary columns were prepared by slurry packing 3-μm Jupiter C18 bonded particles (Phenomenex, Torrence, CA, USA) into a 75 μm x 65 cm fused silica capillary (Polymicro Technologies, Phoenix, AZ, USA) using 0.5 cm sol-gel plugs for particle retention. Mobile phases consisted of (a) 0.1% formic acid in water and (b) 0.1% formic acid in acetonitrile, and they were degassed on-line using a Degasys Model DG-2410 vacuum degasser (Dionex, Germany); the HPLC system was equilibrated at 10,000 psi with 100% mobile phase (a) for initial starting conditions. After loading 2.5 μg of peptides onto the column, the mobile phase was held at 100% mobile phase (a) for 50 min. Exponential gradient elution was initiated 50 min after sample loading with a column flow rate of 300 nL/min, and the mobile phase was ramped from 0% to 55% mobile phase (b) over 100 min using a 2.5 mL stainless steel mixing chamber, followed by a rapid increase to~100% (b) for 10 min to wash the column. To identify the eluting peptides, the LTQ was operated in a datadependent MS/MS mode (400-2,000 m/z), in which a full MS scan was followed by ten MS/MS scans www.nature.com/sdata/ SCIENTIFIC DATA | 1:140033 | DOI: 10.1038/sdata.2014.33 using a normalized collision energy of 35%. A dynamic exclusion window of 1 min was used to discriminate against previously analyzed ions. The temperature of the heated capillary and the electrospray ionization (ESI) voltage were 200°C and 2.2 kV, respectively.
Following AMT tag database generation, capillary LC-MS analyses were performed on all virusinfected and mock-infected samples to generate quantitative data using the AMT tag approach. For this, individual dried peptide samples (see above) were reconstituted in 30 μL of 25 mM ammonium bicarbonate, pH 7.8, and analyzed in either duplicate or triplicate and in random order using identical chromatographic and electrospray conditions as for capillary LC-MS/MS analyses. The LC system was interfaced to an Exactive Orbitrap mass spectrometer (Thermo Scientific), and the temperature of the heated capillary and the ESI voltage were 250°C and 2.2 kV, respectively. Data were collected over the mass range 400-2,000 m/z. (CA04M001, ICL004, ICL006, ICL010, ICL011, ICL012, IM001, IM002,  IM004, IM005, IM006A, IM006B, IM007, SCL005, SCL006, SM001).

Development of the AMT tag database for virus-infected model systems
A novel AMT database was generated for peptides within the specific model system, using mock-infected and virus-infected samples. To generate the AMT tag database, aliquots of the virus infected or mock-infected samples were combined to make the appropriate sample pools. Each pool was subjected to strong cation exchange fractionation as described above, and each fraction was analyzed by capillary LC-MS/MS. The SEQUEST analysis software was used to match the MS/MS fragmentation spectra with sequences from the appropriate UniProt/Swiss-Prot protein database. When searching, SEQUEST used a dynamic mass modification on methionine residues corresponding to oxidation (15.9949 Da) and a static mass modification on cysteinyl residues to account for alkylation by iodoacetamide (57.0215 Da). Peptides passing the following filter criteria were stored as AMT tags in a Microsoft SQL Server database: 1) SEQUEST DelCn2 value ≥ 0.10 (normalized Xcorr difference between the top scoring peptide and the second highest scoring peptide in each MS/MS spectrum) and 2) for the mouse lung tissue database, a SEQUEST correlation score (Xcorr) ≥ 1.6, 2.4, and 3.2 for fully tryptic peptides with 1+, 2+, and 3+ charge states, respectively, and Xcorr ≥ 4.3, and 4.7 for partially tryptic or non-tryptic protein terminal peptides with 2+, and 3+ charge states, respectively. While for the human cell line database, a SEQUEST correlation score (Xcorr) ≥ 2, 2.6, and 3.5 for fully tryptic peptides with 1+, 2+, and 3+ charge states, respectively, and Xcorr ≥ 2.5, 3.6, and 4.1 for partially tryptic or non-tryptic protein terminal peptides with 1+, 2+, and 3+ charge states, respectively. These filter criteria resulted in estimated peptide false discovery rates of o2% based on target-decoy database searches for both lung and cell samples. Nontryptic peptides were excluded, and a minimum peptide length of 6 amino acid residues was required. The elution times for these peptides were normalized to a range of 0 to 1 using a predictive peptide LC normalized elution time (NET) model and linear regression. A NET average and standard deviation are assigned to each identified peptide if the same peptide is observed in multiple analyses. Both calculated monoisotopic masses and observed NETs of identified peptides were included in the AMT tag database. From the primary data records corresponding to the data used to generate the AMT tag database (PASS00416 and PASS00417), the user can download the raw data, the assembled database, and parameter files necessary to recreate the AMT-tag approach. In addition to the data files, a detailed tutorial has been made available for download. (CA04M001, ICL004, ICL006, ICL010, ICL011, ICL012,  IM001, IM002, IM004, IM005, IM006A, IM006B, IM007, SCL005, SCL006, SM001).

Processing of quantitative LC-MS datasets
Quantitative LC-MS datasets were processed using the PRISM Data Analysis system 37 , which is a series of software tools developed in-house (e.g. Decon2LS 38 and VIPER 39 ; freely available at http://ncrr.pnl.gov/ software/). Individual steps in this data processing approach are reviewed here 40 . The peptide identities for detected features in each dataset (i.e. a single LC-MS analysis) were determined for features matched to AMT tags with high confidence based upon the accurate measured monoisotopic masses and NETs for each of the peptides in the filtered AMT tag databases within initial search tolerances of ± 6 ppm and ± 0.025 NET for monoisotopic mass and elution time, respectively. The peptides identified from this matching process were retained as matrices for subsequent data analysis. (CA04M001, ICL004, ICL006,  ICL010, ICL011, ICL012, IM001, IM002, IM004, IM005, IM006A, IM006B, IM007, SCL005,  SCL006, SM001).

Proteomic data statistical processing
The integrated LC-MS peak intensity values (i.e. abundances) for the final peptide identifications were processed in a series of steps using MatLab R2010b that included quality control, normalization, protein quantification, and comparative statistical analyses. Quality control processing was performed to identify and remove peptides with an insufficient amount of data across the set of samples 41 , and LC-MS runs that showed significant deviation from the standard behavior of all LC-MS analyses 42 , using a significance level of 0.0001 (ref. 43). The peptide abundance values were normalized across the technical replicates using a rank invariant subset of peptides (p-value threshold of 0.1) followed by median absolute deviation centering of the data. Normalized log 10 abundance values were averaged across the technical replicates within each biological sample. Proteins were quantified using a standard R-Rollup method 44 using the most abundant reference peptide--after filtering the peptides that were redundant, had low data content, or were outside the dominant significance pattern. Comparative statistical analyses of time-matched mock with viral infected samples were performed using a Dunnett adjusted t-test to assess differences in protein average abundance. (CA04M001, ICL004, ICL006, ICL010, ICL011, ICL012, IM001, IM002,  IM004, IM005, IM006A, IM006B, IM007, SCL005, SCL006, SM001).

Data Records
Study designs for all experiments, including relevant repository identifiers, are summarized in Table 1 (Available online only), with individual RNA experiment samples detailed in Table 2 (Available online only) and individual protein experiment samples detailed in Table 3 (Available online only). Sample tracking from animal subjects to experiment samples is given in Table 4 (Available online only). Primary assay results (txt and mzXML files) are archived in public repositories as listed below (Data Record 1 and 2). Derived data resulting from processing and hypothesis testing accompanied by structured metadata describing both the experimental and data analysis methods are available at the Influenza Research Database and the Viral Pathogen Resource (Data Record 3).

Technical Validation
Two approaches were used to establish the technical validity of the datasets described. First, the biological samples used to generate the transcriptomic and proteomic assay results reported here have also been www.nature.com/sdata/ SCIENTIFIC DATA | 1:140033 | DOI: 10.1038/sdata.2014.33 assessed using various criteria to ensure technical validity. For example, productive viral infections following inoculation of animals used for in vivo studies was confirmed by examining viral titers, viral mRNA levels, and body weight loss to quantify virus replication and host response to infection as detailed below. For in vitro studies, viral genomic RNA, viral mRNA, and viral titers were quantified to ensure that the virus was actively replicating and transcribing its genome in cell culture. Second, the processed data produced from in vivo and in vitro studies were compared, using Boolean logic, to quantify the number of overlapping significant genes identified in multiple studies performed under similar experimental conditions as an assessment of general reproducibility of the derived results between experiments.

In vivo validation
To validate the in vivo studies, mean viral mRNA levels, mean % body weight, and mean viral titers were examined as surrogates for viral infection in the animal subjects. The CA04M001 and IM001 experiments, in which mice were infected with influenza A virus strains A/California/04/2009 (pandemic H1N1) and A/Viet Nam/1203/2004 (H5N1) respectively, were used as representative datasets for this validation process. For both experiments, the viral mRNA levels indicate that the virus was actively transcribing mRNA within the host organism with mean viral mRNA levels peaking at either 2 or 4 days post infection (dpi), when either a high or low initial inoculum was used, respectively 8 . Similarly, the mice that were exposed to virus in either experiment showed a dose-dependent decrease in mean % body weight. The viral growth kinetics for A/California/04/2009 in the CA04M001 study showed mean viral titers peaking at 2 dpi regardless of the viral concentration present in the initial inoculation. In contrast, the growth kinetics for A/Viet Nam/1203/2004 used in the IM001 study showed mean viral titers peaking at 2 days post infection when the initial inoculum contained either 10 3 or 10 4 plaque forming units (PFU) of virus; whereas mean viral titers were highest at 4 dpi when the initial inoculum contained 10 2 PFU of virus.

In vitro validation
For in vitro studies, measurements from study ICL004, performed in Calu-3 cells infected with the A/Viet Nam/1203/2004 influenza virus strain, were used as representative of all studies. Mean viral genomic RNA levels increased logarithmically between 0 and 12 hours post infection (hpi), eventually peaking at 18 hpi before decreasing at 24 hpi. Similarly, mean viral mRNA levels increased exponentially between 0 and 7 hpi, where they peaked before decreasing through the 24-hour time point. Mean viral titers were observed to increase logarithmically between the 7 and 12-hour time points, and then continued to rise through 24 hpi 8 .
Overall, the metrics used to validate the in vivo and in vitro studies indicate that the biological samples that were taken from these sample sources were productively infected with the intended virus. Consequently, any downstream assays that were performed on these biological samples should give rise to biologically valid results.

Reproducibility of derived results
The primary assay results obtained from the transcriptomic and proteomic experiments were processed to identify specific RNAs or peptides that were differentially expressed under a particular set of conditions (e.g. virus amount, time post infection, virus strain, etc.) in comparison with relevant control samples to derive so called 'Host Factor Biosets'. Another measure of technical validity of the combined laboratory and analytical workflow is that the Host Factor Biosets derived from similar experiment samples treated in a similar way should have similar membership, thereby indicating experimental reproducibility. To examine experimental reproducibility, we determined the overlap that was observed between all experiments that used the same virus strains in the same infection system. In the evaluation reported here, we used datasets reporting results from wild-type A/California/04/2009 (H1N1), A/Viet Nam/1203/04 (H5N1), and SARS-CoV as being representatives of all experiments. In order to avoid complications related to slight differences in infection kinetics, we took the union of all biosets across the entire time course from each experiment (i.e. the non-redundant set of differentially expressed genes) for comparison. This approach can lead to a discrepancy between the total number of genes in each experiment due to differences in number of time points sampled. As an alternative strategy, the data could also be more strictly paired using specific time points and dosages.
Seven situations in which the same treatment conditions (viral strain and viral amount) were reproduced in separate experiments were identified. Since the magnitude of the responses and the number of time points differ between experiments, we calculated the percentage overlap of the number of host factors using the smallest set as the denominator for the sake of comparison. The smallest set was chosen as the denominator to calculate percentage overlap, which may not be an ideal comparison since this would tend to result in higher reproducibility values than if the largest set was used as the denominator. However, since we are primarily interested in measuring the commonality between similar experiments that have slight variations in study designs and are inherently noisy, we chose to focus on determining if the smaller set is largely a subset of the larger sets, which would demonstrate that the these significant hits are reproducible even in the presence of confounding variables such as biological and technical variation. For transcriptomic experiments the following overlap was observed: 83% for A/California/04/2009 in Calu-3 (Fig. 2a), 91% for A/Viet Nam/1203/2004 in Calu-3 (Fig. 2b, 3-way comparison), 62% for A/California/04/2009 in C57BL/6 ( Fig. 2c), 77% for A/Viet Nam/1203/2004 in C57BL/6 ( Fig. 2d), 72% for A/California/04/2009 in HAE cells (Fig. 2e), and 40% for SARS-CoV in HAE cells (Fig. 2f). For the one replicated proteomic experiment, in which A/Viet Nam/1203/2004 was used to infect Calu-3, 43% overlap was observed (Fig. 2g). The reproducibility of host responses to both pandemic and highly pathogenic avian influenza viruses was high in both the Calu-3 cell line and mouse infection systems. Reproducibility using HAE was somewhat lower both in terms of the absolute number of host factor responding (Fig. 2e) and the percentage overlap (only 39% in Fig. 2f), perhaps due to the variability in establishing similar cell culture conditions using primary epithelial cells. Reproducibility of proteomic results was also somewhat lower with an overlap of 43% following infection of Calu-3 with A/Viet Nam/1203/2004 (Fig. 2g).
In general, the reproducibility of this collection of experiments using well-controlled standard operating procedures was quite high. In other microarray studies it has been shown that technical variability can result in 10% difference between replicate RNA samples 46,47 , while biological variability can cause anywhere from 10% to 30% differences depending on the biological system sampled 48 . In addition, variability in stimulation conditions has been found to have an even more profound impact than either technical or biological variability 47  Overall, the results from the independent in vitro and in vivo validation experiments and the high degree of reproducibility of the derived host factor lists show that the datasets described here are of high quality and can be useful for in-depth investigation and hypothesis generation in follow-up analyses.

Usage Notes
The systems biology studies described herein were designed to characterize how the host responds to viral infection. Separate experiments were performed to measure the response to multiple independent variables that include sampling at multiple time points after infection, at different viral dosages, in various model systems, and in systems with various genetic backgrounds (both viral and host). A number of publications have reported the use of these datasets to address questions concerning pathogenesis. For instance, comparisons between the similarities and dissimilarities of the host response at early and late stages of infection with genetically modified viral strains have been performed to identify possible correlates of disease severity 9 . Likewise, the host response was measured in different model systems to identify host factors that have a role in disease progression 45 . Lastly, all of these variables were combined to conduct a detailed evaluation of HPAI H5N1 infection based on time, dosage, and genetically-modified viruses 8 .
Various analytical and statistical tools can be used to interpret the results from systems biology experiments. The commonly-used methods fall into three categories: enrichment analysis, co-expression network analysis, and transcription factor predictions. Enrichment analysis tools, such as DAVID or GSEA [49][50][51] , use diverse statistical methods to identify annotations that are over-represented in a list of significant genes or proteins. Co-expression network analysis methods, such as Weighted Correlation Network Analysis (WGCNA) 52 , find correlated patterns across multiple microarray samples. Lastly, transcription factor prediction tools, such as PSCAN and PRISM 53,54 , examine the promoter regions of co-expressed genes to identify over-represented transcription factor binding sites. All of these tools allow researchers to generate hypotheses concerning how different components of the biological system being studied are modulated.
The host factor component within IRD and ViPR contains the processed data derived from these studies focusing primarily on the host response to virus infection. Here we present several scientific use cases that highlight some of the exploration and data mining uses that are currently possible using these data in the IRD/ViPR systems.

Host responses to related viruses
Comparisons of host responses to infection by viruses that are phylogenetically related but associated with subtle or dramatic differences in disease symptoms or severity can be used to help identify the genetic determinants of virulence.
The  metabolism, amino acid metabolism, cardiovascular disease, and cancer gene responses were associated with virulence differences 6 .
To illustrate one potential way to compare datasets in ViPR/IRD, we quantified the overlap between differentially expressed genes identified from samples infected with this collection of 'pre-pandemic' H1N1 influenza A virus. As before, we compared the non-redundant union of significant genes derived from the transcriptomic results reported in the IM002 experiment to determine differences detected in mice infected with these viruses. Although the pairwise comparisons of the host response to virus infection were somewhat variable, the differences appeared to correlate with the virulence of the infecting virus. No overlap was observed between all three virulent viruses and the A/Brisbane/59/2007 strain. In contrast, 262 host factors (66%) were shared in mice infected with the three virulent viruses (Fig. 3a). Performing a functional annotation analysis for this list of 262 genes identified the same list of GO terms as reported in the original study 6 except for the exclusion of cardiovascular disease and cancer. Similar analyses designed to compare the host response to other related viruses could be performed using the datasets that are currently available in IRD/ViPR. For example, the ICL010-R experiment examines the host transcriptional response to 2 different 2009 influenza A (H1N1) pandemic strains in Calu-3 cells at 12-, 24, and 48 hpi. The ECL001-R and SCL005-R experiments could be analysed to better understand the similarities and differences in the host response during infection with either MERS or SARS-CoV in Calu-3 cells across multiple time points.

Host responses to mutant viruses
In order to investigate how specific viral proteins might function to regulate host intracellular processes, experiments were performed to identify shared and unique host genes that are differentially expressed in hosts infected with either wild-type or mutant viruses 5,11,13 . For example, analysis of experimental results involving variants of SARS-CoV with a deletion in the ORF6 coding region performed in human cell lines (SCL005) or mice (SM012) determined that antagonism of karyopherin-based transport was an important function of the ORF6 gene product in SARS-CoV. As nuclear signaling disruption is a strategy used by many viral pathogens, identifying trends in these data presents a possible area for general therapeutic intervention.
Using Host Factor Bioset data in ViPR/IRD comparing human cell lines infected with either wild-type or ORF6 mutant viruses, 74% of significant host factors were shared between the two infection conditions (Fig. 3b). When an enrichment analysis was performed on the non-overlapping host factors (i.e. unique to infection with the ORF6 mutant strain), we found enrichment for terms relating to cell proliferation and immune response (P o0.05), which matches what had been reported previously 12 .
Similar analyses can be performed with data from other studies in IRD/ViPR to compare infection with wild-type versus genetically-engineered mutant strains. For instance, there are multiple studies that characterize the host responses to SARS-CoV variant strains in a human cell line (SCL006) and a mouse model (SM003, and SM012), and host responses to four influenza mutant variants of VN1203 in mice (IM004, IM005, IM006A, IM006B, and IM007) and a human cell line (ICL011 and ICL012).

Differences between RNA and protein responses
While proteins carry the functional workload in cells, it is still challenging to achieve the levels of dynamic range necessary to measure their abundances and activities at a comprehensive genome-wide level to support systems-level analysis, given practical constraints upon measurement throughput and sensitivity. In contrast, while gene expression arrays allow for the comprehensive measurement of gene expression levels across the entire genome, these levels do not directly reflect functional physiology mediated by the encoded proteins. Therefore, it is interesting to compare quantitative proteomic and transcriptomic experiments using similar samples to estimate how often transcriptional profiles are reflected in protein profiles.
Prior reports have compared transcriptomic and proteomic data as a way of validating experimental results 12 . In these cases, proteomic data was integrated with the corresponding transcriptomic data to improve the biological networks produced and to support the associated statistical analysis of SARS-CoV host responses in general, and the karyopherin transport hubs in particular.
We performed a similar comparison of RNA and protein Host Factor Biosets in ViPR/IRD from the same tissue/cell samples using either microarray or mass spectrometry technologies (ICL004). In this case, transcriptomic data from the 12 h time point of human Calu-3 cells infected with A/Viet Nam/ 1203/2004 influenza virus was chosen because this time point contained the highest number of significant probes. These transcriptomic data were compared with a pooled list from the time-matched and all subsequent time points from the related proteomics data in order to control for the 'lag' that occurs between when the changes in mRNA expression are translated into corresponding changes in the translated protein product. We found approximately 56% overlap between the statistically significant RNA at 12 hpi and the statistically significant protein at the combined 12 and 18/24 hpi time points (Fig. 3c), similar to what has been reported previously 55,56 .

Meta-analysis
While many of the usage examples provided involve comparative analysis of results obtained in the same experiment, in some cases similar comparisons can be performed between experiments. Indeed, the Validation section focused on the reproducibility of derived results and involved the comparison of similar experiment conditions performed in separate experiments. Combining and contrasting results from different studies is often referred to as meta-analysis. One of the goals behind making these datasets freely available through public database resources like GEO, ViPR and IRD is to support their incorporation into meta-analysis approaches that include new experiments being performed by the research community. To facilitate this kind of meta-analysis, considerable effort has been dedicated to providing complete descriptions of all experimental and data analysis methodologies, combined with structured, standardized metadata describing selected key aspects of the experiment design, especially regarding the experiment www.nature.com/sdata/ SCIENTIFIC DATA | 1:140033 | DOI: 10.1038/sdata.2014.33 factors that serve as the independent (manipulated) variables in the study in ViPR and IRD. Clear representations of the experiment metadata will make it relatively simple to appropriately combine these primary and derived results with data from future related experiments for meta-analysis.

Re-analysis
Although we have emphasized the use of the derived Host Factor Biosets, in every case a close linkage is maintained between the processed data and the primary assay results from which they were derived-even though they may be stored in distinct resources. Thus, the final usage example is related to the re-processing and re-analysis of primary assay results.
Different processing algorithm(s) chosen to perform a specific processing or analysis step in a multistep workflow can yield different results. Many bioinformatics data processing workflows currently implement a 'one size fits all' approach with a single algorithm being applied to all datasets without regard for the specific nuances of each individual dataset. The problem with this approach is that the processing algorithms and parameters that are ideal for one dataset may not be optimal for other datasets-especially when dealing with different experimental systems. Hence, processing of the data using a single universally-applied algorithm may not sufficiently address the individual requirements of each dataset. To truly solve this underlying problem, an analytical workflow that explores the relevant parameter space for a diverse set of algorithms would need to be implemented. This approach would provide an objective and automated process to intelligently 'choose' the optimal combination of parameters and algorithms that produce the best set of results for each individual dataset. Such an approach could be implemented in analytical workflow suites like Galaxy or GenePattern 57,58 to objectively and confidently reprocess the raw data from each '-omics' dataset in a data-driven fashion to further improve the quality of the derived data. Until such a system is developed, we advocate for using the processed data that was provided by the original submitters and subjected to the peer-review process when performing meta-analyses given its overall high quality.