## Introduction

Respiratory infections are a leading cause of adult morbidity and mortality in the United States1. Short-term increases in ambient fine particulate air pollution (≤ 2.5 µm in diameter; PM2.5) concentrations have been associated with increased emergency department (ED) visits for influenza and culture negative pneumonia in adults2. Studies of adults in the USA have also observed associations between acute increases in air pollution concentrations and an increased risk of healthcare encounters for respiratory viral infection (RVI) or other lower respiratory tract infections (LRTI), including respiratory bacterial infection (RBI)3,4,5,6. With over 90% of the world breathing unhealthy air7, and large health care burden of respiratory infections8, understanding the mechanisms underlying air pollution effects on the innate immune response to respiratory infections is crucial.

The association between the inflammation caused by air pollution and disruption of the lung’s innate immune system, including epithelial barrier disruption, macrophage function, and protein/cytokine response, is typically studied via in vitro human cell and in vivo rodent models9,10,11,12,13,14,15,16,17,18,19. Although the majority of studies suggest that different air pollution components might predispose patients to RVI or RBI by multiple mechanisms, laboratory-based studies are generally limited to inbred mouse strains and/or only a single pollutant or a single category of pollution (traffic pollution), and do not fully capitulate complex events occurring in naturally exposed humans. Studying natural exposures also helps characterize pollutant effects that controlled exposure studies may miss. Epidemiologic studies allow examination of health responses to single pollutants as they occur within a pollutant mixture.

By examining the association between multiple pollutants and the transcriptional profiles of the peripheral blood of patients with respiratory infection we can better understand what specific gene pathways may be driving the relationship between specific particulate pollution and different types of infection including RVI and RBI. Studying respiratory infection types in aggregate can give important insight into potential shared mechanisms between the effects of air pollution on respiratory infection in general while studying infections individually may elucidate infections specific mechanisms applicable to only RVI or RBI. Furthermore, studying the potential mechanisms of the effects of air pollution on respiratory infection may help design potential risk reduction strategies (e.g. anti-inflammatory treatment) for patients exposed to air pollution during periods of high rates of respiratory infection.

Although most literature focuses on PM2.5 and not its constituents, our prior source-specific study in NY State observed associations between combustion related constituents of PM2.5 and ED visits for influenza20. Among the PM2.5 constituent sources, we hypothesized that ambient combustion sources including traffic related air pollution (black carbon) and biomass burning (Delta-C) would be associated with dysregulation of key gene pathways within the immune response to respiratory infection. In this study, we focused on the gene expression patterns in patients with RVI as the primary outcome of interest associated with ambient air pollution. We also performed the same analysis in patients with RBI and combined viral and bacterial infection. First, we highlighted the patterns in gene expression across these three infection groups. Next, we analyzed the associations between air pollution concentrations and gene expression in all types of respiratory infection combined. We then analyzed the gene pathways associated with increases in air pollution concentrations in the aggregate group and visualized patterns of gene expression in key pathways for RVI, RBI and combined infection individually (Fig. 1). We report several associations between gene expression and air pollution exposure including several gene pathways relevant to respiratory infections, including immunity, protein folding and iron homeostasis.

## Results

### Patient characteristics

The majority of participants were white (76%), female (58%), and older, with median age range of 58–61 years old (Table 1). The majority of participants had viral infection (n = 66) and fewer patients had bacterial infection (n = 20) or coinfection (n = 25). The most common viruses included influenza (n = 22), respiratory syncytial virus (n = 19) and the most common bacterial infection was Streptococcus pneumonia (n = 12) (Supplemental Table S1 online). Over 90% of patients lived at home and the most common comorbidities included diabetes (34%), COPD (38%) and obesity (39%). Forty percent of participants were active smokers, with the lowest proportion of smokers in the viral infection group (35%) and the largest proportion of smokers in the co-infection group (52%). Inhaled steroid use was most prevalent within the bacterial infection group (65%) and least prevalent in the viral infection group (30%). Home oxygen use was most prevalent in the bacterial infection group (30%) and less common in the viral (15%) and combined infection groups 12%. Air pollution distributions for Rochester, NY during the same season as the hospitalizations are reported in Table 2. Although Rochester, NY is a medium-sized city, the average PM2.5 concentration is well below the EPA 24 hour fine particle standard of 35 μg/m321. Moderate correlations (0.5 to 0.7) were observed between the pollutants (Supplemental Fig. S1 online).

### Distinct patterns of gene expression within the highest variance genes correspond to infection type and ambient air pollution

We performed an initial exploratory analysis focusing on the 150 genes with the highest variance across samples. In this analysis, we observed infection specific differences in gene expression (Fig. 2). Based on the hierarchical clustering of genes, we defined 7 distinct gene clusters. One cluster (#2) showed clear expression differences between infections, as previously shown by Suarez et al.22. Though air pollution modeling was not performed in this exploratory portion of the analysis, increased expression of genes in this cluster visually corresponded to high pollution levels among the viral infection samples. In contrast, elevated expression of these genes in patients with bacterial or combined infection did not clearly correspond to high pollution levels.

Within a specific type of respiratory infection, the expression levels were not consistent across all high variance genes. We observed a higher gene expression corresponding to the highest air pollution concentrations in Cluster #2 for viral infection compared with Clusters #3–6. While other gene clusters captured consistent expression patterns across subjects, these differences were not clearly associated with either infection or pollutant levels.

In the heatmap of the 150 most variable genes (Fig. 2), the 38 individual genes within Cluster #2, had the highest gene expression within the viral infection group and included a variety of immune related genes (Supplemental Table S2 online). For example, there were multiple interferon induced protein coding genes including IFI44L, OAS1 and MX1. IFI27, IFIT1,2,3 and RSAD2 participate in Interferon gamma signaling and the innate immune response. OAS2 also participates in the innate immune response by encoding for the 2-5A synthetase family. The HERC5 gene is upregulated by endothelial cell inflammation. This result suggests exposure to particulate air pollutants in the week prior to hospitalization with RVI may be associated with a more exuberant immune and inflammatory response.

### Transcriptomic analysis quantifies the associations between the expression of individual genes and both infection type and ambient air pollution

To quantify the association between expression of individual genes and air pollution measurements prior to hospitalization for individuals with a respiratory infection, we performed a LIMMA analysis controlling for sex and infection type (see “Methods” for details). This analysis was performed on all measured genes.

#### Few individual genes associated with changes in air pollution

We then analyzed the association between air pollution and gene expression in the overall study population including all infection types. The average effect of each 1 µg/m3 increase in DC in the prior 7 days, regardless of infection type, is a log2-fold increase of 3.5 for RNF14 and 2.4 for UBE2F, two genes participating in antigen presenting cell pathways (Supplemental Table S3 online). Other notable genes upregulated in association with increased concentrations of DC at the 7-day lag period include a 3.1 log2-foldchange increase in MAP2K3 (TLR signaling) and 2.9 log2-fold change increase in ADORA1 (pro-inflammatory monocyte activation). Two genes involved in hemoglobin synthesis (HMBS) and blood cell size (TMCC2) were also upregulated in association with increased DC concentrations in the seven days prior to hospitalization. No significant changes in genes were associated with increased concentrations of PM2.5, BC, UFP or AMP. To further investigate these putative associations between gene expression and air pollution, as well as others not readily apparent from the unsupervised and individual gene analysis, we completed a pathway analysis using the differentially expressed genes group by clusters.

#### Differential expression of viral versus bacterial infection to independently replicate baseline expression differences between RVI and RBI (not factoring in air pollutants)

The top genes in the individual LIMMA analyses that were differentially expressed in peripheral blood samples of patients hospitalized with viral infection vs. bacterial (baseline in the model) infection were ISG15, TIMM10, IFI27, IFI44L and OAS2 (Supplemental Table S4 online). These genes were consistently differentially expressed regardless of which pollutant was included in the model. For consistency, we specifically presented the gene expression values within the same group of participants with data for the pollutant DC at the 0–6 lag period (as in the above pollution focused individual gene analysis), There was a log2-fold increase of 3.3 for ISG15 and 2.2 for TIMM10, two genes participating in antigen presenting cell pathways. IFI 27, IFI44 (and its paralogIFI44L), are part of the interferon induced response to RVI and were observed to have a log2-fold increase in the 3-4 range. Finally, OAS-2, a part of the innate immune system, was observed to have a 2.5 log2-fold increase. Three of the genes (IFI44, OAS2 and IFI27) that we observed in this replication effort (Table S4) matched the classifier genes found in the original Suarez et al.22 study. Though we control for infection type in our model considering air pollution, highlighting the genes that are classifiers distinguishing RBI from RVI (independent of pollution) is helpful to understand whether or not there is overlap between the genes that are associated with air pollution in patients with respiratory infection and the genes that are associated with respiratory infection alone (without accounting for pollution).

#### Pathway analysis to characterize the broad areas of gene expression associated with different air pollutants in patients with all types of respiratory infection together and targeted infection specific heatmaps

With a signal appearing between high concentrations of air pollution and gene expression within Fig. 2, we then identified specific gene pathways that were associated with modelled increases in air pollution concentrations across all infections (Table 3 and Supplemental Tables S5S9). When analyzing the association between air pollutants and gene pathways using the CAMERA method, we observed several broad pathways with implications on innate immunity. Then, given the differences in gene expression across infections, we constructed heatmaps of specific pathways of interest for infection specific changes at multiple pollution lags (Supplemental Fig. S2S12 Online).

#### Delta-C (DC)

Multiple different gene pathways associated with red blood cell synthesis and iron handling were significantly upregulated in association with increases in DC concentrations at all lag periods (Table 3 and Supplemental Table S5 online). Gene pathways associated with endoplasmic reticulum activity (protein folding essential for cytokine production in the immune system) were also upregulated by increased DC concentrations at the 7–13, 14–20 and, 21–28 lag periods while increased concentrations of DC at the 7–13 and 14–20 lag periods were associated with upregulation of a viral replication pathway.

In the infection specific heatmaps, DC was observed to have different patterns of expression in the iron homeostasis pathway (Protoporphyrinogen IX Biosynthetic Process) when comparing infections at the 0–6 and 21–27 lag times (Supplemental Figs. S2S7 online). While the highest gene expression in iron homeostasis was observed in patients with bacterial or viral infection at the highest concentrations of DC, the combined infection group displayed a mixed to decreased expression. There was no clear pattern of gene expression along the pollution concentration gradient when comparing infection type within an endoplasmic reticulum or viral gene expression pathway.

#### Black carbon (BC)

Increased concentrations of BC were associated with upregulation of a combination of anti-viral pathways and pro-viral pathways in the 0–13 day lag periods (Table 3 and Supplemental Table S6 online). In the 14–20 day lag period, several immune pathways were significantly down regulated in association with a one-unit increase in BC concentrations. Differences in gene expression of the type 1 interferon pathway were observed between infection types for BC (Fig. 3). For BC at the 0–6 lag period, the highest gene expression was found in Cluster #5 of genes for patients with RVI and the lowest expression was observed in this cluster for patients with respiratory bacterial infection (RBI). Patients with RVI appeared to be driving the positive association between BC and the type 1 interferon pathway.

#### PM2.5

In the 7–20 day lag periods, multiple ribosomal/ER related pathways were downregulated. Iron binding pathways were upregulated in the 20–27 day lag period and iron homeostasis pathways were also upregulated at all lag times (Table 3 and Supplemental Table S7 online).

For PM2.5 at the 0–6 and 14–20 day lag periods, the highest expression of iron homeostasis genes was observed within patients with bacterial infection (Supplemental Figs. S8S9 online). At the 14–20 lag days for PM2.5, patients with RVI had the lowest gene expression within the iron homeostasis pathways. In patients with combined infection, low levels of gene expression in iron homeostasis was also observed at the 0–6 lag time for the highest concentrations of PM2.5. These findings indicate that the patients with RBI were driving the signal in the combined analysis of all infections.

#### Ultrafine particle (UFP)

A one-unit increase in UFP was associated with down regulation of several immune related pathways at the 7–13 day lag period, including neutrophil and myeloid activation and regulation of leukocyte degranulation (Table 3 and Supplemental Table S8 online). In the 0–13 day lag period, increased concentrations of UFP were associated with upregulation of both iron binding and heme related iron pathways. Finally, we observed upregulation of ER related pathways in the 14–27 lag time associated with a one unit increase in UFP concentrations.

In the infection specific heatmaps for UFP at the 14–20 and 21–27 lag time, we observed the highest expression of viral gene expression pathway genes in patients with viral infection, but there was no clear pattern within bacterial or combined infection (Supplemental Figs. S10S11 online). While the infection specific heatmaps showed similar expression pattern for iron homeostasis across different infections, there were distinct expression patterns for immune specific (highest expression for viral infection) and protein folding pathways (highest for combined infection) when comparing infections.

#### Accumulation mode particles (AMP)

One unit increases in AMP were associated with upregulation of multiple ER associated pathways in the 0–6 lag period and upregulation of iron homeostasis related pathways at the 21–27 day lag period. An increased concentration of AMP was associated with downregulation of a myeloid leukocyte pathway at the 0–6 lag time (Table 3 and Supplemental Table S9 online).

In the infection specific heatmaps for AMP at the 0–6 lag period, we observed the highest level of expression in the mRNA catabolic pathway in patients with bacterial infection and lowest levels in patients with viral infection (Supplemental Fig. S12 online).

### Summary of the comparison between gene pathway expression patterns across pollutants

The pathways associated with all pollutants included genes related to the upregulation of endoplasmic reticulum related pathways (except for downregulation seen with PM2.5), including ribosomal synthesis (Table 3). Iron binding and hemoglobin related pathways were also commonly affected by all pollutants except for BC. Out of all the pollutants, BC and UFP were associated with the greatest number of immune and viral related gene pathways.

When broadly visualizing the patterns of relative gene expression divided by type of pollution, the most distinct differences were observed when comparing RVI and RBI. Combined infection commonly displayed a mixed (indeterminate) pattern. The strongest gene expression differences when comparing between infections occurred with BC at the 0–6 lag period for Type 1 interferon (Fig. 3). These differences associated with BC displayed decreased expression in the patients with bacterial infection and higher expression in patients with viral infection. One of the most consistent patterns of discordance across multiple pollutants was observed within an iron homeostasis pathway. The highest concentrations of pollutants including AMP, DC and UFP corresponded to higher expression of the middle to lower clusters of genes at the 0–6 and 21–27 lag periods for both viral and bacterial infection, but the relationship was inconsistent in the combined infection group. In the infection specific analysis, there were also multiple examples of low gene expression, which appeared to be independent of air pollution levels.

### Targeted exploration of individual genes within the Interferon pathway.

Given the contrast between the gene expression patterns corresponding to the highest pollutant concentrations among the types of infection within Cluster #5 of the Response to Type 1 Interferon pathway (Fig. 3), we examined the 18 individual genes presents within this cluster (Table 4). The majority of the genes in this cluster were related to the immune response to infection including OAS-1/3, IFI44L, HLA-A and HLA-DRB-1/4. Genes related to ribosomal activity (RPS4Y-1) and heme biosynthesis (ALAS2) were also present in this cluster. This review of individual genes highlights the overlap between infection related pathways (Interferon) and the ribosomal and heme related genes. OAS2 and IFI44L were also found to be expressed in a high level in patients with RVI in the analysis of 150 highest variance genes (Fig. 2). The gene expression in the other clusters in the heatmap did not appear to correspond to pollution levels when comparing across infection types. The genes in Cluster #2, for example, contains a variety of genes related to acute phase reaction (ORM1) viral replication (IFIT1) and granulocyte regulation (CD17, CLC). However, Cluster #2 also has genes that are not directly related to immunity including genes related to collagen formation (COL9A2) and the urea cycle (ARG1). In summary, the cluster which displays the greatest infection specific differences in gene expression corresponding to high pollutant concentrations (Cluster #5) has a higher proportion of immune related genes than the more diversely populated Cluster #2.

Descriptions of additional sensitivity analysis on participants who are smokers (Supplemental Table S10 online) and highlighting individual genes present in the aforementioned significant gene pathways (Supplemental Tables S11S15 online) are provided in the supplemental information.

## Discussion

In patients hospitalized with viral, bacterial and combined infection, we observed associations between gene expression and air pollution exposure in the 1 to 4 weeks prior to hospitalization, including gene pathways related to immunity, protein folding and iron homeostasis.

As hypothesized, combustion related air pollutants including DC (marker of wood burning) and BC (marker of traffic pollution) were associated with changes in gene expression of multiple pathways related to the immune response to respiratory infection. All pollutants were associated with genes involved in protein folding, and all but BC were associated with key iron homeostasis pathways. Of all pollutants, BC was associated with upregulation of the largest number of immune specific pathways in the week prior to infection. In the infection specific exploration of selected pathways, we observed similar patterns of gene expression with the overall pathway analysis at the early lag times (0–6 days) but observed different patterns at the later lag times. There appeared to be infection specific patterns of gene expression when comparing patients with RVI, RBI and combined respiratory infection. Overall, this analysis suggests that the mechanistic effects of air pollution on the pathogenesis of respiratory infection may be pollutant, timing, and infection specific.

Discerning the effect of air pollution on the normal immune response to respiratory infection is made more difficult by limitations of our use of an epidemiological design, and by the overlapping effects of both air pollution and respiratory infection on the human immune response.

To date, it is not clear by what mechanism(s) short term air pollution exposures contribute to diagnosed respiratory infection. Independent of air pollution exposure, RVIs can disrupt epithelial barriers23,24,25, activate an inflammatory cascade mediated by nuclear factor kappa-light-chain-enhancer of activated B cells (NF-κB)26, and activate key antiviral proteins, including Type I (e.g. interferon beta [IFN-β]) and Type II (e.g. interferon gamma [IFN-γ]) interferons)27,28.

Differentiating RVI from RBI using gene expression is an area of active research29. In a study by Tsalik et al.29, three externally validated, well performing (AUC 0.90–0.99), host-response classifiers were described for non-infectious disease, bacterial and viral respiratory infection respectively. Though direct comparison to genes in our study was limited, the overall theme of distinct gene profiles for viral and bacterial infection and a variable (heterogeneous) pattern for combined viral and bacterial infection was similar to the patterns observed in the exploratory heatmap (without pollution modeling) of our study (Fig. 2). In another study focused on individual genes, found that a single gene (IFI27) is able to differentiate influenza from RBI30. IFI27 was also identified in the Suarez study22 as a classifier of RBI vs. RVI and was observed in our brief replication analysis Supplemental Table S4. While distinct gene expression can exist between RVI and RBI, air pollution itself can also illicit a strong immune response independent of respiratory infection.

Independent of respiratory infection, air pollution exposure is known to broadly lead to immune dysregulation in cell and animal models through pro-inflammatory changes to lung epithelia, dysregulation of cell signaling pathways and direct effects on immune cells including macrophages, dendritic cells and granulocytes31. Specifically, in response to diesel exhaust particles, a NF-κB mediated inflammatory cascade is thought to occur within the lung epithelium10. This response has the potential to disrupt the tight junctions between epithelial cells, thereby increasing the risk for viral or bacterial penetration and subsequent infection9,11. We did not observe many inflammatory pathways associated with air pollution and only observed two individual inflammatory related genes in our study. ADORA1 participates in the activation of monocytes, which leads to a pro-inflammatory response. HERC5 was another example of a gene upregulated by endothelial inflammation that was differentially expressed between types of infections in our study. The paucity of observed inflammatory changes may be related to the overall low concentrations of ambient air pollution in the Rochester area compared to other more heavily polluted areas.

In terms of immune effects, decreased levels of IFN-γ (important for macrophage activation) were observed in the peripheral blood of mice exposed to traffic pollution in China12, and in the peripheral blood of humans exposed to diesel exhaust32. Further research has also observed dysregulation of the epithelial cell junction, respiratory microbiome and cytokine response as additional factors increasing pathogenic virulence in the setting of PM exposure33. Our analysis suggests that while BC is associated with upregulation of type 1 interferon related pathway in the two weeks prior to infection, there may also be a component of immune suppression associated with exposure to traffic pollution in the later lag periods. Specifically, in the pathway analysis (Table 3), BC was associated with upregulation with numerous immune related pathways at the 0–6 lag period and downregulation of natural killer cell activity and antigen processing in the 14–20 day lag period. BC may have unique health effects when compared to other PM due to its physical shape as a chain aggregate particle. This shape provides a large surface area and concavity between intersecting spheres (most other particles are convex/spherical) that improves the ability for BC to serve as a transport vector for other chemicals or possibly infectious organisms into the body34. UFP also was associated with upregulation of a viral related pathway at the 0–6 lag day period and suppression of multiple immune related pathways at the 7–13 lag day lag period. Though speculative, these finding suggest that for BC and UFP, an inflection point may exist where the effects of air pollution alone (lag days 13–28) are then comingled with the effect of acute infection during the incubation period in the 7 days preceding infection. In contrast to BC and UFP, PM2.5 was associated with suppression of multiple immune related pathways at the 0–6 lag day period but had no associations with immune pathways at later lag periods. Though the etiology of the gene suppression from PM2.5 is not clear, it may suggest that effects on immunity may be pollutant specific in addition to being timing specific. The relative importance of the directionality and timing of these immune changes for specific pollutants (and considering the composition of pollutants like PM2.5) in the pathogenesis of respiratory infection deserves further study in a prospective manner.

Despite our study population preceding the current COVID-19 pandemic, our participants had a high prevalence of several comorbidities that are risk factors to severe COVID-19 illness including diabetes, COPD, smoking and obesity35. A deficiency in the type 1 interferon response has been hypothesized to be a risk factor for a severe clinical course of COVID-1936. Our study observed a difference in type 1 interferon expression when comparing RVI (high expression) and RBI (low expression) in the week prior to hospitalization (Fig. 3). The most distinct differences in gene expression for type 1 interferon between RVI and RBI corresponded to the highest concentrations of black carbon. We observed that one cluster of genes within the Type 1 Interferon pathway appeared to drive the association between air pollution and gene expression. Determining the potential gene pathways and individual genes involved in the air pollution/respiratory infection association is a key area of research for the current and future pandemics. A further benefit of improved knowledge of the risk of specific air pollutants and respiratory infections could be the ability to make real time policy changes (e.g. diesel traffic modifications) during a pandemic to reduce pathogen virulence.

In addition to the immune specific pathways, there were two additional general pathways broadly related to protein folding and iron homeostasis, which were associated with changes in air pollution. All pollutants were associated with three pathways related to the endoplasmic reticulum (ER), an organelle central to protein synthesis (e.g. cytokines or other immune proteins) and transport in the body37. (Table 3) In a prior in vitro study of PM exposure to bronchial epithelial cells, PM increased stress on the ER (upregulation) that lead to a deleterious unfolded protein response (UPR)38. Influenza has also been observed to cause similar dysfunction in the ER39. This can impair the function of cells central to innate immunity in the lung including bronchial epithelial cells40, and also has the potential to lead to dysregulation in the synthesis and transport of other immune related proteins. Aside from AMP, all other pollutant/ER related pathway associated were observed at the 7–27 day lag period, with no association observed in the 0–6 day lag period. This may suggest that the ER (protein folding) related dysregulation precedes the time point of acute infection for infections with incubation periods under 7 days (e.g. Influenza)41. While all other pollutants led to upregulation of ER related pathways, PM2.5 was associated with suppression of the ER pathways. While both upregulation and suppression could lead to dysregulation, upregulation may be more detrimental due to the risk of the UPR.

Iron homeostasis plays an important role in the clinical course of RBI42, and is increasingly recognized as an important factor in RVI43 as well. Air pollution is known to induce a relative iron deficiency through dysregulation of iron homeostasis through mechanisms of chelation and/or sequestration44. Aside from BC, all other pollutants were observed to upregulate multiple pathways of iron homeostasis (Table 3). The association between multiple pollutants and an upregulation of iron related pathways is consistent with the relative iron deficient state induced by air pollution. Intracellular iron deficiency can lead to an increased oxidative state and inflammation within the host. Whether the observed changes in iron homeostasis are protective (immune priming), deleterious (immunosuppressive), or related to changes in the proportion of blood cells remains unclear. The effect of air pollution on iron homeostasis pathways in viral infection is deserving of further study.

### Limitations

Our study results should be interpreted in light of several limitations. First, our study focused on a small group of severely ill patients requiring hospitalization, though the final specific strata of severity within the hospital were not recorded as participants were enrolled upon hospital arrival. Our cohort was also older with multiple comorbidities, potentially limiting the generalizability to younger patients with respiratory infection. Second, the lack of a control population limited the pathway and LIMMA analysis to comparisons between types of infection. Furthermore, in the use of alternative gene-list based gene set enrichment algorithms, such as Enricher45,46, was not possible in the study due to the relatively small number of marginally statistically significant genes identified. Third, this analysis was not able to correct for the blood cell proportions in the peripheral whole blood of our samples. In theory the lack of inclusion of blood cell proportions should only minimally change the magnitude of effect estimates and would result in a larger standard error and reduced statistical significance. Fourth, the epidemiological design does not allow for causal link between air pollution and immune response to respiratory infection. Fifth, as there was no external validation with gene sets outside of our study, the generalizability of our findings are reduced. Sixth, there was likely an element of exposure misclassification given central site monitor estimated pollution, which likely reduced the magnitude of the observed effects. Finally, Rochester, NY has a relatively low average concentration of PM air pollution so generalizability to areas of higher pollution may be limited if dose thresholds exist in the pathogenic response to PM. Future studies can improve exposure assessment by using land use regression techniques, account for multipollutant mixtures and improve overall generalizability by including patients with mild infection and non-infected controls.

## Conclusions

Overall, this epidemiological study suggests that combustion related pollution, particularly BC, is associated with changes in gene expression within innate immune pathways. Increased concentrations in the majority of pollutants also appear to correspond to changes in expression to protein folding and iron homeostasis. Distinct from other pollutants, PM2.5 was associated with downregulation of immune and protein folding pathways. The relatively low pollution in the study region may explain the lack of inflammatory changes accompanying the changes in the immune pathways. Future controlled exposure studies informed by epidemiological studies are needed to further explore the relationship between inflammatory and immune responses to particulate air pollution in patients with respiratory infection.

## Methods

We used existing data from 111 patients originally enrolled in the study by Falsey et al.47, who underwent transcriptional profiling as detailed in the study by Suarez et al.22 included adults over the age of 21 years with symptoms compatible with acute respiratory tract infection admitted through the emergency department at Rochester General Hospital (RGH), Rochester, NY from 2008 to 2011. As detailed in Falsey et al.47, each patient was assigned an admitting diagnosis by a pulmonary specialist after examination of each subject and review of laboratory, microbiologic and radiographic data. All ethical approvals, guidelines and consent were provided in this previous study. Subjects had comprehensive microbiologic testing and cases were adjudicated by specialists as viral alone, bacterial alone, or mixed viral-bacterial infection. From this population, 1–3 ml of peripheral whole blood RNA was collected from 118 patients in Tempus tubes and hybridized using an Illumina Human HT-12 v4 BeadChip kit. Transcripts from the Illumina GenomeStudio based analysis were included if they were present in 10% or more of the samples and if they exhibited a minimum of a twofold expression change. As 7 of the 118 patients had missing pollution data, we analyzed the data of 111 patients who had a transcriptional analysis of peripheral blood performed in our current study on the association between air pollution and gene expression.

### Air pollution data

Ambient air pollution concentrations were measured at a central site monitor in Rochester, NY, and all patients living in Monroe County, NY were assigned pollutant concentrations from this monitor. The daily ambient air pollutant concentrations in the 28 days prior to the date of each participant’s hospital admission were matched to each participant, as an estimate of the patient’s air pollution exposure in those 28 days. Specifically, measurements of particle number concentrations in the size range of 10–500 nm are made continuously and sequentially at the New York State Department of Environmental Conservation (NYS DEC) site in Rochester, NY48. From 2004 to the present, measurements have been made at the NYS DEC primary site (latitude 43°09′56″ N, longitude 77°33′15″ W) on the eastside of Rochester, NY. This sampling site is close to two major interstates (I-490 and I-590) as well as NY route 96, a major route carrying traffic traveling to and from downtown Rochester. Hourly PM2.5 mass, wind speed and wind direction, ambient temperature and relative humidity are also measured at the above-mentioned site. Size distribution measurements are made using a scanning mobility particle sizer (SMPS, TSI Inc.) system consisting of an electrostatic classifier (TSI model 3071), with an impactor having an orifice size of 0.0457 cm, an 85Kr aerosol neutralizer (TSI model 3077), and a condensation particle counter (CPC; TSI model 3010). The size range bounds are 10.4 nm (lower) and 0.542 µm (upper) leading to measurement of mid-point particle sizes ranging from 11.1 nm to 0.47 µm (32 channels per decade) at a total scan time of 5 min per sample. Routine maintenance such as calibrating rates is performed once a week to ensure that the system is functioning properly. PM2.5 is measured with a TEOM (model 1400ab, Thermo Fisher Scientific Inc., USA). Black carbon (BC) was measured with a 2-wavelength aethalometer. Delta-C (DC) is the difference between BC measured at 370 and 880 nm and has been shown to be a marker for biomass burning49. Pollution measurements were taken from Nov 1st, 2008 to May 31st, 2011.

### Microarray data acquisition and processing

This dataset, GSE60244 on the Gene Expression Omnibus, contains background corrected, non-normalized whole-blood genome data from microarrays run on the Illumina HumanHT-12 V4.0 expression BeadChip platform. The reported detection p-values were used to infer the mean and variance of the negative control probe intensities in order to perform background correction using the normal exponential convolution model. This method prevents negative values that arise from subtraction-based background correction. Quantile normalization was used after background correction in order to minimize variation between arrays. Specifically, we used the neqc function in LIMMA to perform adaptive background correction based on each array in order to account for background intensity around each feature and control for variability between arrays. The neqc function uses both negative and positive controls for normalization 50. The probes were matched to gene names using the annotation package IluminaHumanV4 51. Unidentified and non-detected genes were removed.

### Statistics

#### Exploratory analysis

After preprocessing, the top 150 high variance genes were selected for initial exploratory analyses. The selected expression values were centered and scaled to have mean zero and variance one. We performed hierarchical clustering with Euclidean distance and complete linkage on both the genes and samples. The resulting sample dendrogram was qualitatively compared to ambient air pollution levels in the weeks prior to infection diagnosis obtained from a prior study22. Finally, the ambient pollution values for each patient were overlaid from the day 0 to day 6 time period preceding diagnosis of infection. The program ComplexHeatmap was used to generate the heatmaps in this study52.

### Individual gene analysis

The Linear Models for MicroArray (LIMMA) package53 in R (version 4.03)54 was used to test hypotheses about the effect of ambient air pollution levels in the weeks prior to infection diagnosis on patients’ gene expression using all 47,231 probes available in the microarray platform.

Each of the five pollutant exposures was tested individually by fitting a separate linear model for each of the four exposure time intervals: 0–6 days, 7–13 days, 14–20 days, and 21–27 days prior to date of diagnosis. This resulted in fitting a total of 20 models of the following form:

$${E[Y}_{ij}]={\beta }_{0j}+{\beta }_{1j}I({Viral}_{i})+{\beta }_{2j}I({Coinfection}_{i})+{\beta }_{3j}I({Female}_{i})+{\beta }_{4j}{x}_{i}^{pollutant}$$

Here $${Y}_{ij}$$ is the gene expression in subject i for gene probe j. Since no pollution exposure data was available for the gene expression control group, the bacterial infection group was set as the baseline category; therefore, the difference in expected gene expression between viral and bacterial infection and between coinfection and bacterial infection are $${\beta }_{1j}$$ and $${\beta }_{2j}$$, respectively. The difference in expected gene expression between female and male subjects is $${\beta }_{3j}$$. Sex was selected as a covariate of interest as it was predictive of respiratory viral infection and had a potential effect on gene expression. None of the other covariates tested, including race, smoking, COPD, diabetes, congestive heart failure, white blood cell count, oxygen requirement, chronic renal failure, statin and obesity were predictive of respiratory viral infection and were therefore not included in the model. Ambient pollutant levels measured for AMP, PM 2.5, UFP, DC, and BC in four different time lags are each used separately as the pollution value $${x}_{i}^{pollutant}$$, and $${\beta }_{4j}$$ is the change in expected gene expression for a one unit increase in a given pollutant. Standard errors for these coefficients were calculated using the empirical Bayes method55 central to the LIMMA method. Differential expression was determined by using a false discovery rate (FDR) threshold of 0.1.

### Pathway analysis

In order to test the effect of pollutants on gene pathways we used the CAMERA algorithm56. CAMERA implements a competitive gene set test comparing each pathway against all other genes not in the pathway. Such tests focus on identifying the most important biological processes relative to all other processes. Gene-wise moderated t-statistics are used as in LIMMA, but here the goal is to determine if the mean of the gene-wise statistics differ between the pathway of interest and all other genes. The key feature of CAMERA is that it accounts for inter-gene correlation in order to better control type I error. In this work, we used a false discovery rate threshold of 0.1.

Pathway definitions came from the Molecular Signatures Database version 7.1 GO: Gene Ontology gene sets listing of 10,192 pathways. The pathways defined here are derived from the Gene Ontology (GO) resource, and they were compiled into R data files which mapped probes to gene symbols, which were subsequently used to define pathway membership. We cross referenced the significant LIMMA results against the significant CAMERA results to limit our scope to only the pathways found significant in CAMERA which also contained individual genes found to be significant in the LIMMA analysis. From this subset of 448 pathways, we chose those known to be associated with infection and clustered those gene sets as in the unsupervised analysis to visually examine trends across patients and expression levels compared with pollutant levels.