Abstract
We investigate patterns of COVID19 mortality across 20 Italian regions and their association with mobility, positivity, and sociodemographic, infrastructural and environmental covariates. Notwithstanding limitations in accuracy and resolution of the data available from public sources, we pinpoint significant trends exploiting information in curves and shapes with Functional Data Analysis techniques. These depict two starkly different epidemics; an “exponential” one unfolding in Lombardia and the worst hit areas of the north, and a milder, “flat(tened)” one in the rest of the country—including Veneto, where cases appeared concurrently with Lombardia but aggressive testing was implemented early on. We find that mobility and positivity can predict COVID19 mortality, also when controlling for relevant covariates. Among the latter, primary care appears to mitigate mortality, and contacts in hospitals, schools and workplaces to aggravate it. The techniques we describe could capture additional and potentially sharper signals if applied to richer data.
Introduction
At the end of January 2020, two Chinese tourists were hospitalized in Rome and tested positive to SARSCoV2. At the beginning of February, a group of Italian citizens was repatriated from Wuhan – among them, one tested positive. As the news media reported these headlines, neither the Italian public nor the Italian authorities appeared to perceive an imminent threat, though retrospective analyses now suggest that the virus may have been circulating in the north of the country as far back as December 2019 (e.g., detection of SARSCoV2 in the wastewater of Milan and Turin^{1}). The first recorded nontravel related COVID19 case occurred in Codogno (Lombardia)—where a 38 years old male visited the hospital first on February 17, and then again on February 19 with worsening respiratory symptoms; in this date, he was tested and diagnosed. On February 20, two individuals tested positive in Vo’ Euganeo (Veneto). Notably, the outbreaks in Lombardia and Veneto took two very different paths, something many observers attributed to the early response and aggressive testing strategy adopted by the regional authorities in Veneto^{2,3}. After some initial, much debated inconsistencies (e.g., hesitations in implementing local lockdowns in areas hosting major industrial production hubs, contested decisions to move patients between hospitals and nursing homes and to keep major sports events open to the public in Lombardia), starting in early March, local and central authorities took progressively more stringent measures to limit mobility and social gatherings—culminating with a general nationwide lockdown on March 9 and the suspension of all nonessential production activities on March 23 (starting in early May, activities restarted and mobility and gathering restrictions were gradually loosened).
Lockdown notwithstanding, based on official records, Italy saw a total of \(\approx\)35,200 COVID19 deaths as of the beginning of August 2020. While other countries (e.g., the U.S. and Brazil) reached much higher death counts, Italy’s relative death toll remained rather stark at 58.25 per 100,000 inhabitants. This may be partially attributable to the fact that Italy’s population is very old (nationally, the median age is almost 46 years and the percentage of individuals over 65 almost 22%^{4}), and that age itself correlates with conditions such as type II diabetes, hypertension and chronic respiratory ailments, which substantially worsen illness and increase the likelihood of death for individuals affected by the virus^{5}. But perhaps the most striking aspect of the COVID19 epidemic in Italy has been its heterogeneity. Some parts of Lombardia and of other regions in the industrialized north were hit early and especially hard, yet other demographically and socioeconomically similar areas fared better^{6,7}. Moreover, most of the central and southern regions of the country experienced a much milder epidemic—notwithstanding waves of relocations from employmentrelated domiciles in the north back to family homes in the center and south around the time of the nationwide lockdown. Potential contributors to this heterogeneity discussed by both scientists and the media include human density characteristics; centralized, hospitalbased vs distributed, primary health care systems; and pollution levels^{8,9,10,11,12}.
A broad and extremely sophisticated literature exists on epidemiological models^{13}, which many research groups are utilizing both to aid policy through forecasts and to dissect what happened, in Italy and around the world. We did not utilize these models. Instead, we applied a mix of statistical tools from the field of Functional Data Analysis (FDA^{14,15}), some wellestablished, and some recently developed by our group—which are still undergoing peer review and have not yet been validated by the community at large. FDA offers very powerful approaches to analyze data sets composed of curves or surfaces, exploiting information in their shapes. These techniques, which have been successfully applied in a variety of scientific domains^{16,17,18}, can effectively complement traditional epidemiological analyses and provide useful insights^{19}. We used them to characterize patterns of COVID19 deaths occurring around the country and analyze their statistical association with two key predictors; namely, mobility and positivity (the fraction of performed tests returning positive results). We also considered various sociodemographic, infrastructural and environmental covariates. We focused on the period from February 16, 2020, right before the first cases were recorded in Codogno and Vo’ Euganeo, to April 30, 2020, right before the first lockdown relaxations (restarting of manufacturing and construction activities at the beginning of May). Based on data availability, we performed our analyses at the spatial resolution of regions, which is suboptimal for several reasons. An epidemic is certainly better studied at a much finer resolution (municipalities, urban areas, perhaps the provinces within which Italian regions are further partitioned)—and so are its links to predictors and covariates whose signals may dilute when aggregated at the regional level. Moreover, operating with 20 observational units (the Italian regions) limits the size of the statistical models one can reliably fit on the data. The techniques we employed allowed us to pinpoint significant trends working with what we could retrieve from public data sources. Unquestionably though, access to data at higher resolution would allow more nuanced, indepth analyses and likely produce sharper results.
Results
Below we describe the salient outcomes of our analyses. After addressing some shortcomings in publicly available COVID19 deaths records, we characterize two starkly different epidemic patterns and rank regional mortality curves. Next, we relate mortality to mobility and positivity, and to a number of sociodemographic, infrastructural and environmental factors.
Undercounting deaths
Since February 24, 2020, the Italian Civil Protection Agency (Dipartimento della Protezione Civile; DPC) has released daily counts of recorded COVID19 deaths at the coarse resolution of regions (only the number of recorded cases were released at the finer resolution of provinces). In Italy and elsewhere, official death records have often been criticized as undercounts^{20,21}. Alternative data sources do exist, e.g., daily mortality rates—which can be contrasted to those from prior years to gauge differential mortality. In Italy these are provided by the National Statistical Institute (ISTAT) at the resolution of municipalities. We aggregated the data over municipalities belonging to the same region and subtracted averages over the prior 5 years (201519, see Methods)^{22}. Figure 1(a) shows smoothed DPC and ISTAT differential mortality curves (per 100,000 inhabitants) for some example regions (Lombardia, Veneto, Emilia Romagna and Campania). The undercounting in the official DPC records was dramatic, especially in badly affected areas and in the initial stages of the epidemic. However, ISTAT differential mortality curves have themselves limitations, especially in less affected areas, where they can fluctuate at small levels and even take negative values—idiosyncratically or reflecting other COVID19 related phenomena (e.g., increases in mortality due to untreated emergencies or reductions in mortality due to fewer accidents during the lockdown). We therefore formed maxima curves (MAX), where the largest between the DPC and the ISTAT datum is taken in each day and for each region, and then smoothed. These are shown in Fig. 1(b) (DPC and ISTAT smoothed curves for all regions are shown in Figs. S1 and S2). We repeated our analyses on all three data sets; given the small number of observational units at our disposal (\(n=20\) regions), this allowed us to borrow strength replicating results across data sets, with their differences and limitations.
Two different epidemics
Italy saw the unfolding of two very different epidemics; a relatively mild one in the majority of the country, and a tragic, seemingly out of control one in its most hardhit regions. These two epidemics can be effectively characterized with probKMA, an FDA technique designed to identify recurrent motifs within a set of curves, and group the curves based on the motifs they comprise^{22,23}. Here, the motifs are the temporal patterns of deaths that characterize alternative epidemic unfoldings, which may in fact start at different times in different curves (regions). Thus, the algorithm also produces the shifts required to align regions comprising the same motif to each other. ProbKMA is similar to a Kmean algorithm; it requires the user to specify the number of motifs (K) at the outset, and to select a distance—which can be defined on the curve levels, their derivatives, or a combination of both (see Methods).
The solution with \(K=2\) and distance defined on curve levels depicts two starkly different epidemics, shown for the MAX curves in Fig. 2(a). Allowing for shifts, these are represented by 65day long motifs. Group 1 undergoes a steep ascent (the “exponential” pattern) followed by a slower descent from the peak; it includes many northern regions. Based on the shifts, Lombardia was first, followed by Emilia Romagna, Marche, Liguria, Piemonte, Trento/Bolzano, and last Valle d’Aosta. Lombardia and Valle d’Aosta presented the most extreme peaks—but Valle d’Aosta’s descent was steeper (with a second late ascent likely due to data recording imprecisions; Valle d’Aosta is a very small region with only \(\approx 125,000\) inhabitants). Group 2 follows a “flat(tened)” pattern; it includes all regions in southern and central Italy and, remarkably, Veneto—where the curve was successfully curbed. The shifts produced for this group are less stable and less meaningful in terms of interpretation, as flatter profiles leave more leeway in aligning curves against each other. All results (except for the shifts in Group 2) are rather consistent when using DPC and ISTAT curves (see Fig. S3a and Fig. S3c), and when using distances defined on derivatives instead of curve levels. The solution with \(K=3\) places Lombardia (ISTAT curves) or Lombardia and Valle d’Aosta (MAX and DPC curves) in a cluster of their own (see Fig. S4). We also validated our results using a modification of funBI^{24}, a functional biclustering technique, and IWTomics^{25}, a functional testing technique which contrasts two sets of aligned curves pinpointing the locations and scales at which they differ (see Methods). Figure 2(b) shows how, starting a little over two weeks from the beginning of their motif (wherever that was in each curve), Group 1 and Group 2 differ significantly at all temporal scales (see also Fig. S3b and Fig. S3d, and Table S1).
Why the two epidemics? The pattern of deaths characterizing Group 1 may be due, in large part, to the fact that the virus had circulated silently in the north of Italy for a long period of time before any kind of behavioral changes by the general public, medical protocols, or mitigation policies by local and central authorities were put in place. Mounting evidence suggests that a large share of COVID19 cases are asymptomatic and yet contagious^{3,26}; their numbers may have increased until a pentup reservoir of virus found its way to vulnerable individuals (some researchers also hypothesize AntibodyDependentEnhancement of SARSCoV2^{27}, and thus a role for reinfections). But a variety of additional factors may have contributed to shaping the two epidemics; we explore some below.
Ranking mortality curves
Nonparametric FDA methods can be used to rank curves based on the notion of depth—from the innermost to the most extreme, and to identify outliers^{28,29}. Figure 3 shows a functional box plot of the MAX mortality curves and a depth ranking of the curves in the DPC, ISTAT and MAX data sets—shifted based on probKMA run with \(K=2\) and restricted to their aligned 65day portions. The ranking is directional; we attributed signs to the depth measurements, so that curves far over or under the median curve are at the top or bottom of the ranking, respectively (see Methods). The top portion of the ranking comprises regions with “exponential” epidemics (Group 1) and is rather stable across data sets; Lombardia and Valle d’Aosta are consistently among the most extreme curves (they are also identified as outliers in the MAX and DPC data sets). The mid and bottom portions of the ranking comprise regions with “flat(tened)” epidemics (Group 2) and are less stable across data sets, as the flatter profiles can more easily switch in their depth ranks. However, Toscana (which is the median in the MAX and ISTAT data sets) and Veneto are consistently among the deepest, most central curves. This analysis highlights again the tragic epidemic unfolding in Lombardia, and, by contrast, confirms how Veneto managed to “flatten” its curve back into the bulk.
Local mobility and positivity as statistical predictors of mortality
Next, we focus on two key variables. The first is one of the most discussed policyactionable variables, mobility, which has been curtailed to various degrees through lockdown measures in most of the countries affected by COVID19. The second is one of the most discussed sentinel indicators, positivity, i.e. the fraction of performed tests returning positive results. For both these variables daily values for the period February 16–April 30, 2020, were obtained from data in the public domain at regional resolution.
We considered differential mobility curves provided by Google for the category “Grocery & pharmacy”. These express the fractional reduction with respect to January 2020 levels, and refer to mobility linked to first necessities—such as buying food, medicine, etc. For Italy, they were provided at the resolution of regions. Even though individuals were allowed to leave their homes for these necessities also during the most restrictive phase of the lockdown, the reduction captured by Google’s “Grocery & pharmacy” was substantial. Mobility in weekdays fell by roughly 0.30, i.e. 30%, in the week after the lockdown (March 9), and further decreased in following weeks—reaching the lowest levels (between approximately \(0.60\) and \(0.40\) depending on the region) in the week after the suspension of nonessential production activities (March 23). It then slowly increased, getting back in a range between approximately \(0.40\) and \(0.20\) at the very end of April (see Fig. S5). In Lombardia, the peak MAX mortality was between March 20 and 25—i.e., roughly, simultaneous to the lowest mobility and two weeks after its first substantial drop. Notably, in most Italian regions mobility during lockdown weekends reached \(1.00\), i.e. \(100\%\). For comparison, in the state of New York, which had among the strongest restriction measures in the U.S., Google’s “Grocery & pharmacy” never fell below \(0.40\). We refer to Google’s “Grocery & pharmacy” curves as local mobility because they measure how much individuals move around where they live, as opposed to how much individuals move from place to place—e.g., to go from Wuhan to Milan, or from Milan to Palermo, or New York City. Obviously both types of mobility are relevant for the spread of a virus, and definitions depend on scale/resolution, but the first one is the one we analyzed.
To construct positivity curves, we combined daily public records on number of tests performed and number of new cases, which are also provided by the Italian Civil Protection Agency. Taking daily ratios of new cases on tests performed is clearly imperfect, because of (variable and unreported) delays in test results. But regularizing and smoothing these ratios (see Methods) produced a reasonable proxy. Smoothed positivity surpassed 0.1, i.e. 10%, as early as February 20 in some hard hit regions, peaked in a staggered fashion throughout March, and fell below 0.10 for all regions by around April 22 (see Fig. S5). Lombardia surpassed 0.10 around February 22 and peaked around March 1518; that is, roughly, about a month and about a week prior to the peak of MAX mortality, respectively. Though we cannot draw exact parallels (our positivity curves are approximate and smoothed), this is consistent with what was observed, e.g., in New York City—where positivity was above 0.10 approximately from March 67 to May 1213 and peaked at about 0.70 around March 28, with deaths peaking between April 5 and 13.
To anchor local mobility and positivity curves to the epidemic unfolding in each region, we shifted them congruently with the mortality curves. Figure 4(a) displays shifted curves based on probKMA run on MAX data with \(K=2\) (Fig. S6 displays shifted curves based on probKMA run on DPC and ISTAT data). The horizontal axis now indicates again days in the regionspecific epidemic unfolding, restricted to the 65day portions where mortality curves align forming the two probKMA motifs.
We then used functiononfunction regressions^{14,30} to model the statistical dependence of mortality on local mobility and positivity; in symbols, we fit the joint model \(y(t) = \alpha (t) + \int \beta _{mob}(s,t)x_{mob}(s)ds + \int \beta _{pos}(s,t)x_{pos}(s)ds + \varepsilon (t)\), where y(t) is the response curve, i.e. mortality, \(\alpha (t)\) is the intercept, \(\varepsilon (t)\) is the model error, and \(x_{mob}(s)\) and \(x_{pos}(s)\) are the predictor curves —mobility and positivity, respectively. These predictors are integrated over time, with “effects” represented by surfaces; \(\beta _{mob}(t,s)\) is the association of mortality at time t with local mobility at time s, and similarly \(\beta _{pos}(t,s)\) for positivity (see Methods).
Figure 4(b) shows the effect surfaces for local mobility and positivity estimated using the MAX curves as response. \({\hat{\beta }}_{mob}(t,s)\) suggests that local mobility levels early on and midway through the epidemic (e.g., around the March 9 lockdown date for Lombardia) are strong positive predictors of mortality at its peak, with the early predictive signal stronger than the midway one. In contrast, the local mobility level late in the epidemic has a negative association with mortality at its peak, likely reflecting a faster resumption of mobility in regions with milder epidemics. \({\hat{\beta }}_{pos}(t,s)\) suggests that positivity levels early on and midway through the epidemic are also positive predictors of mortality at its peak—though the predictive signals are substantially weaker than those of mobility, likely because they are confounded with the latter. However, the positivity level late in the epidemic has a marked positive association with mortality at its peak. Here the signal is “detangled” from that of mobility, and one finds a sort of retrospective signature; regions which fared worse still had hightened positivity in the late stages of their epidemics. The data at our disposal does not allow an accurate evaluation of the lags that might occur between mobility, positivity and mortality. However, we performed some additional analyses to investigate this. We further denoised the curves projecting them on their first functional principal components^{14}, and measured the distances between the peaks of such projections. On average, there were \(\approx 20\) days between the peak of mobility and the peak of positivity, and \(\approx 10\) days between the peak of positivity and the peak of mortality (see Fig. S7). Back to the estimated effect surfaces, we found them to be remarkably similar across the three data sets (MAX, DPC and ISTAT). The joint models all have insample \(R^2\)s above 90% and leaveoneout crossvalidated (LOOCV) \(R^2\)s above 50% (see Table S2), with strong and comparable contributions of local mobility and positivity (e.g., for the MAX curves, the partial \(R^2\)s are 62% and 53%, respectively). Also, while this is not the case for all regions, residuals are rather consistent across data sets for Veneto, whose mortality is well predicted, and for Lombardia, whose mortality is always and sizably underestimated (see Fig. S8a and Fig. S9).
In order to further assess the roles of local mobility and positivity, we also considered marginal functiononfunction regressions for mortality on each, separately; in symbols, \(y(t) = \alpha (t) + \int \beta _{mob}(s,t)x_{mob}(s)ds + \varepsilon (t)\) and \(y(t) = \alpha (t) + \int \beta _{pos}(s,t)x_{pos}(s)ds + \varepsilon (t)\). Effect surface estimates for local mobility are very similar to those in the joint models for all three data sets (see Fig. S10). Those for positivity confirm a strong association with mortality at its peak, but are less defined in terms of time profile (see Fig. S11). In summary, we find substantial evidence that local mobility and positivity are associated with COVID19 mortality, and can predict it with some lagtime. Though the data at our disposal does not allow us to pinpoint lag lengths with accuracy, our analysis does support their roles as policyactionable and monitoring variables, respectively. We also find that, even when considered jointly, these variables are not enough to fully account for the massive numbers of COVID19 deaths recorded in Lombardia, the worst hit region in the country.
The role of sociodemographic, infrastructural and environmental factors
We considered 68 scalar (nonlongitudinal) covariates retrieved from public sources, proxying for sociodemographic, infrastructural and environmental factors debated by scientists and policymakers during the epidemic (see Table S3). Many of these are suboptimal proxies; they refer to the closest times we could find data for (in some cases 2016 or earlier) and are, too, at the coarse resolution of regions. We performed an initial screen among these covariates to guarantee reasonable data quality (eliminating older and less complete data sets), facilitate interpretations and control collinearity. Fig. S12 shows a histogram of the pairwise correlations, about a quarter of which exceeds 0.5 in absolute value, and Fig. S13 shows a dendrogram where the covariates agglomerate in distinct groups. We thus selected 12 covariates which were relatively recent (2017 or 2018) and well spread across the dendrogram groups. These capture aging of the population; prevalence of preexisting conditions believed to affect disease severity; quality of distributed primary health care vs. centralized hospitalbased health care; the potential of hospitals and nursing homes, but also schools, workplaces, households and public transport to act as contagion hubs; and pollution levels (see Table 1; Fig. S14 provides marginal densities, pairwise scatter plots and correlations for the 12 selected covariates).
Even this restricted set of 12 covariates presents a distinct interdependence structure (see covariates dendrogram in Fig. 5(a) and Variance Inflation Factors in Table S4). For instance, our contagion hubs proxies for hospitals, schools and work places, and our (inverse) proxy for quality of distributed, primary health care (number of adults per family doctor), tend to vary closely together across regions. Also, our contagion hub proxy for public transport and pollution levels tend to vary together (this is not counterintuitive, as both increase in more industrialized regions with large metropolitan areas), as do the percentages of individuals affected by diabetes and allergies, and our proxy for quality of centralized, hospitalbased health care (ICU beds per 100,000 inhabitants) and the percentage of individuals over 65.
Conversely, some regions show similar profiles across covariates (see regions dendrogram in Fig. 5(a)). For instance, Lombardia, Veneto, Emilia Romagna and Piemonte have strong similarities, as do groups of southern regions (e.g., Sicilia, Campania, Puglia and Calabria; Basilicata, Abruzzo and Molise). An interesting characterization is produced using the Cheng and Church’s biclustering algorithm^{31}, which we implement with an adjusted mean squared residue, or Hscore^{32}. A bicluster is a subset of regions which exhibit similar behavior across a subset of covariates. Figure 5(b) shows two biclusters with similar adjusted Hscore values, obtained through the same run of the algorithm. The first bicluster comprises central and southern regions, all with “flat(tened)” epidemics (Group 2). Its regions have low ratios of adults to family doctors, limited concentrations in hospitals, nursing homes, work places and public transport, and low pollution levels. They also have high percentages of diabetic individuals and limited availability of ICU beds. The second bicluster comprises northern regions with “exponential” epidemics (Group 1), such as Lombardia, EmiliaRomagna and Piemonte, but also northern and central regions with “flat(tened)” epidemics (Group 2), such as Veneto, Friuli Venezia Giulia and Toscana. Its regions have high ratios of adults to family doctors, high concentrations in hospitals, work places and classrooms, and tend to have large percentages of individuals over 65. They also have low percentages of diabetic individuals and medium or smallsized households.
Next, we used functional regressions with a twofold aim: pursue a more direct, systematic assessment of the associations between the scalar covariates and COVID19 mortality; and use the scalar covariates as controls in models comprising mobility and positivity to reassess these key predictors. We stress again that the coarse resolution of the data poses serious limitations for these analyses, because it may dilute some predictive signals and because it bounds us to a small sample size. With only \(n=20\) observational units (the regions), fitting functional regression models comprising many terms (e.g., several scalar covariates and possibly their interactions; mobility and positivity curves along with more than one scalar covariate) produces unstable, overfit outcomes. Thus, we evaluate only the marginal effects of the scalar predictors, and the effects of mobility and positivity with one scalar control at a time. The marginal functiononscalar regressions of mortality curves on each of the 12 covariates have insample \(R^2\)s ranging between \(\approx 20\) and 65%. Here the “effects” are curves; \(\beta _{x}(t)\) represents the association of mortality at time t with the covariate x. For 8 of the covariates the \({\hat{\beta }}_{x}(t)\)’s show the expected signs throughout the peak period of the epidemic. In particular, the (inverse) proxy for quality of distributed, primary health care is the strongest marginal predictor; adults per family doctor shows a very large positive association with mortality. Also hospital, school and work place contagion hub proxies show strong positive associations with mortality. Nursing homes and public transport contagion hub proxies, pollution and the percentage of individuals over 65 are positive but comparatively weaker marginal predictors. For 4 of the covariates the \({\hat{\beta }}_{x}(t)\)’s show unexpected signs. The percentages of diabetics and allergic individuals show negative associations with mortality, likely due to the fact that their prevalence is high(er) in areas which were spared the brunt of the epidemic. In fact, estimated effect curves become positive when a differential intercept is included in the model to account for different overall mortality levels in Group 1 and Group 2 regions (see Fig. S15). Also the average number of members per household shows a negative association with mortality. Its small range of variation across regions (\(\approx 2.0{}2.8\), mean 2.3, s.d. 0.16) may not allow it to properly proxy the effect of household contagions. At the same time, a strong negative correlation with the percentage of individuals over 65 may not allow it to properly proxy intergenerational contacts; regions with more elderly people are in fact those with smaller households. The negative association of average number of members per household with mortality, which persists even when including a differential intercept for Group 1 and Group 2 in the model (see Fig. S15), may simply be a “shadow” of its negative correlation with the percentage of individuals over 65. Finally, ICU beds per 100,000 inhabitants shows a positive association with mortality which, too, persists when including a differential intercept for Group 1 and Group 2 in the model (see Fig. S15), and may be in part a “shadow” of positive correlations with percentage of individuals over 65 and average number of beds per hospital. However, this proxy for quality of centralized, hospitalbased health care, so prominent to the public debate during the epidemic, is not a negative predictor of mortality in our analysis.
In conclusion, better proxies and finer resolution may reveal stronger aggravating roles for age, nursing homes, public transport and pollution^{8, 9} and better dissect the roles of chronic conditions, households and intergenerational contacts, and ICU availability^{33,34}. But our analysis, notwithstanding limitations in the data, suggests important roles of primary care in mitigating mortality, and of contacts in hospitals, schools and work places in aggravating it.
The results of our marginal functiononscalar regressions, which are summarized in Fig. 6(a) for MAX mortality curves, are also consistent across data sets (see Fig. S15)—which lends them support, at least at the resolution of regions. To further validate their stability we ran a functional generalization of SsNALEN^{35}—an Elastic Netlike algorithm that performs feature selection for regressions with many predictors, producing reasonably stable outcomes even with small sample sizes and collinear features. Reassuringly, the output of SsNALEN is consistent with the marginal analysis, and again consistent across data sets (see Table S5): the top feature is always adults per family doctor, and the top 5 always include, in addition to it, average beds per hospital, average students per classroom, average employees per firm, and average members per household.
Finally, we ran again the functiononfunction regression of mortality on local mobility and positivity, and reevaluated the effects of these predictors introducing in the model one of the top 5 scalar covariates at a time (see Fig. S16 for results on DPC, ISTAT and MAX data), as well as their first principal component, which explains \(\approx 68\%\) of their variability and can act as a “summary” control (see Fig. 6(b) for MAX curves and Fig. S17 for DPC and ISTAT curves). Remarkably, while the control covariate “subsumes” some of the predictive power in each model, the estimated effect surfaces of local mobility and positivity retain the same shapes, and they remain very strong and comparable contributors (e.g., for MAX curves in Fig. 6(b), the overall insample \(R^2\) reaches 94%, the LOOCV \(R^2\) is 70%, and the partial \(R^2\)s are 66%, 61% and 39%, respectively, for local mobility, positivity and the first principal component; see also Table S2). Thus, with all the limitations of the data at our disposal, controlling for relevant covariates does not modify how the epidemic unfolding is associated to local mobility and positivity over time. Introducing sociodemographic, infrastructural and environmental factors in the modeling also does not change what we observed concerning residuals: mortality in Veneto is well predicted, and mortality in Lombardia remains sizably underestimated (see Fig. S8b) for MAX and Fig. S17 for DPC and ISTAT).
Discussion
Notwithstanding the limitations of the data employed in this study, using FDA techniques we were able to characterize heterogeneous and staggered epidemics in different areas of Italy—recapitulating and quantitating what scientists, policy makers and the public saw unfolding during the months of February, March and April 2020. In addition, we were able to document strong associations of COVID19 mortality with local mobility and positivity, which persist in models that control for other relevant covariates. Investigating local mobility and positivity as, respectively, an actionable effector and a sentinel indicator of epidemic strength and progression, possibly to be used to adapt mitigation and containment efforts in real time, will require more and better data. In particular, accurate data on cases and hospitalizations in addition to deaths, and at a resolution much finer than that of Italian regions. Such data would allow a more systematic evaluation of the lags between the temporal patterns of mobility, contagions, illnesses and casualties—an important avenue for future studies, which could again utilize FDA tools (e.g., registration and dimension reduction techniques^{36}). Such data would also be critical to better capture predictive signals in a number of covariates—which may weaken and/or become confounded when aggregating data over broad, internally heterogeneous areas. Clearly, the limited data at our disposal for this study prevent us from drawing causal implications with confidence, but our results, along with those of other recent studies^{37,38}, do support a role for mobility as a key modulator of COVID19 spread and for positivity as a monitoring variable. Moreover, they support a role for distributed, primary health care in mitigating mortality, and for hospitals, schools and work places as contagion hubs that may aggravate the epidemic. If confirmed and finetuned on higher resolution data, also these findings could inform decision making—e.g., on short and mediumterm investments to boost distributed health care, or “pod” patients, students or employees. Finally, an extension of the temporal span of the data would also be of great interest to properly characterize different phases of the Italian epidemic—including its evolution after the gradual weakening of lockdown measures in May 2020. We believe that our work demonstrates the potential of FDA techniques for analyzing epidemiological data and we note that, while some of the techniques we used are wellestablished, others are very recent. These novel tools appear to offer original and useful insights, but it is important to point out their limited validation to date. Of course our pipelines and the mix of FDA tools used in this study could be applied to COVID19 data from other parts of the world.
Methods
Data retrieval and preprocessing
Functional variables
Daily cumulative COVID19 death counts per region were retrieved from the Italian Civil Protection Agency (Dipartimento della Protezione Civile; DPC^{39}). DPC mortality curves from February 24 to April 30, 2020, were computed for each region as the daily increments in COVID19 death counts, divided by the population of the region as of January 1, 2019 (as recorded by ISTAT^{40}). DPC mortality curves were set to zero for the period February 1623, 2020, before the Civil Protection Agency started releasing data. Daily death counts from all causes in 7270 Italian municipalities (about 93.5% of the Italian population) for the years 201520 were downloaded from the Italian National Institute of Statistics (ISTAT^{41}) on June 4, 2020. Data were aggregated by region, and ISTAT differential mortality curves from February 16 to April 30, 2020, were computed for each region as the daily difference between 2020 deaths and the average daily deaths in 201519, divided by the total population of the municipalities included in the death counts as of January 1, 2019^{42}. MAX mortality curves were created taking, for each region and each day, the maximum between DPC mortality and ISTAT differential mortality. Daily measurements concerning “Grocery & pharmacy” mobility from February 16 to April 30, 2020, were downloaded for each region from the Google Mobility Report^{43} (local mobility curves). These measurements express percent changes with respect to the corresponding daily mobility levels in the first five weeks of 2020 (January 3 to February 6). Positivity curves were constructed using raw data from the Italian Civil Protection Agency^{39}. For each day from February 24 to April 30, 2020, and each region, we took the ratio between the number of new positive cases and the number of new tests performed. The ratios were truncated at 0 and 1 to account for irregularities in the row data (e.g., positive cases \(=1\), or positive cases exceeding tests performed, presumably due to delays in test results). Like DPC mortality, positivity curves were set to zero for the period before the Civil Protection Agency started releasing data (February 1623, 2020). We point out that positivity curves must be used with caution due the fact that positive cases have been tallied with different rules and approaches in different areas and at different times^{44}. For all functional data sets, the two selfgoverning provinces of Trento and Bolzano were considered together as the Trento/Bolzano region, since not all data were available for both provinces separately. The 20 curves in each functional data set were smoothed using cubic smoothing Bsplines with knots at each day and roughness penalty on the curve second derivative^{14}. For each functional data set the smoothing parameter was selected minimizing the average generalized crossvalidation error (GCV^{45}) across the 20 curves. All computations were performed using the statistical software R^{46}, and specifically the R package fda^{47}.
Scalar covariates
We considered a large number of scalar covariates of potential interest (see Table S3), and focused on the 12 listed in Table 1 and below. In retrieving and computing various measurements, as was done for the functional variables, the provinces of Trento and Bolzano were aggregated into the Trento/Bolzano region. % Over 65 was retrieved from ISTAT^{40} at the regional level for the year 2018. % Diabetics and % Allergics were retrieved from ISTAT^{48} at the regional level for the year 2018. Adults per family doctor was retrieved from the Ministry of Health^{49} at the regional level for the year 2017. To compute ICU beds per 100,000 inhabitants, we collected the total number of ICU beds in each region in 2018 from the Ministry of Health^{50}, multiplied by 100,000 and divided by the population of the region as of January 1, 2019^{40}. To compute Ave. beds per hospital (whole) we used data from the Ministry of Health^{51}, which provides the number of beds per ward in each hospital in 2018. We first aggregated them over wards belonging to the same hospital, and then averaged over hospitals in each region. Ave. beds per nursing home (ward) was also obtained based on data for the year 2018 from the Ministry of Health^{52}—here we considered regional averages at the level of wards, without aggregating over wards inside the same nursing home (the wardlevel covariate had a slightly higher association with mortality outcomes). To compute Ave. students per classroom we used data from the Ministry of Education^{53}, which provides the number of students in each classroom of each school in the country (public or private, at every level of education), for the year 2018. We averaged them over schools in each region. Data for Trento/Bolzano and Valle d’Aosta were missing, and were imputed through random forest imputation^{54} using the R package missForest^{55} with default parameters maxiter=10 (maximum number of iterations) and ntree=100 (number of trees)—the latter is very large compared to the small number of missing values to be imputed, and our runs always converged well before the 10th iteration. To compute Ave. employees per firm we used data from ISTAT^{56}, which provides number of employees per firm at the level of municipalities. We averaged them over firms in each region. Data for Valle d’Aosta were missing, and were again imputed through random forest imputation with default parameters. Ave. members per household was retrieved from ASR Lombardia^{57} at the regional level for the year 2017. To compute Public transport rides per capita we used data from ISTAT^{58}, which provides the number of rides per capita for each Italian province in 2017. We multiplied these by the provinces’ population as of January 1, 2019^{40}, summed up over provinces in the same region, and divided by the region population as of January 1, 2019^{40}. To compute PM10 we used data from ISTAT^{58}, which provides the average annual concentrations of PM10 (in \(\upmu \hbox {g}/\hbox {m}^{3}\)) detected by air quality meters distributed over the Italian territory. We averaged them over meters located in each region.
Multivariate analysis tools
We used a number of standard multivariate techniques to analyze first the entire set of scalar covariates and then the 12 we focused on—including the extraction of Principal Components^{59}, the calculation of Variance Inflation Factors^{60} to evaluate multicollinearities, and clustering based on hierarchical agglomeration^{59}. The latter was used both to agglomerate covariates with similar behavior across regions and to agglomerate regions with similar behavior across covariates. Agglomerative hierarchical clustering groups elements in a set with a bottomup procedure that results in a dendrogram. Each element starts in its own cluster, and pairs of clusters are merged iteratively with a chosen distance for elements and linkage criterion for clusters. We employed the correlation distance, defined as \(d(x_1,x_2)=1\vert corr(x_1,x_2)\vert\) for two generic elements \(x_1\) and \(x_2\), and the complete linkage, defined as \(D(X_1,X_2) = max_{x_1 \in X_1, x_2 \in X_2} d(x_1,x_2)\) for two generic clusters \(X_1\) and \(X_2\) (thus, the distance between two clusters is defined as the furthest distance between their elements).
We also used biclustering on the 20 (regions) by 12 (covariates) data matrix, to identify subsets of regions exhibiting similar behaviors across subsets of covariates. Following standard literature, we sought submatrices of the data whose entries are consistent with the “ideal” additive model \(x_{i,j} = \mu + \alpha _i + \tau _j\), where \(\mu\) is the typical value within the bicluster, and \(\alpha _i\) and \(\tau _j\) are additive adjustments for row i and column j, but we set all \(\alpha _i\)s to 0 in order to find constant column biclusters, i.e., submatrices with constant columns (covariates). We employed the Cheng and Church Biclustering Algorithm^{31}, a greedy algorithm which finds the largest submatrices whose departure from the additive model is below a userdefined threshold. The departure is computed using the Hscore (or mean squared residue score); in symbols, \(H(I,J) = \frac{1}{\mid I \mid \mid J \mid } \sum _{i \in I, j \in J} \left( x_{i,j}  x_{I,j} \right) ^{2}\), where I and J index the sets of rows and columns composing the bicluster, \(x_{i,j}\) is a generic cell in the bicluster and \(x_{I,j}\) is the mean of column j (note the algorithm thus estimates the typical values in the additive model using means). We implemented this algorithm with a recently proposed adjustment to the Hscore^{32} that corrects a bias towards smaller biclusters in the original formulation. The adjusted Hscore is defined as \(H_{adj}(I,J) = (\prod _{r=2}^{I1}\frac{r^2}{r^21}\prod _{q=2}^{J1}\frac{q^2}{q^21})^{1} H(I,J)\), where r and q indicate the number of rows and columns, respectively.
Functional data analysis tools
Local clustering of curves and functional motif discovery
We performed local clustering of smoothed mortality curves (DPC, ISTAT and MAX, separately) using probabilistic Kmean with local alignment (probKMA^{22}). ProbKMA is a Kmeanlike algorithm for functional data that finds K groups in a set of curves based on local similarity among portions of the curves themselves. This allows the discovery of functional motifs, i.e. of typical local shapes that recur within and across the curves. In symbols, the algorithm finds K motifs \(v_1,\dots ,v_K\), membership probabilities \(p_{k,i}\) and shifts \(s_{k,i}\) (i.e. the starting points of the motif instances) for each clustercurve pair that minimize the generalized leastsquares functional \(J(v_1,\dots ,v_K,p_{k,i},s_{k,i})=\sum _{i=1}^N \sum _{k=1}^K p_{k,i}^2 d^2({\tilde{x}}_{i},v_k)\), where \({\tilde{x}}_{i}\) is the portion of the curve i corresponding to the shift \(s_{k,i}\), and d is the distance used to capture local similarity. For each data set, we considered \(K=2\) and \(K=3\) (using larger K values did not improve results and did not produce robust clusters across the three data sets considered). ProbKMA is probabilistic; it returns as output a membership probability \(p_{k,i}\) for each clustercurve pair. However, such an output can be turned into a hard partition by assigning each curve to the group with highest membership probability—which is what we did here. Notably, for \(K=2\), membership probabilities showed that Lombardia’s and Valle d’Aosta’s extreme mortality patterns were not well accommodated even in the “exponential” group^{22}. The algorithm can employ different definitions of similarity d and thus capture different aspects of curve shapes. We used Euclidean (\(L^2\)) distance between curve levels for our main analysis—in symbols, \(d^2=\frac{1}{c} \int _0^c (x(t)v(t))^2 dt\) for two generic curves x and v—though using Euclidean distance between curve derivatives produced similar results (not shown). ProbKMA allows the length of the motifs to be extended endogenously starting from a minimal one fixed in input. However, to identify epidemic patterns we ran it with a fixed motif length of 65 days—hence allowing a maximum shift of 10 days between curves (the mortality curves are 75 days long). The same clusters and very similar shifts were obtained with a fixed motif length of 50 days, which allows a maximum shift of 25 days (results not shown). The shifts produced by probKMA with \(K=2\) on the three mortality data sets (DPC, ISTAT and MAX) were employed to align, in addition to the mortality curves themselves, local mobility and positivity curves. All subsequent analyses employing shifted curves (tests contrasting groups of curves, functional boxplots and depth analyses, and functional regression models) were therefore restricted to the 65day portions where mortality curves aligned following the two probKMA motifs. We also validated the groups produced by probKMA with a modified version of funBI^{24}, an algorithm tipically used for finding functional biclusters. We used the modified funBI to identify groups of curves characterized by groupspecific fixed length motifs, considering all possible subcurves of a fixed length and clustering them with a divisive hierarchical algorithm (results not shown).
Testing for differences between groups of curves
We employed an IntervalWise Testing algorithm developed for omics data (IWTomics^{25}) to test for differences between the two groups of shifted mortality curves produced by probKMA with \(K=2\) (again, separately for DPC, ISTAT and MAX). IWTomics is a nonparametric, permutationbased functional hypothesis test. It contrasts two sets of curves aligned on a common domain to detect locations where their distributions differ significantly, and scales at which such significant differences are displayed (scales correspond to varying degrees of adjustment for multiple testing on intervals of varying lengths). Here locations are represented by the 65 days where the shifted mortality curves are defined, while scales vary from 1 day to the whole 65 days. The test was performed with the R package IWTomics^{25,61}. The package allows the user to select among various possible test statistics. Since our tests contrasted groups of curves produced by probKMA with a Euclidean (\(L^2\)) distance, so that cluster centers are in effect the functional means of the aligned curves in each cluster, we employed the mean as test statistic in IWTomics. The number of permutations was set to 1000 (default value).
Functional boxplots and depth analyses
The functional boxplot^{28} is an exploratory tool used to visualize functional data. It is constructed after ordering a set of curves based on a depth measure, such as the modified band depth^{29}. The statistics employed to construct a functional boxplots are: the 50% central region envelope, the median curve, and the maximum nonoutlying envelope. The 50% central region envelope corresponds to the box in a classical boxplot; it contains the 50% deepest, most centrally located curves. The median, i.e. the deepest curve, is inside this box and represents a robust “center” of the functional data set. The maximum nonoutlying envelope is obtained by inflating the 50% central region envelope by 1.5 times its range. All curves extending outside of this envelope are flagged as outliers (the fact that the ISTAT data set in Fig. 3(b) lacks outlying curves based on this definition is due to the width of its 50% central region envelope). We ranked the curves based on their depth measurements, after attributing a sign to such measurements with an ad hoc procedure. We subtract the median from each curve, and consider the share of the domain on which the difference is positive. If this is larger than 50%, we attribute a positive sign to the curve’s depth—otherwise, we attribute a negative sign. Curves can thus be ranked from the most outlying above the median (labeled as positive), down to those close to the median, down to the most outlying below the median (labeled as negative)—see Fig. 3(b). While this is not a fully general procedure, it works well on the DPC, ISTAT and MAX mortality curves we considered, which are rather unambiguously above/below the median (the share of the domain where the difference from the median is positive is \(\ge 70\) or \(\le 30\%\) for all curves in all three data sets). Note also that the median curve of a data set, defined as the deepest, does not necessarily have half of the curves above it and half of the curves below it in the signed ranking we created (e.g., Toscana is the median curve in both ISTAT and MAX data sets, but the number of curves above/below it differs).
Functional regression models
We consider models where a functional response variable is regressed against functional predictors and/or scalar covariates^{14,15}. All are special cases of the general equation^{30}
n is the number of observations, in our case \(n=20\) regions. \(y_i(t)\), \(i=1,\ldots n\) are the aligned mortality curves (DPC, ISTAT or MAX, modeled separately), \(\alpha (t)\) is a functional intercept and \(\varepsilon _i(t)\), \(i=1,\ldots n\) are i.i.d. Gaussian model errors. L is the number of functional predictors. \(x_{i,\ell }(s)\), \(i=1,\ldots n\), \(\ell =1, \ldots L\), are such predictors, measured on the n observations. The regression coefficient of each functional predictor, \(\beta _\ell (s,t)\), is a surface. J is the number of scalar covariates. \(x_{i,j}(s)\), \(i=1,\ldots n\), \(j=1, \ldots J\), are such covariates, measured on the n observations. The regression coefficients of each scalar covariate, \(\beta _j(t)\), is a curve. For the marginal regressions of mortality on local mobility and mortality on positivity, we have \(L = 1\) and \(J=0\). For the joint regression of mortality on local mobility and positivity, we have \(L = 2\) and \(J = 0\). For the marginal regressions of mortality on individual scalar covariates, we have \(L=0\) and \(J=1\). In Fig. S15 we fit marginal regressions of this type allowing the estimation of two different intercepts: \(\alpha _1(t)\) for curves in Group 1 and \(\alpha _2(t)\) for curves in Group 2. Finally, for the joint regression of mortality on local mobility, positivity and one scalar control variable, we have \(L=2\) and \(J=1\). To fit all these functional regressions we used the R package refund^{62}, which estimates the functional coefficients as well as their standard errors. We used these standard errors to construct confidence bands around the estimated functional coefficients. To gauge the explanatory power of each model, we computed the insample \(R^2\) as well as the LeaveOneOut CrossValidation (LOOCV) \(R^2\). The former is a functional generalization of the classical coefficient of determination defined as \(SS_{reg}/(SS_{reg} + SS_{res})\), where \(SS_{reg}\) and \(SS_{res}\) are the regression and the residual sum of squares, respectively. To compute the latter, for each observation i, one replaces the fitted response curve \({{\hat{y}}}_i(t)\) (from the model fitted on all observations) with the predicted response curve \({{\hat{y}}}_{pred, i}(t)\) obtained for i from the model fitted withholding i itself. Finally, for models with multiple terms (predictors and/or covariates), the partial \(R^2\) of each term is computed as \((R^2  R^2_{red})/(1 R^2_{red})\), where \(R^2\) is the coefficient of determination of the complete model, and \(R^2_{red}\) that of the model comprising all terms but the one being evaluated.
SsNALEN for feature selection
SsNALEN^{35} is an algorithm to perform Elastic Net^{63} feature selection in a standard regression framework (i.e. when both response and features are scalars) which has been designed to provide computational efficiency. The Elastic Net is a hybrid between LASSO and Ridge, which penalizes both the \(L^1\) and the \(L^2\) (Euclidean) norm of the regression coefficients. The \(L^1\) penalty induces sparsity selecting only the most predictive among the features. The \(L^2\) penalty regularizes coefficient estimates mitigating variance inflation due to collinearity. To perform feature selection in the functional regression setting, we applied a generalization of SsNALEN which incorporates a group structure in the Elastic Net objective function and uses the Functional Principal Components basis expansion to represent a functional response. In particular, we performed feature selection for the regression of mortality against all 12 scalar covariates in Table 1. Notably, we selected the same top 5 features across all three data sets (DPC, ISTAT and MAX) (see Table S5)—lending strong support to their association with mortality.
References
La Rosa, G. et al. SARSCoV2 has been circulating in northern Italy since December 2019: Evidence from environmental monitoring. Sci. Total Environ. 750, 141711 (2021).
Mugnai, G. & Bilato, C. COVID19 in Italy: Lesson from the Veneto region. Eur. J. Internal Med. 77, 161–162 (2020).
Lavezzo, E. et al. Suppression of COVID19 outbreak in the municipality of Vo’, Italy. Nature 584, 425–429 (2020)
ISTAT. Demographic indicators. http://dati.istat.it/Index.aspx?DataSetCode=DCIS_INDDEMOG1&Lang=en.
Lim, S., Bae, J. H., Kwon, H.S. & Nauck, M. A. Covid19 and diabetes mellitus: From pathophysiology to clinical management. Nat. Rev. Endocrinol. 17, 11–30 (2021).
Pluchino, A. et al. A novel methodology for epidemic risk assessment of covid19 outbreak. Sci. Rep. 11, 1–20 (2021).
Rovetta, A. & Castaldo, L. Relationships between demographic, geographic, and environmental statistics and the spread of novel coronavirus disease (covid19) in Italy. Cureus 12, e11397 (2020).
Wu, X., Nethery, R. C., Sabath, B. M., Braun, D. & Dominici, F. Air pollution and COVID19 mortality in the United States: Strengths and limitations of an ecological regression analysis. Sci. Adv. 6, eabd4049 (2020).
Coccia, M. Factors determining the diffusion of COVID19 and suggested strategy to prevent future accelerated viral infectivity similar to COVID. Sci. Total Environ. 729, 138474 (2020).
Binkin, N., Salmaso, S., Michieletto, F. & Russo, F. Protecting our health care workers while protecting our communities during the COVID19 pandemic: A comparison of approaches and early outcomes in two Italian regions, 2020 (2020). Preprint at https://www.medrxiv.org/content/10.1101/2020.04.10.20060707v2.
Frumento, P. & Sylos Labini, M. Mortalità da coronavirus: quanto vale l’effetto Lombardia. LaVoce.info https://www.lavoce.info/archives/65752/mortalitadacoronavirusquantovaleleffettolombardia (2020).
Cortés, M. E. Enfermedad por coronavirus 2019 (covid19): Importancia de la comunicación científica y de la enseñanza actualizada de las zoonosis. Revista peruana de investigación en salud 4, 87–88 (2020).
James, L. P., Salomon, J. A., Buckee, C. O. & Menzies, N. A. The use and misuse of mathematical modeling for infectious disease policymaking: Lessons for the covid19 pandemic. Med. Decis. Making 41, 379–385 (2021).
Ramsay, J. O. & Silverman, B. W. Functional data analysis, 2nd edn (Springer, 2005).
Kokoszka, P. & Reimherr, M. Introduction to Functional Data Analysis (CRC Press, 2017).
Ramsay, J. O. & Silverman, B. W. Applied Functional Data Analysis: Methods and Case Studies (Springer, 2007).
Ullah, S. & Finch, C. F. Applications of functional data analysis: A systematic review. BMC Med. Res. Methodol. 13, 43 (2013).
Cremona, M. A. et al. Functional data analysis for computational biology. Bioinformatics 35, 3211–3213 (2019).
Carroll, C. et al. Time dynamics of COVID19. Sci. Rep. 10, 21040 (2020).
Ciminelli, G. & GarciaMandicó, S. Covid19 in Italy: An analysis of death registry data. VOXEU, Centre for Economic Policy Research, London https://voxeu.org/article/covid19italyanalysisdeathregistrydata (2020).
Modi, C., Böhm, V., Ferraro, S., Stein, G. & Seljak, U. Estimating covid19 mortality in Italy early in the covid19 pandemic. Nat. Commun. 12, 1–9 (2021).
Cremona, M. A. & Chiaromonte, F. Probabilistic Kmean with local alignment for clustering and motif discovery in functional data (2020). Preprint at arXiv:1808.04773.
Di Iorio, J. & Vantini, S. funbi: A biclustering algorithm for functional datas. MOXReport46/2019 (2019).
Cremona, M. A. et al. IWTomics: Testing highresolution sequencebased “Omics” data at multiple locations and scales. Bioinformatics 34, 2289–2291 (2018).
Ra, S. H. et al. Upper respiratory viral load in asymptomatic individuals and mildly symptomatic patients with sarscov2 infection. Thorax 76, 61–63 (2021).
Cegolon, L. et al. Hypothesis to explain the severe form of COVID19 in northern Italy. BMJ Glob. Health 5, e002564 (2020).
Sun, Y. & Genton, M. G. Functional boxplots. J. Comput. Graph. Stat. 20, 316–334 (2011).
LópezPintado, S. & Romo, J. On the concept of depth for functional data. J. Am. Stat. Assoc. 104, 718–734 (2009).
Horváth, L. & Kokoszka, P. Inference for functional data with applications, vol. 200 (Springer, 2012).
Cheng, Y. & Church, G. M. Biclustering of expression data. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, La Jolla, CA, pp. 93–103 (2000).
Di Iorio, J., Chiaromonte, F. & Cremona, M. A. On the bias of hscores for comparing biclusters, and how to correct it. Bioinformatics 36, 2955–2957 (2020).
Dowd, J. B. et al. Demographic science aids in understanding the spread and fatality rates of COVID19. Proc. Natl. Acad. Sci. 117, 9696–9698 (2020).
Nepomuceno, M. R. et al. Besides population age structure, health and other demographic factors can contribute to understanding the COVID19 burden. Proc. Natl. Acad. Sci. 117, 13881–13883 (2020).
Boschi, T., Reimherr, M. & Chiaromonte, F. An efficient semismooth newton augmented lagrangian method for elastic net (2020). Preprint at arXiv:2006.03970.
Boschi, T., Chiaromonte, F., Secchi, P. & Li, B. Covariance based lowdimensional registration for functiononfunction regression. MOXReport (2018).
Cintia, P. et al. The relationship between human mobility and viral transmissibility during the COVID19 epidemics in Italy (2020). Preprint at arXiv:2006.03141.
Martellucci, C. A. et al. Changes in the spatial distribution of covid19 incidence in Italy using gisbased maps. Ann. Clin. Microbiol. Antimicrob. 19, 1–4 (2020).
DPC. Covid19 dati regioni. https://github.com/pcmdpc/COVID19/tree/master/datiregioni.
ISTAT. Atlante statistico territoriale delle infrastrutture. http://asti.istat.it/asti.
ISTAT. Decessi e cause di morte: cosa produce l’istat. https://www.istat.it/it/files/2020/03/Datasetdecessicomunaligiornalierietracciatorecord30giugno.zip.
ISTAT. Popolazione residente al 1\(^{\circ }\) gennaio. http://dati.istat.it/Index.aspx.
Google. Community mobility reports. https://www.google.com/covid19/mobility/.
Barone, N. & Bartoloni, M. La giravolta comunicativa sul coronavirus, menotamponi e contare solo i casi gravi. Il sole 24 ore https://www.ilsole24ore.com/art/lagiravoltacomunicativacoronavirusmenotamponiecontaresolocasigraviACQYXQMB (2020).
Craven, P. & Wahba, G. Smoothing noisy data with spline functions. Numer. Math. 31, 377–403 (1978).
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2021). Software version 4.1.0.
Ramsay, J. O., Wickham, H., Graves, S. & Hooker, G. fda: Functional Data Analysis (2011). R package version 2.26.
ISTAT. Aspetti della vita quotidiana. http://dati.istat.it/Index.aspx?QueryId=15448.
Ministry of Health. Assistenza primaria. http://www.salute.gov.it/imgs/C_17_pubblicazioni_1203_ulterioriallegati_ulterioreallegato_8_alleg.pdf.
Ministry of Health. http://www.dati.salute.gov.it/dati/dettaglioDataset.jsp?menu=dati&idPag=96.
Ministry of Health. http://www.salute.gov.it/imgs/C_17_bancheDati_6_0_1_file.xls.
Ministry of Health. http://www.salute.gov.it/imgs/C_17_bancheDati_6_0_0_file.xls.
Ministry of Education. https://dati.istruzione.it/opendata/opendata/catalogo/elements1/leaf/?area=Studenti&datasetId=DS0030ALUCORSOINDCLASTA,DS0030ALUCORSOINDCLAPAR, DS1114INFANZIACLASTA,DS1115INFANZIACLAPAR.
Stekhoven, D. J. & Bühlmann, P. Missforestnonparametric missing value imputation for mixedtype data. Bioinformatics 28, 112–118 (2012).
Stekhoven, D. J. missForest (2012). R package version 1.4.
ISTAT. Atlante statistico dei comuni. http://asc.istat.it/ASC/.
ASR Lombardia. Numero di famiglie, convivenze e numero medio di componenti per famiglia. https://www.asrlombardia.it/asrlomb/it/13740numerodifamiglieconvivenzeenumeromediodicomponentifamigliaregionale.
ISTAT. Ambiente urbano. https://www.istat.it/it/archivio/236912.
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, pp. 534–541 (Springer, 2009).
Allison, P. D. Multiple Regression: A Primer 140–145 (Pine Forge Press, 1999).
Cremona, M. A. IWTomics (2018). R package version 1.16.0. https://bioconductor.org/packages/release/bioc/html/IWTomics.html.
Goldsmith, J. et al.Refund: Regression with functional data (2016). R package version 0.1.16.
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67, 301–320 (2005).
Acknowledgements
M.A. Cremona acknowledges support from the NSERC. F. Chiaromonte and T. Boschi acknowledge support from the Huck Institutes of the Life Sciences (Penn State University). F. Chiaromonte, J. Di Iorio and L. Testa acknowledge support from the Sant’Anna School of Advanced Studies. We are grateful to Paola Cesari, Christian Esposito, Giovanni Felici, Daniele Licari, Andrea Mina and Flavia Petruso for useful feedback.
Author information
Authors and Affiliations
Contributions
All authors conceived ideas and analysis approaches. T.B., J.Di I., L.T. and M.A.C. retrieved and processed data from multiple public sources, implemented pipelines and performed statistical analyses. All authors interpreted findings and participated to the writing of the manuscript. M.A.C. and F.C. supervised the research.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Boschi, T., Di Iorio, J., Testa, L. et al. Functional data analysis characterizes the shapes of the first COVID19 epidemic wave in Italy. Sci Rep 11, 17054 (2021). https://doi.org/10.1038/s4159802195866y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159802195866y
This article is cited by

Statistical inference using GLEaM model with spatial heterogeneity and correlation between regions
Scientific Reports (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.