Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Optimal-design domain-adaptation for exposure prediction in two-stage epidemiological studies



In the first stage of a two-stage study, the researcher uses a statistical model to impute the unobserved exposures. In the second stage, imputed exposures serve as covariates in epidemiological models. Imputation error in the first stage operate as measurement errors in the second stage, and thus bias exposure effect estimates.


This study aims to improve the estimation of exposure effects by sharing information between the first and second stages.


At the heart of our estimator is the observation that not all second-stage observations are equally important to impute. We thus borrow ideas from the optimal-experimental-design theory, to identify individuals of higher importance. We then improve the imputation of these individuals using ideas from the machine-learning literature of domain adaptation.


Our simulations confirm that the exposure effect estimates are more accurate than the current best practice. An empirical demonstration yields smaller estimates of PM effect on hyperglycemia risk, with tighter confidence bands.


Sharing information between environmental scientist and epidemiologist improves health effect estimates. Our estimator is a principled approach for harnessing this information exchange, and may be applied to any two stage study.

Your institute does not have access to this article

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Simulation results.
Fig. 2: \(\hat \beta _{\hat x} - \beta _x\) for the Naïve (green) and ODIWI with L = 10 iterations (red) estimators.
Fig. 3: \(\hat \beta _{\hat x}\) estimates over the number of iterations.
Fig. 4: Densities of \(\hat \beta _{\hat x} - \beta _x\) over 200 simulation repetitions of ODIWI (red lines) and Naïve (green lines) estimators.
Fig. 5

Data availability

The data are not available for replication because of privacy issues.


  1. Montero J-M, Fernández-Avilés G, Mateu J. Spatial and spatio-temporal geostatistical modeling and kriging. Chichester: John Wiley & Sons; 2015.

  2. Hodges JS. Richly parameterized linear models: additive, time series, and spatial models using random effects. BocaRaton, FL: CRC Press; 2013.

  3. Szpiro AA, Sheppard L, Lumley T. Efficient measurement error correction with spatially misaligned data. Biostatistics. 2011;12:610–23.

    Article  Google Scholar 

  4. Shtein A, Karnieli A, Katra I, Raz R, Levy I, Lyapustin A, et al. Estimating daily and intra-daily pm10 and pm2. 5 in israel using a spatio-temporal hybrid modeling approach. Atmos Environ. 2018;191:142–52.

    CAS  Article  Google Scholar 

  5. Sarafian R, Kloog I, Just AC, Rosenblatt JD. Gaussian markov random fields versus linear mixed models for satellite-based pm2. 5 assessment: evidence from the Northeastern USA. Atmos Environ. 2019;205:30–35.

    CAS  Article  Google Scholar 

  6. Szpiro AA, Paciorek C. Measurement error in twostage analyses, with application to air pollution epidemiology. Environmetrics. 2013;24:501–17.

    Article  Google Scholar 

  7. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models: a modern perspective. Boca Raton, FL: CRC Press; 2006.

  8. Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B. Covariate shift by kernel mean matching. Dataset Shift Mach Learn. 2009;3:5.

    Google Scholar 

  9. Spiegelman D. Approaches to uncertainty in exposure assessment in environmental epidemiology. Annu Rev Public health. 2010;31:149–63.

    Article  Google Scholar 

  10. Lopiano KK, Young LJ, Gotway CA. A comparison of errors in variables methods for use in regression models with spatially misaligned data. Stat Methods Med Res. 2011;20:29–47.

    Article  Google Scholar 

  11. Just AC, Carli MMD, Shtein A, Dorman M, Lyapustin A, Kloog I. Correcting measurement error in satellite aerosol optical depth with machine learning for modeling pm2. 5 in the northeastern usa. Remote Sens. 2018;10:803.

    Article  Google Scholar 

  12. Diao M, Holloway T, Choi S, O'Neill SM, Al-Hamdan MZ, Van Donkelaar A, et al. Methods, availability, and applications of pm2.5 exposure estimates derived from ground measurements, satellite, and atmospheric models. J Air Waste Manag Assoc. 2019;69:1391–414.

    CAS  Article  Google Scholar 

  13. Szpiro AA, Paciorek CJ, Sheppard L. Does more accurate exposure prediction necessarily improve health effect estimates? Epidemiology (Camb, MA). 2011;22:680.

    Article  Google Scholar 

  14. Just AC, Arfer KB, Rush J, Dorman M, Shtein A, Lyapustin A, et al. Advancing methodologies for applying machine learning and evaluating spatiotemporal models of fine particulate matter (pm2.5) using satellite data over large regions. Atmos Environ. 2020;239:117649.

    CAS  Article  Google Scholar 

  15. Park Y, Kwon B, Heo J, Hu X, Liu Y, Moon T. Estimating pm2. 5 concentration of the conterminous united states via interpretable convolutional neural networks. Environ Pollut. 2020;256:113395.

    CAS  Article  Google Scholar 

  16. Hough I, Just AC, Zhou B, Dorman M, Lepeule J, Kloog I. A multi-resolution air temperature model for France from modis and landsat thermal data. Environ Res. 2020;183:109244.

    CAS  Article  Google Scholar 

  17. Dean A, Morris M, Stufken J, Bingham D. Handbook of design and analysis of experiments, vol. 7. Boca Raton, FL: CRC Press; 2015.

  18. Fedorov VV, Leonov SL. Optimal design for nonlinear response models. Boca Raton, FL: CRC Press; 2013.

  19. Shimodaira H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan inference. 2000;90:227–44.

    Article  Google Scholar 

  20. Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND. Dataset shift in machine learning. Cambridge, MA: The MIT Press; 2009.

  21. Sarafian R, Kloog I, Sarafian E, Hough I, Rosenblatt JD. A domain adaptation approach for performance estimation of spatial predictions. IEEE Trans Geosci Remote Sens 2020;59.6:5197–5205.

  22. Park SK, Wang W. Ambient air pollution and type 2 diabetes mellitus: a systematic review of epidemiologic research. Curr Environ Health Rep. 2014;1:275–86.

    CAS  Article  Google Scholar 

  23. Peng C, Bind MC, Colicino E, Kloog I, Byun HM, Cantone L, et al. Particulate air pollution and fasting blood glucose in nondiabetic individuals: associations and epigenetic mediation in the normative aging study, 2000–2011. Environ Health Perspect. 2016;124:1715–21.

    Article  Google Scholar 

  24. Yitshak Sade M, Kloog I, Liberty IF, Schwartz J, Novack V. The association between air pollution exposure and glucose and lipids levels. J Clin Endocrinol Metab. 2016;101:2460–7.

    Article  Google Scholar 

  25. Pukelsheim F. Optimal design of experiments. Philadelphia, PA: SIAM; 2006.

  26. Wu Y, Hoffman FO, Apostoaei AI, Kwon D, Thomas BA, Glass R, et al. Methods to account for uncertainties in exposure assessment in studies of environmental exposures. Environ Health. 2019;18:31.

    CAS  Article  Google Scholar 

  27. Sheppard L, Burnett RT, Szpiro AA, Kim SY, Jerrett M, Pope CA, et al. Confounding and exposure measurement error in air pollution epidemiology. Air Qual, Atmos Health. 2012;5:203–16.

    Article  Google Scholar 

  28. Bickel PJ. One-step huber estimates in the linear model. J Am Stat Assoc. 1975;70:428–34.

    Article  Google Scholar 

  29. Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big Data. 2016;3:1–40.

    Article  Google Scholar 

Download references


The authors wish to thank Dr. Raanan Raz, and Dr. Lena Novack, for their comments and ideas.


The results reported herein correspond to specific aims of grant no. 900/16 to JDR from the Israel Science Foundation.

Author information

Authors and Affiliations



RS conceived of the presented idea. RS developed the theory with support from JDR, and performed the computations. JDR and IK verified the analytical methods, and supervised the findings of this work. RS wrote the manuscript with support from JDR. and IK. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Ron Sarafian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sarafian, R., Kloog, I. & Rosenblatt, J.D. Optimal-design domain-adaptation for exposure prediction in two-stage epidemiological studies. J Expo Sci Environ Epidemiol (2022).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI:


  • Environmental epidemiology
  • Two-stage studies
  • Optimal-design
  • Domain-adaptation


Quick links