Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Batch-effect correction with sample remeasurement in highly confounded case-control studies

Abstract

Batch effects are pervasive in biomedical studies. One approach to address the batch effects is repeatedly measuring a subset of samples in each batch. These remeasured samples are used to estimate and correct the batch effects. However, rigorous statistical methods for batch-effect correction with remeasured samples are severely underdeveloped. Here we developed a framework for batch-effect correction using remeasured samples in highly confounded case-control studies. We provided theoretical analyses of the proposed procedure, evaluated its power characteristics and provided a power calculation tool to aid in the study design. We found that the number of samples that need to be remeasured depends strongly on the between-batch correlation. When the correlation is high, remeasuring a small subset of samples is possible to rescue most of the power.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Illustration of the study design.
Fig. 2: The MSE of a0 estimate for different procedures when n1 = n2 = 50.
Fig. 3: Evaluation of empirical type I error and power for different procedures in testing the biological effect a0 = 0 when n1 = n2 = 50.
Fig. 4: Proportion of control samples that need to be remeasured to achieve 80%, 90% and 95% relative power versus between-batch correlation ρ when n1 = n2 = 50, 100, 200.
Fig. 5: Comparison of ReMeasure, Batch2, Ignore and LS on the real dataset.
Fig. 6: Comparison with ComBat, SVA and RUV on the real dataset.

Similar content being viewed by others

Data availability

Source data for Figs. 26 are available with this paper. They can also be found at https://github.com/yehanxuan/BatchReMeasure-manuscript-sourcecode.

Code availability

All the codes to reproduce the results in this paper are available at https://github.com/yehanxuan/BatchReMeasure-manuscript-sourcecode. The developed R package BatchReMeasure is available at https://github.com/yehanxuan/BatchReMeasure. The specific version used to produce the results in this paper is also available on Code Ocean35.

References

  1. Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).

    Article  Google Scholar 

  2. Goh, W. W. B., Wang, W. & Wong, L. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 35, 498–507 (2017).

    Article  Google Scholar 

  3. Scherer, A. Batch Effects and Noise in Microarray Experiments: Sources and Solutions (John Wiley & Sons, 2009).

  4. Tom, J. A. et al. Identifying and mitigating batch effects in whole genome sequencing data. BMC Bioinform. 18, 351 (2017).

    Article  Google Scholar 

  5. Price, E. M. & Robinson, W. P. Adjusting for batch effects in dna methylation microarray data, a lesson learned. Front. Genet. 9, 83 (2018).

    Article  Google Scholar 

  6. Sims, A. H. et al. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets—improving meta-analysis and prediction of prognosis. BMC Med. Genomics 1, 42 (2008).

    Article  Google Scholar 

  7. Li, C. & Wong, W. H. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA 98, 31–36 (2001).

    Article  MATH  Google Scholar 

  8. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    Article  Google Scholar 

  9. Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2, lqaa078 (2020).

  10. Shabalin, A. A., Tjelmeland, H., Fan, C., Perou, C. M. & Nobel, A. B. Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24, 1154–1160 (2008).

    Article  Google Scholar 

  11. Benito, M. et al. Adjustment of systematic microarray data biases. Bioinformatics 20, 105–114 (2004).

    Article  Google Scholar 

  12. Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

    Article  Google Scholar 

  13. Jolliffe, I. T. Principal Component Analysis (Springer, 2013).

  14. Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).

    Article  Google Scholar 

  15. Gagnon-Bartsch, J. A. & Speed, T. P. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552 (2012).

    Article  Google Scholar 

  16. Gagnon-Bartsch, J. A., Jacob, L. & Speed, T. P. Removing Unwanted Variation From High Dimensional Data With Negative Controls Technical Report (Department of Statistics, Univ. California Berkeley, 2013).

  17. Jacob, L., Gagnon-Bartsch, J. A. & Speed, T. P. Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 17, 16–28 (2016).

    Article  MathSciNet  Google Scholar 

  18. Sun, Y., Zhang, N. R. & Owen, A. B. Multiple hypothesis testing adjusted for latent variables, with an application to the agemap gene expression data. Ann. Appl. Stat. 6, 1664–1688 (2012).

    Article  MathSciNet  MATH  Google Scholar 

  19. Wang, J., Zhao, Q., Hastie, T. & Owen, A. B. Confounder adjustment in multiple hypothesis testing. Ann. Stat. 45, 1863–1894 (2017).

    Article  MathSciNet  MATH  Google Scholar 

  20. Tasaki, S. et al. Multi-omics monitoring of drug response in rheumatoid arthritis in pursuit of molecular remission. Nat. Commun. 9, 2755 (2018).

    Article  Google Scholar 

  21. Xia, Q., Thompson, J. A. & Koestler, D. C. Batch effect reduction of microarray data with dependent samples using an empirical Bayes approach (bridge). Stat. Appl. Genet. Mol. Biol. 20, 101–119 (2021).

    Article  MATH  Google Scholar 

  22. Zhou, L., Sue, A. C.-H. & Goh, W. W. B. Examining the practical limits of batch effect-correction algorithms: when should you care about batch effects? J. Genet. Genomics 46, 433–443 (2019).

    Article  Google Scholar 

  23. Olson, J. E. et al. Characteristics and utilisation of the mayo clinic biobank, a clinic-based prospective collection in the USA: cohort profile. BMJ Open 9, e032707 (2019).

    Article  Google Scholar 

  24. Rundle, A. G., Vineis, P. & Ahsan, H. Design options for molecular epidemiology research within cohort studies. Cancer Epidemiol. Biomarkers Prev. 14, 1899–1907 (2005).

    Article  Google Scholar 

  25. Cohen, J. Statistical Power Analysis for the Behavioral Sciences (Routledge, 2013).

  26. Wang, C. et al. Expression signature distinguishing two tumour transcriptome classes associated with progression-free survival among rare histological types of epithelial ovarian cancer. Br. J. Cancer 114, 1412–1420 (2016).

    Article  Google Scholar 

  27. Konecny, G. E. et al. Prognostic and therapeutic relevance of molecular subtypes in high-grade serous ovarian cancer. J. Natl Cancer Inst. 106, dju249 (2014).

  28. Fridley, B. L. et al. Transcriptomic characterization of endometrioid, clear cell, and high-grade serous epithelial ovarian carcinoma. Cancer Epidemiol. Biomarkers Prev. 27, 1101–1109 (2018).

    Article  Google Scholar 

  29. Chen, G. M. et al. Consensus on molecular subtypes of high-grade serous ovarian carcinoma. Clin. Cancer Res. 24, 5037–5047 (2018).

    Article  Google Scholar 

  30. Leek, J. T. & Storey, J. D. A general framework for multiple testing dependence. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).

    Article  MATH  Google Scholar 

  31. Buja, A. & Eyuboglu, N. Remarks on parallel analysis. Multivariate Behav. Res. 27, 509–540 (1992).

    Article  Google Scholar 

  32. Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).

    Article  Google Scholar 

  33. Chen, J. et al. An omnibus test for differential distribution analysis of microbiome sequencing data. Bioinformatics 34, 643–651 (2018).

    Article  Google Scholar 

  34. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).

    Article  Google Scholar 

  35. Ye, H., Zhang, X. & Chen, J. BatchReMeasure: batch effects correction with sample remeasurement. Code Ocean https://doi.org/10.24433/CO.4806327.v1 (2023).

Download references

Acknowledgements

This work was supported by the National Institute of Health R21HG011662 and R0GM144351 (J.C. and X.Z.), National Science Foundation DMS2113359 (X.Z.) and National Science Foundation DMS2113360 and Mayo Clinic Center for Individualized Medicine (J.C.).

Author information

Authors and Affiliations

Authors

Contributions

J.C. and X.Z. conceived, designed and supervised the work together. H.Y. implemented the method and performed the evaluation. J.C., X.Z. and H.Y. wrote the paper. C.W. and E.L.G. prepared the real data and aided in the interpretation. All authors approved the paper.

Corresponding authors

Correspondence to Xianyang Zhang or Jun Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Rhonda Bacher, W. Evan Johnson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Table 1, Figs. 1–9, proof of theorems and additional simulations.

Reporting summary

Peer Review File

Source data

Source Data Fig. 2

The proportion of remeasured samples. Columns: ‘samples’: the total sample size. ‘Ratio’: the ratio of desired power to the maximal power. ‘ReMeasureNo’: the ratio of the remeasured sample size to the number of control samples. Columns: ‘TrueEffect’: biological effect a0; ‘LocationEffect’: batch location effect a1; ‘ScaleEffect’: batch scale effect sigma1; ‘SNR’: between-batch correlation; ‘ReMeasureNo’: the number of remeasured samples. ‘Methods’: compared methods; ‘Value’: the mean MSE; ‘ymax/ymin’: mean value ± s.e.m.

Source Data Fig. 3

Columns: ‘TrueEffect’: biological effect a0; ‘LocationEffect’: batch location effect a1; ‘ScaleEffect’: batch scale effect sigma1; ‘SNR’: between-batch correlation; ‘ReMeasureNo’: the number of remeasured samples. ‘Methods’: compared methods; ‘Value’: the average power (type I error for a0 = 0); ‘ymax/ymin’: mean value ± s.e.m.; ‘SD’: s.e.m.

Source Data Fig. 4

The proportion of remeasured samples. Columns: ‘samples’: the total sample size. ‘Ratio’: the ratio of desired power to the maximal power. ‘ReMeasureNo’: the ratio of the remeasured sample size to the number of control samples.

Source Data Fig. 5

Contain the numbers of discoveries (sheet 1) and rankings of P values for comparison of different types between two platforms (sheets 2–5). Specifically, column ‘Method’ in each sheet indicates methods we compared in the paper, column ‘ReMeasureNo’ represents the number of remeasured samples, column ‘Value’ refers to the number of detected signature genes for sheet 1, and the ranking of P values for the other sheets. Sheet 1 corresponds to Fig. 5a, that is, C1+C2+C4+C5 versus C1+C2+C4+C5. Sheets 2–5 correspond to Fig. 5b. Sheet 2: C1+C2+C5 versus C4. Sheet 3: C1+C4+C5 versus C2. Sheet 4: C1+C2+C4 versus C5. Sheet 5: C2+C4+C5 versus C1.

Source Data Fig. 6

Contains the numbers of discoveries for all methods (sheets 1–5) and P values for plotting the histogram (sheets 6 and 7). Sheet 1 corresponds to Fig. 6a, and sheets 2–5 correspond to ‘C1+C2+C5 versus C4’, ‘C1+C4+C5 versus C2’, ‘C1+C2+C4 versus C5’ and ‘C2+C4+C5 versus C1’ in Fig. 6b. Column ‘ReMeasureNo’: the number of remeasured samples. Column ‘Value’: the number of identified signature genes. Sheets 6 and 7 correspond Fig. 6c,d, respectively. Column ‘Gene’: the index of genes. Column ‘Pvalue’: the estimated P values across genes. Column ‘Methods’: all methods that were compared.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, H., Zhang, X., Wang, C. et al. Batch-effect correction with sample remeasurement in highly confounded case-control studies. Nat Comput Sci 3, 709–719 (2023). https://doi.org/10.1038/s43588-023-00500-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-023-00500-8

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing