Batch-effect correction with sample remeasurement in highly confounded case-control studies

Ye, Hanxuan; Zhang, Xianyang; Wang, Chen; Goode, Ellen L.; Chen, Jun

doi:10.1038/s43588-023-00500-8

Article
Published: 23 August 2023

Batch-effect correction with sample remeasurement in highly confounded case-control studies

Nature Computational Science volume 3, pages 709–719 (2023)Cite this article

642 Accesses
1 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Batch effects are pervasive in biomedical studies. One approach to address the batch effects is repeatedly measuring a subset of samples in each batch. These remeasured samples are used to estimate and correct the batch effects. However, rigorous statistical methods for batch-effect correction with remeasured samples are severely underdeveloped. Here we developed a framework for batch-effect correction using remeasured samples in highly confounded case-control studies. We provided theoretical analyses of the proposed procedure, evaluated its power characteristics and provided a power calculation tool to aid in the study design. We found that the number of samples that need to be remeasured depends strongly on the between-batch correlation. When the correlation is high, remeasuring a small subset of samples is possible to rescue most of the power.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Illustration of the study design.**

**Fig. 2: The MSE of a₀ estimate for different procedures when n₁ = n₂ = 50.**

**Fig. 3: Evaluation of empirical type I error and power for different procedures in testing the biological effect a₀ = 0 when n₁ = n₂ = 50.**

**Fig. 4: Proportion of control samples that need to be remeasured to achieve 80%, 90% and 95% relative power versus between-batch correlation ρ when n₁ = n₂ = 50, 100, 200.**

**Fig. 5: Comparison of ReMeasure, Batch2, Ignore and LS on the real dataset.**

**Fig. 6: Comparison with ComBat, SVA and RUV on the real dataset.**

The importance of batch sensitization in missing value imputation

Article Open access 21 February 2023

Flexible experimental designs for valid single-cell RNA-sequencing experiments allowing batch effects correction

Article Open access 01 July 2020

Batch effects removal for microbiome data via conditional quantile regression

Article Open access 15 September 2022

Data availability

Source data for Figs. 2–6 are available with this paper. They can also be found at https://github.com/yehanxuan/BatchReMeasure-manuscript-sourcecode.

Code availability

All the codes to reproduce the results in this paper are available at https://github.com/yehanxuan/BatchReMeasure-manuscript-sourcecode. The developed R package BatchReMeasure is available at https://github.com/yehanxuan/BatchReMeasure. The specific version used to produce the results in this paper is also available on Code Ocean³⁵.

References

Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).
Article Google Scholar
Goh, W. W. B., Wang, W. & Wong, L. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 35, 498–507 (2017).
Article Google Scholar
Scherer, A. Batch Effects and Noise in Microarray Experiments: Sources and Solutions (John Wiley & Sons, 2009).
Tom, J. A. et al. Identifying and mitigating batch effects in whole genome sequencing data. BMC Bioinform. 18, 351 (2017).
Article Google Scholar
Price, E. M. & Robinson, W. P. Adjusting for batch effects in dna methylation microarray data, a lesson learned. Front. Genet. 9, 83 (2018).
Article Google Scholar
Sims, A. H. et al. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets—improving meta-analysis and prediction of prognosis. BMC Med. Genomics 1, 42 (2008).
Article Google Scholar
Li, C. & Wong, W. H. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA 98, 31–36 (2001).
Article MATH Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Article Google Scholar
Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2, lqaa078 (2020).
Shabalin, A. A., Tjelmeland, H., Fan, C., Perou, C. M. & Nobel, A. B. Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24, 1154–1160 (2008).
Article Google Scholar
Benito, M. et al. Adjustment of systematic microarray data biases. Bioinformatics 20, 105–114 (2004).
Article Google Scholar
Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).
Article Google Scholar
Jolliffe, I. T. Principal Component Analysis (Springer, 2013).
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).
Article Google Scholar
Gagnon-Bartsch, J. A. & Speed, T. P. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552 (2012).
Article Google Scholar
Gagnon-Bartsch, J. A., Jacob, L. & Speed, T. P. Removing Unwanted Variation From High Dimensional Data With Negative Controls Technical Report (Department of Statistics, Univ. California Berkeley, 2013).
Jacob, L., Gagnon-Bartsch, J. A. & Speed, T. P. Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 17, 16–28 (2016).
Article MathSciNet Google Scholar
Sun, Y., Zhang, N. R. & Owen, A. B. Multiple hypothesis testing adjusted for latent variables, with an application to the agemap gene expression data. Ann. Appl. Stat. 6, 1664–1688 (2012).
Article MathSciNet MATH Google Scholar
Wang, J., Zhao, Q., Hastie, T. & Owen, A. B. Confounder adjustment in multiple hypothesis testing. Ann. Stat. 45, 1863–1894 (2017).
Article MathSciNet MATH Google Scholar
Tasaki, S. et al. Multi-omics monitoring of drug response in rheumatoid arthritis in pursuit of molecular remission. Nat. Commun. 9, 2755 (2018).
Article Google Scholar
Xia, Q., Thompson, J. A. & Koestler, D. C. Batch effect reduction of microarray data with dependent samples using an empirical Bayes approach (bridge). Stat. Appl. Genet. Mol. Biol. 20, 101–119 (2021).
Article MATH Google Scholar
Zhou, L., Sue, A. C.-H. & Goh, W. W. B. Examining the practical limits of batch effect-correction algorithms: when should you care about batch effects? J. Genet. Genomics 46, 433–443 (2019).
Article Google Scholar
Olson, J. E. et al. Characteristics and utilisation of the mayo clinic biobank, a clinic-based prospective collection in the USA: cohort profile. BMJ Open 9, e032707 (2019).
Article Google Scholar
Rundle, A. G., Vineis, P. & Ahsan, H. Design options for molecular epidemiology research within cohort studies. Cancer Epidemiol. Biomarkers Prev. 14, 1899–1907 (2005).
Article Google Scholar
Cohen, J. Statistical Power Analysis for the Behavioral Sciences (Routledge, 2013).
Wang, C. et al. Expression signature distinguishing two tumour transcriptome classes associated with progression-free survival among rare histological types of epithelial ovarian cancer. Br. J. Cancer 114, 1412–1420 (2016).
Article Google Scholar
Konecny, G. E. et al. Prognostic and therapeutic relevance of molecular subtypes in high-grade serous ovarian cancer. J. Natl Cancer Inst. 106, dju249 (2014).
Fridley, B. L. et al. Transcriptomic characterization of endometrioid, clear cell, and high-grade serous epithelial ovarian carcinoma. Cancer Epidemiol. Biomarkers Prev. 27, 1101–1109 (2018).
Article Google Scholar
Chen, G. M. et al. Consensus on molecular subtypes of high-grade serous ovarian carcinoma. Clin. Cancer Res. 24, 5037–5047 (2018).
Article Google Scholar
Leek, J. T. & Storey, J. D. A general framework for multiple testing dependence. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).
Article MATH Google Scholar
Buja, A. & Eyuboglu, N. Remarks on parallel analysis. Multivariate Behav. Res. 27, 509–540 (1992).
Article Google Scholar
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
Article Google Scholar
Chen, J. et al. An omnibus test for differential distribution analysis of microbiome sequencing data. Bioinformatics 34, 643–651 (2018).
Article Google Scholar
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).
Article Google Scholar
Ye, H., Zhang, X. & Chen, J. BatchReMeasure: batch effects correction with sample remeasurement. Code Ocean https://doi.org/10.24433/CO.4806327.v1 (2023).

Download references

Acknowledgements

This work was supported by the National Institute of Health R21HG011662 and R0GM144351 (J.C. and X.Z.), National Science Foundation DMS2113359 (X.Z.) and National Science Foundation DMS2113360 and Mayo Clinic Center for Individualized Medicine (J.C.).

Author information

Authors and Affiliations

Department of Statistics, Texas A&M University, College Station, TX, USA
Hanxuan Ye & Xianyang Zhang
Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
Chen Wang, Ellen L. Goode & Jun Chen

Authors

Hanxuan Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xianyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ellen L. Goode
View author publications
You can also search for this author in PubMed Google Scholar
Jun Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.C. and X.Z. conceived, designed and supervised the work together. H.Y. implemented the method and performed the evaluation. J.C., X.Z. and H.Y. wrote the paper. C.W. and E.L.G. prepared the real data and aided in the interpretation. All authors approved the paper.

Corresponding authors

Correspondence to Xianyang Zhang or Jun Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Rhonda Bacher, W. Evan Johnson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Table 1, Figs. 1–9, proof of theorems and additional simulations.

Reporting summary

Peer Review File

Source data

Source Data Fig. 2

The proportion of remeasured samples. Columns: ‘samples’: the total sample size. ‘Ratio’: the ratio of desired power to the maximal power. ‘ReMeasureNo’: the ratio of the remeasured sample size to the number of control samples. Columns: ‘TrueEffect’: biological effect a₀; ‘LocationEffect’: batch location effect a₁; ‘ScaleEffect’: batch scale effect sigma1; ‘SNR’: between-batch correlation; ‘ReMeasureNo’: the number of remeasured samples. ‘Methods’: compared methods; ‘Value’: the mean MSE; ‘ymax/ymin’: mean value ± s.e.m.

Source Data Fig. 3

Columns: ‘TrueEffect’: biological effect a₀; ‘LocationEffect’: batch location effect a₁; ‘ScaleEffect’: batch scale effect sigma1; ‘SNR’: between-batch correlation; ‘ReMeasureNo’: the number of remeasured samples. ‘Methods’: compared methods; ‘Value’: the average power (type I error for a₀ = 0); ‘ymax/ymin’: mean value ± s.e.m.; ‘SD’: s.e.m.

Source Data Fig. 4

The proportion of remeasured samples. Columns: ‘samples’: the total sample size. ‘Ratio’: the ratio of desired power to the maximal power. ‘ReMeasureNo’: the ratio of the remeasured sample size to the number of control samples.

Source Data Fig. 5

Contain the numbers of discoveries (sheet 1) and rankings of P values for comparison of different types between two platforms (sheets 2–5). Specifically, column ‘Method’ in each sheet indicates methods we compared in the paper, column ‘ReMeasureNo’ represents the number of remeasured samples, column ‘Value’ refers to the number of detected signature genes for sheet 1, and the ranking of P values for the other sheets. Sheet 1 corresponds to Fig. 5a, that is, C1+C2+C4+C5 versus C1+C2+C4+C5. Sheets 2–5 correspond to Fig. 5b. Sheet 2: C1+C2+C5 versus C4. Sheet 3: C1+C4+C5 versus C2. Sheet 4: C1+C2+C4 versus C5. Sheet 5: C2+C4+C5 versus C1.

Source Data Fig. 6

Contains the numbers of discoveries for all methods (sheets 1–5) and P values for plotting the histogram (sheets 6 and 7). Sheet 1 corresponds to Fig. 6a, and sheets 2–5 correspond to ‘C1+C2+C5 versus C4’, ‘C1+C4+C5 versus C2’, ‘C1+C2+C4 versus C5’ and ‘C2+C4+C5 versus C1’ in Fig. 6b. Column ‘ReMeasureNo’: the number of remeasured samples. Column ‘Value’: the number of identified signature genes. Sheets 6 and 7 correspond Fig. 6c,d, respectively. Column ‘Gene’: the index of genes. Column ‘Pvalue’: the estimated P values across genes. Column ‘Methods’: all methods that were compared.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ye, H., Zhang, X., Wang, C. et al. Batch-effect correction with sample remeasurement in highly confounded case-control studies. Nat Comput Sci 3, 709–719 (2023). https://doi.org/10.1038/s43588-023-00500-8

Download citation

Received: 06 September 2022
Accepted: 11 July 2023
Published: 23 August 2023
Issue Date: August 2023
DOI: https://doi.org/10.1038/s43588-023-00500-8

This article is cited by

Leveraging remeasured samples in biomedical studies
- Luer Zhong
- Rhonda Bacher
Nature Computational Science (2023)