Abstract
Batch effects are pervasive in biomedical studies. One approach to address the batch effects is repeatedly measuring a subset of samples in each batch. These remeasured samples are used to estimate and correct the batch effects. However, rigorous statistical methods for batch-effect correction with remeasured samples are severely underdeveloped. Here we developed a framework for batch-effect correction using remeasured samples in highly confounded case-control studies. We provided theoretical analyses of the proposed procedure, evaluated its power characteristics and provided a power calculation tool to aid in the study design. We found that the number of samples that need to be remeasured depends strongly on the between-batch correlation. When the correlation is high, remeasuring a small subset of samples is possible to rescue most of the power.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Source data for Figs. 2–6 are available with this paper. They can also be found at https://github.com/yehanxuan/BatchReMeasure-manuscript-sourcecode.
Code availability
All the codes to reproduce the results in this paper are available at https://github.com/yehanxuan/BatchReMeasure-manuscript-sourcecode. The developed R package BatchReMeasure is available at https://github.com/yehanxuan/BatchReMeasure. The specific version used to produce the results in this paper is also available on Code Ocean35.
References
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).
Goh, W. W. B., Wang, W. & Wong, L. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 35, 498–507 (2017).
Scherer, A. Batch Effects and Noise in Microarray Experiments: Sources and Solutions (John Wiley & Sons, 2009).
Tom, J. A. et al. Identifying and mitigating batch effects in whole genome sequencing data. BMC Bioinform. 18, 351 (2017).
Price, E. M. & Robinson, W. P. Adjusting for batch effects in dna methylation microarray data, a lesson learned. Front. Genet. 9, 83 (2018).
Sims, A. H. et al. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets—improving meta-analysis and prediction of prognosis. BMC Med. Genomics 1, 42 (2008).
Li, C. & Wong, W. H. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA 98, 31–36 (2001).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2, lqaa078 (2020).
Shabalin, A. A., Tjelmeland, H., Fan, C., Perou, C. M. & Nobel, A. B. Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24, 1154–1160 (2008).
Benito, M. et al. Adjustment of systematic microarray data biases. Bioinformatics 20, 105–114 (2004).
Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).
Jolliffe, I. T. Principal Component Analysis (Springer, 2013).
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).
Gagnon-Bartsch, J. A. & Speed, T. P. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552 (2012).
Gagnon-Bartsch, J. A., Jacob, L. & Speed, T. P. Removing Unwanted Variation From High Dimensional Data With Negative Controls Technical Report (Department of Statistics, Univ. California Berkeley, 2013).
Jacob, L., Gagnon-Bartsch, J. A. & Speed, T. P. Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 17, 16–28 (2016).
Sun, Y., Zhang, N. R. & Owen, A. B. Multiple hypothesis testing adjusted for latent variables, with an application to the agemap gene expression data. Ann. Appl. Stat. 6, 1664–1688 (2012).
Wang, J., Zhao, Q., Hastie, T. & Owen, A. B. Confounder adjustment in multiple hypothesis testing. Ann. Stat. 45, 1863–1894 (2017).
Tasaki, S. et al. Multi-omics monitoring of drug response in rheumatoid arthritis in pursuit of molecular remission. Nat. Commun. 9, 2755 (2018).
Xia, Q., Thompson, J. A. & Koestler, D. C. Batch effect reduction of microarray data with dependent samples using an empirical Bayes approach (bridge). Stat. Appl. Genet. Mol. Biol. 20, 101–119 (2021).
Zhou, L., Sue, A. C.-H. & Goh, W. W. B. Examining the practical limits of batch effect-correction algorithms: when should you care about batch effects? J. Genet. Genomics 46, 433–443 (2019).
Olson, J. E. et al. Characteristics and utilisation of the mayo clinic biobank, a clinic-based prospective collection in the USA: cohort profile. BMJ Open 9, e032707 (2019).
Rundle, A. G., Vineis, P. & Ahsan, H. Design options for molecular epidemiology research within cohort studies. Cancer Epidemiol. Biomarkers Prev. 14, 1899–1907 (2005).
Cohen, J. Statistical Power Analysis for the Behavioral Sciences (Routledge, 2013).
Wang, C. et al. Expression signature distinguishing two tumour transcriptome classes associated with progression-free survival among rare histological types of epithelial ovarian cancer. Br. J. Cancer 114, 1412–1420 (2016).
Konecny, G. E. et al. Prognostic and therapeutic relevance of molecular subtypes in high-grade serous ovarian cancer. J. Natl Cancer Inst. 106, dju249 (2014).
Fridley, B. L. et al. Transcriptomic characterization of endometrioid, clear cell, and high-grade serous epithelial ovarian carcinoma. Cancer Epidemiol. Biomarkers Prev. 27, 1101–1109 (2018).
Chen, G. M. et al. Consensus on molecular subtypes of high-grade serous ovarian carcinoma. Clin. Cancer Res. 24, 5037–5047 (2018).
Leek, J. T. & Storey, J. D. A general framework for multiple testing dependence. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).
Buja, A. & Eyuboglu, N. Remarks on parallel analysis. Multivariate Behav. Res. 27, 509–540 (1992).
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
Chen, J. et al. An omnibus test for differential distribution analysis of microbiome sequencing data. Bioinformatics 34, 643–651 (2018).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).
Ye, H., Zhang, X. & Chen, J. BatchReMeasure: batch effects correction with sample remeasurement. Code Ocean https://doi.org/10.24433/CO.4806327.v1 (2023).
Acknowledgements
This work was supported by the National Institute of Health R21HG011662 and R0GM144351 (J.C. and X.Z.), National Science Foundation DMS2113359 (X.Z.) and National Science Foundation DMS2113360 and Mayo Clinic Center for Individualized Medicine (J.C.).
Author information
Authors and Affiliations
Contributions
J.C. and X.Z. conceived, designed and supervised the work together. H.Y. implemented the method and performed the evaluation. J.C., X.Z. and H.Y. wrote the paper. C.W. and E.L.G. prepared the real data and aided in the interpretation. All authors approved the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Rhonda Bacher, W. Evan Johnson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Table 1, Figs. 1–9, proof of theorems and additional simulations.
Source data
Source Data Fig. 2
The proportion of remeasured samples. Columns: ‘samples’: the total sample size. ‘Ratio’: the ratio of desired power to the maximal power. ‘ReMeasureNo’: the ratio of the remeasured sample size to the number of control samples. Columns: ‘TrueEffect’: biological effect a0; ‘LocationEffect’: batch location effect a1; ‘ScaleEffect’: batch scale effect sigma1; ‘SNR’: between-batch correlation; ‘ReMeasureNo’: the number of remeasured samples. ‘Methods’: compared methods; ‘Value’: the mean MSE; ‘ymax/ymin’: mean value ± s.e.m.
Source Data Fig. 3
Columns: ‘TrueEffect’: biological effect a0; ‘LocationEffect’: batch location effect a1; ‘ScaleEffect’: batch scale effect sigma1; ‘SNR’: between-batch correlation; ‘ReMeasureNo’: the number of remeasured samples. ‘Methods’: compared methods; ‘Value’: the average power (type I error for a0 = 0); ‘ymax/ymin’: mean value ± s.e.m.; ‘SD’: s.e.m.
Source Data Fig. 4
The proportion of remeasured samples. Columns: ‘samples’: the total sample size. ‘Ratio’: the ratio of desired power to the maximal power. ‘ReMeasureNo’: the ratio of the remeasured sample size to the number of control samples.
Source Data Fig. 5
Contain the numbers of discoveries (sheet 1) and rankings of P values for comparison of different types between two platforms (sheets 2–5). Specifically, column ‘Method’ in each sheet indicates methods we compared in the paper, column ‘ReMeasureNo’ represents the number of remeasured samples, column ‘Value’ refers to the number of detected signature genes for sheet 1, and the ranking of P values for the other sheets. Sheet 1 corresponds to Fig. 5a, that is, C1+C2+C4+C5 versus C1+C2+C4+C5. Sheets 2–5 correspond to Fig. 5b. Sheet 2: C1+C2+C5 versus C4. Sheet 3: C1+C4+C5 versus C2. Sheet 4: C1+C2+C4 versus C5. Sheet 5: C2+C4+C5 versus C1.
Source Data Fig. 6
Contains the numbers of discoveries for all methods (sheets 1–5) and P values for plotting the histogram (sheets 6 and 7). Sheet 1 corresponds to Fig. 6a, and sheets 2–5 correspond to ‘C1+C2+C5 versus C4’, ‘C1+C4+C5 versus C2’, ‘C1+C2+C4 versus C5’ and ‘C2+C4+C5 versus C1’ in Fig. 6b. Column ‘ReMeasureNo’: the number of remeasured samples. Column ‘Value’: the number of identified signature genes. Sheets 6 and 7 correspond Fig. 6c,d, respectively. Column ‘Gene’: the index of genes. Column ‘Pvalue’: the estimated P values across genes. Column ‘Methods’: all methods that were compared.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ye, H., Zhang, X., Wang, C. et al. Batch-effect correction with sample remeasurement in highly confounded case-control studies. Nat Comput Sci 3, 709–719 (2023). https://doi.org/10.1038/s43588-023-00500-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-023-00500-8
This article is cited by
-
Leveraging remeasured samples in biomedical studies
Nature Computational Science (2023)