Quantify and control reproducibility in high-throughput experiments

Zhao, Yi; Sampson, Matthew G.; Wen, Xiaoquan

doi:10.1038/s41592-020-00978-4

Article
Published: 12 October 2020

Quantify and control reproducibility in high-throughput experiments

Nature Methods volume 17, pages 1207–1213 (2020)Cite this article

3642 Accesses
8 Citations
19 Altmetric
Metrics details

Subjects

Abstract

Ensuring reproducibility of results in high-throughput experiments is crucial for biomedical research. Here, we propose a set of computational methods, INTRIGUE, to evaluate and control reproducibility in high-throughput settings. Our approaches are built on a new definition of reproducibility that emphasizes directional consistency when experimental units are assessed with signed effect size estimates. The proposed methods are designed to (1) assess the overall reproducible quality of multiple studies and (2) evaluate reproducibility at the individual experimental unit levels. We demonstrate the proposed methods in detecting unobserved batch effects via simulations. We further illustrate the versatility of the proposed methods in transcriptome-wide association studies: in addition to reproducible quality control, they are also suited to investigating genuine biological heterogeneity. Finally, we discuss the potential extensions of the proposed methods in other vital areas of reproducible research (for example, publication bias and conceptual replications).

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Accuracy and performance of the proposed methods in simulations.**

**Fig. 2: Highly reproducible TWAS signals identified from the height GWAS data in the UK Biobank and the GIANT consortium.**

**Fig. 3: Tissue-consistent and -specific height TWAS signals identified from whole blood and skeletal muscle tissues.**

SIGNET: transcriptome-wide causal inference for gene regulatory networks

Article Open access 08 November 2023

Zhongli Jiang, Chen Chen, … Dabao Zhang

scPower accelerates and optimizes the design of multi-sample single cell transcriptomic studies

Article Open access 16 November 2021

Katharina T. Schmid, Barbara Höllbacher, … Matthias Heinig

GenomicSuperSignature facilitates interpretation of RNA-seq experiments through robust, efficient comparison to public databases

Article Open access 27 June 2022

Sehyun Oh, Ludwig Geistlinger, … Sean Davis

Data availability

All processed data for simulations and real data analysis are available at https://github.com/ArtemisZhao/INTRIGUE/intrigue_paper. GWAS summary statistics for the UK Biobank and the GIANT consortium are available at https://doi.org/10.5281/zenodo.3629742. eQTL data for TWAS analysis are available at https://gtexportal.org/home/datasets.

Code availability

The source code for software implementation (in R and C/C++), simulation studies and real data processing are provided in https://github.com/ArtemisZhao/INTRIGUE. A Docker image that duplicates the complete computational environment for reproducing the reported results can be freely downloaded from https://hub.docker.com/r/xqwen/intrigue.

References

Goodman, S. N., Fanelli, D. & Ioannidis, J. P. What does research reproducibility mean? Sci. Transl. Med. 8, 341ps12–341ps12 (2016).
Article Google Scholar
Begley, C. G. & Ioannidis, J. P. Reproducibility in science: improving the standard for basic and preclinical research. Circ. Res. 116, 116–126 (2015).
Article CAS Google Scholar
Leek, J. T. & Peng, R. D. Opinion: reproducible research can still be wrong: adopting a prevention approach. Proc. Natl Acad. Sci. USA 112, 1645–1646 (2015).
Article CAS Google Scholar
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genetics 11, 733–739 (2010).
Article CAS Google Scholar
AC’t Hoen, P. et al. Reproducibility of high-throughput mrna and small rna sequencing across laboratories. Nat. Biotech. 31, 1015–1022 (2013).
Article Google Scholar
Goh, W. W. B., Wang, W. & Wong, L. Why batch effects matter in omics data, and how to avoid them. Trends Biotech. 35, 498–507 (2017).
Article CAS Google Scholar
Ioannidis, J. P. et al. Repeatability of published microarray gene expression analyses. Nat. Genetics 41, 149–155 (2009).
Article CAS Google Scholar
Baggerly, K. A. & Coombes, K. R. et al. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Ann. Appl. Stats 3, 1309–1334 (2009).
Article Google Scholar
Flutre, T., Wen, X., Pritchard, J. & Stephens, M. A statistical framework for joint EQTL analysis in multiple tissues. PLoS Genet. 9, e1003486 (2013).
Article CAS Google Scholar
Li, G., Shabalin, A. A., Rusyn, I., Wright, F. A. & Nobel, A. B. An empirical Bayes approach for multiple tissue eqtl analysis. Biostatistics 19, 391–406 (2017).
Article Google Scholar
Consortium, G. et al. The genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Article Google Scholar
Goodman, S. N. A comment on replication, P-values and evidence. Stat. Med. 11, 875–879 (1992).
Article CAS Google Scholar
Heller, R., Bogomolov, M. & Benjamini, Y. Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study. Proc. Natl Acad. Sci. USA 111, 16262–16267 (2014).
Article CAS Google Scholar
Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. et al. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stats 5, 1752–1779 (2011).
Article Google Scholar
Tukey, J. W. The future of data analysis. Ann. Math. Stats 33, 1–67 (1962).
Article Google Scholar
Stephens, M. False discovery rates: a new deal. Biostatistics 18, 275–294 (2016).
PubMed Central Google Scholar
Efron, B. et al. Size, power and false discovery rates. Ann. Stats 35, 1351–1377 (2007).
Article Google Scholar
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3, e161 (2007).
Article Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Article Google Scholar
Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genetics 47, 1091–1098 (2015).
Article CAS Google Scholar
Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016).
Article CAS Google Scholar
Zhang, Y. et al. PTWAS: investigating tissue-relevant causal molecular mechanisms of complex traits using probabilistic TWAS analysis. Genome Biol. 21, 232 (2020).
Article Google Scholar
Storey, J. D. et al. The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Stats 31, 2013–2035 (2003).
Article Google Scholar
Aguet, F. et al. The gtex consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Article CAS Google Scholar
Peters, J. L. et al. Assessing publication bias in meta-analyses in the presence of between-study heterogeneity. J. Royal Stat. Soc. A 173, 575–591 (2010).
Article Google Scholar
Lin, L. & Chu, H. Quantifying publication bias in meta-analysis. Biometrics 74, 785–794 (2018).
Article Google Scholar
Terrin, N., Schmid, C. H., Lau, J. & Olkin, I. Adjusting for publication bias in the presence of heterogeneity. Stat. Med. 22, 2113–2126 (2003).
Article Google Scholar
Augusteijn, H. E., van Aert, R. & van Assen, M. A. The effect of publication bias on the q test and assessment of heterogeneity. Psych. Meth. 24, 116–134 (2019).
Article Google Scholar
Lau, J., Ioannidis, J. P., Terrin, N., Schmid, C. H. & Olkin, I. The case of the misleading funnel plot. BMJ 333, 597–600 (2006).
Article Google Scholar
Higgins, J. P. & Thompson, S. G. Quantifying heterogeneity in a meta-analysis. Stat. Med. 21, 1539–1558 (2002).
Article Google Scholar
Schmidt, S. Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev. Gen. Psych. 13, 90–100 (2009).
Article Google Scholar
Wen, X. Bayesian model selection in complex linear systems, as illustrated in genetic association studies. Biometrics 70, 73–83 (2014).
Article Google Scholar
Wen, X. & Stephens, M. Bayesian methods for genetic association analysis with heterogeneous subgroups: from meta-analyses to gene-environment interactions. Ann. Appl. Stats 8, 176–203 (2014).
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Institutes of Health grant nos. R35GM138121, R01DK108805 and R01DK119380.

Author information

Authors and Affiliations

Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
Yi Zhao & Xiaoquan Wen
Division of Nephrology, Boston Children’s Hospital, Boston, MA, USA
Matthew G. Sampson
Department of Pediatrics, Harvard Medical School, Boston, MA, USA
Matthew G. Sampson
Broad Institute of MIT and Harvard, Cambridge, MA, USA
Matthew G. Sampson

Authors

Yi Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Matthew G. Sampson
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoquan Wen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.Z., M.G.S. and X.W. conceived the ideas. Y.Z. and X.W. designed the experiments. Y.Z. and X.W. developed methods, implemented software and performed analyses. Y.Z., M.G.S. and X.W. wrote the manuscript.

Corresponding author

Correspondence to Xiaoquan Wen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Proportion estimates from batch effect affected high-throughput experiments with no genuine biological signals.

Each simulated dataset consists of 1,000 genes. No gene is differentially expressed in the case (N = 20) and the control (N = 20) samples. In each replication dataset, 500 genes are affected by the unobserved batch effects with various magnitudes (η/σ). The figure shows the estimates of (π_IR, π_R) from the CEFN and the META models for all magnitudes of batch effects examined. The reproducible proportions across all datasets remain close to 0, while the estimates of the irreproducible proportions monotonically increases as the batch effects become stronger.

Extended Data Fig. 2 A directed acyclic graph representation of the proposed Bayesian hierarchical model.

The estimated effects, \({\hat{\beta }}_{i,j}\)’s are observed, \({\bar{\beta }}_{i}\)’s and β_i,j’s are latent random variables. ω, k (or r) are hyper-parameters.

Supplementary information

Supplementary Information

Supplementary Table 1 and Notes.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, Y., Sampson, M.G. & Wen, X. Quantify and control reproducibility in high-throughput experiments. Nat Methods 17, 1207–1213 (2020). https://doi.org/10.1038/s41592-020-00978-4

Download citation

Received: 09 January 2020
Accepted: 14 September 2020
Published: 12 October 2020
Issue Date: December 2020
DOI: https://doi.org/10.1038/s41592-020-00978-4

This article is cited by

Microfluidic high-throughput 3D cell culture
- Jihoon Ko
- Dohyun Park
- Noo Li Jeon
Nature Reviews Bioengineering (2024)
Comparison between stone and digital cast measurements in mixed dentition
- Lisa Schieffer
- Lukas Latzko
- Adriano G. Crismani
Journal of Orofacial Orthopedics / Fortschritte der Kieferorthopädie (2022)

Quantify and control reproducibility in high-throughput experiments

Subjects

Abstract

Access options

Similar content being viewed by others

SIGNET: transcriptome-wide causal inference for gene regulatory networks

scPower accelerates and optimizes the design of multi-sample single cell transcriptomic studies

GenomicSuperSignature facilitates interpretation of RNA-seq experiments through robust, efficient comparison to public databases

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Extended Data Fig. 1 Proportion estimates from batch effect affected high-throughput experiments with no genuine biological signals.

Extended Data Fig. 2 A directed acyclic graph representation of the proposed Bayesian hierarchical model.

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

This article is cited by

Microfluidic high-throughput 3D cell culture

Comparison between stone and digital cast measurements in mixed dentition

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links