Statistical methods

  • Article
    | Open Access

    Scalable trajectory inference for multi-omic single cell datasets is challenging in terms of capturing non-tree complex topologies. Here the authors present a method, VIA, that scales to millions of cells across multiple omic modalities using lazy-teleporting random walks.

    • Shobana V. Stassen
    • , Gwinky G. K. Yip
    •  & Kevin K. Tsia
  • Article
    | Open Access

    Genetic plasticity drives phenotypic differences. Here, the authors develop a framework to quantify the individual and combinatorial contributions of SNPs on a phenotype of interest and use it to identify SNP-SNP interactions associated with variations in bacteria’s response to external changes.

    • Dengcheng Yang
    • , Yi Jin
    •  & Rongling Wu
  • Article
    | Open Access

    Glycomics can uncover important molecular changes but measured glycans are highly interconnected and incompatible with common statistical methods, introducing pitfalls during analysis. Here, the authors develop an approach to identify glycan dependencies across samples to facilitate comparative glycomics.

    • Bokan Bao
    • , Benjamin P. Kellman
    •  & Nathan E. Lewis
  • Article
    | Open Access

    Mass spectrometry-based metabolomics is a powerful method for profiling large clinical cohorts but batch variations can obscure biologically meaningful differences. Here, the authors develop a computational workflow that removes unwanted data variation while preserving biologically relevant information.

    • Taiyun Kim
    • , Owen Tang
    •  & Jean Yee Hwa Yang
  • Article
    | Open Access

    Existing genetic prediction tools typically assume that genetic variants contribute equally towards the phenotype. The authors develop eight prediction tools that allow the user to specify the heritability model, and show that these tools enable substantially improved prediction of complex traits.

    • Qianqian Zhang
    • , Florian Privé
    •  & Doug Speed
  • Article
    | Open Access

    Precision medicine needs prognostic markers to select the patients that will benefit more from targeted therapy. Authors show here that high level of baseline T cell receptor diversity is an indicator of favourable prognosis in multiple cancer types, and monoclonal expansion of T-cells correlates with good response to immune checkpoint blockade therapy in metastatic melanoma patients.

    • Sara Valpione
    • , Piyushkumar A. Mundra
    •  & Richard Marais
  • Article
    | Open Access

    Cross-linking mass spectrometry (MS) can identify protein-protein interaction (PPI) networks but assessing the reliability of these data remains challenging. To address this issue, the authors develop and validate a method to determine the false-discovery rate of PPIs identified by cross-linking MS.

    • Swantje Lenz
    • , Ludwig R. Sinn
    •  & Juri Rappsilber
  • Article
    | Open Access

    The genome-wide investigation of chromatin organization enables insights into global gene expression control. Here, the authors present a computationally efficient method for the analysis of chromatin organization data and use it to recover principles of 3D organization across conditions.

    • Merve Sahin
    • , Wilfred Wong
    •  & Christina S. Leslie
  • Article
    | Open Access

    Allele-specific expression in diploid organisms can be quantified by RNA-seq and it is common practice to rely on a single library. Here, the authors show that the standard approach has variable error rate and present Qllelic as a tool to improve reproducibility of allele-specific RNA-seq analysis.

    • Asia Mendelevich
    • , Svetlana Vinogradova
    •  & Alexander A. Gimelbrant
  • Article
    | Open Access

    Association analyses that capture rare and noncoding variants in whole genome sequencing data are limited by factors like statistical power. Here, the authors present KnockoffScreen, a statistical method using the knockoff framework to detect, localise and prioritise rare and common risk variants at genome-wide scale.

    • Zihuai He
    • , Linxi Liu
    •  & Iuliana Ionita-Laza
  • Article
    | Open Access

    The vast majority of somatic mutations observed in tumors are rare. Here, the authors show that these large numbers of rare mutations are more predictive of the tissue of origin of a tumor than the information from a few common driver mutations.

    • Saptarshi Chakraborty
    • , Axel Martin
    •  & Ronglai Shen
  • Article
    | Open Access

    Cellular genetic heterogeneity is common across biological conditions, yet application of long-read sequencing to this subject is limited by error rates. Here, the authors present iGDA, a tool for detection and phasing of minor variants from long-read sequencing data, allowing accurate reconstruction of haplotypes.

    • Zhixing Feng
    • , Jose C. Clemente
    •  & Eric E. Schadt
  • Article
    | Open Access

    Study of human disease remains challenging due to convoluted disease etiologies and complex molecular mechanisms at genetic, genomic, and proteomic levels. Here, the authors propose a computationally efficient Permutation-based Feature Importance Test to assist interpretation and selection of individual features in complex machine learning models for complex disease analysis.

    • Xinlei Mi
    • , Baiming Zou
    •  & Jianhua Hu
  • Article
    | Open Access

    Biomedical measurements usually generate high-dimensional data where individual samples are classified in several categories. Vogelstein et al. propose a supervised dimensionality reduction method which estimates the low-dimensional data projection for classification and prediction in big datasets.

    • Joshua T. Vogelstein
    • , Eric W. Bridgeford
    •  & Mauro Maggioni
  • Article
    | Open Access

    Quantifying the effects of individual loci on the human phenome is a challenging task. Here, the authors introduce a modelling technique, TGCA, that assesses total genetic contribution per locus and apply this to UK Biobank phenotype domains, revealing top loci and links to tissue-specific gene expression.

    • Ting Li
    • , Zheng Ning
    •  & Xia Shen
  • Article
    | Open Access

    Single-nucleotide variants in enhancers or promoters may affect gene transcription by altering transcription factor binding sites. Here the authors present a meta-analysis empowered by a new statistical method covering thousands of ChIP-Seq experiments resulting in the identification of more than 500 thousand allele-specific binding (ASB) events in the human genome.

    • Sergey Abramov
    • , Alexandr Boytsov
    •  & Ivan V. Kulakovskiy
  • Article
    | Open Access

    Estimates of COVID-19-related mortality are limited by incomplete testing. Here, the authors perform counterfactual analyses and estimate that there were 59,000–62,000 deaths from COVID-19 in Italy until 9th September 2020, approximately 1.5 times higher than official statistics.

    • Chirag Modi
    • , Vanessa Böhm
    •  & Uroš Seljak
  • Article
    | Open Access

    Tissue damage and turnover lead to the release of DNA in the blood and can be used to monitor changes in tissue state. Here, the authors developed a tool to accurately estimate the proportion of cell types contributing to cell-free DNA in the blood, with an application to pregnant women and ALS patients.

    • Christa Caggiano
    • , Barbara Celona
    •  & Noah Zaitlen
  • Article
    | Open Access

    Functional RNA secondary structure is important for the pre-mRNA processing including splicing, cleavage and polyadenylation, and RNA editing. Here the authors present a catalog of conserved long-range RNA structures in the human transcriptome by defining pairs of conserved complementary regions (PCCR) in pre-aligned evolutionarily conserved regions.

    • Svetlana Kalmykova
    • , Marina Kalinina
    •  & Dmitri Pervouchine
  • Article
    | Open Access

    Methods for profiling differences between individual cells are constantly expanding. Here, the authors present a computational framework for the analysis of chromatin accessibility data at the single-cell level that takes into account previous knowledge and data-specific characteristics.

    • Shengquan Chen
    • , Guanao Yan
    •  & Zhixiang Lin
  • Article
    | Open Access

    Genetic correlation analyses give insight on complex disease, yet are limited by oversimplification. Here, the authors present LOGODetect, a method using summary statistics from genome-wide association studies to identify genomic regions with correlation signals across multiple phenotypes.

    • Hanmin Guo
    • , James J. Li
    •  & Lin Hou
  • Article
    | Open Access

    Clustering cells based on similarities in gene expression is the first step towards identifying cell types in scRNASeq data. Here the authors incorporate biological knowledge into the clustering step to facilitate the biological interpretability of clusters, and subsequent cell type identification.

    • Tian Tian
    • , Jie Zhang
    •  & Hakon Hakonarson
  • Article
    | Open Access

    Human leukocyte antigen (HLA) genes contribute to risk of many complex traits, yet understanding inter-ethnic heterogeneity is computationally challenging. Here, the authors develop DEEP*HLA for imputation of HLA genotypes and show its ability to disentangle HLA variant risk effects in diverse populations.

    • Tatsuhiko Naito
    • , Ken Suzuki
    •  & Yukinori Okada
  • Article
    | Open Access

    Human leukocyte antigen (HLA) genes influence many immune phenotypes, however methods to impute HLA type have been limited in accuracy. Here, the authors present an HLA imputation method, CookHLA, which uses locally embedded prediction markers to adaptively impute HLA genes across a range of scenarios.

    • Seungho Cook
    • , Wanson Choi
    •  & Buhm Han
  • Article
    | Open Access

    Single cell genomics uses cells from the same individual, or pseudoreplicates, that can introduce biases and inflate type I error rates. Here the authors apply generalized linear mixed models with a random effect for individual, to properly account for both zero inflation and the correlation structure among cells within an individual.

    • Kip D. Zimmerman
    • , Mark A. Espeland
    •  & Carl D. Langefeld
  • Article
    | Open Access

    Mendelian randomization is a popular method to detect causal relationships between traits, but can be confounded by instances of horizontal pleiotropy. Here, the authors present a Mendelian randomization workflow which includes causal discovery analysis and filtering of genetic instruments based on their conditional independencies.

    • David Amar
    • , Nasa Sinnott-Armstrong
    •  & Manuel A. Rivas
  • Article
    | Open Access

    Pathogenicity scores are instrumental in prioritizing variants for Mendelian disease, yet their application to common disease is largely unexplored. Here, the authors assess the utility of pathogenicity scores for 41 complex traits and develop a framework to improve their informativeness for common disease.

    • Samuel S. Kim
    • , Kushal K. Dey
    •  & Alkes L. Price
  • Article
    | Open Access

    Single-cell transcriptomics enhanced our ability to profile heterogeneous cell populations. It is not known which statistical frameworks are performant to detect subpopulation-level responses. Here, the authors developed a simulation framework to evaluate various methods across a range of scenarios.

    • Helena L. Crowell
    • , Charlotte Soneson
    •  & Mark D. Robinson
  • Article
    | Open Access

    Many published studies of the current SARS-CoV-2 pandemic have analysed data from non-representative samples from populations. Here, using UK BioBank samples, Gibran Hemani and colleagues discuss the potential for such studies to suffer from collider bias, and provide suggestions for optimising study design to account for this.

    • Gareth J. Griffith
    • , Tim T. Morris
    •  & Gibran Hemani
  • Article
    | Open Access

    The HIV reservoir is a major hurdle for a cure of HIV, but the factors determining its size and dynamics remain unclear. Here the authors show in a large cohort of 610 HIV-1 infected individuals, who are on suppressive ART for a median of 5.4 years, that viral genetic factors contribute substantially to the HIV-1 reservoir size.

    • Chenjie Wan
    • , Nadine Bachmann
    •  & Sabine Yerly
  • Article
    | Open Access

    Fraudulent adulteration of edible oils is based on the fact that their characteristic fatty acid profile can be mimicked with mixtures of other oil types. Here, the authors use a deep learning method to uncover fatty acid patterns discriminative for ten different plant oil types and to discern composition of mixtures.

    • Kevin Lim
    • , Kun Pan
    •  & Rong Hui Xiao
  • Article
    | Open Access

    Mendelian randomization is a useful tool to infer causal relationships between traits, but can be confounded by the presence of pleiotropy. Here, the authors have developed MR-link, a Mendelian randomization method which accounts for unobserved pleiotropy and linkage disequilibrium between instrumental variables.

    • Adriaan van der Graaf
    • , Annique Claringbould
    •  & Serena Sanna
  • Article
    | Open Access

    Smoking-associated DNA methylation changes in whole blood have been reported by many EWAS. Here, the authors use a cell-type deconvolution algorithm to identify cell-type specific DNA methylation signals in seven EWAS, identifying lineage-specific smoking-associated DNA methylation changes.

    • Chenglong You
    • , Sijie Wu
    •  & Andrew E. Teschendorff
  • Article
    | Open Access

    Linear mixed models have bias due to the assumed independence between random effects. Here, the authors describe a genome-based restricted maximum likelihood, CORE GREML, which estimates covariance between random effects. Application to UK Biobank data highlights this as an important parameter for multi-omics analyses of phenotypic variance.

    • Xuan Zhou
    • , Hae Kyung Im
    •  & S. Hong Lee
  • Article
    | Open Access

    Variance components analysis may be used for a variety of applications including heritability estimation and association mapping. Here, the authors present a computationally efficient method, scalable to extremely large GWAS datasets, and use it for heritabilty analysis of 22 traits from UK Biobank

    • Ali Pazokitoroudi
    • , Yue Wu
    •  & Sriram Sankararaman