Statistical methods articles within Nature Communications

Featured

  • Article
    | Open Access

    Clustering cells based on similarities in gene expression is the first step towards identifying cell types in scRNASeq data. Here the authors incorporate biological knowledge into the clustering step to facilitate the biological interpretability of clusters, and subsequent cell type identification.

    • Tian Tian
    • , Jie Zhang
    •  & Hakon Hakonarson
  • Article
    | Open Access

    Human leukocyte antigen (HLA) genes contribute to risk of many complex traits, yet understanding inter-ethnic heterogeneity is computationally challenging. Here, the authors develop DEEP*HLA for imputation of HLA genotypes and show its ability to disentangle HLA variant risk effects in diverse populations.

    • Tatsuhiko Naito
    • , Ken Suzuki
    •  & Yukinori Okada
  • Article
    | Open Access

    Human leukocyte antigen (HLA) genes influence many immune phenotypes, however methods to impute HLA type have been limited in accuracy. Here, the authors present an HLA imputation method, CookHLA, which uses locally embedded prediction markers to adaptively impute HLA genes across a range of scenarios.

    • Seungho Cook
    • , Wanson Choi
    •  & Buhm Han
  • Article
    | Open Access

    Single cell genomics uses cells from the same individual, or pseudoreplicates, that can introduce biases and inflate type I error rates. Here the authors apply generalized linear mixed models with a random effect for individual, to properly account for both zero inflation and the correlation structure among cells within an individual.

    • Kip D. Zimmerman
    • , Mark A. Espeland
    •  & Carl D. Langefeld
  • Article
    | Open Access

    Mendelian randomization is a popular method to detect causal relationships between traits, but can be confounded by instances of horizontal pleiotropy. Here, the authors present a Mendelian randomization workflow which includes causal discovery analysis and filtering of genetic instruments based on their conditional independencies.

    • David Amar
    • , Nasa Sinnott-Armstrong
    •  & Manuel A. Rivas
  • Article
    | Open Access

    Pathogenicity scores are instrumental in prioritizing variants for Mendelian disease, yet their application to common disease is largely unexplored. Here, the authors assess the utility of pathogenicity scores for 41 complex traits and develop a framework to improve their informativeness for common disease.

    • Samuel S. Kim
    • , Kushal K. Dey
    •  & Alkes L. Price
  • Article
    | Open Access

    Single-cell transcriptomics enhanced our ability to profile heterogeneous cell populations. It is not known which statistical frameworks are performant to detect subpopulation-level responses. Here, the authors developed a simulation framework to evaluate various methods across a range of scenarios.

    • Helena L. Crowell
    • , Charlotte Soneson
    •  & Mark D. Robinson
  • Article
    | Open Access

    Many published studies of the current SARS-CoV-2 pandemic have analysed data from non-representative samples from populations. Here, using UK BioBank samples, Gibran Hemani and colleagues discuss the potential for such studies to suffer from collider bias, and provide suggestions for optimising study design to account for this.

    • Gareth J. Griffith
    • , Tim T. Morris
    •  & Gibran Hemani
  • Article
    | Open Access

    The HIV reservoir is a major hurdle for a cure of HIV, but the factors determining its size and dynamics remain unclear. Here the authors show in a large cohort of 610 HIV-1 infected individuals, who are on suppressive ART for a median of 5.4 years, that viral genetic factors contribute substantially to the HIV-1 reservoir size.

    • Chenjie Wan
    • , Nadine Bachmann
    •  & Sabine Yerly
  • Article
    | Open Access

    Fraudulent adulteration of edible oils is based on the fact that their characteristic fatty acid profile can be mimicked with mixtures of other oil types. Here, the authors use a deep learning method to uncover fatty acid patterns discriminative for ten different plant oil types and to discern composition of mixtures.

    • Kevin Lim
    • , Kun Pan
    •  & Rong Hui Xiao
  • Article
    | Open Access

    Mendelian randomization is a useful tool to infer causal relationships between traits, but can be confounded by the presence of pleiotropy. Here, the authors have developed MR-link, a Mendelian randomization method which accounts for unobserved pleiotropy and linkage disequilibrium between instrumental variables.

    • Adriaan van der Graaf
    • , Annique Claringbould
    •  & Serena Sanna
  • Article
    | Open Access

    Smoking-associated DNA methylation changes in whole blood have been reported by many EWAS. Here, the authors use a cell-type deconvolution algorithm to identify cell-type specific DNA methylation signals in seven EWAS, identifying lineage-specific smoking-associated DNA methylation changes.

    • Chenglong You
    • , Sijie Wu
    •  & Andrew E. Teschendorff
  • Article
    | Open Access

    Linear mixed models have bias due to the assumed independence between random effects. Here, the authors describe a genome-based restricted maximum likelihood, CORE GREML, which estimates covariance between random effects. Application to UK Biobank data highlights this as an important parameter for multi-omics analyses of phenotypic variance.

    • Xuan Zhou
    • , Hae Kyung Im
    •  & S. Hong Lee
  • Article
    | Open Access

    Variance components analysis may be used for a variety of applications including heritability estimation and association mapping. Here, the authors present a computationally efficient method, scalable to extremely large GWAS datasets, and use it for heritabilty analysis of 22 traits from UK Biobank

    • Ali Pazokitoroudi
    • , Yue Wu
    •  & Sriram Sankararaman
  • Article
    | Open Access

    The next generation sequencing has provided the opportunity to look for signatures of carcinogenesis on a genome wide scale. Here, the authors develop the algorithm, sigLASSO, that provides confidence in assigning mutational signatures when the mutation count is low and the samples used are variable.

    • Shantao Li
    • , Forrest W. Crawford
    •  & Mark B. Gerstein
  • Article
    | Open Access

    It is not clear which designs, other than completely randomized ones, are valid for scRNA-seq experiments so that batch effects can be adjusted. Here the authors show that under flexible reference panel and chain-type designs, biological variability can also be separated from batch effects, at least by BUSseq.

    • Fangda Song
    • , Ga Ming Angus Chan
    •  & Yingying Wei
  • Article
    | Open Access

    Lineage tracing studies combining CRISPR-Cas9 editing and scRNA-seq face several challenges and cannot integrate lineages from multiple individuals. Here the authors show that integration of mutation and expression leads to accurate lineage tree inference and enables the learning of a species-invariant lineage tree.

    • Hamim Zafar
    • , Chieh Lin
    •  & Ziv Bar-Joseph
  • Article
    | Open Access

    Linking epigenetic marks to clinical outcomes promises insight into the underlying processes. Here, the authors introduce a statistical approach to estimate associations between a phenotype and all epigenetic probes jointly, and to estimate the proportion of variation captured by epigenetic effects.

    • Daniel Trejo Banos
    • , Daniel L. McCartney
    •  & Matthew R. Robinson
  • Article
    | Open Access

    Mayaro virus (MAYV) is an emerging arbovirus, but cross-reactivity with other alphaviruses makes analysis of its epidemiology difficult. Here, the authors develop an analytical framework to assess MAYV epidemiology and find evidence for an important sylvatic cycle and seroprevalences of up to 18% in some areas of French Guiana.

    • Nathanaël Hozé
    • , Henrik Salje
    •  & Simon Cauchemez
  • Article
    | Open Access

    Methods to integrate association evidence across multiple traits often focus on individual common variants GWAS. Here the authors present multi-trait analysis of rare-variant associations (MTAR), a framework for joint analysis of association summary statistics between multiple rare variants and different traits.

    • Lan Luo
    • , Judong Shen
    •  & Zheng-Zheng Tang
  • Article
    | Open Access

    Sample index hopping results in various artefacts in multiplexed scRNA-seq experiments. Here, the authors introduce a statistical model to estimate sample index hopping rate in droplet-based scRNA-seq data and show that artifacts can be corrected by purging phantom molecules from the data.

    • Rick Farouni
    • , Haig Djambazian
    •  & Hamed S. Najafabadi
  • Article
    | Open Access

    Differential expression (DE) and gene set enrichment (GSE) analysis tend to be carried out separately. Here, the authors present iDEA (integrative Differential expression and gene set Enrichment Analysis) for the analysis of scRNAseq data which uses a Baysian approach to jointly model DE and GSE for improved power in both tasks.

    • Ying Ma
    • , Shiquan Sun
    •  & Xiang Zhou
  • Article
    | Open Access

    Downstream of trajectory inference for cell lineages based on scRNA-seq data, differential expression analysis yields insight into biological processes. Here, Van den Berge et al. develop tradeSeq, a framework for the inference of within and between-lineage differential expression, based on negative binomial generalized additive models.

    • Koen Van den Berge
    • , Hector Roux de Bézieux
    •  & Lieven Clement
  • Article
    | Open Access

    GWAS analysis currently relies mostly on linear mixed models, which do not account for linkage disequilibrium (LD) between tested variants. Here, Sesia et al. propose KnockoffZoom, a non-parametric statistical method for the simultaneous discovery and fine-mapping of causal variants, assuming only that LD is described by hidden Markov models (HMMs).

    • Matteo Sesia
    • , Eugene Katsevich
    •  & Chiara Sabatti
  • Article
    | Open Access

    In Mendelian randomization (MR) studies, one typically selects SNPs as instrumental variables that do not directly affect the outcome to avoid violation of MR assumptions. Here, Cho et al. present a framework, MR-TRYX, that leverages knowledge of such outliers of horizontal pleiotropy to identify putative causal relationships between exposure and outcome.

    • Yoonsu Cho
    • , Philip C. Haycock
    •  & Gibran Hemani
  • Article
    | Open Access

    Complex diseases often share genetic determinants and symptoms, but the mechanistic basis of disease interactions remains elusive. Here, the authors propose a network topological measure to identify proteins linking complex diseases in the interactome, and identify mediators between COPD and asthma.

    • Enrico Maiorino
    • , Seung Han Baek
    •  & Amitabh Sharma
  • Article
    | Open Access

    For single-cell RNA-seq experiments the sequencing budget is limited, and how it should be optimally allocated to maximize information is not clear. Here the authors develop a mathematical framework to show that, for estimating many gene properties, the optimal allocation is to sequence at the depth of one read per cell per gene.

    • Martin Jinye Zhang
    • , Vasilis Ntranos
    •  & David Tse
  • Article
    | Open Access

    Most currently available statistical tools for the analysis of ATAC-seq data were repurposed from tools developed for other functional genomics data (e.g. ChIP-seq). Here, Gabitto et al develop ChromA, a Bayesian statistical approach for the analysis of both bulk and single-cell ATAC-seq data.

    • Mariano I. Gabitto
    • , Anders Rasmussen
    •  & Richard Bonneau
  • Article
    | Open Access

    Although transcription factor (TF) cooperativity is widespread, a global mechanistic understanding of the role of TF cooperativity is still lacking. Here the authors introduce a statistical learning framework that provides structural insight into TF cooperativity and its functional consequences based on next generation sequencing data and provide mechanistic insights into TF cooperativity and its impact on protein-phenotype interactions.

    • Ignacio L. Ibarra
    • , Nele M. Hollmann
    •  & Judith B. Zaugg
  • Article
    | Open Access

    Multivariable Mendelian randomization (MR) extends the standard MR framework to consider multiple risk factors in a single model. Here, Zuber et al. propose MR-BMA, a Bayesian variable selection approach to identify the likely causal determinants of a disease from many candidate risk factors as for example high-throughput data sets.

    • Verena Zuber
    • , Johanna Maria Colijn
    •  & Stephen Burgess
  • Article
    | Open Access

    Sequencing cancer genomes reveals low frequency novel somatic variants without known function. Here, the authors leverage statistical methodology from the fields of computational linguistics and ecology to highlight the potentially important signals harboured by these novel variants that are often dismissed.

    • Saptarshi Chakraborty
    • , Arshi Arora
    •  & Ronglai Shen
  • Article
    | Open Access

    Disease heritability and genetic correlations between traits depend on genetics, the environment and their interaction. Here, Jia et al. compute disease prevalence curves and disease embeddings from electronic health records and impute heritability for hundreds of diseases and genetic correlations for thousands of disease pairs.

    • Gengjie Jia
    • , Yu Li
    •  & Andrey Rzhetsky
  • Article
    | Open Access

    Allele-specific expression at single-cell resolution can reveal stochastic and dynamic features of gene expression in greater detail. The authors propose scBASE, a soft zero-and-one inflated model that improves estimation of cellular allelic proportions by pooling information across cells.

    • Kwangbom Choi
    • , Narayanan Raghupathy
    •  & Gary A. Churchill
  • Article
    | Open Access

    Various approaches are being used for polygenic prediction including Bayesian multiple regression methods that require access to individual-level genotype data. Here, the authors extend BayesR to utilise GWAS summary statistics (SBayesR) and show that it outperforms other summary statistic-based methods.

    • Luke R. Lloyd-Jones
    • , Jian Zeng
    •  & Peter M. Visscher
  • Article
    | Open Access

    Taxonomy classification of amplicon sequences is an important step in investigating microbial communities in microbiome analysis. Here, the authors show incorporating environment-specific taxonomic abundance information can lead to improved species-level classification accuracy across common sample types.

    • Benjamin D. Kaehler
    • , Nicholas A. Bokulich
    •  & Gavin A. Huttley
  • Article
    | Open Access

    Programmed ribosomal frameshifting (PRF) is an alternative translation strategy that causes controlled slippage of the ribosome along the mRNA, changing the sequence of the synthesized protein. Here the authors provide a thermodynamic framework that explains how mRNA sequence determines the efficiency of frameshifting.

    • Lars V. Bock
    • , Neva Caliskan
    •  & Helmut Grubmüller
  • Article
    | Open Access

    HiChIP/PLAC-seq assay is popular for profiling 3D genome interactions among regulatory elements at kilobase resolution. Here the authors describe FitHiChIP an empirical null-based, flexible computational method for statistical significance estimation and loop calling from HiChIP data.

    • Sourya Bhattacharyya
    • , Vivek Chandra
    •  & Ferhat Ay