Technical Reports

Annotation-free quantification of RNA splicing using LeafCutter

LeafCutter is a new tool that identifies variable intron splicing events from RNA-seq data for analysis of complex alternative splicing. The method does not require transcript annotation and can be used to map splicing quantitative trait loci.

Yang I. Li
David A. Knowles
Jonathan K. Pritchard
Technical Report11 Dec 2017
Covariate selection for association screening in multiphenotype genetic studies

Covariates for multiphenotype studies (CMS), a new approach for testing for associations from large-scale datasets, leverages genetic and environmental factors shared between correlated variables measured on the same samples. Applying CMS to real and simulated data demonstrates a large increase in power equivalent to that gained by doubling the sample size.

Hugues Aschard
Vincent Guillemot
Noah Zaitlen
Technical Report16 Oct 2017
Graphtyper enables population-scale genotyping using pangenome graphs

Graphtyper is a fast and scalable method for variant genotyping that aligns short-read sequence data to a pangenome. Graphtyper was able to accurately genotype ∼90 million sequence variants in the whole genomes of ∼28,000 Icelanders, including those in six HLA genes.

Hannes P Eggertsson
Hakon Jonsson
Bjarni V Halldorsson
Technical Report25 Sept 2017
Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data

Adam Siepel and colleagues report a new computational method, LINSIGHT, that combines evolutionary conservation and functional genomic information to predict the fitness consequences of noncoding mutations in the human genome. They use LINSIGHT to show that fitness consequences of enhancer mutations depend on tissue and cell type specificity and promoter constraints.

Yi-Fei Huang
Brad Gulko
Adam Siepel
Technical Report13 Mar 2017
Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA

Kun Zhang and colleagues present a metric called methylation haplotype load (MHL) that quantifies methylation patterns within blocks of tightly linked CpG dinucleotides. They show that the MHL can distinguish samples from different human somatic tissues and that it can be used to improve detection of cancer-derived circulating DNA and identify its tissue of origin.

Shicheng Guo
Dinh Diep
Kun Zhang
Technical Report06 Mar 2017
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome

Adam Phillippy, Curtis Van Tassell, Timothy Smith and colleagues present a new reference genome assembly for the domestic goat using a pipeline that improves contiguity of the assembly by more than 250-fold. The pipeline uses a combination of short- and long-read sequencing, optical mapping, and chromatin interaction mapping.

Derek M Bickhart
Benjamin D Rosen
Timothy P L Smith
Technical ReportOpen Access06 Mar 2017
Variant-aware saturating mutagenesis using multiple Cas9 nucleases identifies regulatory elements at trait-associated loci

Stuart Orkin, Daniel Bauer and colleagues present DNA Striker, a computational tool to design variant-aware saturating-mutagenesis screens with multiple CRISPR-associated nucleases. They apply their methodology to the HBS1L-MYB intergenic region, which is associated with red-blood-cell traits, and identify putative regulatory elements that control MYB expression.

Matthew C Canver
Samuel Lessard
Stuart H Orkin
Technical Report20 Feb 2017
A method for identifying genetic heterogeneity within phenotypically defined disease subgroups

James Liley, John Todd and Chris Wallace present a statistical method for determining whether disease-associated variants have different effect sizes in phenotypically defined subgroups of disease cases. The test can be combined with existing methods to determine whether genetic heterogeneity is driven by population stratification or by different mechanisms of disease pathology.

James Liley
John A Todd
Chris Wallace
Technical Report26 Dec 2016
Robust and scalable inference of population history from hundreds of unphased whole genomes

Yun Song and colleagues present SMC++, a statistical method for population history inference capable of analyzing unphased whole genomes and sample sizes much larger than can be analyzed by current methods. The authors apply SMC++ to sequence data from human, Drosophila and finch populations.

Jonathan Terhorst
John A Kamm
Yun S Song
Technical Report26 Dec 2016
Scaling probabilistic models of genetic variation to millions of humans

John Storey, David Blei and colleagues present a method, TeraStructure, for estimating population structure from human genomic data sets on a scale not possible with current methods. TeraStructure is able to analyze data from the Human Genome Diversity Panel and the 1000 Genomes Project in less than three hours.

Prem Gopalan
Wei Hao
John D Storey
Technical Report07 Nov 2016
M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity

Gill Bejerano and colleagues present M-CAP, a classifier that estimates variant pathogenicity in clinical exome data sets. They show that M-CAP outperforms other existing methods at all thresholds and correctly dismisses 60% of rare missense variants of uncertain significance at 95% sensitivity.

Karthik A Jagadeesh
Aaron M Wenger
Gill Bejerano
Technical Report24 Oct 2016
Reference-based phasing using the Haplotype Reference Consortium panel

Po-Ru Loh, Alkes Price and colleagues present Eagle2, a reference-based phasing algorithm that allows for highly accurate and efficient phasing of genotypes across a broad range of cohort sizes. They demonstrate an approximately 10% improvement in accuracy and 20% improvement in speed compared to a competing method, SHAPEIT2.

Po-Ru Loh
Petr Danecek
Alkes L Price
Technical Report03 Oct 2016
Unsupervised detection of cancer driver mutations with parsimony-guided learning

Runjun Kumar, S. Joshua Swamidass and Ron Bose present an unsupervised parsimony-guided method, ParsSNP, for prioritizing candidate cancer driver mutations. They apply ParsSNP to a gastric cancer data set and predict potential driver mutations not detected by other methods, including truncations in known tumor-suppressor genes and previously confirmed drivers.

Runjun D Kumar
S Joshua Swamidass
Ron Bose
Technical Report12 Sept 2016
Tensor decomposition for multiple-tissue gene expression experiments

Victoria Hore, Jonathan Marchini and colleagues present a method for multiple-tissue gene expression studies aimed at uncovering gene networks linked to genetic variation. They apply their method to RNA sequencing data from adipose, skin and lymphoblastoid cell lines and identify several biologically relevant gene networks with a genetic basis.

Victoria Hore
Ana Viñuela
Jonathan Marchini
Technical Report01 Aug 2016
Rapid genotype imputation from sequence without reference panels

Richard Mott, Simon Myers and colleagues present a new imputation method, STITCH, which does not require genotyping arrays or high-quality reference panels. They use STITCH to accurately impute genotypes in both outbred laboratory mice and a sample human population directly from low-coverage (<2×) sequencing data.

Robert W Davies
Jonathan Flint
Richard Mott
Technical Report04 Jul 2016
Fast and accurate long-range phasing in a UK Biobank cohort

Po-Ru Loh, Pier Francesco Palamara and Alkes Price develop a new long-range phasing method, Eagle, that harnesses long, shared identical-by-descent tracts and can be applied to large outbred populations. They use Eagle to phase samples from the UK Biobank and find that it is faster and has better accuracy than existing methods.

Po-Ru Loh
Pier Francesco Palamara
Alkes L Price
Technical Report06 Jun 2016
Haplotype estimation for biobank-scale data sets

Jonathan Marchini and colleagues develop a new method for haplotype phasing, SHAPEIT3, capable of handling large data sets from biobanks containing >100,000 genotyped samples. They find that their method is fast and accurate, with a low switch error rate, and can be scaled to data sets from increasingly larger cohorts.

Jared O'Connell
Kevin Sharp
Jonathan Marchini
Technical Report06 Jun 2016
A method to decipher pleiotropy by detecting underlying heterogeneity driven by hidden subgroups applied to autoimmune and neuropsychiatric diseases

Soumya Raychaudhuri, Buhm Han and colleagues present a statistical method to distinguish whether shared genetic risk variants among complex traits are driven by whole-group pleiotropy or a subset of individuals who constitute a genetically heterogeneous subgroup. They use the method to examine genetic sharing among autoimmune diseases and between major depressive disorder and schizophrenia and find that most genetic sharing cannot be explained by subgroup heterogeneity but that, in contrast, seronegative rheumatoid arthritis is a heterogeneous condition.

Buhm Han
Jennie G Pouget
Soumya Raychaudhuri
Technical Report16 May 2016
A multiple-phenotype imputation method for genetic studies

Andy Dahl and colleagues present a method for imputing missing phenotype data in genetic studies with multiple correlated phenotypes where samples can have any level of relatedness. They apply their method to simulated and real data sets and show that it improves the sensitivity to detect association signals.

Andrew Dahl
Valentina Iotchkova
Jonathan Marchini
Technical Report22 Feb 2016
A spectral approach integrating functional genomic annotations for coding and noncoding variants

Iuliana Ionita-Laza, Kenneth McCallum and colleagues developed an unsupervised statistical approach, Eigen, that integrates different functional annotations into a single measure of functional importance for coding and noncoding variants. Their meta-score can outperform the recently proposed CADD score and can be applied to fine-mapping studies.

Iuliana Ionita-Laza
Kenneth McCallum
Joseph D Buxbaum
Technical Report04 Jan 2016