Data mining

  • Article
    | Open Access

    Retinoblastoma is the most frequent intraocular paediatric malignancy whose molecular basis remains poorly understood. Here, the authors perform multi-omic analysis and identify two subtypes; one in a cone differentiated state and one more aggressive showing cone dedifferentiation and expressing neuronal markers.

    • Jing Liu
    • , Daniela Ottaviani
    •  & François Radvanyi
  • Article
    | Open Access

    The high dimensional and complex nature of mass spectrometry imaging (MSI) data poses challenges to downstream analyses. Here the authors show an application of artificial intelligence in mining MSI data revealing biologically relevant metabolomic and proteomic information from data acquired on different mass spectrometry platforms.

    • Walid M. Abdelmoula
    • , Begona Gimenez-Cassina Lopez
    •  & Nathalie Y. R. Agar
  • Article
    | Open Access

    Comparing changes in behaviour across various species is not always trivial, especial across significantly divergent species. Here, the authors develop a deep learning framework that allows them to map changes in locomotion demonstrated on dopamine-deficient humans, mice and worms.

    • Takuya Maekawa
    • , Daiki Higashide
    •  & Susumu Takahashi
  • Article
    | Open Access

    Disambiguating abbreviations is important for automated clinical note processing; however, deploying machine learning for this task is restricted by lack of good training data. Here, the authors show novel data augmentation methods that use biomedical ontologies to improve abbreviation disambiguation in many datasets.

    • Marta Skreta
    • , Aryan Arbabi
    •  & Michael Brudno
  • Article
    | Open Access

    The global pattern of the mammalian methylome is formed by changes in methylation and demethylation. Here the authors describe a metric methylation concurrence that measures the ratio of unmethylated CpGs inside the partially methylated reads and show that methylation concurrence is associated with epigenetically regulated tumour suppressor genes.

    • Jiejun Shi
    • , Jianfeng Xu
    •  & Wei Li
  • Article
    | Open Access

    The link between gRNA sequence and Cas9 activity is well established but the mechanism underlying this relationship is not well understood. Here the authors show that gRNA sequence primarily influences activity by dictating the time it takes for Cas9 to find the target site in a species-specific manner.

    • E. A. Moreb
    •  & M. D. Lynch
  • Article
    | Open Access

    The blood transcriptome of human subjects can be profiled on an almost routine basis in translational research settings. Here the authors show that a fixed and well-characterized repertoire of transcriptional modules can be employed as a reusable framework for the analysis, visualization and interpretation of such data

    • Matthew C. Altman
    • , Darawan Rinchai
    •  & Damien Chaussabel
  • Article
    | Open Access

    Recent advances in super-resolution microscopy have made it possible to measure chromatin 3D structure and transcription in thousands of single cells. Here, authors present a deep learning-based approach to characterise how chromatin structure relates to transcriptional state of individual cells and determine which structural features of chromatin regulation are important for gene expression state.

    • Aparna R. Rajpurkar
    • , Leslie J. Mateo
    •  & Alistair N. Boettiger
  • Article
    | Open Access

    Although autophagy has been linked to tumourigenesis, it is unclear how genomic alterations affect autophagy selectivity in tumours. Here, the authors establish a pipeline that integrates computational and experimental approaches to show that altered autophagy selectivity is frequent in cancer cells and link glycogen autophagy with tumourigenesis.

    • Zhu Han
    • , Weizhi Zhang
    •  & Da Jia
  • Article
    | Open Access

    Current genome mining methods predict many putative non-ribosomal peptides (NRPs) from their corresponding biosynthetic gene clusters, but it remains unclear which of those exist in nature and how to identify their post-assembly modifications. Here, the authors develop NRPminer, a modification-tolerant tool for the discovery of NRPs from large genomic and mass spectrometry datasets, and use it to find 180 NRPs from different environments.

    • Bahar Behsaz
    • , Edna Bode
    •  & Hosein Mohimani
  • Article
    | Open Access

    Study of human disease remains challenging due to convoluted disease etiologies and complex molecular mechanisms at genetic, genomic, and proteomic levels. Here, the authors propose a computationally efficient Permutation-based Feature Importance Test to assist interpretation and selection of individual features in complex machine learning models for complex disease analysis.

    • Xinlei Mi
    • , Baiming Zou
    •  & Jianhua Hu
  • Article
    | Open Access

    Patient-derived xenografts are widely used for drug development, but the impact of murine viral infection remains underexplored. Here, the authors demonstrate the extensive existence of murine viral sequences in patient-derived xenografts and significant expression change of crucial genes in samples with high virus load.

    • Zihao Yuan
    • , Xuejun Fan
    •  & W. Jim Zheng
  • Article
    | Open Access

    Most diseases disrupt multiple proteins, and drugs treat such diseases by restoring the functions of the disrupted proteins; how drugs restore these functions, however, is often unknown. Here, the authors develop the multiscale interactome, a powerful approach to explain disease treatment.

    • Camilo Ruiz
    • , Marinka Zitnik
    •  & Jure Leskovec
  • Article
    | Open Access

    Kinases drive fundamental changes in cell state, but predicting kinase activity based on substrate-level changes can be challenging. Here the authors introduce a computational framework that utilizes similarities between substrates to robustly infer kinase activity.

    • Serhan Yılmaz
    • , Marzieh Ayati
    •  & Mehmet Koyutürk
  • Article
    | Open Access

    Bulk approaches fail to capture the cell-to-cell heterogeneity of chromatin landscapes, while single-cell approaches provide low coverage datasets. Here, the authors present ChromSCape, a user-friendly interactive application that processes single-cell epigenomic data to assist the biological interpretation of chromatin landscapes within cell populations, as demonstrated in the context of cancer.

    • Pacôme Prompsy
    • , Pia Kirchmeier
    •  & Céline Vallot
  • Article
    | Open Access

    The histone variant mutation H3.3-G34W occurs in the majority of giant cell tumor of bone (GCTB). By profiling patient-derived GCTB tumor cells, the authors show that this mutation associates with epigenetic alterations in heterochromatic and bivalent regions that contribute to an impaired osteogenic differentiation and the osteolytic phenotype of GCTB.

    • Pavlo Lutsik
    • , Annika Baude
    •  & Christoph Plass
  • Article
    | Open Access

    Comparative analysis of animal behaviour using locomotion data such as GPS data is difficult because the large amount of data makes it difficult to contrast group differences. Here the authors apply deep learning to detect and highlight trajectories characteristic of a group across scales of millimetres to hundreds of kilometres.

    • Takuya Maekawa
    • , Kazuya Ohara
    •  & Ken Yoda
  • Article
    | Open Access

    How to design experiments that accelerate knowledge discovery on complex biological landscapes remains a tantalizing question. Here, the authors present OPEX, an optimal experimental design method to identify informative omics experiments for both experimental space exploration and model training.

    • Xiaokang Wang
    • , Navneet Rai
    •  & Ilias Tagkopoulos
  • Perspective
    | Open Access

    In this Perspective, the authors review the different applications for mobile phone data to support COVID-19 pandemic response, the relevance of these applications for infectious disease transmission and control, and potential sources and implications of selection bias in mobile phone data.

    • Kyra H. Grantz
    • , Hannah R. Meredith
    •  & Amy Wesolowski
  • Article
    | Open Access

    How cell clusters are defined in single-cell sequencing data has important consequences for downstream analyses and the interpretation of results, but is often not straightforward. Here, the authors present a new approach that enables the prediction of differentially expressed genes without relying on explicit clustering of cells.

    • Alexis Vandenbon
    •  & Diego Diez
  • Article
    | Open Access

    Identification of genes that determine and regulate cell identity remains challenging. Here, the authors use machine learning to identify cell identity genes and master regulator transcription factors based on gene expression profiles and histone modifications.

    • Bo Xia
    • , Dongyu Zhao
    •  & Kaifu Chen
  • Article
    | Open Access

    Sample contamination has been reported in high throughput RNA sequencing. Here the authors analyze the RNA sequencing data from the Genotype-Tissue Expression project and describe how highly expressed, tissue specific genes contaminate across samples, which is corroborated in other data sets.

    • Tim O. Nieuwenhuis
    • , Stephanie Y. Yang
    •  & Marc K. Halushka
  • Article
    | Open Access

    Whether the immune system aging differs between men and women is barely known. Here the authors characterize gene expression, chromatin state and immune subset composition in the blood of healthy humans 22 to 93 years of age, uncovering shared as well as sex-unique alterations, and create a web resource to interactively explore the data.

    • Eladio J. Márquez
    • , Cheng-han Chung
    •  & Duygu Ucar
  • Article
    | Open Access

    Data-independent acquisition (DIA) is an emerging technology in proteomics but it typically relies on spectral libraries built by data-dependent acquisition (DDA). Here, the authors use deep learning to generate in silico spectral libraries directly from protein sequences that enable more comprehensive DIA experiments than DDA-based libraries.

    • Yi Yang
    • , Xiaohui Liu
    •  & Liang Qiao
  • Article
    | Open Access

    Identification of clinically relevant gene expression signatures for cancer stratification remains challenging. Here, the authors introduce a flexible nonlinear signal superposition model that enables dissection of large gene expression data sets into signatures and extraction of gene interactions.

    • Michael Grau
    • , Georg Lenz
    •  & Peter Lenz
  • Article
    | Open Access

    Metabolic syndrome is characterized by complex phenotypes that increases the risk of cardiovascular disease and type 2 diabetes. Here the authors’ integrative network analysis suggests BTK inhibitor ibrutinib to be a promising treatment through its obesity-associated inflammation lowering effect.

    • Karla Misselbeck
    • , Silvia Parolo
    •  & Corrado Priami
  • Article
    | Open Access

    Deep learning approaches for image preprocessing and analysis offer important advantages, but these are rarely incorporated into user-friendly software. Here the authors present an easy-to-use visual programming toolbox integrating deep-learning and interactive data visualization for image analysis.

    • Primož Godec
    • , Matjaž Pančur
    •  & Blaž Zupan
  • Article
    | Open Access

    Most databases of genotype-phenotype associations are manually curated. Here, Kuleshov et al. describe a machine curation system that extracts such relationships from the GWAS literature and synthesizes them into a structured knowledge base called GWASkb that can complement manually curated databases.

    • Volodymyr Kuleshov
    • , Jialin Ding
    •  & Michael Snyder
  • Article
    | Open Access

    The oomycete Bremia lactucae is a highly variable pathogen that causes lettuce downy mildew. Here, the authors generate a high-quality genome assembly for B. lactucae, detect a high prevalence of heterokaryosis, and investigate its pathogenic consequences.

    • Kyle Fletcher
    • , Juliana Gil
    •  & Richard Michelmore
  • Article
    | Open Access

    A challenge with single-cell resolution methods is that cell heterogeneity should be captured while allowing for comparisons between populations. Here the authors fuse information from the dispersion profiles with the average profiles at the level of profiles’ similarity matrices for single cell imaging data.

    • Mohammad H. Rohban
    • , Hamdah S. Abbasi
    •  & Anne E. Carpenter
  • Article
    | Open Access

    With the increasing obtainability of multi-OMICs data comes the need for easy to use data analysis tools. Here, the authors introduce Metascape, a biologist-oriented portal that provides a gene list annotation, enrichment and interactome resource and enables integrated analysis of multi-OMICs datasets.

    • Yingyao Zhou
    • , Bin Zhou
    •  & Sumit K. Chanda
  • Article
    | Open Access

    miRNAs have emerged as regulators of diverse biological processes including cancer. Here the authors present an extended pan-cancer analysis of the miRNAs in 15 epithelial cancers; integrating methylation, transcriptomic and mutation data they reveal alternative mechanisms of tumour suppressors’ regulation in absence of mutation, methylation or copy number alterations.

    • Andrew Dhawan
    • , Jacob G. Scott
    •  & Francesca M. Buffa
  • Article
    | Open Access

    Health care in the United States is heterogeneous with respect to factors like disease incidence, treatment choices and health care spending. Here, the authors use insurance claims data from over 150 million patients to compare prescription rates of over 600 drugs, and uncover patterns of geographical variation that suggest an influence of race, health care laws and wealth.

    • Rachel D. Melamed
    •  & Andrey Rzhetsky
  • Article
    | Open Access

    New natural products can be identified via mass spectrometry by excluding all known ones from the analysis, a process called dereplication. Here, the authors extend a previously published dereplication algorithm to different classes of secondary metabolites.

    • Hosein Mohimani
    • , Alexey Gurevich
    •  & Pavel A. Pevzner
  • Article
    | Open Access

    How reproducible evolutionary processes are remains an important question in evolutionary biology. Here, the authors compile a compendium of more than 15,000 mutation events for Escherichia coli under 178 distinct environmental settings, and develop an ensemble of predictors to predict evolution at a gene level.

    • Xiaokang Wang
    • , Violeta Zorraquino
    •  & Ilias Tagkopoulos
  • Article
    | Open Access

    Technical noise in experiments is unavoidable, but it introduces inaccuracies into the biological networks we infer from the data. Here, the authors introduce a diffusion-based method for denoising undirected, weighted networks, and show that it improves the performances of downstream analyses.

    • Bo Wang
    • , Armin Pourshafeie
    •  & Jure Leskovec
  • Article
    | Open Access

    Community detection allows one to decompose a network into its building blocks. While communities can be identified with a variety of methods, their relative importance can’t be easily derived. Here the authors introduce an algorithm to identify modules which are most promising for further analysis.

    • Marinka Zitnik
    • , Rok Sosič
    •  & Jure Leskovec
  • Article
    | Open Access

    While breast cancer incidence in the Asia Pacific region is rising, the molecular basis remains poorly characterized. Here the authors perform genomic screening of 187 Korean breast cancer patients and find differences in molecular subtype distribution, mutation pattern and prevalence, and gene expression signature when compared to TCGA.

    • Zhengyan Kan
    • , Ying Ding
    •  & Yeon Hee Park
  • Article
    | Open Access

    Modules composed of groups of genes with similar expression profiles tend to be functionally related and co-regulated. Here, Saelens et al evaluate the performance of 42 computational methods and provide practical guidelines for module detection in gene expression data.

    • Wouter Saelens
    • , Robrecht Cannoodt
    •  & Yvan Saeys