Databases articles within Nature Communications

Featured

  • Article
    | Open Access

    Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.

    • John Dagdelen
    • , Alexander Dunn
    •  & Anubhav Jain
  • Article
    | Open Access

    In this work, the authors report NMR lipids Databank to promote decentralised sharing of biomolecular molecular dynamics (MD) simulation data with an overlay design. Programmatic access enables analyses of rare phenomena and advances the training of machine learning models.

    • Anne M. Kiirikki
    • , Hanne S. Antila
    •  & O. H. Samuli Ollila
  • Article
    | Open Access

    Accurately benchmarking small variant calling accuracy is critical for the continued improvement of human genome sequencing. Here, the authors show that current approaches are biased towards certain variant representations and develop a new approach to ensure consistent and accurate benchmarking, regardless of the original variant representations.

    • Tim Dunn
    •  & Satish Narayanasamy
  • Article
    | Open Access

    Over their careers, medicinal chemists develop a gut feeling for what is a promising molecule. Here, the authors use machine learning models to learn this intuition and show that it can be successfully applied in several drug discovery scenarios.

    • Oh-Hyeon Choung
    • , Riccardo Vianello
    •  & José Jiménez-Luna
  • Article
    | Open Access

    Rare Mendelian disorders pose a major diagnostic challenge, but evaluation of automated tools that aim to uncover causal genes tools is limited. Here, the authors present a computational pipeline that simulates realistic clinical datasets to address this deficit.

    • Emily Alsentzer
    • , Samuel G. Finlayson
    •  & Isaac S. Kohane
  • Article
    | Open Access

    During preclinical drug development, the ability of cancer cell lines to faithfully model human disease is important for identifying potential therapeutic strategies. Here, using transcriptomic datasets of over 1000 cell lines, the authors evaluate how representative each line is of its cancer type and present their cell line selection tool.

    • Han Jin
    • , Cheng Zhang
    •  & Adil Mardinoglu
  • Article
    | Open Access

    Research aimed at improving healthcare has largely focused on male animals and cells. Here, the authors use data from the International Mouse Phenotyping Consortium to show that body weight does not account for all phenotypic differences between male and female mice, supporting more female-focused research.

    • Laura A. B. Wilson
    • , Susanne R. K. Zajitschek
    •  & Shinichi Nakagawa
  • Article
    | Open Access

    Studies of cell heterogeneity in white matter in primates have been limited to date. Here the authors describe a marmoset brain cell atlas that bridges rodent and human data, revealing strong gray-white matter glial segregation.

    • Jing-Ping Lin
    • , Hannah M. Kelly
    •  & Daniel S. Reich
  • Article
    | Open Access

    There is a broad range of research available on the relationship between food security and mental health. Here the authors carry out a systematic mapping of evidence on food security and nutrition related to mental health and identifies trends in themes, setting, and study design over the 20 year period studied.

    • Thalia M. Sparling
    • , Megan Deeney
    •  & Suneetha Kadiyala
  • Article
    | Open Access

    A comprehensive data portal to explore plant regulomes is still unavailable. Here, the authors develop a web-based platform ChIP-Hub in the ENCODE standards and demonstrate its applications in the identification of hierarchical regulatory network, tissue-specific chromatin dynamics, putative enhancers and chromatin states.

    • Liang-Yu Fu
    • , Tao Zhu
    •  & Dijun Chen
  • Article
    | Open Access

    This paper describes the ‘4DN Data Portal’ that hosts data generated by the 4D Nucleome network, including Hi-C and other chromatin conformation capture assays, as well as various sequencing-based and imaging-based assays. Raw data have been uniformly processed to increase comparability and the portal is implemented with visualization tools to browse the data without download.

    • Sarah B. Reiff
    • , Andrew J. Schroeder
    •  & Peter J. Park
  • Comment
    | Open Access

    Ensuring international benefit-sharing from sequence data without jeopardising open sharing is a major obstacle for the Convention on Biological Diversity and other UN negotiations. Here, the authors propose a solution to address the concerns of both developing countries and life scientists.

    • Amber Hartman Scholz
    • , Jens Freitag
    •  & Jörg Overmann
  • Article
    | Open Access

    Here, we present TP-DB; a pattern-based search engine based on 1.67 million helices from the Protein Database (PDB). We demonstrate the utility of TP-DB in identifying microbe-specific antigens, as well as the design of antimicrobial peptides and Protein-protein interaction blockers.

    • Cheng-Yu Tsai
    • , Emmanuel Oluwatobi Salawu
    •  & Lee-Wei Yang
  • Article
    | Open Access

    Local gene co-expression is found throughout the genome, but systematic analysis of these co-expressed genes is needed. Here, the authors identify local co-expressed genes in 49 tissues and characterize the genetic variants which may affect their expression and contribute to disease.

    • Diogo M. Ribeiro
    • , Simone Rubinacci
    •  & Olivier Delaneau
  • Article
    | Open Access

    Single-nucleotide variants in enhancers or promoters may affect gene transcription by altering transcription factor binding sites. Here the authors present a meta-analysis empowered by a new statistical method covering thousands of ChIP-Seq experiments resulting in the identification of more than 500 thousand allele-specific binding (ASB) events in the human genome.

    • Sergey Abramov
    • , Alexandr Boytsov
    •  & Ivan V. Kulakovskiy
  • Article
    | Open Access

    Sarcomas are morphologically heterogeneous tumours rendering their classification challenging. Here the authors developed a classifier using DNA methylation data from several soft tissue and bone sarcoma subtypes, which has the potential to improve classification for research and clinical purposes.

    • Christian Koelsche
    • , Daniel Schrimpf
    •  & Andreas von Deimling
  • Perspective
    | Open Access

    The IMEx consortium provides one of the largest resources of curated, experimentally verified molecular interaction data. Here, the authors review how IMEx evolved into a fundamental resource for life scientists and describe how IMEx data can support biomedical research.

    • Pablo Porras
    • , Elisabet Barrera
    •  & Sandra Orchard
  • Article
    | Open Access

    With the generation of large pan-cancer whole-exome and whole-genome sequencing projects, a question remains about how comparable these datasets are. Here, using The Cancer Genome Atlas samples analysed as part of the Pan-Cancer Analysis of Whole Genomes project, the authors explore the concordance of mutations called by whole exome sequencing and whole genome sequencing techniques.

    • Matthew H. Bailey
    • , William U. Meyerson
    •  & Christian von Mering
  • Article
    | Open Access

    Schulz et al. systematically benchmark performance scaling with increasingly sophisticated prediction algorithms and with increasing sample size in reference machine-learning and biomedical datasets. Complicated nonlinear intervariable relationships remain largely inaccessible for predicting key phenotypes from typical brain scans.

    • Marc-Andre Schulz
    • , B. T. Thomas Yeo
    •  & Danilo Bzdok
  • Article
    | Open Access

    Reference databases are essential for studies on host-microbiota interactions. Here, the authors present the construction of VIRGO, a human vaginal non-redundant gene catalog, which represents a comprehensive resource for taxonomic and functional profiling of vaginal microbiomes from metagenomic and metatranscriptomic datasets.

    • Bing Ma
    • , Michael T. France
    •  & Jacques Ravel
  • Article
    | Open Access

    The authors previously developed the Protein Common Interface Database (ProtCID), which compares and clusters the interfaces of pairs of full-length protein chains with defined Pfam domain architectures in different PDB entries to identify biological assemblies. Here the authors extend ProtCID to the clustering of domain-domain interactions that also allows analyzing domain interactions with peptides, nucleic acids, and ligands.

    • Qifang Xu
    •  & Roland L. Dunbrack Jr.
  • Article
    | Open Access

    Most databases of genotype-phenotype associations are manually curated. Here, Kuleshov et al. describe a machine curation system that extracts such relationships from the GWAS literature and synthesizes them into a structured knowledge base called GWASkb that can complement manually curated databases.

    • Volodymyr Kuleshov
    • , Jialin Ding
    •  & Michael Snyder
  • Review Article
    | Open Access

    Glycomics is gaining momentum in basic, translational and clinical research. Here, the authors review current reporting standards and analysis tools for mass-spectrometry-based glycomics, and propose an e-infrastructure for standardized reporting and online deposition of glycomics data.

    • Miguel A. Rojas-Macias
    • , Julien Mariethoz
    •  & Niclas G. Karlsson
  • Perspective
    | Open Access

    Questions of causality are ubiquitous in Earth system sciences and beyond, yet correlation techniques still prevail. This Perspective provides an overview of causal inference methods, identifies promising applications and methodological challenges, and initiates a causality benchmark platform.

    • Jakob Runge
    • , Sebastian Bathiany
    •  & Jakob Zscheischler
  • Article
    | Open Access

    Short-tandem repeats (STR), similar to single nucleotide polymorphisms (SNP), contribute to complex traits, but their ascertainment by next-generation sequencing is costly. Here, Saini et al. provide a SNP+STR haplotype reference panel that allows imputation of STRs from SNP array data.

    • Shubham Saini
    • , Ileena Mitra
    •  & Melissa Gymrek
  • Article
    | Open Access

    Proteoforms arise as protein isoforms or as protein haplotypes, which are the result of genetic variation. Here, the authors develop Haplosaurus, a database that computes protein haplotypes genome-wide from existing genotype data and analyse protein haplotype variability in the 1000 Genomes dataset.

    • William Spooner
    • , William McLaren
    •  & Catherine Chaillan Huntington
  • Article
    | Open Access

    Data sharing is recognized as a way to promote scientific collaboration and reproducibility, but some are concerned over whether research based on shared data can achieve high impact. Here, the authors show that neuroimaging papers using shared data are no less likely to appear in top-ranked journals.

    • Michael P. Milham
    • , R. Cameron Craddock
    •  & Arno Klein
  • Article
    | Open Access

    Here, Libertini and colleagues devise a computation tool that can analyze whole-genome bisulfite sequencing (WGBS) data to recover of ∼30% of the lost differential methylation position information. They use COMETgazer and COMETvintage to analyze 13 diffferent methylome data to demonstrate their performance.

    • Emanuele Libertini
    • , Simon C. Heath
    •  & Stephan Beck