Although increasingly recognized as critical to genomic research, genomic data sharing is hindered by an absence of standards regarding timing, patient privacy, use agreement standards, and data characterization and quality. Only after months of identifying, permissioning for use, committing to terms restricting use and sharing, downloading, and assessing quality, is it possible to know whether or not a dataset can be used. In this paper, we evaluate the barriers to data sharing based on the Treehouse experience and offer recommendations for use agreement standards, data characterization and metadata standardization to enhance data sharing and outcomes for all pediatric cancer patients.
Data Access in Genomics
In this collection, we highlight commentary and reviews from across the Nature Research journals that address some of the questions and considerations surrounding access of human genomic data. We have also selected research articles from across our journals that demonstrate the power of large-scale genomic data and data access.
We are now firmly in the genomics era. The cost of large-scale genotyping and sequencing has declined to the point that human genomic association studies with hundreds of thousands of individuals is now a reality. At the same time, the biological insight that can be derived from these data dramatically increases when data are shared, used in combination with other data sets, and analyzed with new computational methods. Data sharing also allows verification of results. For these reasons, publishers, including Nature Research and BMC, and funders increasingly mandate some level of data access. However, the unique considerations for sensitive human data mean that these data cannot be made openly available without restrictions.
Panel Discussion at ASHG 2019
We invite you to join us on Thursday, October 17, 2019 at the American Society of Human Genetics 2019 Annual Meeting for a discussion with leaders in the field about the challenges and opportunities for data access in the era of genomics and biobanks. For more information about this event, see here.
Comment & Review
Who benefits from sharing data? The scientists of future do, as data sharing today enables new science tomorrow. Far from being mere rehashes of old datasets, evidence shows that studies based on analyses of previously published data can achieve just as much impact as original projects.
Open science can lead to greater collaboration, increased confidence in findings and goodwill between researchers.
Indigenous peoples are still underrepresented in genetic research. Here, the authors propose an ethical framework consisting of six major principles that encourages researchers and Indigenous communities to build strong and equal partnerships to increase trust, engagement and diversity in genomic studies.
Melinda Mills and Charles Rahal discuss genome-wide association studies published in the last 13 years, finding increases in sample sizes, rates of discovery, and traits studied over time. They discuss limitations, including sample diversity, and make recommendations for scientists and funding bodies.
A considerable proportion of the usefulness and interest of research publications in our field comes from the data and associated metadata. We therefore insist that data be available for peer reviewers to see and readers to use. Authors should use public permanent repositories designed for appropriately consented data.
Creating large genome/phenome collections can require consortium-scale resources. DNA.Land is a digital biobank that collects genetic data from individuals tested by consumer genomic companies using a fraction of the resources of traditional studies.
Anonymized data sets are growing and it is becoming easier to identify individuals. Research-consent procedures must be updated to protect people from being targeted.
All disciplines should follow the geosciences and demand best practice for publishing and sharing data, argue Shelley Stall and colleagues.
A paper that analysed genetic variants in 14,000 people to identify disease-associated regions set the standard for collaborative genome-wide association studies and provided methodological advances whose effects are still felt today.
Analysis of the UK Biobank genetic and phenotypic data demonstrate the power of including a large population and detailed phenotyping in a prospective study to identify genetic and lifestyle factors related to health and disease.
Increasing amount of public omics data are important and valuable resources for the research community. Here, the authors develop a set of metrics to quantify the attention and impact of biomedical datasets and integrate them into the framework of Omics Discovery Index (OmicsDI).
Re-analysis of public genetic data reveals a rare X-chromosomal variant associated with type 2 diabetes
Genome-wide association studies have uncovered several loci associated with diabetes risk. Here, the authors reanalyse public type 2 diabetes GWAS data to fine map 50 known loci and identify seven new ones, including one near ATGR2 on the X-chromosome that doubles the risk of diabetes in men.
Persistently low levels of estimated glomerular filtration rate (eGFR) are a biomarker of chronic kidney disease. Here, the authors reinterpret the genetic architecture of kidney function across ancestries, to identify not only genes, but the tissue and anatomical contexts of renal homeostasis.
Most databases of genotype-phenotype associations are manually curated. Here, Kuleshov et al. describe a machine curation system that extracts such relationships from the GWAS literature and synthesizes them into a structured knowledge base called GWASkb that can complement manually curated databases.
Genetic analyses of ancestrally diverse populations show evidence of heterogeneity across ancestries and provide insights into clinical implications, highlighting the importance of including ancestrally diverse populations to maximize genetic discovery and reduce health disparities.
Similarities in cancers can be studied to interrogate their etiology. Here, the authors use genome-wide association study summary statistics from six cancer types based on 296,215 cases and 301,319 controls of European ancestry, showing that solid tumours arising from different tissues share a degree of common germline genetic basis.
Oral ulcerations are sores of the mucous membrane of the mouth and highly prevalent in the population. Here, in a genome-wide association study, the authors identify 97 loci associated with mouth ulcers highlighting genes involved in T cell-mediated immunity and TH1 responses.
GWAS have identified more than 500 genetic loci associated with blood lipid levels. Here, the authors report a genome-wide analysis of interactions between genetic markers and physical activity, and find that physical activity modifies the effects of four genetic loci on HDL or LDL cholesterol.
Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways
Genome-wide analyses in >1 million individuals identify new loci and pathways associated with insomnia. The findings implicate key brain areas and cell types in the neurobiology of insomnia and highlight potential targets for developing new treatments.
A large meta-analysis combining genome-wide and custom high-density genotyping array data identifies 63 new susceptibility loci for prostate cancer, enhancing fine-mapping efforts and providing insights into the underlying biology.
Trans-ancestry meta-analysis of estimated glomerular filtration rate (eGFR) from 1,046,070 individuals identifies 264 associated loci, providing a resource of molecular targets for translational research of chronic kidney disease.
Genome-wide association study of knee pain identifies associations with GDF5 and COL27A1 in UK Biobank
Weihua Meng, Mark Adams et al. report a genome-wide association study of knee pain in the UK Biobank, identifying two loci near GDF5 and COL27A1 as significantly associated. These findings are supported by association data in additional cohorts, using self-reported osteoarthritis or radiographic knee osteoarthritis as a proxy for knee pain.
Mendelian randomisation analyses find pulmonary factors mediate the effect of height on coronary artery disease
Eirini Marouli et al. use Mendelian randomisation analyses to investigate the causal relationship between adult height, coronary artery disease (CAD) and type 2 diabetes (T2D) in the UK Biobank. They find that height has a causal effect on CAD, which is mediated by lung function, while there is no direct effect on the risk of T2D.
The majority of published GWAS was performed in European ancestry populations. Here, Kuchenbaecker et al., test to which extent lipid loci are shared and find that the major lipid loci are mostly transferrable between Europeans and Asians while there are notable exceptions for African populations.
Anonymization has been the main means of addressing privacy concerns in sharing medical and socio-demographic data. Here, the authors estimate the likelihood that a specific person can be re-identified in heavily incomplete datasets, casting doubt on the adequacy of current anonymization practices.