Epigenome reference data are continually being enriched—researchers should explore them, even if raw data access still presents some hurdles.
It's mostly a success story. The International Human Epigenetic Consortium (IHEC), started in 2010, recently completed its phase I, which coordinated the generation of epigenomic reference data, the genome-wide profiling of DNA methylation, histone modifications and DNA accessibility together with transcriptome data from primary cells and tissues.
IHEC is a truly global undertaking that integrates well-known national endeavors such as the Canadian CEEHRC, European Blueprint, US ENCODE and NIH Roadmap, German DEEP, Japanese CREST, Korean KNIH, Singapore's GIS and China's EpiHK. Data sharing is one of its stated goals, and such data will be invaluable for understanding normal and disease development. But to have the most impact, easier access to the underlying raw data is needed.
IHEC is successfully sharing its aggregated data—i.e., data that do not reidentify participants and do not contain any sequence information or variant calls. All summary data can be freely browsed via the IHEC data portal and also viewed in several browsers, including the WashU EpiGenome Browser.
As IHEC embarks on phase II with an added focus on reanalyzing and applying the data to medical questions, standardized data annotation and raw data sharing were identified as areas in need of improvement.
Harmonized metadata, such as a uniform nomenclature for cell lines, as well as details about protocols and quality-control procedures, are essential for data reuse and integrated analysis. Encouragingly, all IHEC members agreed to revised metadata specifications and will update data sets accordingly by April 1, 2018. This information will be freely accessible in the metadata repository EpiRR.
Access to the raw data proves more complicated. To protect the privacy of the donors, the underlying raw sequence data can only be obtained via controlled access. Since IHEC is a distributed consortium, the data are protected by the participants' national laws. A user needs to apply to the individual institutions' regional Data Access Committee (DAC) and fill out Data Access Agreements (DAAs).
To probe the extent to which these DAAs differ, a group of researchers, including Stephan Beck from University College London, who cochairs IHEC Integrative Analysis, and Yann Joly at McGill University, who chairs the IHEC Bioethics working group, compared them. They found that some jurisdictions had legal clauses that were not possible for other countries to meet, making raw data access all but impossible.
These hurdles are of course not unique to IHEC, and have been encountered by other large consortia producing genomic data. Paul Flicek at the EBI, a member of several such consortia, including IHEC, likens the period of time when controlled-access genomic data were first becoming widely available to an arms race of DAAs that peaked in 2010. As an illustration he cites the DAA of The Welcome Trust case control consortium, established in 2005, which was relatively simple compared to that of the International Cancer Genome Consortium, which started in 2008. ICGC initially discussed, but did not implement, a commercial-style material transfer agreement (MTA).
Joly described the use of MTAs for data access as a mistake. He expresses the reasonable view that these two types of documents have different goals: MTAs regulate ownership and intellectual property, whereas DAAs should just make sure that data is shared and the identity of the participants protected.
Joly and Beck were behind an effort to develop a harmonized DAA for all IHEC members that would be much easier to use, but they could not get agreement. While disappointing, this was arguably not a surprise, since it would have involved an eight-way negotiation of all the legal teams, an effort some DAC members saw as impractical. Despite the setback, Joly is currently pursuing other ways to simplify DAAs.
An alternative approach for integrated analysis, under development in IHEC's phase II, is the use of harmonized analysis pipelines, to be shared through software containers. Martin Hirst, at the University of British Columbia and chair of IHEC's scientific steering committee, describes this effort as a response to data access challenges.
It is noteworthy that other large consortia have successfully made use of similar cloud computing technology, for example TCGA's use of the Cancer Genomics Cloud provides analysis tools and also lets researchers access the data in two tiers, depending on their authorization level.
The protection of private data is of great importance and will require a concerted effort by data producers, users, IT specialists, legal experts and importantly, also funders, so a safe but efficient solution can be reached.
In the meantime, users should not give up on valuable data because of red tape. If they run into problems, we encourage them to contact IHEC's steering committee, to document the urgency and specifics of the issue.