Background & Summary

The Personal Genome Project UK (PGP-UK) is a member of the global PGP network together with the PGPs in the United States, Canada, Austria and China. The PGP network aims to provide multi-omics and trait data under open access to the community. This contributes to personalised medicine by advancing our understanding of how phenotypes and the development of diseases are influenced by genetic, epigenetic, environmental and lifestyle factors. While all five PGP centres generate whole-genome sequencing (WGS), some PGPs, such as PGP-UK, produce additional multi-omics data.

To participate in this study, volunteers must pass the eligibility criteria (e.g. be a UK citizen or permanent resident), sign the open consent form and pass a very thorough entrance exam. The objective of the exam is to ensure that the participant understands the key PGP-UK procedures and the potential risks of being involved in a project of this nature. At present, 1100 subjects have successfully enrolled in the project, and over a hundred of them have had their genomes sequenced. Once enrolled, participants are invited for sample collection which involves giving a blood or saliva sample or both for DNA and RNA extraction. DNA sequencing is then performed followed by data analysis. The results are reported back to the participants in the form of a Genome Report that is made publicly available after a grace period of one month. However, the participant is able to withdraw from the project at any time. DNA methylation data is generated using the Illumina HumanMethylation450 BeadChip array (450 k) and results are displayed in a freely available Methylome Report, a unique feature of the UK branch of the project. The preparation of both Genome and Methylome reports is discussed in more details in the Usage Notes Section.

A pilot cohort of ten members of the public make up the PGP-UK multi-omics reference panel. For this cohort, we collected whole-genome bisulfite sequencing (WGBS) and RNA sequencing (RNA-seq) in addition to WGS and 450 k data. Figure 1 shows a schematic of the PGP-UK workflow. More information about PGP-UK can be found in1,2 and on the project’s website www.personalgenomes.org.uk.

Fig. 1
figure 1

PGP-UK workflow. Horizontal panels depict the general sample/data categories and options (e.g blood and/or saliva) and vertical panels depict specific data types and their flow from start to end.

While controlled access multi-omics data can be submitted into a single public repository (e.g. EGA in Europe or dbGaP in the USA), there is currently no single public repository for open access multi-omics data. Consequently, the different types of datasets (WGS, WGBS, RNA-seq, 450 k) were submitted to the corresponding repositories (European Nucleotide Archive (ENA), European Variation Archive (EVA), ArrayExpress) at EMBL-EBI. The details are given in the Data Records section and direct data download links are provided on the PGP-UK data web page www.personalgenomes.org.uk/data. For convenience, we offer a web API to download all the available PGP-UK data (see Data Records). The cumulative size of the PGP-UK multi-omics reference panel exceeds 2TB, which means that it would take over 3 days (more than 85 hours) to download (with mean UK download speed of 54.2 Mbps, Ofcom 2018). To overcome this limitation, we collaborated with two cloud platform providers (Seven Bridges Genomics and Lifebit) to host PGP-UK data in their respective clouds for unrestricted access as briefly described in Data Records section.

In this paper, we describe the PGP-UK multi-omics human reference panel derived from 10 participants. We followed best practices to perform various quality control (QC) checks to ensure the quality of the pilot WGS, WGBS, RNA-seq and 450 k datasets as described in the Technical Validation section. Finally, we describe the methods employed for multi-omics data matching, which ensures that samples are mapped to the correct participant.

Methods

Ethics

The PGP-UK study is approved by the University College London (UCL) Research Ethics Committee (ID Number 4700/001) subject to annual reviews and renewals. All the research activities in the project are conducted in accordance with the Declaration of Helsinki, UK national laws and medical research regulatory requirements. Prior to their enrolment, every participant must pass an entrance exam, give their consent to participate in the project and agree for their data and associated reports to be made publicly available under open access.

Tissue samples

Blood samples were collected using EDTA Vacutainers (Becton Dickinson). Saliva samples were collected using Oragene OG-500 self-sampling kits. Sample processing and storage protocols were in line with HTA-approved standard operating procedures.

Whole-genome sequencing (WGS)

WGS libraries were prepared from whole blood DNA using Illumina TruSeq Nano in accordance with standard operating procedures. Illumina TruSeq Nano is a PCR-based method which, like all PCR-based methods, has limitations compared to PCR-free methods3,4,5. In addition, recent studies have shown that algorithms used to call copy number variation (CNV) from PCR-based library such as EnsembleCNV6 can be adapted to identify CNV regions from WGS data7. This indicates that PGP-UK WGS data can still be used to call CNV regions with good degree of accuracy.

Sequencing was performed on an Illumina HiSeq X Ten platform with an average depth of 30X. The resulting reads were trimmed using TrimGalore software, mapped to the human reference genome hg19 (GRCh37) using BWA-MEM algorithm (BWA v. 0.7.128). Ambiguously mapped reads (MAPQ <10) and duplicated reads were removed using SAMtools v. 1.29 and Picard v. 1.130 respectively. Genomic variants were called following the Genome Analysis Toolkit software (GATK v. 3.4–46) best practices10.

The corresponding FASTQ, BAM and VCF files were deposited in European Nucleotide Archive (ENA) with with study ID PRJEB1752911.

Whole-genome bisulfite sequencing (WGBS)

DNA was extracted from blood samples followed by bisulfite conversion and library preparation using the TruMethyl Whole Genome Kit v2.1. WGBS was performed on an Illumina HiSeq X Ten platform with an average depth of 15X. Generated FASTQ files were processed using GemBS v. 0.11.7 software12.

Resulting FASTQ and BAM files were deposited in the ENA with with study ID PRJEB1752911.

RNA Sequencing (RNA-seq)

RNA-seq was performed using 20 ng of RNA isolated from whole blood. All the involved procedures were implemented in accordance with the corresponding manufacturers’ protocols.

Libraries for RNA-seq were prepared with SENSE mRNA-seq Library Prep Kit v2, purified and amplified (18 PCR cycles). After adding adapters and indices, sequencing libraries were further purified using Solid Phase Reversible Immobilisation beads. The output was QC-verified and quantified using Qubit fluorometer. Finally, library QC was performed on Bioanalyzer 2100 and further quantified by qPCR with KAPA library quantification kit and the sequencing was performed on Illumina HiSeq 4000.

RNA-seq FASTQ files are available to download from the ArrayExpress (accession ID E-MTAB-652313) and ENA (project ID PRJEB2513914).

DNA methylation profiling

Genomic DNA (500 ng) extracted from whole blood and saliva was bisulfite converted using the EZ DNA Methylation Kit (Zymo Research) following the recommended incubation conditions for 450 k. Methylation profiling was subsequently performed on 450 k arrays using Illumina iScan Microarray Scanner at UCL Genomics, in accordance with standard operating procedures.

Raw DNA methylation array data (IDAT files) for PGP-UK participants were submitted to the ArrayExpress repository with accession number E-MTAB-537715.

Data Records

The entire PGP-UK dataset is freely available for download from public repositories with no access restrictions. Links for the particular datasets are provided on the PGP-UK website (www.personalgenomes.org.uk). Accession numbers and dataset identifiers are presented in Table 1. Basic phenotype data, which includes self-reported age, sex, smoking status, etc., alongside with genome and methylome reports, generated by the PGP-UK, can be found on the project’s data web page www.personalgenomes.org.uk/data. Furthermore, all of the data (including associated metadata) are available through the PGP-UK API. The API is compliant with the Open API Specification 3.0 and is documented at www.personalgenomes.org/api.

Table 1 PGP-UK data identifiers for the reference panel comprised of 10 PGP-UK participants.

Whole genome sequencing and whole genome bisulfite sequencing data are freely available from the ENA under the project ID PRJEB1752911. RNA-seq data is deposited in ArrayExpress under the accession number E-MTAB-652313 and in ENA PRJEB2513914. DNA methylation array data for PGP-UK participants is stored in ArrayExpress under the accession number E-MTAB-537715.

The PGP-UK pilot dataset described in2 resulted in the PGP-UK multi-omics reference panel described here. The datasets are available from the above-mentioned repositories and from the Seven Bridges Cancer Genomics cloud (docs.cancergenomicscloud.org/docs/personal-genome-project-uk-pgp-uk-pilot-dataset), which offers various tools and workflows for genomic and epigenomic data analysis.

The PGP-UK multi-omics reference panel is also available in the Lifebit cloud through their Open Data project (opendata.lifebit.ai/table/pgp) along with interactive analyses (ancestry, phenotypic traits, genetic variance) and custom pipelines provided by Lifebit’s cloud-computing platform Deploit (deploit.lifebit.ai). As a part of our collaboration with Lifebit our data have also been uploaded to a public Amazon Web Services (AWS) Simple Storage Service (S3) Bucket. This S3 Bucket is available at https://s3.console.aws.amazon.com/s3/buckets/pgp-lifebit (publically accessible with an AWS account) and can be used independently of the Lifebit platform within AWS or any other cloud platform using AWS S3 APIs.

To provide maximum access, PGP-UK data can in principle be hosted in any cloud complying with ‘best practices’ as well as adequate legal and ethical governance16. To this end, we have initiated discussions to also host our data under the Early Adopter Programme of the European Open Science Cloud (https://www.eosc-portal.eu/) and with Open Humans (https://www.openhumans.org/) which opened their project to global members in March 2019.

Technical Validation

In this section, we describe the outcomes of the PGP-UK data quality control checks and validation for the pilot cohort. In a first instance, we describe the QC framework and discuss outputs for each types of data collected. Then, we provide details of multi-omics data matching validation procedures based on cross-comparison of variants between different data types for each individual.

Data quality control

WGS data QC

Quality control of the reported WGS data was performed using FastQC v. 0.11.2 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and Picard v. 1.130 (https://github.com/broadinstitute/picard) tools. QC reports were generated using MultiQC v. 1.5 software17.

The WGS data average median coverage is above 35X (varies between 30X and 47X across samples) with more than 73% of the bases covered reaching 30X or more (varies between 54% and 95% across samples), see Fig. 2(a). A summary of the WGS QC analysis is presented in Table 2.

Fig. 2
figure 2

PGP-UK QC images for WGS, WGBS, RNA-seq and 450 k methylation data. (a) WGS coverage depth plot. (b) WGBS coverage depth plot. (c) RNA-seq reads distribution over the different genome features. (d) Density plot for Illumina 450 k methylation profiles.

Table 2 Quality control metrics summary of the WGS data derived from blood samples of 10 PGP-UK participants.

WGBS data QC

GemBS v. 3.2.1, FastQC v. 0.11.7 and Picard v. 2.18.23 tools were used in quality control of the PGP-UK WGBS and data QC reports were generated using MultiQC v. 1.5 software17.

WGBS average median coverage is above 14X (ranging from 10X to 16X across samples) with more than 19% of bases covered reaching 30X or deeper (varies between 15% and 25% across samples), see Fig. 2(b). Summary of the WGBS QC analysis is presented in Table 3.

Table 3 Quality control metrics summary of the WGBS data derived from blood samples of 10 PGP-UK participants.

RNA-seq data QC

All of the RNA-seq samples were processed with a modified version of the nextflow18 nf-core RNA-seq pipeline (https://github.com/UCL-BLIC/rnaseq). Specifically, reads were trimmed with TrimGalore v. 0.4.1, aligned against hg19 with STAR v. 2.5.2a19 and duplicated reads were identified and removed with Picard v. 2.18.9 tools. QC reports were generated using MultiQC v. 1.517 as part of the same pipeline. Reads were further split and trimmed using GATK4.

The mean RNA integrity number (RIN) value of the RNA used for sequencing was 8.55 (ranging between 7.1 and 9.3). Figure 2(c) demonstrates the distribution of mapped reads over various genomic features. A summary of the RNA-seq QC analysis is presented in Table 4.

Table 4 Quality control metrics summary of the RNA-seq data derived from blood samples of 10 PGP-UK participants.

450 k methylation data QC

450 k DNA methylation profiles were generated from whole blood and saliva for each of the ten participants in the PGP-UK multi-omics reference panel. For quality control of these data, we used R v. 3.5.2 with minfi v. 1.28.3 and ewastools v. 1.4 libraries20,21.

We performed quality checks based on 17 metrics assessed at control probes as described in the Illumina’s BeadArray Controls Reporter. All 17 metrics derived from the array control probes’ data are within the manufacturer’s recommended thresholds. In addition, we analysed detection p-values and bead count information, which is available for 100% and 99.92% of CpGs respectively. 99.96% of the detection p-values are below the threshold of 0.01. Average CpG bead count number across all samples is 14, and 100% of the available bead count numbers ≥3. A summary of this analysis is presented in Table 5. Figure 2(d) shows the overlay of the β-value density distributions for all samples.

Table 5 Quality control metrics summary of the Illumina 450 k data derived from blood and saliva samples of 10 PGP-UK participants.

Multi-omics data matching

In order to ensure data integrity and exclude the possibility of sample mix-up between study participants, we validated our sample assignments, by matching the available 450 k, WGBS and RNA-seq data against WGS. First, we matched the 450 k against WGS data for each participant using 65 single nucleotide polymorphisms (SNP) control probes from Illumina 450 k array. Second, we matched the WGBS-derived genotypes for the same 65 SNP loci with the WGS data. Third, we compared genotypes derived from RNA-seq and WGS data based on the set of loci from protein coding regions. The schema of the multi-omics data matching is given in Fig. 3(a) and further details are provided below.

Fig. 3
figure 3

Multi-Omics Data Matching. (a) Multi-Omics Data Matching Schema. 65 loci were used in matching WGS with methylation and WGBS data, 279 loci were used in matching WGS with RNA-seq data. (b) Correlation plot displaying matching results for WGS vs. 450 k datasets. (c) Correlation plot displaying matching results for WGS vs. WGBS datasets. (d) Correlation plot displaying matching results for WGS vs. RNA-seq datasets. On correlation plots (bd) scale is represented by the combination of ball size and colour (from white to dark blue) and goes from 0 (0% match) to 1 (perfect 100% match).

We used β-values recorded at the 65 450 k SNP control probes to distinguish between heterozygous and homozygous alleles in the 450 k dataset. These SNPs are by design highly variable and can therefore provide a unique genetic signature that can be used to differentiate between each study participant. Note that 64 out of these 65 SNPs are outside protein-coding regions and, hence, not available for the RNA-seq data. We identified 279 SNP loci present in at least 4 WGS samples which were also highly expressed across RNA-seq samples (in the top 100 most expressed genes) yielding a suitable validation set for the WGS vs. RNA-seq comparison.

To match the different datasets, we extracted the locations of the loci used for validation (65 loci for the WGS vs. 450 k and WGBS vs. 450 k comparisons, and 279 loci for WGS vs. RNA-seq) and used the HaplotypeCaller and GenotypeGVCFs tools from (GATK v. 3.8.0) on the corresponding BAM files to force the call of genotypes in these locations. Percentage of matching genotypes were then obtained across samples and datasets to confirm sample identity as presented on Fig. 3 and Table 6.

Table 6 Summary of data cross-validation between 450 k, WGBS and RNA-seq against WGS.

WGS vs. 450 k

In order to obtain genotype information from 450 k data, we extracted β-values for the 65 SNP control probes for each of the 10 PGP-UK participants. As expected, these β-values clustered into three separate peaks around 0.5 (which corresponds to heterozygous genotypes), 0 and 1 (which correspond to homozygous genotypes). We checked and confirmed that reported β-values for all SNP control probes which were derived from the whole blood and corresponding saliva 450 k datasets were a 100% match. In other words, we established that the zygosity of each probe was the same across both DNA samples for any given participant.

We then extracted the genotypes for those 65 SNPs from WGS and matched them with to the corresponding zygosity in the 450 k data. This resulted in perfect 100% match for corresponding samples, i.e. samples from the same participant, see Fig. 3(b) and Table 6.

WGS vs. WGBS

This comparison was performed by matching WGS- and WGBS-derived genotypes for 65 Illumina 450 k array SNP control probes. The mean agreement between matched samples was 99.45%, which corresponds to a total of 3 loci mismatch observed in 3 out of ten participants (i.e. a single mismatch for each of those three participants). Altogether, 100% and 84.77% of 65 SNPs had coverage in the WGS and WGBS data respectively, which allowed us to make our comparison based on 51–61 common loci per participant, see Fig. 3(c) and Table 6.

WGS vs. RNA-seq

To match RNA-seq with WGS samples, we used a set of common loci in highly expressed genes as described above. Available genotypes for these loci were extracted from the RNA-seq and WGS samples and cross-validated. In total, 92.65% and 80.93% of these 279 loci had coverage in the RNA-seq and WGS data respectively, which allowed us to make our comparison based on 152–197 loci per participant. On average, corresponding WGS and RNA-seq data are in agreement for 76.17% of genotype calls (range 69.68–83.23%), see Fig. 3(d) and Table 6.

Results of matching 450 k, RNA-seq and WGBS data with WGS are presented in Table 6. The correlation plots presented on Fig. 3, demonstrate a substantially higher level of correspondence between samples from the same individual compared to those from different people when comparing WGS vs. 450 k (Fig. 3(b)), WGS vs. WGBS (Fig. 3(c)) and WGS vs. RNA-seq (Fig. 3(d)).

Usage Notes

Here we describe two key outputs generated for each PGP-UK participant, the Genome and Methylome Reports. These reports are freely available to download on PGP-UK website, see https://www.personalgenomes.org.uk/data/.

Genome Reports leverage the information from variant call files (VCFs) and provide an overview of the potential influence of genetic variants on several genetic traits, as well as ancestry information. Potentially beneficial or harmful traits for each participant were identified using public data from SNPedia22, gnomAD v2.0.223, GetEvidence24 and ClinVar25. Plots to visualise the ancestry of each participant were created by applying principal component analysis (as implemented in Plink v1.926) on a genotype matrix resulting from merging the participant genotypes with those from 2504 unrelated samples from 26 worldwide populations available from the 1000 Genomes Project27. Population membership proportions were obtained using the Admixture v1.3.0 software28 on above-mentioned genotype matrix.

Methylome reports contain epigenetic age and smoking status prediction for PGP-UK participants based on their methylome as assessed by 450 k array experiments. Raw data were processed, quality controlled and analysed using ChAMP29,30 and minfi20 pipelines for R. Epigenetic age calculation was based on the multi-tissue Horvath clock31, which predicts age using a linear combination of the methylation levels from a reference panel of 353 CpGs. Smoking status was predicted by calculating smoking scores as linear combinations of the methylation levels at 183 CpGs and then comparing them to a particular threshold as described in32. More details on the PGP-UK Genome and Methylome reports are described in2.

Outlook

Because of its open access status, the PGP-UK multi-omics reference panel described here has the potential to become the reference panel of choice for the implementation of the FAIR (Findable, Accessible, Interoperable, Reusable) principles33 for data sharing and the integration of new data standards and formats. While we shall aim to increase the size of the panel in the longer term, the immediate aim is to add more data. Towards this, we have already generated methylation count files (MCFs) which are the epigenetic equivalent to variant count files (VCFs). These pre-processed data files are very popular with users for downstream analyses. While agreed standards and procedures are in place for generating and depositing VCFs into public databases, PGP-UK is at the forefront of helping to establish these for MCFs in collaboration with EBI (https://www.ebi.ac.uk) and ELIXIR (https://elixir-europe.org/). In addition, PGP-UK is spearheading efforts to add Phenopackets to our reference panel. PhenoPackets are represented as phenotype exchange files (PXFs), a novel open standard for sharing disease and phenotype information (http://phenopackets.org). For disease and health information in general, we are currently exploring with Patients Know Best (https://www.patientsknowbest.com/) how to link our reference panel to the corresponding NHS health records. All these activities are conducted in compliance and collaboration with EU standards for precision medicine, EU-STANDS4PM (https://www.eu-stands4pm.eu/).