Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP

We aimed to develop an efficient, flexible and scalable approach to diagnostic genome-wide sequence analysis of genetically heterogeneous clinical presentations. Here we present G2P (www.ebi.ac.uk/gene2phenotype) as an online system to establish, curate and distribute datasets for diagnostic variant filtering via association of allelic requirement and mutational consequence at a defined locus with phenotypic terms, confidence level and evidence links. An extension to Ensembl Variant Effect Predictor (VEP), VEP-G2P was used to filter both disease-associated and control whole exome sequence (WES) with Developmental Disorders G2P (G2PDD; 2044 entries). VEP-G2PDD shows a sensitivity/precision of 97.3%/33% for de novo and 81.6%/22.7% for inherited pathogenic genotypes respectively. Many of the missing genotypes are likely false-positive pathogenic assignments. The expected number and discriminative features of background genotypes are defined using control WES. Using only human genetic data VEP-G2P performs well compared to other freely-available diagnostic systems and future phenotypic matching capabilities should further enhance performance.


Loss of function
Where any of the mutations are nonsense, frame-shifting indel, essential splice site mutation, whole gene deletion OR any other mutation where functional analysis demonstrates clear reduction or loss of function

All missense/in frame
Where all the mutations described in the data source are either missense or in frame deletions and there is no evidence favoring either loss-of-function, activating or dominant negative effect

Dominant negative
Mutation within one allele of a gene that creates a significantly greater deleterious effect on gene product function than a monoallelic loss of function mutation

Activating
Mutation, usually missense that results in functional activation of the gene product

Increased gene dosage
Copy number variation that increases the functional dosage of the gene

Cis-regulatory or promotor mutation
Mutation in cis-regulatory elements that lies outwith the known transcription unit and promotor of the controlled gene

5' or 3'UTR mutation
Mutation within the 5' or 3' untranslated region of the transcript which results in mislocalisation or altered stability of RNA molecule

Uncertain
Where the exact nature of the mutation is unclear or not recorded

G2P implementation details
The G2P web application is built with the Perl Mojolicious web framework. We use the Bootstrap framework and its HTML and CSS-based design templates; we use jQuery, a JavaScript library, for the front-end development. Our data are stored in a MySQL relational database. The ensembl-gene2phenotype API provides access to the database and supports data retrieval and data edits. The database schema and API have been developed according to the design principles of existing Ensembl databases and APIs. The ensembl-gene2phenotype API inherits all methods for manipulating data in the underlying database from the Ensembl core API 1 .

G2P VEP plugin logic
The VEP-G2P plugin (https://www.ebi.ac.uk/gene2phenotype/g2p_vep_plugin) identifies possible disease-causing variants (i.e. "valid hits") by applying a set of filtering rules. If a variant passes all filtering rules it will be further considered to decide if a sufficient number of variants that passed the filtering overlap a transcript and fulfil the allelic requirement of the transcript's gene. The sufficient number of variants is determined by the allelic requirement of the gene: for a biallelic gene, at least 2 heterozygous variants which pass all filtering rules and are located in the same transcript are required,

VEP-G2P plugin input data format
The plugin requires as input a file which lists genes of interest and their allelic requirements.
Files for the G2P panels can be downloaded for individual panels from the G2P website https://www.ebi.ac.uk/gene2phenotype/downloads. Such input files can be generated for any gene set.
The file needs to be tab-delimited and must contain at least two columns; the first listing the gene symbol and the second listing the allelic requirement. The customized file must start with a header line.
Recognized header fields are 'gene symbol' and 'allelic requirement'. Each row lists a pair of gene symbol and allelic requirement. If the gene has more than one allelic requirement a row for each allelic requirement can be provided or allelic requirements can also be separated by a semi-colon. The plugin also accepts PanelApp data files (https://panelapp.genomicsengland.co.uk). A PanelApp data file is recognized by its header fields: 'Gene_Symbol' and 'Model_Of_Inheritance' which are the equivalents of 'gene symbol' and 'allelic requirement'. Variant data can be input in VCF or a simple tab delimited format.

Computational time optimization
In order to reduce the VEP-G2P computational time, from the WES VCFs for each cohort we extracted only variants from the genomic regions containing the genes in the G2P DD (~145Mbp) and G2P Cancer (~8Mbp)

Gene sets for supported assemblies
The default gene set for VEP variant annotation is the Ensembl/GENCODE gene set. The Ensembl/GENCODE gene set is the full merge of Ensembl evidence-based transcript predictions with Ensembl manual annotation 2 . VEP supports annotation for human GRCh37 and GRCh38 assemblies.
Ensembl/GENCODE data on GRCh37 is provided as a stable archive since the release of the GRCh38 assembly. The gene annotation on the GRCh37 archive is based on Ensembl/GENCODE data from release 75 (Ensembl release/90: GRCh37.p13 February 2014). The gene set for GRCh38 is updated twice per year. The RefSeq gene models can also be used in VEP annotation.

Allele frequency annotation
VEP can use local cache files which are built from the variants stored in an Ensembl variation database together with allele frequencies from the 1000 Genomes Project, the NHLBI Exome Sequencing Project and the gnomAD exome data to provide allele frequency data. The variation data for GRCh37 are updated roughly annually whereas the database for GRCh38 is updated four times per year. The cache files contain allele frequencies for variants which have been assigned RefSNP (rsID) identifiers. The VEP-G2P plugin uses the allele frequencies from the cache files for variant filtering but can also add allele frequencies from other sequencing projects like TOPMed, UK10K and the gnomAD whole genome data. The plugin uses a function implemented in the Ensembl API to match input variants directly to alleles in VCF files, resolving different normalisation approaches that may have been used. VEP can also deal with gnomAD's representation of multi-allelic variants on one VCF line.
In order to derive the correct co-located variant allele for an input allele VEP normalizes the input variant allele and the co-located allele independently to arrive at a potential match. The algorithm is outlined on the VEP documentation pages [http://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#colocated]. Searching additional allele frequency resources slows down the overall VEP run but provides more extensive frequency annotation.