Determining the pathogenicity of genetic variants is a critical challenge, and functional assessment is often the only option. Experimentally characterizing millions of possible missense variants in thousands of clinically important genes requires generalizable, scalable assays. We describe variant abundance by massively parallel sequencing (VAMP-seq), which measures the effects of thousands of missense variants of a protein on intracellular abundance simultaneously. We apply VAMP-seq to quantify the abundance of 7,801 single-amino-acid variants of PTEN and TPMT, proteins in which functional variants are clinically actionable. We identify 1,138 PTEN and 777 TPMT variants that result in low protein abundance, and may be pathogenic or alter drug metabolism, respectively. We observe selection for low-abundance PTEN variants in cancer, and show that p.Pro38Ser, which accounts for ~10% of PTEN missense variants in melanoma, functions via a dominant-negative mechanism. Finally, we demonstrate that VAMP-seq is applicable to other genes, highlighting its generalizability.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank J. Underwood and K. Munson of the UW PacBio Sequencing Services for assistance with long-read sequencing; A. Leith of the UW Foege Flow Lab and L. Gitari and D. Prunkard of the UW Pathology Flow Cytometry Core Facility for assistance with cell sorting; and B. Shirts and C. Pritchard in the UW Department of Lab Medicine for advice. The authors would like to acknowledge the American Association for Cancer Research and its financial and material support in the development of the AACR Project GENIE registry, as well as members of the consortium for their commitment to data sharing. Interpretations are the responsibility of study authors. This work was supported by the National Institute of General Medical Sciences (1R01GM109110 and 5R24GM115277 to D.M.F., P50GM115279 to M.V.R. and W.E.E., National Cancer Institute R01CA096670 to S.B. and P30CA21765 to M.V.R.) and an NIH Director’s Pioneer Award (DP1HG007811 to J.S.). K.A.M. is an American Cancer Society Fellow (PF-15-221-01), and was supported by a National Cancer Institute Interdisciplinary Training Grant in Cancer (2T32CA080416). M.A.C. and V.E.G. are supported by the National Science Foundation Graduate Research Fellowship. J.N.D. is supported by a National Institute of General Medical Sciences Training Grant (T32GM007454). J.S. is an Investigator of the Howard Hughes Medical Institute. D.M.F. is a Canadian Institute for Advanced Research Azrieli Global Scholar.
Integrated supplementary information
Supplementary Figure 1 Validation experiments of EGFP-fusions for assessing PTEN and TPMT steady-state abundance.
a, Representative gating strategy for mTagBFP2 negative, mCherry positive cells containing 15,000 recombined cells. b, PTEN variant EGFP:mCherry ratio geometric means as a fraction of WT, for known and previously uncharacterized PTEN low-abundance variants. Error bars denote 95% confidence intervals of the mean (red), with individual data points shown in grey. Each variant was assessed in at least 3 independent experiments. c, Similar plot for TPMT, with error bars denoting 95% confidence intervals of the mean (red), with individual data points shown in grey. All variants were independently assessed three times, except variants p.Asp15Tyr, p.Arg64Ser, p.Ala80Pro, p.Ile143Thr, p.Lys238Glu, p.Tyr240Cys, which were assessed twice. d, Scatterplot comparison of WT-normalized EGFP:mCherry ratios for EGFP- or 15-aa split-GFP fused PTEN variants. Values are the mean of 3 independently performed experiments. n = 6 samples. “r” and “ρ” denote Pearson’s and Spearman’s correlation coefficients, respectively
a, b, Pairwise VAMP-seq abundance score correlations between replicate sorting experiments for PTEN (a) and TPMT (b). n values are the number of variants scored in both experiments. Replicates 5 and 6 for TPMT contained a subset of mutagenized positions different from those mutagenized in replicates 1 through 4, with both subsets mixed together for Replicates 7 and 8. Pearson’s correlation coefficients are shown. Score numbers in this figure correspond to experiment numbers in Supplementary Table 1
a, b, Scatterplot comparison of VAMP-seq abundance scores (x-axis) and individually assessed log10-transformed, WT-normalized geometric means of the EGFP:mCherry ratios for various PTEN (a) and TPMT (b) variants (see also Supplementary Figure 1b, c). r and ρ denote Pearson’s and Spearman’s correlation coefficients, respectively. c, PTEN VAMP-seq scores for variant steady state expression characterized by western blot analysis in previous publications (See Supplementary Table 9). d, Scatterplot comparing TPMT VAMP-seq scores (y-axis) and previously published abundance values from western blots (see Supplementary Table 10). e, Nonsense variant VAMP-seq scores by amino acid position, for PTEN (top) and TPMT (bottom). WT abundance score (1.0) shown as a blue line. N-terminal nonsense variants append a small number of residues to EGFP, which does not affect its abundance. C-terminal nonsense variants remove a small number of residues from PTEN or TMPT, which also does not impact abundance. f, Missense variant abundance score density plots for PTEN (gray) and TPMT (green). The thresholds of the 5% lowest synonymous variant scores are shown, for each protein, by the dotted lines. g, h, Scatterplot comparing positional median PTEN (g) and TPMT (h) VAMP-seq scores to PSIC evolutionary conservation scores for each position (Sunyaev et al.) i, j, Positional median PTEN (i) and TPMT (j) abundance scores for positions found in various secondary structure types, with the red line denoting the median value for the group. n values denote the number of positions that fell into each category
a, Scatterplot comparing abundance score (y-axis) to in vitro characterized melting temperatures of select PTEN variants (Johnston et al.). r and ρ denote Pearson’s and Spearman’s correlation coefficients, respectively. b, A plot of positional median scores for PTEN positions with potential hydrogen bonds or salt bridges. A position was considered intolerant only if it had 5 or more variants and more than 90% of the abundance scores were at or below the score threshold containing the lowest 5% of synonymous variants. Red bars denote median abundance score values. n = 26 for substitution intolerant, and n = 50 for the remaining positions. c, Substitution-intolerant PTEN positions with potential polar contacts, clustered by distance based on PDB coordinates (PDB: 1d5r). Positions within 11 Å of each other were considered part of a group. The dashed line shows the 11 Å distance cutoff. d, Histogram of the number of PTEN missense variants per position in COSMIC. Substitution-intolerant positions potentially involved in polar contacts with counts in COSMIC greater than 7 are labeled in red. e, Minimum distance of all PTEN positions (gray) or elevated-abundance positions (red) from known phospholipid-binding positions. The black line denotes a 7 Å distance. A position was considered elevated in abundance only if it had 5 or more variants and there were more than 5 variants with scores above the median of the synonymous distribution. f, VAMP-seq scores for variants at position S385, with a synonymous variant in black, negatively charged variants in red, positively charged variants in blue, and all other variants in gray
Supplementary Figure 5 PTEN variant abundance classification and relationship to germline and somatic variation.
a, Illustrative examples of variant abundance classifications, with the dotted line representing the threshold above which 95% of synonymous variants reside. Points represent the VAMP-seq score for each representative variant, with error bars denoting the 95% confidence interval derived from experimental replicates. n values are 3, 5, 2, and 4 for p.Thr2Asp, p.Thr5Ala, p.Glu7His, and Lys6Ile, respectively. b, Frequencies of each PTEN abundance class for each PTEN ClinVar interpretation, as well as for all possible SNVs with abundance classifications. c, Abundance scores and classes for PTEN variants with allele counts highly unlikely to be causal for Cowden’s Syndrome. d, Frequencies of all observed PTEN variants across different cancer types in the TCGA and AACR GENIE data. Highly recurrent PTEN variants are labeled in red. e, Western blot analysis of a clonal line stably expressing WT or missense variants of N-terminally HA-tagged PTEN. This line was derived independently from the line used to generate the data shown in Figure 4f. This experiment was independently performed twice with similar results. f, Comparison of PTEN abundance scores with changes in folding energies predicted by Rosetta using the ddg_monomer protocol. Variants are shown as gray circles, with the exception of those with Rosetta ΔΔG predictions greater than 17, which are marked by a black “x” at a ΔΔG value of 17. Contour lines are colored by the regional density of points. Previously or newly identified PTEN dominant negative variants shown as blue points with blue labels
Supplementary Figure 6 Flow chart of PTEN p.Ile135Lys pathogenicity reinterpretation using VAMP-seq data.
The ACMG/AMP joint criteria for classifying variants were used, with low abundance classification by VAMP-seq considered strong experimental support of pathogenicity (PS3). Without functional data there is no strong or very strong evidence of pathogenicity for this variant, therefore pathogenic criteria cannot be fulfilled and the variant remains classified as likely pathogenic. With low abundance data, PS3 can be used and pathogenic criteria is met
a, Scatterplot comparing abundance scores and previously characterized red blood cell (RBC) activity from patients. b, c, Scatterplots comparing individually assessed, WT-normalized EGFP:mCherry geometric means to previously published values of average RBC activity (b), or average patient dosage intensity (c). Dose intensity is the dose where 6-MP becomes toxic to the patient before reaching the 100% protocol dose of 75 mg/m2. r and ρ denote Pearson’s and Spearman’s correlation coefficients, respectively. n = 6 samples for each plot. d, Western blotting results for individually-expressed TPMT variant GFP fusions. Each variant was blotted with 45, 15, and 5 µg of total protein input per lane. This experiment was performed once
A histogram of protein stability indices from Yen et al. Protein stability index values for proteins tested in the VAMP-seq assay are shown as dashed vertical lines. Protein stability indices were not available for PTEN, CYP2C9, CYP2C19, and PMS2
Scatterplots comparing variant frequency derived from replicate PCR amplification and sequencing for each of the four bins in every PTEN experiment are shown
a, b, Scatterplots showing the total frequencies and weighted average values of wt (black), synonymous variants (red), or non-terminal nonsense variants (blue) for each experiment, for PTEN and TPMT respectively. A combination of synonymous variant coefficient of variation (c and d), synonymous variant mean (black) and median (red) (e and f), and total number of scored missense variants (g and h) for PTEN (c, e, and g) and TPMT (d, f, and h) were assessed at increasing total frequency filtering threshold values to obtain the threshold value that we required across the four bins for a variant to be included in the analyses we present. The 1 x 10-4.75 total frequency threshold used for the final analysis is displayed as a dotted line in each plot
a, Barcode counts from independent amplifications of the barcoded PTEN library plasmid preparation used for recombination. n = 67,162 data points. r denotes Pearson’s correlation coefficient. b, A filter based on a minimum count of 200 was imposed (black dotted line), resulting in 40,560 unique barcodes. c, The barcode-variant map was used to determine the frequencies of different types of sequences in the plasmid preparation of the barcoded PTEN library. d, Nucleotide biases at the degenerate codon for the single amino acid PTEN variants. e, Amino acid biases of the single amino acid variants of the PTEN library, with the frequencies expected from perfect NNK mutagenesis shown in red. f, Number of substitutions observed at each position of the PTEN protein amongst the 40,560 barcodes in the PTEN library plasmid preparation. g, Distribution of number of substitutions per position in the PTEN protein. h, Distribution of single amino acid variant frequencies in the PTEN library (black), along with an illustrative log-normal distribution that closely fits the PTEN data (red), shown as a density plot (top panel), or a cumulative distribution function plot (bottom panel). i, Sampling simulations of observed and hypothetical PTEN libraries, displaying the fraction of the 8,040 possible PTEN single amino acid and nonsense variants observed for increasing sampling sizes, with a step size of 1. Results of sampling from the PTEN variant frequency distribution observed in the library plasmid preparation are shown in black. Results of sampling hypothetical, uniformly distributed libraries containing either the subset of single amino acid variants observed in the PTEN library plasmid preparation (dark gray), or all possible PTEN single amino acid variants (light gray) are shown for comparison
Supplementary Figures 1–11, Supplementary Tables 1–3, 5–10 and Supplementary Note
Dataset of PTEN variant scores, classifications, and annotations
Dataset of TPMT variant scores, classifications, and annotations
Dataset of PTEN residue scores, classifications, and annotations
Dataset of TPMT residue scores, classifications, and annotations
R Markdown file recreating all of the analyses
Table of PTEN variant pathogenicity reclassifications that are possible with abundance data