A harmonized meta-knowledgebase of clinical interpretations of somatic genomic variants in cancer

Precision oncology relies on accurate discovery and interpretation of genomic variants, enabling individualized diagnosis, prognosis and therapy selection. We found that six prominent somatic cancer variant knowledgebases were highly disparate in content, structure and supporting primary literature, impeding consensus when evaluating variants and their relevance in a clinical setting. We developed a framework for harmonizing variant interpretations to produce a meta-knowledgebase of 12,856 aggregate interpretations. We demonstrated large gains in overlap between resources across variants, diseases and drugs as a result of this harmonization. We subsequently demonstrated improved matching between a patient cohort and harmonized interpretations of potential clinical significance, observing an increase from an average of 33% per individual knowledgebase to 57% in aggregate. Our analyses illuminate the need for open, interoperable sharing of variant interpretation data. We also provide a freely available web interface (search.cancervariants.org) for exploring the harmonized interpretations from these six knowledgebases.

P recision oncology-in which treatment is informed by the mutational profile of a cancer-requires concise, standardized and searchable clinical interpretations of detected variants. Interpretations of biomarker-disease associations can be diagnostic, prognostic, therapeutic (predictive of favorable or adverse response to therapy) and/or predisposing (germline variants that increase risk of developing cancer). Many have curated the biomedical literature to collect and formalize these interpretations into knowledgebases [1][2][3][4][5][6][7][8][9][10][11][12] . These isolated efforts have resulted in disparate knowledge representation, and exchange of these biomarker-disease associations remains a difficult challenge 13 . Consequently, stakeholders interested in the effects of somatic cancer variants are faced with the following trade-off: (1) reconciling multiple representations and interpretations across knowledgebases; or (2) potentially omitting clinically significant interpretations that are not universally captured. Manual aggregation of information across knowledgebases to interpret the variant profile for each patient is an unsustainable approach at scale. Moreover, the lack of an integrated resource has precluded the ability to easily assess the current state of precision treatment options. Published reports [14][15][16][17] have relied on individual, often highly discordant knowledgebases. Interoperability and automated aggregation are required to make a comprehensive approach to cancer precision medicine tractable and to establish consensus across knowledgebases.
The current diversity and number of 'knowledge silos' and the associated difficulties of coordinating these disparate knowledgebases have led to an international effort to maximize genomic data sharing 18,19 . The Global Alliance for Genomics and Health

A harmonized meta-knowledgebase of clinical interpretations of somatic genomic variants in cancer
AnAlysis NaTurE GENETics (GA4GH) has emerged as an international cooperative project to accelerate the development of approaches for responsible, voluntary and secure sharing of genomic and clinical data 20,21 . The Variant Interpretation for Cancer Consortium (VICC; cancervariants.org) is a Driver Project of GA4GH, established to co-develop standards for genomic data sharing (https://www.ga4gh.org/how-we-work/ driver-projects/ga4gh.org/howwework/driver-projects.html). Specifically, the VICC is a consortium of clinical variant interpretation experts addressing the challenges of representing and sharing curated interpretations across the cancer research community.
Somatic variants in cancer-relevant genes are evaluated from multiple partially overlapping perspectives (Supplementary Note). The Association for Molecular Pathology, the American Society of Clinical Oncology and the College of American Pathologists (AMP/ ASCO/CAP) have published structured somatic variant clinical interpretation guidelines that specifically address diagnostic, prognostic and therapeutic implications 22 . These guidelines do not provide systematic and comprehensive procedures to classify somatic variant oncogenicity, as has been published in the American College of Medical Genetics and Genomics (ACMG)/AMP guidelines 23 for pathogenicity interpretation of germline variants.
Another common difference between somatic and germline classification is the frequent use of variant representations that are defined by multiple alternative genomic alterations, including protein variants such as NP_004295.2:p.F1174L (ALK F1174L; caused by either NC_000002.11:g.29443695G>T or NC_000002.11:g.29443695G>C), and categorical variants 24 , such as 'loss-of-function mutations' or 'activating mutations' (the use of the word 'mutations' in these variant names is a somatic-specific nomenclature that is common across these knowledgebases). This represents an important distinction from the interpretation of germline variants, which are typically described by singular and specific DNA variants, and only rarely in broader terms. A primary challenge of this work was to handle the complexity of these somatic variant representations.
We  Table 1) 1,5,[9][10][11] . From a larger survey of published and available knowledgebases of clinical interpretations of genomic variants (Supplementary Table 1), these knowledgebases were selected for their similarity in somatic disease focus. The institutions leading each constituent knowledgebase agreed upon a core set of principles describing minimal data licensing and structure requirements (http://cancervariants.org/ principles/ and Supplementary Note).
Our cooperative effort developed a framework for structuring and harmonizing clinical interpretations across these knowledgebases. Specifically, we defined key elements of variant interpretations (genes, variants, diseases, drugs and evidence), developed strategies for harmonization and implemented this framework to consolidate interpretations into a single, harmonized meta-knowledgebase (freely available at search.cancervariants.org).

Results
Aggregating and structuring interpretation knowledge. A review of the constituent somatic knowledgebases of the VICC (Fig. 1 and  Supplementary Table 1) 1,5,[9][10][11] showed dramatic differences in the components of variant interpretations, which were often a mixture of concepts with standardized (such as Human Gene Nomenclature Committee (HGNC) gene symbols 25 , Human Genome Variation Society (HGVS) variant nomenclature 26 ), externally referenced (identified elements of an established ontology or database) or knowledgebase-specific (shorthand, internal identifier) representations (Fig. 1). Representations of an element could vary within a knowledgebase, such as with the use of shorthand for diseases, including both standardized representations (for example, 'CLL' and ' ALL' are both listed synonyms in the NCI Thesaurus 27 ) and internal representations (for example, 'G' (glioma), 'L' (lung cancer) or 'OV' (ovarian cancer)).
We harmonized variant interpretations from each of these knowledgebases by mapping all data elements in each knowledgebase to established standards and ontologies describing genes, variants, diseases and drugs ( Fig. 1 and Supplementary Note). Briefly, genes were harmonized using the HGNC gene symbols. Variants were harmonized through a combination of knowledgebase-specific rules, matching to the Catalog of Somatic Mutations in Cancer (COSMIC) 3 , and use of the ClinGen Allele Registry (reg.clinicalgenome.org) 28 . Diseases were harmonized using the European Bioinformatics Institute (EBI) Ontology Lookup Service (OLS; www.ebi.ac.uk/ols/index) to retrieve Disease Ontology (DO) terms and identifiers. Drugs were harmonized through queries to the Mychem.info API (mychem.info), PubChem 29 and ChEMBL 30 . Details for each of these harmonization strategies are described in Methods and Extended Data Fig. 1.
Due to the knowledgebase-specific nature of describing an interpretation evidence level ( Fig. 1), harmonization required manual mapping of evidence levels to a common standard. The AMP/ASCO/CAP somatic classification guidelines were released after (and partially informed by) the design of the VICC knowledgebases. These guidelines are compatible with (but not identical to) the existing evidence levels of these knowledgebases. We constructed a mapping of evidence levels provided by each knowledgebase to the evidence levels constituting AMP/ASCO/CAP tier I and II variants ( Table 1).
The landscape of variant interpretation knowledge. The metaknowledgebase v.0.10 release contained 12,856 harmonized interpretations (hereafter referred to as the core dataset; Methods) supported by 4,354 unique publications for an average of 2.95 interpretations per publication. Notably, 83% of all publications were referenced by only one knowledgebase, and only one publication 31 was referenced across all six knowledgebases (Extended Data Fig. 2a). Gene symbols were almost universally provided; the few interpretations lacking gene symbols (<0.01%) were structural variants that were not associated with an individual gene. In contrast to publications, the genes curated by the cancer variant interpretation community are much more frequently observed in multiple knowledgebases. We observed that 23% of genes (97/415) with at least one interpretation were present in at least half of the knowledgebases, compared to only 5% of publications (203/4,354; odds ratio, OR = 1.6 × 10 −1 , P = 4.7 × 10 −34 ; Fisher's exact test, two-sided; Extended Data Fig. 2b).
Variants had little overlap across the core dataset (Fig. 2a). Of the constituent 3,439 unique variants, 76.6% were described by only one knowledgebase, and <10% were observed in at least three (Fig. 2b). This lack of overlap was partially due to the complexity of variant representation. For example, the representation of an ERBB2 variant as described in nomenclature defined by the HGVS 26 is NP_004439.2:p.Y772_A775dup, and yet it is referenced in multiple different forms in the biomedical literature. p.E770delinsEAYVM 32 , p.M774insAYVM 33 and p.A775_G776insYVMA 34 all describe an identical protein kinase domain alteration, although they appear to identify different variants (Fig. 2c). Despite having a standard representation by the HGVS guidelines, these alternative forms continue to appear in the literature. Consequently, a researcher looking to identify a specific match to NP_004439.2:p.E770delinsEAYVM may find no direct matches, although several exist under various alternate representations. This component of variant harmonization AnAlysis NaTurE GENETics was addressed through the use of the ClinGen Allele Registry (Methods). Some differences in the scale and structure of these knowledgebases may be attributed to curation strategies (Supplementary Note).
To illustrate the challenges of searching across multiple variant representations, we surveyed all interpretations describing the previously discussed ERBB2 variant (NP_004439.2:p.Y772_A775dup) using the web interfaces provided by each knowledgebase (Table 2  and Supplementary Table 2). Each knowledgebase represented this variant differently. Two did not have specific interpretations for this variant, although they did have relevant categorical variants (for example, 'exon 20 insertions'; Table 2). Most of the knowledgebases had a single internal representation of the variant, although the majority of these representations did not match across knowledgebases. The evidence describing these interpretations varied considerably in form, as each used knowledgebase-specific nomenclature (for example, evidence described as 'level 3A' in OncoKB is equivalent to 'level 1B' from MolecularMatch, or 'level B' from

NaTurE GENETics
CIViC; Tables 1 and 2). Of the 19 unique publications describing the collected evidence, only three were observed in more than one knowledgebase, and none were observed in more than two. Interestingly, the curated interpretations from these shared publications varied by knowledgebase in disease scope ('advanced solid tumor' compared to 'non-small cell lung cancer' (NSCLC) 35 ; 'breast cancer and NSCLC' compared to 'cancer' 36 ). A review of the interpretations showed some that are present in most of the knowledgebases (for example, 'use of afatinib, trastuzumab or neratinib in NSCLC'; Table 2), and others that are present in only one or two (for example, 'use of lapatinib in lung adenocarcinoma' and 'use of afatinib and rapamycin in combination in NSCLC'; Table 2). Importantly, this includes sparse interpretations that describe conflicting evidence (for example, 'no benefit from neratinib in NSCLC'; Table 2) or negative evidence (for example, 'does not support sensitivity/response to dacomitinib in NSCLC'; Table 2). Collectively, these data illustrate the diversity in knowledgebase structure, content, terminology and curation methodology. Consequently, utilizing a subset of these knowledgebases would likely result in differing interpretations before the harmonization performed in this study.

Harmonization improves consensus across interpretations.
To test the effect of our harmonization methods on generating consensus, we evaluated the overlap of unique interpretation elements from each knowledgebase of the core dataset in comparison to unharmonized (but aggregated) data (Methods). As noted above, genes from each resource used HGNC gene symbols, resulting in very little gain from harmonization; 45% of genes across knowledgebases overlapped without harmonization, compared to 46% with harmonization. This is in contrast to variants (8% overlapping unharmonized, 26% overlapping harmonized), diseases (27% unharmonized, 34% harmonized) and drugs (20% unharmonized, 36% harmonized) (Supplementary Table 3). None of the evidence levels were consistent across resources when unharmonized, and all a b c d  ). Objects are attributed to the largest containing set; thus, a variant described by all six knowledgebases is attributed to the dark blue set with eight variants. b, Pie chart visualizing overall uniqueness of variants, with categories indicating the number of knowledgebases describing each variant. Nearly 77% of variants are unique across the knowledgebases, with only 0.2% ubiquitously represented. The eight variants present in all six knowledgebases are listed on the right. c, A comparison of element uniqueness across knowledgebases. Despite having the greatest degree of overlap across all elements, approximately 61% of genes are unique across the knowledgebases. Literature cited to support interpretations has the smallest degree of overlap across all elements, with 83% of publications remaining unique across the knowledgebases. *Drugs are not evaluated for PMKB, which does not formally represent this concept. d, Multiple syntactically valid representations of an identical protein product can lead to confusion in describing the change in the literature and in variant databases. The wild-type protein sequence (dark blue with orange lettering) is represented for ERBB2 (top). Two (of many) possible representations of an inframe insertion (orange with dark blue lettering) are shown (bottom). A nonstandard HGVS expression describes a five-amino-acid insertion replacing one glutamate residue (middle). At the bottom, the HGVS standard representation shows an identical protein product from a four-amino-acid duplication. A search for one representation against a database with another (nonoverlapping) representation may lead to omission of a clinically relevant finding. Confers sensitivity to neratinib in patients with neoplasm of breast AnAlysis NaTurE GENETics are consistent with a common standard (Table 1) after harmonization, which is a primary contribution of this work. Notably, in some cases, harmonization dramatically increased the number of elements to be considered. For example, CGI had an increase in variant count from 283 (unharmonized) to 1,600 (harmonized) due to the expansion of ambiguous categorical variants (for example, 'oncogenic mutation') to the set of variants considered oncogenic by CGI (through extraction and mapping of the CGI Catalog of Validated Oncogenic Mutations). As mentioned above, the PMKB does not have a formalized 'drug' field for interpretations, so there is no reasonably accessible data for aggregating or harmonizing drugs for that resource. Drugs and variants both had a relatively greater benefit from normalization compared to the other interpretation elements, which was likely driven by the diverse and numerous synonymous representations of these concepts in use. While the complexities of variant representation have been discussed above, the complexity of drug labeling in these resources is driven by the multiple synonyms given to drugs in their numerous formulations and brands, which change relatively frequently over time.

Harmonization increases findings of clinical significance.
Evaluation of patient variants for strong clinical significance requires an assessment of these variants in the appropriate disease context. When grouped to the nearest top-level disease term (Supplementary Table 4 and Supplementary Note), five major cancer group terms each accounted for over 5% of all interpretations in the core dataset: lung cancer (24%), breast cancer (13%), hematologic cancer (11%), large intestine cancer (9%) and melanoma (6%) (Fig. 3a and Supplementary Table 5). Notably, the most common interpretations mirror top-level terms that have both high incidence (Fig. 3b) and high mortality (Fig. 3c) as reported by the American Cancer Society (Supplementary Table 6) 37 : lung cancer, breast cancer and hematologic cancer. The 'large intestine cancer' term contains numerous interpretations describing colorectal cancers, which are closely related to colon cancer (a top-five cancer in both incidence and mortality; Supplementary Table 7). Evaluation of these terms across the core dataset showed significant differences in the distribution of common cancer types constituting each knowledgebase, illustrating the value of aggregating knowledgebases for a more comprehensive landscape of interpretations (Extended Data Fig. 3 and Supplementary Table 8).
To further test the value of harmonized interpretation knowledge, we evaluated the 38,207 patients of the AACR Project Genomics Evidence Neoplasia Information Exchange (GENIE) 38 . We first queried the 237,175 moderate-or high-impact variants from GENIE using a broad search strategy (Methods and Extended Data Fig. 4). Notably, 11% (4,355) of patients lacked any variants to search before filtering on predicted impact, and 12% (4,543) after. This search yielded 2,316,305 interpretation search results for an average of 9.  5). This is congruent with our observation that the interpretations of the core dataset for the most common diseases are highly focused on these and other specific genes (Fig. 3d), including tier I interpretations (Fig. 3e). Examining our results at the patient level showed that a focused, variant-level search resulted in at least one interpretation (in any cancer type with any level of evidence) for 57% of patients in the GENIE cohort, compared to the average 33% obtained when using each constituent knowledgebase individually (Fig. 3f). We observed that broadening the search scope to include any regional match (Extended Data Fig. 4) increased the cohort coverage to 86% of patients (compared to an average of 68% per individual knowledgebase). However, it is prudent to keep in mind that the increase in matching percentage using regional match instead of exact match would be partly due to nononcogenic passenger variants.
A key component in determining the clinical relevance of an interpretation is whether the tumor type reported in the interpretation matches the patient's tumor type (see 'Defining characteristics' in Table 1). Restricting patient search results to those interpretations that are of matching grouped disease terms (Extended Data Fig. 4 and Supplementary Note) resulted in 29% of patients with at least one clinical interpretation (compared to an average individual knowledgebase match rate of 13%), and 18% of patients with at least one tier I clinical interpretation (compared to an average 6% per individual knowledgebase) (Fig. 3f). Patients with rare diseases were disadvantaged in this analysis, as automated mapping of their disease terms to DO was less likely to succeed (Supplementary Note). Allowing matching to any ancestor or descendant term and allowing partial variant overlaps improves matches to 60% (compared to an average of 35% per individual knowledgebase). This broader strategy, however, requires contextual re-evaluation of assigned AMP/ASCO/CAP evidence levels, which are designated for a precise match to variant and disease context. Consequently, evidence level or tier filtering can only be used with an exact search strategy. We evaluated an alternative, highly permissive search strategy that matches sample variants to any interpretation in the gene (Extended Data Fig. 6). The resulting match profile across the knowledgebases is comparable to findings from the overlapping variant strategy, indicating that many of the commonly mutated genes have genelevel interpretations (which would be a match by either strategy).
A comparison of interpretations across the previously described common cancers (with proportion >5% in Supplementary Table 5) showed that the use of grouped terms instead of exact terms for matching interpretations to patients' cancers varies dramatically by cancer type, with some cancers (for example, lung cancer and melanoma) showing little increased interpretation breadth, while others have enormous effect (for example, breast cancer and large intestine cancer; Fig. 3g). This is primarily due to the specific nature by which patients are classified with certain diseases, versus the aggregate nature by which interpretations are ascribed to diseases. Interestingly, 56% of GENIE patient samples (6,196/11,149) have disease-matched interpretations across the frequently observed cancers, compared to only 40% (5,430/13,724) of patient samples across all other cancers (OR = 1.9, P = 3.9 × 10 −140 ; Fisher's exact test, two-sided). These numbers are reduced to 44% (4,881/11,149) and 18% (2,438/13,724), respectively, when considering only tier I interpretations (OR = 3.6, P < 2.2 × 10 −308 ; Fisher's exact test, two-sided).

A resource for searching variant interpretation knowledge.
We have developed and hosted a public web interface for exploring the VICC meta-knowledgebase, freely available at search.cancervariants.org. This interface retrieves the most recent data release from an ElasticSearch index. Searching the knowledgebase is performed by specifying filters for any term or entering free text or compound (for example, and/or logic) queries in the search box at the top of the page (Fig. 4a). Panels with data distribution visualizations describe the current result set (Fig. 4b). These interactive panels provide additional information about specific subsets and may be used to create additional filters (for example, clicking on a level in the 'evidence_level' panel filters results throughout the page to display only those interpretations with the designated evidence level). This allows investigators to see the distribution of interpretations by evidence level, disease, gene and drug, and to filter according AnAlysis NaTurE GENETics to their interests. Tabulated results are provided at the bottom of the page (Fig. 4c), and are expandable with all details, including the (unharmonized) record provided by the original knowledgebase for each interpretation. These search tools are available via both the web interface and an application programming interface (API) search endpoint (Methods), in addition to a GA4GH beacon on beacon-network.org. Additionally, a Python interface and analysis workbook have been developed to enable reproduction (and additional exploration) of the data presented in this paper, as well as full downloads of the underlying data (Methods). Usage documentation and example queries for this resource may be found at docs.cancervariants.org.

Discussion
In this study, we aggregated, harmonized and analyzed clinical interpretations of cancer variants from six major knowledgebases 1,5,9-11 .  D M e l a n o m a L r g . i n t . c a n c e r H e m . c a n c e r B r e a s t c a n c e r L u n g c a n c e r  Fig. 4) that allows for regional variant matches (for example, gene level) and broader interpretation of disease terms (for example, DOID:162, cancer) nearly doubles the number of patients with matching interpretations. These broader match strategies are incompatible with the ASCO/AMP/CAP evidence guidelines. g, Most significant finding (by evidence level) across patient samples, by disease. Each column represents one of the common diseases indicated in a, and the rows represent the evidence levels described in Table 1

NaTurE GENETics
Our analysis uncovered highly disparate content in curated knowledge, structure and primary literature across these knowledgebases. Specifically, we evaluated the unique nature of the vast majority of genomic variants reported across these knowledgebases and demonstrated the challenge of developing a consensus interpretation given these disparities. These challenges are exacerbated by nonstandard representations of clinical interpretations, in both the primary literature and curated knowledge of these resources. It is encouraging that the curators of these knowledgebases have, without coordination, independently curated diverse literature and knowledge sources. However, this reflects an enormous curation burden generated from the increasingly employed molecular characterizations of patient tumors and the related expansion of the primary literature describing them. Even at the gene level, for which there is the highest degree of overlap across any element of an interpretation, 61% of genes with interpretations are observed in only one knowledgebase. Our findings thus highlight the need for a cooperative, global effort to curate comprehensive and thorough clinical interpretations of somatic variants for robust practice of precision medicine. We observed that harmonization improved concordance between interpretation elements across resources (Supplementary Note), and as a result we were able to achieve at least one specific (positionmatched) harmonized variant interpretation for 57% of the patients in the GENIE cohort. In the most stringent searches, we required

AnAlysis
NaTurE GENETics a precise variant match to a tier I interpretation also matching the patient's cancer; in these cases, 18% of the cohort had a finding of strong clinical significance. Notably, these findings were substantially higher in patients with more common cancers, with 39% of the common cancer samples variant matching at least one tier I interpretation, compared to 15% of other cancer samples. These findings are concordant with observations of matched therapy rates in precision oncology trials, including 15% from IMPACT/COMPACT 15 , 11% from MSK-IMPACT 14 , 5% from the MD Anderson Precision Medicine Study 16 and 23% from the NCI-MATCH trials 17 . Collectively, our results portray a confluence of knowledge describing the most common genomic events relevant to the most frequent cancers, with highly disparate knowledge describing less frequent events in rare cancer types. The differing content of these knowledgebases may be a result of research programs targeted at frequent cancers, highlighting a need for a broader focus on less common cancers. This sparse landscape of curated interpretation knowledge is exacerbated by paucity in cross-references between ontologies describing disease, highlighting the importance of bridging this gap 39 . Similarly, complexities in variant representation have elucidated a need for sophisticated methods to harmonize genomic variants; harmonization with the ClinGen Allele Registry 28 is suited to point mutations and indels, but the representation and harmonization of complex and nongenomic (for example, expression or epigenetic) variants remains a challenge.
Our harmonized clinical interpretation meta-knowledgebase represents a significant step forward in building consensus that was previously unattainable due to a lack of harmonization services, such as the Allele Registry, and expert standards and guidelines, such as those recommended by AMP/ASCO/CAP. This meta-knowledgebase serves as an open resource for evaluating interpretations from institutions with distinct curation structure, procedures and objectives. Potential uses include expert-guided therapy matching, supporting FDA regulatory processes associated with laboratory-developed genomic tests for guiding therapy and identification of diseases and biomarkers that warrant future study. The meta-knowledgebase web application is available at search. cancervariants.org, with usage documentation and examples at docs.cancervariants.org. The content of this meta-knowledgebase is dynamic, as we routinely poll the constituent knowledgebases for their associations between variants and clinical interpretations, which primarily comprise predictions of somatic variant effect on disease response to a therapy. Unlike the recently FDA-recognized ClinGen Expert Curated Human Variant Data 40,41 , this resource is not meant to be used to directly annotate clinical reports, but rather to serve as a search tool for existing knowledge pertaining to observed genomic variation.
While our initial efforts provide a structure by which variant interpretation knowledgebases can contribute to a broader and more consistent set of interpretations, much work remains to be done. In particular, VICC members contribute to GA4GH Work Streams to develop and integrate new and existing [42][43][44][45] standards for the representation of variant interpretations and the evidence that describe them. Our web interface is being redesigned to a fullscale web service and user interface to concisely represent the most relevant interpretations for one or more variants. Specifically, we plan to add visual elements depicting the distribution of diseases corresponding to a searched variant, search modes specific to user intent (for example, disease-focused search, gene-focused search or multivariant search) and restyled result summaries. These and other planned changes are tracked on our central repository at git. io/metakb (Supplementary Note for other planned improvements).
In conclusion, there is a great need for a collaborative effort across institutions to build structured, harmonized representations of clinical interpretations of cancer genomic variants to advance the implementation of precision medicine. Our work has illustrated the diversity of variant interpretations available across resources, leading to inconsistency in interpretation of cancer variants. We have assembled a framework and recommendations for structuring and harmonizing such interpretations, from which the cancer genomics community can improve consensus interpretation for cancer patients. We have also developed and released open-source (MITlicensed) and freely available aggregated knowledge resources (web application, data downloads and API) and associated analysis tools. Our working group and open-source software development environment are open to all and we welcome participation from anyone with an interest in learning about, utilizing, augmenting, improving or proposing new directions for this community-based project, for the benefit of cancer patients.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41588-020-0603-8.

Methods
Collecting cancer variant interpretation knowledge. OncoKB, the CGI and JAX-CKB all contain complementary knowledge of variant oncogenicity. While valuable, knowledge of a variant's potential role in driving tumorigenesis is structured differently than clinical interpretations of genomic variants, and is therefore outside of the scope of this manuscript. While omitted from the analyses presented in this paper, we do aggregate these annotations due to their potential utility in clinical research. ClinGen, ACMG, AMP, ASCO, VICC and CAP are working on developing guidelines to enable consistent and comprehensive assessment of oncogenicity of somatic variants. In the future, variant oncogenicity interpretations based on such guidelines can be incorporated into metaknowledgebases and should help to improve the harmonization of related interpretations.
Harmonizing genes. Gene symbols were matched to the table of gene symbols from HGNC, hosted at the EBI 47 : ftp://ftp.ebi.ac.uk/pub/databases/genenames/ new/json/non_alt_loci_set.json. This table was used to construct an 'aliases' table comprised of retired and alternate symbols for secondary lookup if the interpretation gene symbol was not found among the primary gene symbols from HGNC. If an alias used by a knowledgebase was shared between two genes, omitted by the knowledgebase or failed to match either the primary or alias table, the gene was omitted from the normalized gene field.
Harmonizing variants. Variants collected from each knowledgebase were first evaluated for attributes specifying a precise genomic location, such as chromosome, start and end coordinates, variant allele and an identifiable reference sequence. Variant names were queried against the Catalog of Somatic Mutations in Cancer (COSMIC) 3 v.81 to infer these attributes in knowledgebases that did not provide them. Custom rules were written to transform some types of variants without clear coordinates (for example, amplifications) into gene coordinates. All variants were then assembled into HGVS strings and submitted to the ClinGen Allele Registry (http://reg.clinicalgenome.org) to obtain distinct, cross-assembly allele identifiers, if available.
Harmonizing diseases. Diseases were matched to the DO 48 , through lookup with the EBI OLS 47 , unless a preexisting ontology term for a different ontology existed (98.7% of interpretations map to DO). We downloaded the March 2018 release of the TopNode terms from https://github.com/DiseaseOntology/ HumanDiseaseOntology/blob/master/src/ontology/subsets/TopNodes_ DOcancerslim.json and mapped our interpretation diseases to this list, assigning each disease to its nearest TopNode ancestor (Supplementary Table 4 and Supplementary Note). We assigned remaining interpretation diseases to the nonspecific term of DOID:162 (cancer) if the disease was a descendant of this term, but not a descendant of one of the TopNode terms.
Harmonizing evidence level. Evidence levels were standardized to the AMP/ ASCO/CAP guidelines as outlined in Table 1.
Comprehensive evaluation of ERBB2 duplication. Public web portals for the six VICC knowledgebases were manually searched for interpretations for variants describing the alteration detailed in Fig. 2c. The MolecularMatch resource changed its data access policy after peer review of this manuscript, and is no longer accessible to the public. The web portals for the remaining resources are freely available online without registration at the following URLs: Evaluating nonharmonized aggregate content. To evaluate the gains provided by our harmonization methods, we collected and minimally formatted interpretation elements from each knowledgebase without using any harmonization routines. We selected the set of unique elements for each resource and calculated the overlap across the union of those sets (Supplementary Table  3). We then repeated this procedure for harmonized elements and compared total element count and percentage overlap between harmonized and nonharmonized elements. Calculations for the specific fields of that table are provided in the Supplementary Note. Project GENIE. GENIE data were downloaded from the v.3.0.0 data release available at: https://www.synapse.org/#!Synapse:syn7222066/files/. Variants were extracted from 'data_mutations_extended.txt' , and clinical data from 'data_ clinical_sample.txt' . Variants were filtered on predicted consequence of medium or high impact. This classification was based upon the VEP consequence table (http://useast.ensembl.org/info/genome/variation/prediction/predicted_data. html#consequences) and resulted in exclusion of variants classified as Silent, 3′Flank, 3′UTR, 5′Flank, 5′UTR, Intron or Splice_Region. Patients without any variants after filtering were included in all calculations. Oncotree cross-references were obtained from their API at http://oncotree.mskcc.org/api/tumorTypes (data version, oncotree_2018_05_01) and cross-references were then mapped to DO terms where they matched. In cases where one-to-many mappings occurred, manual review of those mappings was performed to select the most appropriate mapping.

Variant intersection search.
Variant coordinates were used to search genomic features via coordinate intersection. A complete intersection of query and target is considered a 'positional match' , or a more specific 'exact match' if the alternate alleles also match. A 'focal match' is reported if the intersection fraction is less than complete, but over 10% overlapping (reciprocally). A 'regional match' is reported if there is any intersection, but the match is of no other type (Extended Data Fig. 4a).
Disease TopNode search. Disease searching returns a distance of the number of ancestor or descendant TopNode terms between the queried disease and the matching target (see Supplementary Note for more on TopNode terms). Two diseases sharing a TopNode term (for example, DOID:3008, invasive ductal carcinoma, and its parent term DOID:3007, breast ductal carcinoma, are both members of DOID:1612, breast cancer) would have a distance of 0. However, if two diseases share a TopNode term but do not have a direct lineage, they are not a match. For example, DOID:0050938, 'breast lobular carcinoma' , does not match to DOID:3007, 'breast ductal carcinoma' , even though they share a TopNode term (DOID:1612, 'breast cancer'), as they are sibling concepts and do not have an ancestor/descendant relationship (Extended Data Fig. 4b).
Enrichment testing for GENIE Oncotree diseases that map to DO TopNode was performed by comparing the count of a given disease term across the GENIE patients, and then splitting these counts into two groups: those diseases that mapped to DO in our analysis, and those that did not. This set of counts was ranked and compared by group using the Mann-Whitney U-test. The sets of counts (as well as the statistical test) may be found in cell 208 of the analysis notebook accompanying this study.
Gene intersection search. To assess cohort interpretability (Extended Data Fig. 6) when considering only matching a variant to a gene, we used the assigned gene symbols for each GENIE variant and compared them to interpretation gene symbols. Patients with at least one variant matching an interpretation gene symbol were considered a match. Matches were subsequently filtered by broad disease matching and by interpretation tier; no adjustment was made to the evidence level and tier to account for this imprecise aggregation strategy.
ElasticSearch API and web front end. Collectors create ' Association' documents segmented by the source field. Documents are posted to an ElasticSearch v.6.0 instance provisioned by AWS elasticsearch service.
On top of ElasticSearch, we built web services using the Flask web framework. The search.cancervariants.org endpoint provides two simple REST-based web services: an association query service and a GA4GH beacon service. The association query service allows users to query for evidence using any combination of keywords, while the beacon service provisions G2P associations into the GA4GH beacon network (beacon-network.org) enabling retrieval of associations on the basis of genomic location. OpenAPI (swagger) documentation is provided to accelerate development and provide API integration scaffolding. Client applications can use the API to create higher level sets of queries driven by cohort allele sets (for example, MAF/VCF files) with varying genomic resolutions and disease/drug combinations. The API server and nginx proxy are described by Docker configurations and deployed colocated within a t2.micro instance.
The user interface is a customized Kibana dashboard that enhances Lucenebased full-text search of associations with interactive aggregation heat maps, 1 nature research | reporting summary

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection