Privacy-preserving cancer type prediction with homomorphic encryption

Cancer genomics tailors diagnosis and treatment based on an individual’s genetic information and is the crux of precision medicine. However, analysis and maintenance of high volume of genetic mutation data to build a machine learning (ML) model to predict the cancer type is a computationally expensive task and is often outsourced to powerful cloud servers, raising critical privacy concerns for patients’ data. Homomorphic encryption (HE) enables computation on encrypted data, thus, providing cryptographic guarantees to protect privacy. But restrictive overheads of encrypted computation deter its usage. In this work, we explore the challenges of privacy preserving cancer type prediction using a dataset consisting of more than 2 million genetic mutations from 2713 patients for several cancer types by building a highly accurate ML model and then implementing its privacy preserving version in HE. Our solution for cancer type inference encodes somatic mutations based on their impact on the cancer genomes into the feature space and then uses statistical tests for feature selection. We propose a fast matrix multiplication algorithm for HE-based model. Our final model achieves 0.98 micro-average area under curve improving accuracy from 70.08 to 83.61% , being 550 times faster than the standard matrix multiplication-based privacy-preserving models. Our tool can be found at https://github.com/momalab/octal-candet.


S1.1 Dataset
We use the cancer classification dataset from iDASH 2020 competition Task I 9 that was collected for private tumor classification. This data is curated from a centralized database The Cancer Genome Atlas (TCGA) 27 using patients from 11 different cancer types. TCGA cancer genomics dataset consists of 25 petabytes of data. The data includes clinical data, copy number data, DNA sequencing data, imaging data, DNA methylation, microsatellite instability, miRNA sequencing and expression, and protein expression data. But not all patients/cancer types are characterized by each type of data. Several subsets of TCGA resulted in different types of studies. In our work, we study the impact of somatic mutations on prediction of cancer. Our dataset consists of two types of somatic alteration information (considered as two subsets of features): Single-Nucleotide Variations (SNVs) and Copy Number Variations (CNVs) on protein-coding genes. In the SNV subset, four different characteristics are given for each somatic SNV of a gene. These characteristics represent the chromosome location, denotes whether the mutation is a single-nucleotide polymorphism, and the effect of the mutation (using two different measures). The effect of the mutation is calculated using Ensembl Variant Effect Predictor (VEP) 20 and is reported in two ways: 1) A mutation can be considered as one of the following categorical values; high, moderate, modifier, and low, followed by a real number denoting the impact of the mutation. 2) A mutation can qualitatively be denoted as tolerated or deleterious, based on Sorting Intolerant from Tolerant (SIFT) pathogenicity prediction. All of this information reflects the importance of a mutation, i.e. VEP scores help transform an observation of a mutation to its possible impact in development of the tumor. VEP scores help in developing the biological intuition for our feature engineering methodology, which is required as this subset of SNV features contains 2,044,328 somatic mutation rows. In the copy number subset, each gene for each sample (patient) is given a copy number value depending on whether there has been a change from their parents' genes: 0 for no alteration, 1 or 2 for duplication, and -1 or -2 for deletion from one or both the parents, respectively. For each sample, there are 25,128 genes, and thus, 25,128 features. The dataset comes from 2713 patients belonging to 11 different cancer types. The composition of dataset is depicted in Table S1.

S1.2 Grid search for hyper-parameters/model
We performed a grid search over the following classifiers, with their respective hyper-parameters and we report the a subset of models (best-performing models) in Table 1.

S1.3 Matrix multiplication illustration
Here we describe the matrix multiplication of Y =X ×W , whereX is the encrypted input matrix (encoded genomic data) and W is the encoded matrix of LR weights. The polynomial degree is n, |X| is the number of inputs, |Y | is the number of outputs, and f is the number of features. The operator × stands for the standard matrix multiplication, while ⊗ represents our algorithm, [·] n is modular reduction over n, and the intervals [a, b) and [a, b] represent elements packed in a ciphertext. When b < a, there is a rotation of the n elements of the ciphertext. Function ρ(·) is the element-wise addition of all rotations of a ciphertext, and function α(·) represents the compression part of the algorithm, where one slot of n ciphertexts is selected and combined into a new ciphertext.

S1.4 Predictive genes analysis
This subsection depicts the top genes selected from the CNV and SNV pool their corresponding Gene Ontology (GO) enrichment terms. ATPase activity and microtubule motor activity PIK3R1 GTP binding and transcription factor binding S1.5 Homomorphic encryption Homomorphic Encryption (HE) is a type of encryption that allows for computation on encrypted data without decryption. Let us consider a function, f (.) operating on plaintext operands p 1 , p 2 , and the equivalent function f enc (.) operating on the corresponding ciphertexts c 1 , c 2 , such that c1 = Enc(p 1 ), and c2 = Enc(p 2 ), where Enc(.) is the encryption function. Then, the computation of the function f (.) on plaintext operands p 1 , p 2 is the decryption of computation of the function f enc on ciphertexts, i.e. using HE, we can say that f (p1, p2) = Dec( f enc (c 1 , c 2 )), where Dec(.) is the decryption function. Depending on the type of computation possible on the encrypted domain, there are several types of HE schemes. For linear models with unencrypted weights, Partial Homomorphic Encryption (PHE) schemes like Paillier 29 can be used. Nevertheless, encryption and decryption operations, which consist of modular exponentiations, hinders the performance of ML models with larger inputs or outputs. In addition, although it is possible to encode several plaintext into a ciphertext in Paillier for certain applications, the density of plaintexts per ciphertext is much lower than in Somewhat Homomorphic Encryption (SHE) or Fully Homomorphic Encryption (FHE). Furthermore, Paillier is not post-quantum secure since it can be broken by Shor's algorithm 28 . Thus, it is not suitable for handling genomics data, since they must be secure for decades or even generations.
A better approach comes from using SHE/FHE schemes like BFV (Brakerski/Fan-Vercauteren) 16 or CKKS (Cheon, Kim, Kim, Song) 30 . CKKS enables fixed-point arithmetic and it is the standard choice for ML applications. During computation, CKKS drops the lower bits of the plaintext after each operation, reducing the precision of the result. With current HE libraries and standard encryption parameters, unfortunately CKKS does not provide enough precision for our model. Conversely, BFV works on integers (modular arithmetic), where we can emulate fixed-point arithmetic by scaling up the double-precision floating-point number into integers. Similarly to CKKS, there is a limitation on how much precision a BFV ciphertext can provide. However, since it computes on modular arithmetic, we can use the Chinese Remainder Theorem (CRT) to break our values into several smaller values, each one under unique modulus coprime to all other moduli. Each smaller value is then encrypted under a different key. In our threat model, thus, the training is not privacy-preserving, but the inference is private. To make inference private we resort to encrypted computation (cancer prediction) using homomorphic encryption. Fig. 1 summarizes our threat model.