Introduction

Background

Personal human genome information is used daily in medical treatment, diagnosis of monogenic diseases, e.g., Fabry disease and Lynch syndrome1, and prevention of side effects caused by drug administration, e.g., avoiding thiopurine treatment for inflammatory bowel disease and acute lymphocytic leukemia with specific homo alleles p.Arg139Cys in NUDT152. Additionally, personal human genome information is useful for personalized medical treatment not only for monogenic diseases but also for polygenetic diseases, e.g., type two diabetes and cardiovascular artery diseases (CAD). Statin prevention therapy for individuals with high CAD genetic risk (polygenic risk score; PRS) was estimated to reduce the mean cost per individual and improve quality‐adjusted life years and averted future events of CAD3. The polygenic risk score is usually calculated by summing the multiplied values of the weight, i.e., the beta value from a genome-wide association study from an independent population matched CAD cohort study, and the genotype, e.g., 0, 1, or 2, in each individual. Compared to monogenic disease, to calculate PRS, genotypes from many disease-associated variants are required from every individual.

In particular, the advancement of whole-genome sequencing (WGS) technology allows us to measure the whole human genomic regions composed of 3.05 billion bases easily and on a large scale from saliva in a few days for less than $1000. Therefore, in order to capture the high-resolution relationships between phenotypes and genotypes including rare variants, genomic cohort studies around the world are shifting from the SNP array technology, which measures the pre-designed hundreds of thousands to millions of SNPs from one sample, to the whole-genome sequencing technology. To date, more than one million samples have already been sequenced, e.g., the NIH’s Trans-Omics for Precision Medicine (TOPMed) program4, the Million Veteran Program5, Tohoku Medical Megabank Project (TMM)6, and UK Biobank7. For example, in November 2021, one large prospective cohort study, UK Biobank built a research analysis platform on the public cloud for authorized researchers to utilize WGS information of 50,000 participants8. In late 2023, UK Biobank plans to complete the WGS for all 500 k participants9. The analysis of WGS data and thousands of phenotypes allow researchers to accelerate the discovery of novel variant-phenotype relationships.

The use of WGS technology for rare diseases also increases the possibility of identifying disease-causing variants. Therefore, tens of thousands individuals’ whole genome sequencing project for 190 rare diseases is being conducted at Genomic England10. Consequently, these novel discoveries should also be frequently updated for the implementation of clinical treatment. In 2021, the American College of Medical Genetics changed the time interval for guidance on reporting secondary findings in clinical exome and genome sequencing from every four years to annually1.

For each sample, major human whole-genome sequencing technology measures billions of short DNA fragments as sequenced data, e.g., hundreds of DNA bases in NovaSeq 6000 (Illumina, San Diego, CA) or DNBSeq-T7 (MGI, Shenzhen, China). The human whole-genome information is composed of about 3.05 billion base pairs, and usually, the total coverage of sequenceing reads to the whole-genomic region is thirty to forty times. These billions of sequenceing reads are usually analyzed by using WGS analysis pipelines on CPUs, GPUs, or FPGAs11. These pipelines are frequently updated to improve the precision and recall of detected variants. In addition, the reference assembly of the human genome is updated occasionally. In 2022, T2T consortium released the new complete human genome assembly CHMv2.0 and reported improved variant detection accuracy compared to the standard reference assembly, GRCh38, through re-analysis of WGS data with CHMv2.012.

Taking all of these advances and eco-system point of view into consideration, the reasonable implementation of the personal germline genome data management for medical treatment is to apply whole-genome sequencing once to a patient, store the raw WGS, i.e., FASTQ, bam, or cram, and analyze the raw WGS data using the latest reference assembly, data analysis pipeline, and the interpretation guideline. In the future, e.g., annually, if the reference assembly, data analysis pipeline, or interpretation guideline is updated, the stored raw data are reanalyzed for reinterpretation, e.g., using an additional reliable variant catalog of the patients missing in the previous pipeline, to diagnose a monogenic disease that could not be diagnosed in the previous pipeline and to improve the score of PRS.

Information security issues of genomic data

A human genome is the basic information that determines an individual and cannot be modified for a lifetime. Hence protection of genomic information is required. Genome data protection techniques and various laws and guidelines are summarized in13. Notably, the National Institute of Health has stated that two essential values of scientific research—the need to share data broadly to maximize its use for ongoing scientific exploration and the need to protect research participants’ privacy—should be balanced14.

However, in the transmission and reception of information for genome analysis, there are many cases where information is exchanged with encryption only in the normal TLS protocol15. It is unlikely that its security will be compromised immediately, but it cannot be guaranteed that the information will not be decrypted decades later by a “harvest now, decrypt later attack” with to the improved performance of quantum computers and supercomputers in the future16. At the transmission and storage of data that requires long-term confidentiality, such as genomic data, it is required to have sufficient security against future computers and cryptanalysis algorithms. In recent years, laws that impose severe civil punishment for leakage of personal information have been enforced13, and it is desired to build a secure system that can absolutely eliminate the threat of future decryption in line with the growing awareness of personal information protection (in short, an information theoretically secure system).

Solution for genomic data protection

As a solution to this need, quantum key distribution (QKD)17,18, which allows sharing information using theoretically secure keys between two parties, has been attracting attention in recent years. Vernam's one time pad (OTP)19 encryption using the key from a QKD link enables information theoretically secure data transmission and eliminates concerns about future decryption. On the other hand, for example, in the QKD using the BB84 protocol, a single photon is used as the key transmission medium, so it is easily affected by transmission loss, and the transmission distance is limited to about 50 to 100 km in the field. However, current QKD networks enable to expand the key supply length by performing key relay via a “trusted node”20. Moreover, a distributed storage system has been built on the QKD network and realized information theoretically secure data storage20. From the viewpoint of information security, such a distributed storage system on the QKD network would be suitable for storing data that requires confidentiality for a very long period, such as personal genomic data. This secure distributed storage system has recently been updated with an enhanced computing functionality for secure secondary use of data, referred to as a “quantum secure cloud”. According to secure secondary use of data, several solutions have been proposed, e.g., secure computation using multi-party computation21,22 and homomorphic encryption23. Unfortunately, each of these two schemes has its own issues, such as increased communication volume and computational complexity. These issues result in a degradation of throughput.

Preset work for secure secondary use of data methods (multi-party computation and homomorphic encryption)

Multi-party computation requires a huge amount of computational and communication resources. Computational resources increase by the number of shares21,22, and communication between the data owner and share holders would be a rate-determining process.

Research aiming for both communication efficiency and confidentiality in multiparty computation schemes is underway. Zhao et al.22 described a survey of recent activities and introduced implementation schemes targeting several genomic data sets. In24, three secure sequence comparison protocols were proposed based on garbled circuit techniques and under a semi-honest model. However, inefficiencies of their schemes were mentioned when dealing with large amounts of data. A theoretical study of efficiency improvement using secure two-party computation (S2PC) was reported in25, Although more efficient than conventional systems, it is not sufficient from the standpoint of practicality. BesidesS2PC implementations for genome data comparison were proposed in26,27, their methods are different from our method for whole genome data analysis. It is also not consistent with our policy of information theoretic security, which we consider essential for genomic data analysis.

As a homomorphic encryption scheme, the throughput is not sufficient because the calculation is complicated28. As a scheme to mitigate the computational complexity of homomorphic encryption, a method called switchable homomorphic encryption (SHE), which can combine both additive and multiplicative schemes when needed, was proposed29. And in30, Genome-wide association studies (GWASs) analysis using homomorphic encryption was demonstrated. Moreover, FPGA and ASIC hardware solutions would boost the throughput of homomorphic encryption, while such efforts are still in the development stage31,32. These implementation innovations have the potential to expand the use of homomorphic encryption. However, throughput improvements are in the process of development and information theoretic security has not been achieved.

In addition to high throughput technology, since a large amount of data is used in genome analysis, e.g., whole-genome sequencing data from one individual is usually more than 30 GB bases in FASTQ format, both multi-party computation and homomorphic encryption need to save computational and communication resources. Secondly, genomic data, e.g., a FASTQ file, is categorized to unstructured data33 since it is not suitable for secure computation by multi-party computation and homomorphic encryption.

Our approach; using a trusted server for secure secondary use of data

To improve throughput, secure computation using a "trusted server" enables realistic and secure data use, and has been proposed as one form of implementation34. Essentially, it is necessary to have a one-stop system that can guarantee information theoretic security among the data owner, i.e., the authorized genomic data bank, and users, e.g., the medical doctor.

Towards the implementation of a one-stop system, for the first step, we developed a method to guarantee the integrity of distributed storage data with information theoretical security by assuming a “trusted share calculator” and a verifier in the quantum secure cloud35. In this paper, based on this "trusted server" in the quantum secure cloud concept, we developed an advanced high-performance one-stop system for large-scale genome data analysis with secure secondary use of data to the data owner and multiple users with different levels of data access control.

This manuscript is organized as follows. In Materials and Method, we introduce our experimental setup i.e. the quantum secure cloud and the implementation of the trusted server. In Results, the throughputs of our system are reported. We summarize our experiment and discuss the outlook of our system in Conclusion.

Materials and methods

Quantum secure cloud system

Our quantum secure cloud system is a distributed secure genomic data analysis system (DSGD) built on the Tokyo QKD Network. The Tokyo QKD Network36 is established on the National Institute of Information and Communications Technology (NICT) optical fiber testbed. The QKD network works as a secure key supply infrastructure. Figure 1b shows a conceptual diagram of the Tokyo QKD Network by referring to the open system interconnection (OSI) reference model. The Tokyo QKD Network has five trusted nodes (cylinders in figure), and each node is connected by QKD links (blue lines) in the quantum layer (bottom layer in figure) and key management systems with public channels (green lines) in the key management layer (middle layer in figure). The five trusted nodes are physically distributed about 100 km between NICT (nodes 1 to 4) and Otemachi (node 5) by various vendors37,38,39,40,41. Each node works as a "trusted node" and the keys from the QKD links are strictly managed. Each trusted node has a key management agent (KMA) and a key supply agent (KSA) (box in Fig. 1) and located in a physically protected place. Keys generated in each QKD link in the quantum layer are pushed up (dashed line in Fig. 1) to the KMA in the key management layer and transferred to the KSA. In the key management layer, the key management server (KMS) gathers link information and instructs KMAs to execute key relay according to requests from the DSGD.

Figure 1
figure 1

Conceptual view of Tokyo QKD Network. The QKD network works as a secure key supply infrastructure. (a) Service layre; Secret sharing or other services are installed in this QKD network. The data owner, the trusted server, share holders, and the user at the service layer communicate through OTP encrypted communication lines in which secure key are provided from the QKD network. Once supplied with the keys or random numbers, the key data in the QKD network are erased and the responsibility of key management moves to users at service layer. (b) quantum layer and key management layer; Generated keys in each QKD link are pushed up to servers, called key management agents (KMAs). Each KMA is set in a physically protected place, referred to as “a trusted node”. A key supply agent (KSA) is integrated to the KMA. The KSAs supply users the keys. A key management server (KMS) gathers link information and instructs KMAs to execute key relay according to request from the service layer.

As the service layer, the DSGD realizes a high-performance one-stop system for large-scale genome data analysis with secure secondary use of data to the data owner of genomic data and multiple users with different levels of data access control. The data owner has the administrative privilege of depositing genomic data and controlling secondary use of the data on the DSGD. As in Fig. 1 (a), the DSGD stores genomic data, e.g., FASTQ, cram, bam, VCF files, on a distributed storage system (the triangle system on the left), transfers and decodes the encrypted data to a trusted server (top of the figure), and reports the filtered (part of) genotype information, e.g., the minimum genotypes of the patient for diagnosis, to authorized users (right side of the figure). For the distributed data encryption and decryption in the service layer, OTP encrypted communication is used between the key management layer and the service layer via KSAs.

Distributed system in service layer

We have implemented the secret sharing protocol of Shamir’s scheme42 in the service layerand realized information theoretically secure data transmission, storage, authentication, and restoration21. This time, to deal with a large amount of genomic data, we implemented an XOR-based secret sharing that is simpler in calculation than Shamir’s scheme while having information theoretic security and is expected to realize high throughput. The method we implemented is the scheme with the (2,3) threshold.

For the secret data S = S1S2 ( in combination, the data size of each data is the same), we prepare random numbers R = R1 R2 with the same number of bits, and calculate three shares as follows; A = A1 A2 = (S1R1) (S2R2R1), B = B1B2 = (S1R1R2) (S2R2), C = C1C2 = R1R2.

XOR-based secret sharing can achieve high throughput with information theoretic security, because the calculation of XOR is simple. But it does not have the additive homomorphic nature of shares like Shamir's scheme, and secure computation based on multi-party computation is impossible. There is an example of high-speed secure computation of multi-party computation43. This scheme is based on arithmetic calculation with a (2,3) threshold secret sharing scheme and XOR is used in a particular case on the ring of modulo 2. However, the size of the share is twice the original data (6 times the total data), and it will be necessary to consider the cost of communication and computation when secret sharing large volumes of data. The secret sharing based on XOR would be frequently used for distributed processing of a large amount of data because it does not require the development of special software and is superior in terms of simplicity.

Figure 2 shows the secret sharing and secure computation configuration implemented on the Tokyo QKD Network.

Figure 2
figure 2

The network configuration of secret sharing and secure computation on the Tokyo QKD Network. The share holders (Share A, B and C) and the data owner are established in trusted nodes of Tokyo QKD Network. The data owner has a function of the share holder. The trusted server is set in the same place of the data owner. The share holders, the data owner, and the users communicate by using OTP encryption from keys generated in Tokyo QKD Network.

Definition of trusted server in service layer

Since large volume data with a block structure such as FASTQ should be treated in the same way as “unstructured data” due to its large volume, so it is extremely difficult to achieve a user-satisfactory throughput by multi-party computation or homomorphic encryption. In other words, we need a system that allows efficient and secure secondary use genomic data. Given these conditions, we propose to establish a trusted server in the quantum secure cloud to enable the secondary use of secure data distributed by the XOR-based secret sharing.

QKD networks around the world assume “trusted nodes” and achieve key relays to expand the service area. Since such QKD networks are established on physical protection nodes, the secondary use of data assuming “a trusted server" in the quantum secure cloud has a very high affinity conceptually.

Ideally, similar requirements of hardware security module (HSM) defined in FIPS140-2 would be required as the conditions to be a trusted server. With reference to the HSM specification requirements, the following conditions are implemented in our trusted server in Table 1.

Table 1 Conditions of the trusted server in our experiment. Similar requirements of hardware security module (HSM) defined in FIPS140-2 would be required as the conditions to be “a trusted server.” Refer to the HSM requirements, conditions listed in Table 1 were implemented in this demonstration.

Implementation of trusted server in service layer

The above conditions are implemented in a dedicated server for genome data analysis to realize secure secondary use of data. We show our secure secondary use system installed in the Tokyo QKD Network.

Figure 3 shows the diagram of a trusted server. We used DRAGEN44 as a computational engine in the server for genome analysis. DRAGEN enables the generation of a VCF file, which is a set of conventions for representing all sites within the genome in a reasonably compact format from a FASTQ file, which is a text file that contains the sequence data from the clusters that pass a quality filter on a flow cell effectively. When a data request comes from an end user, the data owner who has share A asks one of the share holders to send back the share (B or C), and re-constructs the data (FASTQ file). When the devices are connected between the data owner and the end user, Wegman-Carter authentication45 is carried out, which is information theoretically secure. The FASTQ file is input to DRAGEN and is transformed into a VCF file. When the data owner sends the VCF file by OTP, the data owner controls the disclosure part of the VCF file according to the access right of the end user. The VCF files include all sites within the region of interest in a single file for each sample. There is a risk of identifying an individual from mutation information for non-research purposes contained in the VCF file. Therefore, to prevent unnecessary leakage of personal information, the data owner controls the range of mutants to be disclosed according to the user's access rights. In the next section, we describe the performance of our system (Fig. 4).

Figure 3
figure 3

Diagram of the trusted server for genomic data analysis. DRAGEN is used as a trusted server for genome analysis in our reference implementation. After reconstructing encrypted data as FASTQ using shares, the trusted server takes the FASTQ and generates variant call format (VCF) file, which is a list of genotypes different from the reference human genome assembly, e.g., three to four millions of genotyping records from an individual FASTQ file. The part of VCF file, is sent to the user after filtering the unnecessary genotyping information. This trusted server is installed in a physically protected area with tamper-resistance.

Figure 4
figure 4

Definition of throughput of secure computation using the trusted server. The trusted server consists of DRAGEN and a filtering server. The DRAGEN takes decrypted FASTQ.gz file as input, processes the file to generate VCF file, then generates GZIP compressed VCF file (VCF.gz) (throughput A). The VCF.gz file is transferred to the filtering server. On the filtering server, the requested genotype regions of VCF.gz file is decompressed (throughput B-1), filtered (throughput B-2), compressed (throughput B-3), and transferred to the server of an authenticated user (throughput C). The equations of throughputs A, B-1, B-2, B-3, and C, i.e., X[Mbit], are defined in the bottom of this figure.

Results

Benchmarking of DSGD

The secret sharing and secure computation were established on the Tokyo QKD Network, and the throughput was measured including the genotype filtering function. A FASTQ file has a volume at the 10 GB level, but it is compressed to about 400 MB when converted to a VCF file. Furthermore, it is operationally assumed that the disclosure range changes from 50 MB to less than 1 MB by filtering. The definition of throughput was measured according the following five equations;

$$\mathrm{A}=\frac{size(FASTQ)+size(VCF)}{processing\, time}$$
(1)
$$\mathrm{B}-1=\frac{size(VCF.gz)+size(VCF)}{processing\, time}$$
(2)
$$\mathrm{B}-2=\frac{size(VCF)+size(F\_VCF)}{processing\, time}$$
(3)
$$\mathrm{B}-3=\frac{size(F\_VCF)+size(F\_VCF.gz)}{processing\, time}$$
(4)
$$\mathrm{Total\, throughput}=\frac{size(FASTQ)}{total \,processing\, time}$$
(5)

Processing time include data format transform.

Assuming cases (extracting 0, 30,000, 1.5 million, and 3 million cases) of filtering ranges, the throughput in each case was measured. The results are summarized in Table 2. Table 3 focuses on the throughput of OTP transmission (encryption and decryption processes).

Table 2 Throughput in each process. Data storage type in each process is “share”, the process to restore the original data is also included. Transfer condition plain: No encryption OTP: Vernam’s one time pad Encryption and Decryption.
Table 3 Throughput of OTP transmission. The throughput with and without OTP encryption is summarized. The communication protocols with TCP and UDP are listed. “Header encryption” means OTP encryption including header information of each protocol. Maximum transmission unit corresponds to the data size of IPSEC.

A throughput exceeding 400 Mbps was achieved by secret sharing on the Tokyo QKD Network and secure computation using the trusted server. These results mean that the throughput is limited by the processing time in DRAGEN. Our system can provide genomic analysis data to users without the additional latency associated with data concealment. To our knowledge, the operation cannot be handled on multi-party computation or homomorphic encryption. In addition, the filtering function in this system enables us to conduct genome analysis research without worrying about the leakage of unnecessary personal data.

Security enhancement of trusted node

We have also made various efforts to implement security of the trusted server. In particular, erasing the used key is a necessary function for OTP encryption, while careful consideration is required to realize it. For example, solid state devices (SSDs), which are widely used as data storage media, can be restored in many cases even if the data is erased at first glance, suggesting that there is a risk of data leakage46. In that respect, the data of random access memories (DRAMs) is surely erased when the power is turned off. If all the key data for OTP can be stored in DRAMs, it is certain in terms of erasing the data. However, there is a risk that valuable keys from QKD links will be lost due to instantaneous electric outages, etc., and there is anxiety about stable operation. To solve these problems, the method shown in Fig. 5 is also used to erase random numbers used at OTP transmission in our system47.

Figure 5
figure 5

Key generation and erasing method for OTP encryption. Steps 1 to 3 are the key generation steps and Steps 4 and 5 are the key erasing steps for OTP encryption. Step 1: the key K0 provided from the QKD link is separated into K1 and K2 (size (K1) >  > size (K2)). Step 2: K1 is stored in an SSD (SRAM) or other medium that can be stored for a long time, while K2 is stored in a DRAM (short term memory). Key expansion using the AES scheme is performed using K2 as the initial random number, and the key is stored in DRAM as K3. At that time, set size (K1) = size (K3) on DRAM. Step 3: the XOR of K1 and K3 is calculated as K4. Step 4: Delete keys in DRAM Step 5: Delete keys in SSD. By completing Steps 4 and 5, impossible to guess the K4 by the probing attack to the SSD.

The detailed procedure is as follows. The key K0 provided by the QKD link is separated into K1 and K2 (size (K1) >  > size (k2)). K1 is stored in an SSD or another medium that can be stored for a long time, while K2 is stored in a DRAM. Key expansion using the AES scheme is performed using K2 as the initial random number, and the key is stored in DRAM as K3. At that time, set size (K1) = size (K3). Furthermore, the XOR of K1 and K3 is calculated as K4, and K4 is also stored in the DRAM. K4 is the key for OTP transmission. By using this method, it has information theoretic security against eavesdropping on the communication path. And even if the SSD that should have been erased is stolen by a malicious third party, it is extremely difficult to guess K4. This method is computationally secure against SSD attacks, however, implementation becomes much safer.

A fuzzing test was conducted on the network around the trusted server, and measures such as general cyber countermeasures were implemented. In addition, the rack that mounts the trusted server has a lock function by face recognition, and our experimental system realized a defense-in-depth implementation. It is difficult to make a general conclusion on how secure our experimental implementation is because we need to confirm the existence of security holes in other network devices. However, we can conclude that we succeeded in proof-of-principle demonstration of secure computation using a trusted server with information theoretical security.

Genome data i.e. FASTQ file is structured to some extent, but it does not apply to statistical processing like the case of electronic medical record data. Therefore, analysis by a dedicated analysis device is adopted worldwide. We think that it is difficult to apply multi-party computation or homomorphic encryption to such genomic data whose volume is large and is not well structured. Therefore, we think that the secure computation method using a trusted server in a secure network capable of information theoretically secure transmission and storage is a practical and promising candidate for secure secondary use of genomic data.

Conclusion

As a method to enable information theoretically secure genomic data transmission, storage, and secondary use, we proposed and implemented a secret sharing system on a quantum key distribution network and secure secondary use method of data assuming dedicated hardware as a trusted server. We have realized high-speed processing of over 400 Mbps for data analysis to obtain genotyping information from GB-class sequencing genomic data and made it possible to hand over to authorized users in an information theoretically secure manner using OTP transmission and secret sharing. From the perspective of enabling the secure secondary use of genomic data, the multi-party computation would be a candidate, however, it cannot always be implemented with a realistic throughput. In addition, while multi-party computation requires as many computational resources as there are shares, our method is equivalent to ordinary computation. In addition, since the analysis method for genome analysis is updated frequently, it is necessary to update the software with the same number of computation resources as the same of shares. On the other hand, when we use the scheme of assuming a trusted server, it can be completed by updating the software and firmware of one dedicated hardware, so the maintenance work can be minimized, and there are many advantages in terms of cost and operation. Research and diagnosis using human genome data must maintain secrecy for a very long time. The combination of QKD network, secret sharing, and secondary use of data by a trusted server is extremely effective as a practical implementation method for this purpose. To make this scheme more mature, it is expected that security certification methods will be required. The penalty for leakage of personal information is increasing, which leads to the high cost of using personal information. If the personal information is aggregated in the quantum secure cloud and the secure computation scheme proposed in this study is used, it is possible to reduce the cost of protecting personal information at hospitals or institutes. In the future, we would like to complete a system that allows various medical institutions to store data with information theoretic security and allows cross-referencing and highly efficient secondary use of data by multiple organizations using our proposed system.