Published online 12 April 2010 | Nature | doi:10.1038/news.2010.178


Anonymizing patient records for genomics

New method for concealing identity could open up more data for science.

Laptop,Medical Record chart and StethoscopeDiagnostic data held in medical records could be very useful to geneticists.DNY59/iStockphoto

Combining genetic information with existing medical records could allow researchers to rapidly identify genes linked to diseases. But bringing together diagnostic and genetic information in one place risks compromising the confidentiality of patients.

Now a team from Vanderbilt University in Nashville, Tennessee, has developed a method they say can anonymize electronic medical records, eliminating the risk of individual patients being identified while preserving the ability of researchers to do such work.

There is a huge push under way in the United States and elsewhere to create and exploit electronic medical records (see 'Clinical epidemiology: Archived answers'). Combining the diagnosis codes that list a patient's conditions in these records with genetic information from the patient could save time and money in genome-wide association studies, which seek to link genetic variations with diseases.

If their new approach is adopted it could open up vast new amounts of data for researchers to crunch while keeping patient data anonymous, the team say.

"There are lots of different types of information that exist within medical records. In the past it has been shown that demographic information is readily linkable to public information that could lead to the identification of patients. What this research demonstrates is that residual features — particularly the diagnosis codes — are also exploitable," says Bradley Malin, a biomedical informatics researcher at Vanderbilt University and one of the authors of the study. Malin says their new research, published in the Proceedings of the National Academy of Sciences, shows that it is possible to strike a balance between patient privacy and providing researchers with useful data for their studies1.

Institutions such as universities are currently unwilling to release such data over concerns about privacy, the team says.

Code conundrum

Patients often have a unique combination of diagnosis codes. That could allow someone such as a hospital employee to identify them, even if the patient's name and other identifying information have been removed. But the diagnosis codes provide exactly the information that would help researchers link genetic variants to diseases.

To solve this problem, the new method allows researchers to set two parameters: the minimum number of patients (k) that should have the same set of codes, and a 'utility policy' which specifies how codes should be linked in the anonymized data.

Grigorios Loukides, co-author of the paper and a member of Malin's research group, cites an example where the first parameter is 2 and the utility policy groups type 1 diabetes with type 2 diabetes. If in the original data only Patient X has type 1 and only Patient Y has type 2, the system will anonymize the data by making both patients have the codes for both types. However, the data will still allow researchers to identify people with diabetes.

Essentially, their new system modifies the data — for example by generalizing the rarest codes — until k is reached, says Loukides. "Our method will always allow useful inferences to be made when the utility policy is satisfied — that is, when diagnosis codes are grouped as required — even when there are fewer than k patients having a diagnosis code in the original data."

Consensus eMERGE-ing?

The paper comes from universities in the Electronic Medical Records and Genomics (eMERGE) Network, which is overseen by the National Human Genome Research Institute (NHGRI). Malin says that although the method has only been applied at Vanderbilt, other universities have been positive about the idea. If it is adopted, he says, researchers "would have access to significantly greater amounts of data than they currently have".

"The paper addresses an important issue related to potential for re-identification from electronic medical record data used for genome-wide association studies," says Teri Manolio, director of the Office of Population Genomics at the NHGRI. "It does a good job of trying to maximize the information shared while minimizing the risk for re-identification, recognizing that these goals are in dynamic tension and both cannot be fully met at the same time."


She adds that the best insurance against unintended re-identification are the agreements that bind researchers to ethical use of the data. "But it is useful to know the magnitude of the possible risk should an approved user decide to violate these agreements," Manolio says.

"Genome wide association studies and whole-genome sequencing is by itself very sensitive data — with or without the electronic medical record data being anonymized," says Thomas Hudson, a member of the International Cancer Genome Consortium executive and president of the Ontario Institute for Cancer Research in Toronto.

Hudson plans to bring the new research to the attention of ethics and data-access groups of the International Cancer Genome Consortium. "This is a hot issue, which we need to understand from multiple perspectives — informatics, legal, societal, ethical — that may differ in different parts of the world," he says. 

  • References

    1. Loukides, G. et al. Proc. Natl Acad. Sci. USA doi: 10.1073/pnas.0911686107 (2010).
Commenting is now closed.