The studies conclude that it could soon be possible to search crime-scene DNA for links to nearly all Americans of European descent, while massively expanding the potential reach of an existing forensic genetic database. The results also raise urgent privacy issues, say researchers.
“It’s important to have this discussion early on,” says Yaniv Erlich, chief scientific officer at consumer genetics firm MyHeritage in Yehuda, Israel, and a computational geneticist at Columbia University in New York City, who led one of the studies, published in Science1.
From the mid 1970s to the late 1980s, a string of burglaries, sexual assaults and murders committed in California were attributed to an unknown person dubbed the Golden State Killer or the East Area Rapist. The case went cold, but in April 2018, police arrested a suspect named Joseph James DeAngelo. He was identified as a suspect, in part, by matching crime-scene DNA to genetic profiles posted by his distant relatives on the genetic-genealogy website GEDmatch, which allows people to upload genetic profiles obtained from consumer genetic companies to search for relatives.
The Golden State Killer case wasn’t the first in which police nabbed a suspect through a relative’s DNA. But its high profile, coupled with the breakneck growth of consumer genetics testing, has led to a surge of similar investigations. Between April and August 2018, more than a dozen cases have been solved using this technique, which is known as long-range familial search.
Erlich’s team — which has previously shown3 that it can identify anonymous DNA samples in public databases — set out to measure the reach of long-range familial search. Many criminal cases that have incorporated such genetic searches used GEDmatch, which contains the DNA profiles of roughly 1 million people.
To study the potential of these searches, Erlich’s team analysed private, anonymized DNA profiles from 1.28 million MyHeritage customers. Like other consumer genetics firms, the company allows customers to search for relatives who share DNA segments inherited from a common ancestor, such as a great-great-grandparent.
Erlich’s team found that 60% of MyHeritage customers had a third cousin or closer relative in its database. Searches of 30 randomly selected GEDmatch profiles found a similar rate of relative matching in that database.
But such genetic databases have the potential to identify many more people who aren't in them. DeAngelo, for example, was not on GEDmatch; detectives found him using profiles of his third cousins. Erlich’s team estimates that a database containing genetic profiles of 3 million Americans of European descent could enable the identification of 90% of this demographic using public genealogy records.
(Consumer genetics customers are overwhelmingly of European descent, in stark contrast to forensic databases, in which minorities tend to be over-represented, and nearly all the cases solved using GEDmatch have involved people of European descent.)
GEDmatch’s database is currently growing at a rate of 1000-2000 profiles per day, says GEDmatch’s co-administrator Curtis Rogers, and should hit that threshold within the next few years.
Such searches involve significant detective work. The full details of the Golden State Killer investigation have not been revealed, but before focusing on DeAngelo, investigators screened dozens, if not hundreds, of people — including some of his close relatives.
To see whether they could track down people not in the database, Erlich and his team set out to identify an anonymous woman from Utah who had made her DNA public as part of a genomics project called 1000 Genomes. In a 2013 paper3, the team determined the identity of the woman’s husband (who also donated his DNA to the project) using a database that links Y-chromosome sequences with surnames.
To find the man’s wife, the team uploaded her 1000 Genomes profile to GEDmatch and searched the database for distant cousins. Of the people who had enough DNA in common with the Utah woman to suggest that they shared an ancestor in the past few generations, two — from North Dakota and Wyoming — also had enough public genealogical information to narrow the search. A day’s worth of research, which involved ruling out hundreds of descendants, eventually identified the Utah woman.
Erlich’s team contacted the US National Institutes of Health, which is involved with the 1000 Genomes Project database, to let it know that the group had identified a participant. The woman is not named in the paper and the researchers made no attempt to contact her.
DeAngelo was identified and arrested only because crime-scene DNA had been preserved. This allowed forensic scientists to compare it to genetic material using modern techniques that determine the sequence of hundreds of thousands of DNA variants, or single nucleotide polymorphisms (SNPs), across the genome. This is the same genotyping approach used in consumer genetics testing and many biomedical studies.
For the past few decades, though, most crime-scene DNA samples have been analysed using a technology that determines the sequences of more than a dozen ‘short tandem repeats’, the lengths of which vary from person to person. The FBI’s Combined DNA Index System (CODIS) holds more than 13 million such profiles in its computer database.
These allow forensic scientists to determine a genetic signature for individuals, and are relatively easy to generate from highly degraded samples, such as blood spots. But the profiles are poorly suited to matching relatives, says Noah Rosenberg, a population geneticist at Stanford University in California. They don’t have the resolution to determine ancestry and relatedness in the same way that SNP assays based on 1 million variants do, and false positives are common in familial searches.
To circumvent this problem, Rosenberg’s team developed a computational method to cross-match CODIS profiles with a close relative’s SNP profile (the test used by most consumer genetics companies and available for searching on GEDmatch). The method takes advantage of the fact that DNA is inherited in large chunks, and it is possible to identify SNP sequences that tend to be passed down on the same chunk of DNA as a particular short tandem repeat.
The method can thus far match only first-degree relatives — siblings or parents and their children. Simulations suggested that about one-third of people genotyped using short tandem repeats could be correctly matched to a first-degree relative genotyped with SNPs (and vice-versa). This could allow investigators who are unable to generate SNP profiles from crime-scene material to look for matches to CODIS profiles in databases such as GEDmatch, and vice versa, Rosenberg says. His team’s study appears in Cell2.
Forensic genealogical investigations similar to the Golden State Killer case are set to grow. Parabon NanoLabs, a forensic DNA company in Reston, Virginia, that has been involved in many such investigations, now markets the service to investigators and has dozens of cases in the works.
The lack of regulation surrounding such searches is striking, says Rori Rohlfs, a statistical geneticist at San Francisco State University in California who has written about the ethics of familial searching. She can imagine policymakers limiting when and how law-enforcement agencies can use public databases such as GEDmatch.
Some such restrictions already exist. In California, for example, law-enforcement forensic databases can be used to find family members only in serious crimes where there is a risk to public safety, and the genealogical investigative team must be distinct from local detectives working on a case.
Erlich contends that technology can protect people from unwanted searches. Consumer genetics firms often allow customers to download their data and post them on third-party databases such as GEDmatch. Erlich says that consumer genetics companies could include digital signatures with these files, allowing GEDmatch to differentiate them from crime-scene profiles uploaded by investigators, shielding consumers from searches.
However, Rohlfs notes that GEDmatch has so far made no effort to discourage investigations — and has updated its terms of service to indicate that law-enforcement agencies may use the database. “So it’s not obvious to me that GEDmatch wants to protect against that use,” she says.
Rogers says GEDmatch has no plans to limit law enforcement access to the site — after the Golden State Killer case emerged, the site updated its terms of service to explicitly warn users that investigators could use the site — and he worries that regulating use will interfere with the site’s raison d’etre: helping people find relatives. “I don’t think anyone’s privacy is being violated,” he says. “People should be able to control their own DNA and not the government.”
Colleen Fitzpatrick, co-executive director of the DNA Doe Project in Sebastopol, California, which has used familial searching to help solve a number of missing-person cases, says the information that investigators glean from these searches isn’t so different from other leads — and therefore shouldn’t be treated any differently.
“Just about anything we do in life reveals information about others,” she says. “Reporting that my brother came home with a black eye the night of a fight in the neighbourhood bar can be just as revealing to the right parties as posting a photograph labelled with the name of my grandmother on Facebook.”
Nature 562, 315-316 (2018)
Erlich, Y., Shor, T., Pe'er, I. & Carmi, S. Science https://doi.org/10.1126/science.aau4832 (2018).
Kim, J., Edge, M. D., Algee-Hewitt, B. F. B., Li, J. Z. & Rosenberg, N. A. Cell https://doi.org/10.1016/j.cell.2018.09.008 (2018).
Gymrek, M., McGuire, A. L., Golan, D., Halperin, E. & Erlich, Y. Science 339, 321–324 (2013).