The Sumatran orang-utan (Pongo abelii) reference genome was first published in 2011, in conjunction with ten re-sequenced genomes from unrelated wild-caught individuals. Together, these published data have been utilized in almost all great ape genomic studies, plus in much broader comparative genomic research. Here, we report that the original sequencing Consortium inadvertently switched nine of the ten samples and/or resulting re-sequenced genomes, erroneously attributing eight of these to the wrong source individuals. Among them is a genome from the recently identified Tapanuli (P. tapanuliensis) species: thus, this genome was sequenced and published a full six years prior to the species’ description. Sex was wrongly assigned to five known individuals; the numbers in one sample identifier were swapped; and the identifier for another sample most closely resembles that of a sample from another individual entirely. These errors have been reproduced in countless subsequent manuscripts, with noted implications for studies reliant on data from known individuals.
Alongside their publication of a Sumatran orang-utan (Pongo abelii) draft genome assembly in 2011, The Orang-utan Genome Consortium re-sequenced the genomes of ten additional unrelated wild-caught individuals – ostensibly five Sumatran and five Bornean (P. pygmaeus) orang-utans – using short-read Illumina sequencing1. Their manuscript, and its accompanying 297 Gb of sequence data, has since been cited more than 500 times. During the course of our own studies, however, we noted several inconsistencies between the data made available in the NCBI Sequence Read Archive and their accompanying metadata and descriptors in the paper.
We found no record of a sample with the identifier “KB5543”, for example, in the Frozen Zoo repository, the reported source of a sample attributed to the orang-utan, Louis. The closest match in their database to this ID was for another sample, “15543”, which derived from a different individual. We also observed that the identifier “KB9528”, as reported for the orang-utan Baldy in the manuscript’s Tables S4-1, was catalogued as a sample from an “African pig” – though, in a supplemental file, it was correctly denoted as KB9258, which derived from another orang-utan. The sample identifier “SB550”, as reported for the orang-utan Doris, appeared to reference a studbook number (i.e. “SB”) that belonged to another sequenced orang-utan, Sibu. The sex reported for five individuals also contradicted their known sexes, as had been recorded in contemporary studbook records2, plus differed from the sexes assigned to each sample in Locke et al.’s supplementary data.
Thus, we were driven to reconsider the identities of each genome’s source individual, through re-analysis of the published data combined with new molecular studies. Herein, we report that nine of the ten samples and/or published genomes were erroneously labelled in the original Nature publication. We present the corrected data and discuss the implications for other published works.
We first mapped the re-sequencing reads of all 10 Locke et al. whole genomes, plus those previously published from 27 conspecifics3,4, to the latest iteration of the (female) orang-utan reference genome (ponAbe35). To this, we had concatenated a recent orang-utan Y chromosome assembly6. Using the idxtools function in samtools 1.147, we inferred sex by comparing the ratios to which sequence reads were mapped against the X and Y chromosomes. Following two rounds of bootstrapped base recalibration, we then jointly called genotypes with GATK 4.1.8.08, all as previously described9. We randomly sampled 1,000,000 biallelic autosomal SNPs with no missing genotypes and ≥5% minor allele frequency (MAF), pruned linked loci in PLINK10 (–indep-pairwise 50 10 0.1), and assigned populations in ADMIXTURE 1.3911 as supervised with provenance data reported for the conspecifics3,4 (K = 3).
Additionally, we sampled and assayed eight orang-utans known to be first, second or third-degree relatives of seven of those purportedly sequenced by Locke et al., using the Illumina iScan Multi-Ethnic Global Array, also as previously described12. The reproduction of those seven, and thus these known relationships, had been contemporaneously recorded2 (Fig. 1). To convert the microarray intensity data to variant calls, we mapped the probe flank sequences to ponAbe3 (using --fasta-flank) and exported genotypes (--sam-flank) with the bcftoools7 plugin gtc2vcf (https://github.com/freeseek/gtc2vcf), subject to the following filter parameters: meanR_AB < 0.2, meanR_AA < 0.2, meanR_BB < 0.2, Cluster_Sep < 0.35, meanTHETA_AA > 0.3, meanTHETA_BB < 0.7, meanTHETA_AB < 0.3 and > 0.7, devTHETA_AA > 0.025, devTHETA_AB ≥ 0.07, devTHETA_BB > 0.025 and GenTrain_Score < 0.7. We then re-genotyped all 37 whole genomes at each of the resulting loci, as previously described9; merged these with the microarray genotype VCF, and LD-pruned and MAF-filtered biallelic SNPs precisely as aforementioned. With a view to avoiding the spurious kinship associations that typify highly structured data, we then bootstrapped ADMIXTURE’s cross-validation procedure to infer the most suitable K (trialling 1 through 10) before estimating kinship coefficients (Φij) in REAP13.
We adopted a tri-fold method to confirm each sample’s identity. Identities were first inferred with an exclusionary approach, from computed (versus known and reported) sex and species. Each was then confirmed, where available, when observed kinship coefficients resembled those expected from known relationships. Third, we reviewed the historical biomaterial records retained by the Frozen Zoo, the original source of the samples, plus notes from the Laboratory Information System (LIMS) retained at Washington University in Saint Louis, where the samples were originally sequenced. Identity was assigned to a given sample when all these factors concorded.
We observed X:Y sequence ratios in known males to range from 0.369–0.569 (mean 0.476) and in females from 4.114 to 5.827 (mean 4.973). From this, we interpreted that sex had been incorrectly assigned by Locke et al. to the sample SAMN00007170. This sample was inferred to be female (4.170) and thus cannot have derived from Baldy as purported. The species of each sample was correctly reported, though we inferred that the sample SAMN00007170 derived from a Tapanuli orang-utan (Fig. 2). This species was not formally described until 20174.
Kinship analyses linked one or more known relatives to seven of the ten samples sequenced by Locke et al. Specifically, we linked Billy to his granddaughter, Batari (observed/expected kinship coefficients: 0.116/0.125); Dumplin to her parents, Dennis (0.162/0.25) and Dolly (0.19/0.25); Sibu to her son, Tengku (0.246/0.25); Louis to his grandson, Max (0.082/0.125); Likoe to his great granddaughter, Menari (0.063/0.063); and Bubbles to her son, Oliver (0.233/0.25), daughter, Bella (0.242/0.25) and granddaughter, Nairi/Nadira (0.036/0.125). Admixture (K = 3) and kinship were inferred from a total of 1,132,210 biallelic SNPs.
We assigned identity to the remaining three samples as molecular sex, sample and LIMS records were all concordant, and as the relatedness data had excluded other possible candidates. Table 1 corrects the record as originally presented in Nature.
Because Locke et al. focused solely on genome content, their discrepancies have no bearing on the accuracy of their data or their manuscript’s published findings. These errors have had considerable impact on other studies that utilized the published data, however, particularly those dependent on using data from known individuals. Three of our co-authors (GLB, EDF, AK) write from first-hand experience: reliant on tables and metadata from the original Nature publication, we came perilously close to incorrectly reporting that Baldy, a male orang-utan who lived at the Sacramento Zoo, was the first of the recently described Tapanuli species to be captured and exported from a wild population – a full five decades before his species’ formal description. On the contrary, this dubious honour belongs to Bubbles, a female orang-utan who lived at the San Diego Zoo. Though beyond the scope of our manuscript, the implications of this switch have not escaped our attention: principally, that Bubbles produced eight Sumatran x Tapanuli hybrid descendants, who were previously thought to be Sumatran. The genetic integrity of the captive population is therefore unexpectedly compromised, as we present in detail in a manuscript that is currently under review.
Though we eventually caught these errors, others did not. Mattle-Greminger et al. (2018) reproduced eight erroneous sample identities in their paper, meaning each of the genomes they analysed were from different animals than reported14. Sudmant et al. (2013) reproduced seven such errors, thus also misattributing samples15. Neither Ma et al. (2013) nor Beeravolu et al. (2018) recognized that the sample identities were wrong, though as they reported only sample IDs (versus animal identities), no corrections to their manuscripts are warranted16,17. As sample identities are normally only reported in supplemental data – which is not always indexed by search engines – we cannot easily ascertain the full extent to which papers citing Locke et al. have reproduced these errors.
Given these findings and implications, we have corrected the samples’ identities in the NCBI BioSample database. Tables detailing the revisions made are included in Supplementary File 1. We respectfully ask that those utilizing these updated identities cite this article in Scientific Data, in addition to the Correction concurrently published in Nature.
No custom code was used to generate or process the data described in the manuscript.
Locke, D. P. et al. Comparative and demographic analysis of orang-utan genomes. Nature 469, 529–533 (2011).
Jones, M. L. Studbook of the orang utan, Pongo pygmaeus (Zoological Society of San Diego, 1980).
Prado-Martinez, J. et al. Great ape genetic diversity and population history. Nature 499, 471–475 (2013).
Nater, A. et al. Morphometric, behavioral, and genomic evidence for a new orangutan species. Curr. Biol. 27, 3487–3498.e10 (2017).
Kronenberg, Z. N. et al. High-resolution comparative analysis of great ape genomes. Science 360, eaar6343 (2018).
Cechova, M. et al. Dynamic evolution of great ape Y chromosomes. Proc. Natl. Acad. Sci. U.S. 117, 26273–26280 (2020).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra 1st edn (O’Reilly Media, 2020).
Banes, G. L. et al. Genomic targets for high-resolution inference of kinship, ancestry and disease susceptibility in orang-utans (Genus: Pongo). BMC Genomics 21, 873 (2020).
Purcell, S. et al. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am. J. Hum. Genet. 81, 559–575 (2007).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Fountain, E. D. et al. Cross-species application of Illumina iScan microarrays for cost-effective, high-throughput SNP discovery. Front Ecol Evol 9, 629252 (2021).
Thornton, T. et al. Estimating kinship in admixed populations. Am. J. Hum. Genet. 91, 122–138 (2012).
Mattle-Greminger, M. P. et al. Genomes reveal marked differences in the adaptive evolution between orangutan species. Genome Biol. 19, 193 (2018).
Sudmant, P. H. et al. Evolution and diversity of copy number variation in the great ape lineage. Genome Res. 23, 1373–1382 (2013).
Ma, X. et al. Population genomic analysis reveals a rich speciation and demographic history of orang-utans (Pongo pygmaeus and Pongo abelii). PLOS ONE 8, e77175 (2013).
Beeravolu, C. R., Hickerson, M. J., Frantz, L. A. F. & Lohse, K. ABLE: blockwise site frequency spectra for inferring complex population histories and recombination. Genome Biol. 19, 145 (2018).
Banes, G. L., Fountain, E. D. & Karklus, A. Nine of ten samples were mistakenly switched by The Orang-utan Genome Consortium, Figshare, https://doi.org/10.6084/m9.figshare.c.5313095.v1 (2022).
This research was performed using the compute resources and assistance of the University of Wisconsin-Madison Center for High Throughput Computing (CHTC) in the Department of Computer Sciences. The CHTC is supported by UW-Madison, the Advanced Computing Initiative, the Wisconsin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National Science Foundation, and is an active member of the Open Science Grid, which is supported by the National Science Foundation and the U.S. Department of Energy’s Office of Science. We thank Audubon Zoo, Birmingham Zoo, Cameron Park Zoo, the Center for Great Apes, Cheyenne Mountain Zoo, Cincinnati Zoo & Botanical Garden, Cleveland Metroparks Zoo, Columbus Zoo and Aquarium, Dallas Zoo, El Paso Zoo, Fort Wayne Children’s Zoo, Houston Zoo, Indianapolis Zoo, Kansas City Zoo, Little Rock Zoo, Louisville Zoo, Metro Richmond Zoo, Milwaukee County Zoo, Phoenix Zoo, Sedgwick County Zoo, Topeka Zoo, Toronto Zoo and ZooTampa at Lowry Park for providing samples from known descendants, and the Orangutan Species Survival Plan (SSP) for approval by recommendation to the SSP’s member institutions. We are especially grateful to Dr Cynthia Steiner, who provided metadata from the San Diego Zoo Institute for Conservation Research. This work was funded in part by the Arcus Foundation, the Eppley Foundation for Research, Inc., the Ronna Noel Charitable Trust, and the Institute of Museum and Library Services through National Leadership Grant MG-249168-OMS-21 (all to GLB). AK was supported by the Morris Animal Foundation. Research reported in this publication was also supported in part by the Office of the Director, National Institutes of Health, under award number P51OD011106 to the Wisconsin National Primate Research Center and conducted in part at a facility constructed with support from the Research Facilities Improvement Program under grant numbers RR15459-01 and RR020141-01. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Banes, G.L., Fountain, E.D., Karklus, A. et al. Nine out of ten samples were mistakenly switched by The Orang-utan Genome Consortium. Sci Data 9, 485 (2022). https://doi.org/10.1038/s41597-022-01602-0
This article is cited by