Introduction

Alongside their publication of a Sumatran orang-utan (Pongo abelii) draft genome assembly in 2011, The Orang-utan Genome Consortium re-sequenced the genomes of ten additional unrelated wild-caught individuals – ostensibly five Sumatran and five Bornean (P. pygmaeus) orang-utans – using short-read Illumina sequencing1. Their manuscript, and its accompanying 297 Gb of sequence data, has since been cited more than 500 times. During the course of our own studies, however, we noted several inconsistencies between the data made available in the NCBI Sequence Read Archive and their accompanying metadata and descriptors in the paper.

We found no record of a sample with the identifier “KB5543”, for example, in the Frozen Zoo repository, the reported source of a sample attributed to the orang-utan, Louis. The closest match in their database to this ID was for another sample, “15543”, which derived from a different individual. We also observed that the identifier “KB9528”, as reported for the orang-utan Baldy in the manuscript’s Tables S4-1, was catalogued as a sample from an “African pig” – though, in a supplemental file, it was correctly denoted as KB9258, which derived from another orang-utan. The sample identifier “SB550”, as reported for the orang-utan Doris, appeared to reference a studbook number (i.e. “SB”) that belonged to another sequenced orang-utan, Sibu. The sex reported for five individuals also contradicted their known sexes, as had been recorded in contemporary studbook records2, plus differed from the sexes assigned to each sample in Locke et al.’s supplementary data.

Thus, we were driven to reconsider the identities of each genome’s source individual, through re-analysis of the published data combined with new molecular studies. Herein, we report that nine of the ten samples and/or published genomes were erroneously labelled in the original Nature publication. We present the corrected data and discuss the implications for other published works.

Methods

We first mapped the re-sequencing reads of all 10 Locke et al. whole genomes, plus those previously published from 27 conspecifics3,4, to the latest iteration of the (female) orang-utan reference genome (ponAbe35). To this, we had concatenated a recent orang-utan Y chromosome assembly6. Using the idxtools function in samtools 1.147, we inferred sex by comparing the ratios to which sequence reads were mapped against the X and Y chromosomes. Following two rounds of bootstrapped base recalibration, we then jointly called genotypes with GATK 4.1.8.08, all as previously described9. We randomly sampled 1,000,000 biallelic autosomal SNPs with no missing genotypes and ≥5% minor allele frequency (MAF), pruned linked loci in PLINK10 (–indep-pairwise 50 10 0.1), and assigned populations in ADMIXTURE 1.3911 as supervised with provenance data reported for the conspecifics3,4 (K = 3).

Additionally, we sampled and assayed eight orang-utans known to be first, second or third-degree relatives of seven of those purportedly sequenced by Locke et al., using the Illumina iScan Multi-Ethnic Global Array, also as previously described12. The reproduction of those seven, and thus these known relationships, had been contemporaneously recorded2 (Fig. 1). To convert the microarray intensity data to variant calls, we mapped the probe flank sequences to ponAbe3 (using --fasta-flank) and exported genotypes (--sam-flank) with the bcftoools7 plugin gtc2vcf (https://github.com/freeseek/gtc2vcf), subject to the following filter parameters: meanR_AB < 0.2, meanR_AA < 0.2, meanR_BB < 0.2, Cluster_Sep < 0.35, meanTHETA_AA > 0.3, meanTHETA_BB < 0.7, meanTHETA_AB < 0.3 and > 0.7, devTHETA_AA > 0.025, devTHETA_AB ≥ 0.07, devTHETA_BB > 0.025 and GenTrain_Score < 0.7. We then re-genotyped all 37 whole genomes at each of the resulting loci, as previously described9; merged these with the microarray genotype VCF, and LD-pruned and MAF-filtered biallelic SNPs precisely as aforementioned. With a view to avoiding the spurious kinship associations that typify highly structured data, we then bootstrapped ADMIXTURE’s cross-validation procedure to infer the most suitable K (trialling 1 through 10) before estimating kinship coefficients (Φij) in REAP13.

Fig. 1
figure 1

Genogram depicting relationships between orang-utans sampled by Locke et al. and known relatives assayed in this study. Orang-utans used in the kinship analysis (Table 1) are circled. The species affiliations of uncircled orang-utans reflect those purported by contemporary studbook records.

We adopted a tri-fold method to confirm each sample’s identity. Identities were first inferred with an exclusionary approach, from computed (versus known and reported) sex and species. Each was then confirmed, where available, when observed kinship coefficients resembled those expected from known relationships. Third, we reviewed the historical biomaterial records retained by the Frozen Zoo, the original source of the samples, plus notes from the Laboratory Information System (LIMS) retained at Washington University in Saint Louis, where the samples were originally sequenced. Identity was assigned to a given sample when all these factors concorded.

Results

We observed X:Y sequence ratios in known males to range from 0.369–0.569 (mean 0.476) and in females from 4.114 to 5.827 (mean 4.973). From this, we interpreted that sex had been incorrectly assigned by Locke et al. to the sample SAMN00007170. This sample was inferred to be female (4.170) and thus cannot have derived from Baldy as purported. The species of each sample was correctly reported, though we inferred that the sample SAMN00007170 derived from a Tapanuli orang-utan (Fig. 2). This species was not formally described until 20174.

Fig. 2
figure 2

Ancestry proportions of the Locke et al. orang-utans, as supervised with provenance data from 27 conspecifics sampled across the natural range of the genus3,4. Sample SAMN000007170 derives from a Tapanuli orang-utan.

Kinship analyses linked one or more known relatives to seven of the ten samples sequenced by Locke et al. Specifically, we linked Billy to his granddaughter, Batari (observed/expected kinship coefficients: 0.116/0.125); Dumplin to her parents, Dennis (0.162/0.25) and Dolly (0.19/0.25); Sibu to her son, Tengku (0.246/0.25); Louis to his grandson, Max (0.082/0.125); Likoe to his great granddaughter, Menari (0.063/0.063); and Bubbles to her son, Oliver (0.233/0.25), daughter, Bella (0.242/0.25) and granddaughter, Nairi/Nadira (0.036/0.125). Admixture (K = 3) and kinship were inferred from a total of 1,132,210 biallelic SNPs.

We assigned identity to the remaining three samples as molecular sex, sample and LIMS records were all concordant, and as the relatedness data had excluded other possible candidates. Table 1 corrects the record as originally presented in Nature.

Table 1 Corrected identities and metadata for the samples sequenced and published by Locke et al.1, as deposited in the NCBI BioSample database.

Discussion

Because Locke et al. focused solely on genome content, their discrepancies have no bearing on the accuracy of their data or their manuscript’s published findings. These errors have had considerable impact on other studies that utilized the published data, however, particularly those dependent on using data from known individuals. Three of our co-authors (GLB, EDF, AK) write from first-hand experience: reliant on tables and metadata from the original Nature publication, we came perilously close to incorrectly reporting that Baldy, a male orang-utan who lived at the Sacramento Zoo, was the first of the recently described Tapanuli species to be captured and exported from a wild population – a full five decades before his species’ formal description. On the contrary, this dubious honour belongs to Bubbles, a female orang-utan who lived at the San Diego Zoo. Though beyond the scope of our manuscript, the implications of this switch have not escaped our attention: principally, that Bubbles produced eight Sumatran x Tapanuli hybrid descendants, who were previously thought to be Sumatran. The genetic integrity of the captive population is therefore unexpectedly compromised, as we present in detail in a manuscript that is currently under review.

Though we eventually caught these errors, others did not. Mattle-Greminger et al. (2018) reproduced eight erroneous sample identities in their paper, meaning each of the genomes they analysed were from different animals than reported14. Sudmant et al. (2013) reproduced seven such errors, thus also misattributing samples15. Neither Ma et al. (2013) nor Beeravolu et al. (2018) recognized that the sample identities were wrong, though as they reported only sample IDs (versus animal identities), no corrections to their manuscripts are warranted16,17. As sample identities are normally only reported in supplemental data – which is not always indexed by search engines – we cannot easily ascertain the full extent to which papers citing Locke et al. have reproduced these errors.

Given these findings and implications, we have corrected the samples’ identities in the NCBI BioSample database. Tables detailing the revisions made are included in Supplementary File 1. We respectfully ask that those utilizing these updated identities cite this article in Scientific Data, in addition to the Correction concurrently published in Nature.