Genetic barcoding of museum eggshell improves data integrity of avian biological collections

Natural history collections are often plagued by missing or inaccurate metadata for collection items, particularly for specimens that are difficult to verify or rare. Avian eggshell in particular can be challenging to identify due to extensive morphological ambiguity among taxa. Species identifications can be improved using DNA extracted from museum eggshell; however, the suitability of current methods for use on small museum eggshell specimens has not been rigorously tested, hindering uptake. In this study, we compare three sampling methodologies to genetically identify 45 data-poor eggshell specimens, including a putatively extinct bird’s egg. Using an optimised drilling technique to retrieve eggshell powder, we demonstrate that sufficient DNA for molecular identification can be obtained from even the tiniest eggshells without significant alteration to the specimen’s appearance or integrity. This method proved superior to swabbing the external surface or sampling the interior; however, we also show that these methods can be viable alternatives. We then applied our drilling method to confirm that a purported clutch of Paradise Parrot eggs collected 40 years after the species’ accepted extinction date were falsely identified, laying to rest a 53-year-old ornithological controversy. Thus, even the smallest museum eggshells can offer new insights into old questions.

supernatant was then removed and discarded. Samples were then snap-frozen at -20°C. The final diameter of the blow-hole was measured (Table S1, Table S2).
Figure S1 | Photographs of the: A swabbing process, B drilling process, and C approximate amount of powder obtained from drilling (note that in some cases there was perhaps half this amount of powder). Photographs by Alicia Grealy Other observations. We experimented with several sized and shaped drill bits ( Figure S2). Some of these drill bits tended to catch on any loose membrane and tear the eggshell away further; others did not allow the powder to easily be released from the bit into water. A moistened standard 0.8 mm twist drill bit ( Figure S2d) proved the gentlest on every egg size while also catching and releasing powder with ease. We also experimented with various-types of self-adhesive templates to reinforce the blow hole prior to drilling and to provide a guide; however, all types of adhesive tended to tear away eggshell in an unpredictable manner, even when the adhesive was low-tack or archival. We therefore do not recommend using any adhesive on the eggshell. Figure S2 | Various types of drill bits tested to sample eggshell powder: A Tapered, B Straight, C Carbide down-cut inlay, D Plain shank (of varying thickness, but the best was 0.8 mm), E Diamond wheel point taper, F Carbide grout bit, G High speed cutter. The asterisk indicates that this was the best performing drill bit.

DNA extraction of collection eggshell
All extraction steps were carried out in a designated ultra-clean facility at the ANU (Ecogenomics and Bioinformatics Laboratory) and were extracted in an area separate from the sampling area. 200 ul of a digest buffer containing 2 mg/mL Proteinase K (Ambion) in 0.5 M EDTA (Invitrogen) was added to eggshell powder and internal samples. 1000 ul of the same digest buffer was added to the swab samples. Samples were incubated with shaking at 1000 rpm overnight at 55°C in a Thermoshaker (Eppendorf). Digests were centrifuged at maximum speed for 10 minutes in a bench-top centrifuge to collect cell debris. For the powder and internal samples, the supernatant was removed and placed in a clean 15 mL Falcon tube along with 4 mL of Glocke and Meyer (2017) binding buffer (i.e., 2M Guanidine Hydrochloride, 70% isopropanol, 0.05% Tween-20, and 1:250 v/v Qiagen pH indicator in Ultra-pure water). For the swab samples, the supernatant was transferred to a Vivaspin 500 centrifugal concentrated column (MWCO 30 kDa) and concentrated to a volume of 50 ul by centrifuging at 15,000 x g, discarding the flow-through. The 50 ul concentrated digest was transferred to a 1.5 mL Safe-lock Lo-Bind Eppendorf tubed and combined with 650 ul of Glock and Meyer (2017) binding buffer. 700 ul of the binding-buffer/digest solution at a time was passed through a MinElute PCR purification silica spin column (Qiagen) by centrifugation for 1 minute at 13,000 rpm, discarding the flow-through. 750 ul of PE buffer (Qiagen) was passed through the column twice by centrifugation for 1 minute at 13,000 rpm, discarding the flow-through each time. The silica membrane was dried by centrifuging the column for a further 1 minute at 13,000 rpm. The column was placed in a clean 1.5 mL Lo-Bind Eppendorf tube with the lid cut off, and allowed to incubate at 37°C for 5 minutes after the addition of 15 ul EB buffer (Qiagen) to the silica membrane. DNA was eluted by centrifuging for 1 minute at 10,000 rpm. An additional 15 ul of EB buffer was then passed through the column as above for a total of 30 ul of eluate. Finally, the eluate was passed back through the column as above after a further 5 minutes of incubation at 37°C. The eluate was transferred to a clean 0.5 ml Safe-lock Lo-bind Eppendorf tube. 1.5 ul of 1% TE-Tween-20 was added to the extract, which was then stored at -20°C. DNA free extraction controls were included.

SI 1.4 Amplification and sequencing of mini-barcodes
PCR reaction set-up was carried out in a designated ultra-clean facility at the ANU (Ecogenomics and Bioinformatics Laboratory) in designated UV hood inside a physically separate room from the DNA extraction and sampling room. DNA extracts were amplified with two avian-specific mitochondrial 12S rRNA mini-barcodes: 12SAC with a 53 bp insert (Forward 5'-CTGGGATTAGATACCCCACTAT-3', Reverse 5'-GTTTTAAGCGTTTGTGCTCG-3') and 12SAH with a 232 bp insert (Forward 5'-CTGGGATTAGATACCCCACTAT-3', Reverse 5'-CCTTGACCTGTCTTGTTAGC-3') (Cooper 1994), following the methods described by Grealy et al. (2019). The PCR reaction contained reagents in final concentrations of: 1.2 mg/ml BSA, 1X Gold PCR buffer (Applied Biosystems), 2.5 mM MgCl2, 0.25 mM dNTPs, 1.25 U Amplitaq Gold DNA polymerase, 0.12X SYBR Green, 0.4 μM of each IDT primer, and 2 μl DNA in a final reaction volume of 25 μl. Thermal cycling and all post-PCR procedureds were carried out in another physically separated, post-PCR laboratory. Thermal cycling conditions were: 95°C for 10 min, followed by 50 cycles of 95°C for 30 sec, 54°C (12SAC) or 57°C (12SAH) for 30 sec, 72°C for 45 sec, and a final extension of 72°C for 10 minutes. DNA free PCR negative controls were included, as was a positive control (an extract from fresh tissue of Chalcites minutillus; note that the DNA for the positive control was added in a separate facility to eliminate the possibility of cross-contamination from this sample).
The longest amplicon for each sample that was successfully amplified was then amplified in triplicate using fusion primers containing Illumina flow-cell binding sites, followed by a custom sequencing adapter and unique multiplexing index upstream of the gene-specific primer. The same PCR reaction conditions were employed as above, but 5 ul of DNA extract was used per reaction, and reactions were performed in duplicate. Duplicate reactions were pooled and purified using SeraPure beads at 1.6X beads ratio, following the manufacturer's instructions, and eluting in 20 ul Ultra-pure water. The DNA concentration of 1 ul of each amplicon was quantified using the Qubit fluorometer (Invitrogen) HiSense kit, following the manufacter's instructions. Amplicons were pooled in approximately equimolar concentrations and purified again as above, eluting in 100 ul EB buffer (Qiagen). The molarity of the final library was determined by quantitating 5 ul on with the Qubit HiSense kit (Invitrogen), following the manufacturer's instructions, and by running 10 ul on the LabChip GXII fragment analyser using the 5K chip. The library was diluted to 2 nM and sequenced on Illumina's MiSeq (single end, Nano 300 cycle v2 kit, no indexing) using a spiked-in custom sequencing primer at the BRF based at ANU.
For each 12SAC read, taxonomy was initially assigned as follows: ->99 to 100% sequence similarity to reference across 100% of the query: genus ->96 to <99% sequence similarity to reference across 100% of the query: family -90 to <96% sequence similarity to reference across 100% of the query: order For each 12SAH read, taxonomy was initially assigned as follows: ->99 to 100% sequence similarity to reference across 100% of the query: species ->96 to <99% sequence similarity to reference across 100% of the query: genus ->95 to <96% sequence similarity to reference across 100% of the query: family -90 to <95% sequence similarity to reference across 100% of the query: order These cut-offs were based on alignments made from every available avian 12SAC and 12SAH sequence on GenBank as of December 2019 for 51 Australian families. Sequences were aligned by family and the intra-and inter-specific pairwise identity was calculated by genus in MEGAX (Kumar et al. 2018;Stecher et al. 2020). The inter-generic pairwise identity was calculated by family in the same way. These identities were averaged to obtain the cut-offs above. Results are summarised in Table S2.
Next, IDs were downgraded to the last common ancestor if: -There was more than one equally top-scoring hit to different taxa -The closest Australian relative of the match taxon was not also represented in GenBank for that locus -The second top hit did not share the most recent common ancestor with the top hit compared with lower hits -More than one plausible read assigned to a different taxon for a sample, a plausible read being one where the appearance of the egg is consistent with the taxon Finally, IDs were upgraded if: -Only one species exists within a genus or one genus within a family -If there is only one Australian taxon within a given taxonomic level -The egg clearly matches in appearance only one taxon -The most abundant read is 10X more abundant than any other read in that sample

SI 1.6 Statistical analyses
Amplification, reads pass filter, and plausible ID for each extract was marked as 1 (being present) or 0 (being absent) (see Table S2). For each extract type and amplicon, Spearman's non-parametric correlations were conducted to examine the relationship between each of these binary response variables and the size and thickness of eggshell. In each case, all correlations were statistically insignificant (p>0.05). Pairwise non-parametric Mann-Whitney U tests with a Bonferroni correction were conducted to compare whether the plausibility of the ID was related to the extract type and/or amplicon (see Fig. 2 of the main text). Statistical tests were performed in Past3.23 (Hammer et al. 2001).  Note: all BLAST hits, including the most abundant sequence variant, were examined but have not been detailed for brevity (full details can be found in the BLAST files on DataDryad). The number of lessabundant reads is reported. Specimens yielding no ID either returned no BLAST hits above 90% identity across 100% of the query, or returned only one taxon from the swab that the egg morphology did not match. These samples probably had very degraded DNA and only amplified contamination. Table S3 | Intra-and inter-specific and generic pairwise identities within various avian families for 12SAC and 12SAH mini-barcodes.

SI 2.0 Supplementary Results
Table S4 | The lowest-common ancestor for top blast hits of each sequence variant in samples that had more than one unique filtered read. The percentage of the total filtered reads is provided. For downstream analysis, the most abundant read was considered to be the molecular ID, but as discussed, it can be seen that swabs and internal samples in particular yield many sequences of conflicting identity, therefore supporting the idea that they are not reliable methods for sampling eggshell, even if the most abundant read returns a plausible ID. Complete BLAST files can be found on DataDryad.
Figure S3 | Agreement among the three extract types in cases where all three produced an identification.