Data on the diets of Salish Sea harbour seals from DNA metabarcoding

Marine trophic ecology data are in high demand as natural resource agencies increasingly adopt ecosystem-based management strategies that account for complex species interactions. Harbour seal (Phoca vitulina) diet data are of particular interest because the species is an abundant predator in the northeast Pacific Ocean and Salish Sea ecosystem that consumes Pacific salmon (Oncorhynchus spp.). A multi-agency effort was therefore undertaken to produce harbour seal diet data on an ecosystem scale using, 1) a standardized set of scat collection and analysis methods, and 2) a newly developed DNA metabarcoding diet analysis technique designed to identify prey species and quantify their relative proportions in seal diets. The DNA-based dataset described herein contains records from 4,625 harbour seal scats representing 52 haulout sites, 7 years, 12 calendar months, and a total of 11,641 prey identifications. Prey morphological hard parts analyses were conducted alongside, resulting in corresponding hard parts data for 92% of the scat DNA samples. A custom-built prey DNA sequence database containing 201 species (192 fishes, 9 cephalopods) is also provided. Measurement(s) diet Technology Type(s) DNA metabarcoding diet analysis Sample Characteristic - Organism Phoca vitulina Measurement(s) diet Technology Type(s) DNA metabarcoding diet analysis Sample Characteristic - Organism Phoca vitulina Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.19077809


Background & Summary
In recent years the fisheries and ecosystem modelling communities have expressed increased interest in harbour seal (Phoca vitulina) diet information to help inform prey consumption estimates and ecosystem models [1][2][3] . Harbour seals are a common pinniped found in the northeast Pacific Ocean and are particularly abundant in the inland marine waters of southern British Columbia (Canada) and Washington state (USA) -an area defined as the Salish Sea 4,5 . Growth in the population of seals since their protection in 1972 has led to concerns about their potential predatory impacts on fishes of conservation concern such as Pacific salmon (Oncorhynchus spp.) 6 . In addition to being known salmon predators, the inverse population trend between seals and salmon species has led to speculation that seals may be a causal factor influencing salmon populations 7,8 . More specifically, the marine survival of Chinook (O. tshawytscha), coho (O. kisutch), and steelhead (O. mykiss) salmon in the region decreased dramatically during the same time period when harbour seals grew exponentially 7,9,10 . However, such correlative evidence is generally not sufficient to mandate management actions. Additional data, such as detailed predator diet information, is needed to establish a potential ecological link between seals and fish populations 2,11 .
Although harbour seal diet in the region has been studied for nearly a century 12 , methodological limitations have prevented accurate quantification of the salmon proportion of seal diet. Previous studies relied primarily on morphological identification of hard remains (e.g., bones) in stomachs or scat samples, and diet was summarized based on the proportion of samples containing bones of certain prey species 13,14 . Unfortunately, the majority of salmon bones in scats cannot easily be identified to the species level, and simple prey occurrence data are www.nature.com/scientificdata www.nature.com/scientificdata/ of limited use for quantitative analyses of predation. Such estimates generally require (as an input) the predator population diet fraction comprised of a particular prey species 1,2 . A new harbour seal diet method was therefore needed to meet the data requirements of regional modelling efforts.
In 2011, collaborating researchers at the University of British Columbia (UBC) and the Australian Antarctic Division undertook an effort to create a new diet analysis method capable of providing both the salmon species and the proportional biomass of prey contained in harbour seal scat samples. At the time, analysis of predator diet using DNA metabarcoding methods was a new approach that offered the potential to provide both the seal prey species and the relative proportions of prey in scat samples 15 . High-resolution taxonomic data can be achieved with DNA metabarcoding using massively parallel sequencing of short (e.g., 200 bp), diagnostic genetic markers amplified from scat samples 15,16 . Sequences are compared to a database containing known DNA barcodes of potential prey species, and the proportions of DNA sequences assigned to different prey are used as an index of proportional biomass composition for each scat sample 16 . The product of the described R&D effort was a DNA metabarcoding diet analysis method for harbour seals that has since been applied to thousands of wild harbour seal scat samples collected throughout the Salish Sea 8,17 .
Between the years of 2011 and 2019, over four-thousand regional harbour seal scat samples were collected and processed using our standardized DNA metabarcoding diet analysis method. This is the product of a collaborative transboundary research effort including two universities (University of British Columbia, Western Washington University), three government agencies (Washington Department of Fish and Wildlife, Fisheries and Oceans Canada, Australian Antarctic Division), a native American tribe (Nisqually Indian Tribe), a non-profit organization (Long Live the Kings) and private corporation (Smith-Root Inc.). Subsets of the data have been used to address specific questions with respect to pinniped predation on particular prey species or taxa 1,2,8 . Further, the diet analysis method produces proportional data for all prey species amplified by the semi-universal PCR primers; therefore, these data products are also useful for a broad range of ecological questions and modelling exercises that extend beyond the previous taxon-specific inquiries.
Our objective in publishing this dataset is to make these valuable trophic ecology data broadly available to the fisheries and ecosystem modelling communities. The core of the dataset is a single spreadsheet containing data from 4,625 harbour seal scats representing 52 haulout sites, 7 years, 12 calendar months, and totaling 11,641 prey identifications. We hope that by making these data publicly available they will help facilitate an open and free exchange of ideas regarding harbour seal trophic ecology in the Salish Sea.

Methods
Scat sample collection and preparation. At known harbour seal haulout sites individual scat samples were collected using a standardized protocol (Fig. 1). Disposable wooden tongue depressors were used to transfer deposited scats into 500 ml single-use jars or zip-style bags lined with 126 µm nylon mesh paint strainers 18 . Samples were either preserved immediately in the field by adding 300 ml 95% ethanol to the collection jar, or were taken to the lab and frozen at −20 °C within 6 hours of collection 19 . Later, samples were thawed and filled with ethanol before being manually homogenized with a disposable wooden depressor inside the paint strainer to separate the scat matrix material from hard prey remains (e.g. bones, cephalopod beaks). The paint strainer containing prey hard parts was then removed from the jar leaving behind the ethanol preserved scat matrix for genetic analysis 20 . The paint strainer containing prey hard parts was refrozen for subsequent parallel morphological prey ID. Molecular laboratory processing. Scat matrix samples were subsampled (approximately 20 mg), centrifuged and dried to remove ethanol prior to DNA extraction. DNA was extracted from scat with the QIAGEN QIAamp DNA Stool Mini Kit according to the manufacturer's protocols. For additional details on the extraction process see Deagle et al. 21 and Thomas et al. 20 .
The metabarcoding marker we used to quantify fish and cephalopod proportions was a 16S mDNA fragment (~260 bp) previously described in Deagle et al. 15 for pinniped scat analysis. We used the combined Chord/Ceph primer sets: Chord_16S_F (GATCGAGAAGACCCTRTGGAGCT), Chord_16S_R (GGATTGCGCTGTTATCCCT), and Ceph_16S_F (GACGAGAAGACCCTAWTGAGCT), Ceph_16S_R (AAATTACGCTGTTATCCCT). This multiplex PCR reaction is designed to amplify both chordate and cephalopod prey species DNA. A blocking oligonucleotide was included in the all 16S PCRs to limit amplification of seal DNA 22 . The oligonucleotide (32 bp: ATGGAGCTTTAATTAACTAACTCAACAGAGCA-C3) matches harbour seal sequence (GenBank Accession AM181032) and was modified with a C3 spacer so it is non-extendable during PCR 22 .
A secondary metabarcoding marker was used in a separate PCR reaction to quantity the salmon portion of seal diet, because the primary 16S marker was unable to reliably differentiate between coho and steelhead DNA sequences. This marker was a COI "minibarcode" specifically for salmonids within the standard COI barcoding region: Sal_COI_F (CTCTATTTAGTATTTGGTGCCTGAG), Sal_COI_R (GAGTCAGAAGCTTATGTTRTTTATTCG). The COI amplicons were sequenced alongside 16S such that the overall salmonid fraction of the diet was quantified by 16S, and the salmon species proportions within that fraction were quantified by COI.
To take full advantage of sequencing throughput, we used a two-stage labeling scheme to identify individual samples that involved both PCR primer tags and labeled MiSeq adapter sequences. The open source software package EDITTAG was used to create 96 primer sets each with a unique 10 bp primer tag and an edit distance of 5; meaning that to mistake one sample's sequences for another, 5 insertions, substitutions or deletions would have to occur 23 .
All PCR amplifications were performed in 20 μl volumes using the Multiplex PCR Kit (QIAGEN). Reactions contained 10 μl (0.5 X) master mix, 0.25 μM of each primer, 2.5 μM blocking oligonucleotide and 2 μl template DNA. Thermal cycling conditions were: 95 °C for 15 min followed by 34 cycles of: 94 °C for 30 s, 57 °C for 90 s, and 72 °C for 60 s.
Amplicons from 96 individually labeled samples were pooled by running all samples on 1.5% agarose gels, and the luminosity of each sample's PCR product was quantified using Image Studio Lite (Version 3.1). To combine all samples in roughly equal proportion (normalization), we calculated the fraction of each sample's PCR product added to the pool based on the luminosity value relative to the brightest band. After 2013, amplicon normalization was performed using SequalPrep ™ Normalization Plate Kits, 96-well.
Sequencing libraries were prepared from pools of 96 samples using an Illumina TruSeq DNA sample prep kit which ligated uniquely labeled adapter sequences to each pool. Libraries were then pooled and DNA sequencing was performed on Illumina MiSeq using the MiSeq Reagent Kit v2 (300 cycle) for SE 300 bp reads. Samples were sequenced on multiple different runs as part of the larger study; however, typically between 4 and 6 libraries (each a pool of 96 individually identifiable samples) were sequenced on a single MiSeq run.

Bioinformatics.
To assign DNA sequences to a fish or cephalopod species, we created a custom BLAST reference database of 16S sequences by an iterative process. First, using a list of the fish species of Puget Sound, we searched Genbank for the 16S sequence fragment of all fishes known to occur in the region (71 fish families 230 species) 24,25 . Reference sequences for each prey species were included in the database if the entire fragment was available, and preference was given to sequences of voucher specimens. When the database was first generated (November, 2012) Genbank contained 16S sequences for 192 of the 230 fish species in the region, and the remaining 38 species were mostly uncommon species unlikely to occur in seal diets. Following a similar procedure, we added to this database sequences for all of the regional cephalopods for which 16S data were available (7 squid species, 2 octopus species). A separate reference database was generated for the COI salmon marker containing Genbank sequences for the nine salmonid species known to occur regionally: Oncorhynchus gorbuscha (Pink Salmon), Oncorhynchus keta (Chum Salmon), Oncorhynchus kisutch (Coho Salmon), Oncorhynchus mykiss (Steelhead), Oncorhynchus nerka (Sockeye Salmon), Oncorhynchus tshawytscha (Chinook Salmon), Oncorhynchus clarkii (Cutthroat Trout), Salmo salar (Atlantic Salmon), Salvelinus malma (Dolly Varden) 24 .
To determine if some species in the database cannot be distinguished from each other at 16S (i.e. have identical sequences in the reference database) a distance matrix was performed on the complete database using the DistanceMatrix function in the R package DECIPHER 26 . Species with identical sequences were identified as having a distance of "0.00". In some cases, one haplotype for a species was identical to another species but other haplotypes were not. When two species' sequences were identical, we ultimately reported both species in the prey_ID field.
Sequences were automatically sorted (MiSeq post processing) by amplicon pool using the indexed TruSeqTM adapter sequences. FASTQ sequence files for each library were imported into MacQIIME (version 1.9.1-20150604) for demultiplexing and sequence assignment to species 27 . For a sequence to be assigned to a sample, it had to match the full forward and reverse primer sequences and match the 10 bp primer tag for that sample (allowing for up to 2 mismatches in either primers or tag sequence).
Next, we clustered the DNA sequences that were assigned to scat or tissue samples with USEARCH (similarity threshold = 0.99; minimum cluster size = 3; de novo chimera detection), and entered a representative sequence from each cluster into a GenBank nucleotide BLAST search 28,29 . If the top matching species for any cluster was not included in the existing database (or the sequence differed indicating haplotype variation), we put the top matching entry in the reference database. We repeated this procedure with every new batch of sequence data to minimize the potential for incorrect species assignment or prey species exclusion. This process was conducted for both the 16S and COI reference databases with each new batch of samples.
For all DNA sequences successfully assigned to a sample, a BLAST search was performed against our custom 16S or COI reference databases. A sequence was assigned to a species based on the best match in the database (threshold BLASTN e-value < 1e-20 and a minimum identity of 0.9), and the proportions of each species' sequences were quantified by individual sample after excluding harbour seal sequences or any identified contaminants 27 . Samples were excluded from subsequent analysis if they contained <10 identified prey DNA sequences (given the current costs of DNA sequencing, a higher threshold is now advisable). Harbour seal DNA diet percentages for individual scats were then calculated using the Relative Read Abundance (RRA) calculation commonly used in metabarcoding studies (Box 1) 16 . The RRA formula was used to calculate the "DNA_diet_ percent" field in data record "Harbour_Seal_DNA_Diet_Data.csv". Box 1 (From Deagle et al. 16 ). Metrics used to summarise sequence data in dietary studies.

Occurrence Data
Frequency of occurrence (FOO) is the number of samples that contain a given food item, most often expressed as a percent (%FOO). Percent of occurrence (POO) is simply %FOO rescaled so that the sum across all food items is 100%. Weighted percent of occurrence (wPOO) is similar to POO, but rather than giving equal weight to all occurrences, this metric weights each occurrence according to the number of food items in the sample (e.g., if a sample contains 5 food items, each will be given weight 1/5). Mathematical expressions are as follows: where T is the number of food items (taxa), S is the number of samples, and I is an indicator function such that Many metabarcoding diet studies make use of both %FOO and POO e.g. 42 . POO provides a convenient view since each food taxon contributes a percentage of total diet (unlike %FOO which does not sum to 100%). In POO summaries samples with a high number of food taxa have a stronger influence, whereas in wPOO each sample is weighted equally (i.e. lower weighting to food taxa in a mixed meal) and this may be more biologically realistic. wPOO is the same as split-sample frequency of occurence; see 43 and references within.

Read Abundance Data
Using the sequence counts, relative read abundance (RRA i ) for food item i is calculated as: www.nature.com/scientificdata www.nature.com/scientificdata/ Prey hard parts analysis. Extraction and identification of hard structures from harbour seal scats was conducted by three different analysts. We used the "all structures" approach to identify harbor seal prey contained in individual scat samples, which make our results comparable to similar studies previously conducted in the region 8,13,14 . Prey "hard parts" retained in paint strainers were cleaned of debris using either a conventional washing machine or nested sieves. All diagnostic prey hard parts were identified to the lowest possible taxon using a dissecting microscope and reference fish bones from Washington and British Columbia, in addition to published keys for fish bones and cephalopod beaks [30][31][32][33] . Samples containing prey hard parts identifiable only to the family level (e.g., Clupeidae), and bones identifiable to the species level of the same family (e.g., Pacific herring, Clupea pallasii) were both tallied.
In previous studies of harbour seal impacts to juvenile salmonids in the Salish Sea 1,2,8,34 , these diagnostic hard structures (e.g., otoliths, bones) were combined with DNA extracted from each scat sample to estimate the proportion of juvenile and adult salmon (by species) in the seal diet. This approach (see: Thomas et al. 8 ) integrates separate analyses of hard parts and DNA through an algorithm that apportions the salmonid DNA component in each sample to a "juvenile" or "adult" classification. The decision algorithm is based on the co-occurrence of age-classified salmon bones and salmon DNA in samples, and (when bones were not present but salmon DNA was detected) on known seasonal life-history information. For example, an individual scat sample found to contain 5% Chinook salmon, and a 1:1 ratio of juvenile to adult salmon bones, would be disaggregated into a final classification of 2.5% juvenile Chinook salmon and 2.5% adult Chinook salmon. For individual scat samples that do not contain diagnostic hard structures, the ratio of juveniles to adults in that sample would rely on the ratio of hard parts pooled for the collection month. If no hard structures are available for the collection month-which only occurred for 7% of samples in Thomas et al. 8 a seasonal classification would then be applied to the sample (spring = juvenile, fall = adult). The classification of hard parts as "juvenile" or "adult" was performed by taxonomic experts who differentiated samples visually, and/or according to otolith or vertebral measurements (e.g., Nelson et al. 34 ).

Data Records
A summary figure depicting the average diet of the complete Salish Sea harbour seal dataset comprised of 4,625 scat samples is available in Fig. 2.
The following files are available at figshare 35 . Harbour_seal_dna_diet_data.csv. This is the primary dataset that contains scat sample composition data for 4,625 harbor seal scats. Field names and descriptions are outlined in Table 1. Multiple entries exist for each sample ID, with each entry representing an identified prey species and its proportional composition within the sample. Notes for Harbour_Seal_DNA_Diet_Data.csv: The sample ID nomenclature varied between projects and we present here the IDs used by the respective projects and institutions. This presentation allows for links to drawn to other parallel projects or datasets using the common sample IDs.
Similarly, the collection location names were assigned by the separate projects and do not necessarily represent specific cartographic locations. Therefore, the Latitude and Longitude are provided in decimal degree format for each sampling location.
Database duplicates: When the prey ID sequence in the database was a perfect match for another prey species sequence, both prey IDs are provided separated by "OR" in the entry. For example, "Copper_OR _ Quillback_ OR _China_Rockfish". The same approach was used for the taxonomic names of the prey species.
DNA_diet_percent is the product of the RRA calculation for each prey species and can be used as an index of proportional prey consumption (see methods).

Harbour_seal_hardparts_diet_data.csv.
These are the results of the morphological prey hard parts ID of collected scat samples, with corresponding sample IDs to those in Harbour_Seal_DNA_Diet_Data.csv (Fig. 3). These data can be used to make direct comparisons between the traditional seal diet analysis technique (prey hard parts analysis) and the newly developed diet characterization method (DNA metabarcoding diet analysis).
Notes for Harbour_Seal_Hardparts_Diet_Data.csv: The.csv file contains data in five fields: 1. Sample ID -the scat sample ID that corresponds to "sample ID" in the DNA dataset. 2. Prey Species -the prey species scientific name. 3. Salmon classification -Binary age classification for salmon structures as noted by the analyst. 4. Sample comments -Additional comments about the specific prey hard structure. 5. Analyst -The name of the prey hard parts identification analyst.
Because the prey hard parts analysis was conducted by three different analysists without an existing standardization protocol, the reported results varied slightly throughout the dataset. Here we attempted to preserve as much of information provided by the analyst as possible in the "Sample comments" field.
Further, given the intense intertest in salmon predation by harbour seals and the large magnitude differences in consumption estimates depending on prey life stage, we have included the binary salmon age classification (Juvenile, Adult). We should acknowledge, however, that seals consume salmon prey in a wide range of age/size classes, and this binary scheme is an oversimplification. (2022) 9:68 | https://doi.org/10.1038/s41597-022-01152-5 www.nature.com/scientificdata www.nature.com/scientificdata/ Prey_database.txt. The prey reference sequence database (Fig. 4) is a product of the iterative process of adding prey sequences based on the representative sequences of clustered MiSeq data from each successive scat sequencing run. The database started with a Genbank search for 16S sequences using a list of local fish species, and then the clustering processes for each sequencing run added additional species or haplotypes to the database over time.

Notes for prey_database.txt. This version of the database does not contain identified contaminants or
predators other than harbour seal that were inadvertently collected.
Prey_species_distance_matrix.xlsx. This distance matrix was used to determine when the reference sequences of two prey species in the database were identical. Both prey species names are given in the Prey_ID field of the diet dataset in those occurrences.
Notes for Prey_Species_Distance_Matrix.xlsx: Multiple sequence entries exist for some prey species as a byproduct of the iterative sequence entry process. In some cases, one haplotype for a species was identical to another species but other haplotypes were not.

QiiMe_mapping_files.tgz.
A tar archive of the QIIME mapping files (.txt) needed for demultiplexing the .fastq files in the NCBI SRA database (Online-only Table 1) 36 . The mapping file names in the tar archive match those of the associated.fastq files. Each mapping file contains the F and R primer sequences along with the unique primer tag for each of the 96 scat samples listed in the sequencing plate. www.nature.com/scientificdata www.nature.com/scientificdata/

technical Validation
In parallel to the field sample collection effort, we conducted a series of studies to evaluate the quantitative capabilities of DNA metabarcoding diet analysis for seal diet estimation. Specifically, we wanted to determine if prey DNA sequence percentages recovered from scats (RRA) accurately reflect the proportional biomass of the prey consumed. This work involved a series of experiments including: a captive seal feeding study with known diets, the development of food tissue control materials (homogenized fish tissue) to produce correction factors, and computer simulations to compare RRA to alternative diet indices such as weighted percent of occurrence. Here we will briefly describe those studies and the principal conclusions drawn from each. Lastly, we will outline why we have chosen to report RRA as the principal diet index for this dataset.
Ideally, RRA would be a direct reflection of the proportional biomass of the prey species consumed -i.e. a 1:1 relationship would exist between metabarcoding DNA sequence proportion and diet biomass proportion. However, a large number of potential factors could skew this relationship, including variability in mDNA density (number of mitochondria per gram of tissue) between different prey species, technical biases such as preferential primer binding during PCR caused by primer mismatches, and the many bioinformatic processes that may  Table 1. Field names and descriptions of the data contained in "Harbour_Seal_DNA_Diet_Data.csv". www.nature.com/scientificdata www.nature.com/scientificdata/ select for one prey species' sequences over another. For this reason, our first validation study was an evaluation of the factors that influence the relationship between diet biomass proportion and RRA in a captive seal feeding study wherein aquarium animals were fed a known diet of fixed biomass proportions of three prey species 37 .
In that study using the Ion Torrent PGM sequencer, we found that DNA sequence quality filtering, direction of sequencing (F vs R), and our chosen minimum read length, all largely impacted the proportions of prey sequences ultimately resulting from RRA analysis 37 . Furthermore, we detected interactive effects between biasing factors such as variable effects of quality filtering depending on the primer tags used to identify sequences from unique samples. However, despite these effects we found that replicate samples of a common diet produced largely consistent DNA sequence percentages when technical factors were held constant. This consistency implied that biasing factors could potentially be corrected using a set of standards sequenced alongside scat samples, such as homogenized mock communities of known composition (i.e. prey fish tissue mixes).
The next study therefore attempted to correct for biasing factors by sequencing a prey fish tissue mixture that matched the diet biomass proportions of captive seals, allowing us to calculate and apply Tissue Correction Factors (TCFs) to the scat DNA sequence counts 20 . When applied to seal scat DNA, TCFs substantially improved the relationship between RRA and prey biomass proportion for all prey species. We also surmised that while the TCFs account for technical biases (e.g. quality filtering, differential primer binding) and mDNA density variability, they did not account for any potential effects of differential prey species digestion (e.g., one species being more fully digested than another in the seal's gastrointestinal tract). The study design of the experiment allowed us to quantify the bias introduced by differential prey digestion and determine in that case the magnitude of the required Digestion Correction Factor (DCF) for each species. It is unrealistic to experimentally determine the DCF for all prey when hundreds of prey species are possible, so we explored prey characteristics (proximate composition analysis) that could be used as a proxy for prey digestion. We found that the percent lipid of prey species closely predicted the DCF, implying that highly accurate diet biomass estimates could be obtained by applying both TCFs and lipid-based DCF correction factors. The correction factor design of this experiment however relied on a priori knowledge of the seal diets; therefore, a more generalizable approach was needed that could be applied to samples of unknown composition.
Using the lessons learned from the captive feeding studies, we thus created a novel correction factor approach for samples of unknown composition by generating 50/50 biomass percentage mixtures of variable potential prey fish combined with a "control" fish that was held constant in all mixtures 38 . By holding one of the two prey constant in all 50/50 mixtures, we were able to detect biases in the "test" prey relative to the control species and to calculate Relative Correction Factors (RCFs) that could be applied to DNA sequence counts from samples of unknown biomass composition (i.e. seal scat samples). We then built a prey library of the 50/50 mixtures and sequenced them alongside seal scat samples, calculating RCFs for prey species in the library and applying them to scat samples to determine the magnitude of corrective effect. Similar to previous results, RCFs were highly consistent between sequencing replicates, and RCF values indicated that there is some phylogenetic structure to the magnitude of bias (active swimming fishes were overestimated relative to more sedentary species, likely reflecting mDNA density). The effects of RCF correction were most pronounced on individual samples, although when samples were averaged together (as is common practice when calculating population level diet summaries) the effect of RCF correction were less impactful. We ultimately concluded that RCF correction is www.nature.com/scientificdata www.nature.com/scientificdata/ only a worthwhile endeavor when a high degree of diet biomass accuracy is needed for a single sample, or when generating population diet summaries from a small number of diet samples.
Despite the known biases, researchers have consistently concluded that RRA is a semi-quantitative tool even without the application of correction factors. When the diet proportion of a prey species increases, there is a corresponding increase in the relative number of DNA sequences for that prey species in diet samples. So that begs the question, are uncorrected "semi-quantitative" RRA estimates good enough? One way to answer this question is to compare the accuracy of the RRA method to other competing methodological alternatives. DNA metabarcoding RRA may be subject to known biasing factors, but if it outperforms alternative diet metrics in terms of taxonomic resolution and biomass estimate accuracy, then it may be the current best option for pinniped diet analysis.
The most common practice for pinniped diet studies is to treat detections of prey species in diet samples as presence/absence data (i.e., occurrences). Percent of occurrence (POO) tables can be generated for groups of samples using such data, or when proportions must sum to 1 (as is needed for bioenergetics modelling) a weighted percent of occurrence (wPOO) can be calculated (Box 1). The latter has also been called Split Sample Frequency of Occurrence (SSFO). Occurrence-based indices however are prone to overestimating prey eaten in low proportion and underestimating prey eaten in high proportions. In a metanalysis of DNA metabarcoding diet studies, Deagle et al. 16 compared multiple diet indices using an in silico simulation experiment. Diet datasets were simulated at multiple sample sizes and the proportional composition was calculated using RRA and occurrence indices, then compared to the theoretical true diet using Bray-Curtis dissimilarity. They found that occurrence indices produce consistent diet estimates (relatively precise) but result in a less accurate reflection of the true diet by comparison to RRA. RRA produced a more accurate estimate than POO even when the simulated level of bias was extreme (20X), suggesting that the quantitative signature of RRA is important for generating accurate diet estimates despite the presence of biases.
Comparing DNA metabarcoding RRA to non-DNA based methods with a proven history of use (e.g., scat morphological hard parts analysis) is another means of assessment. In an extensive methods comparison www.nature.com/scientificdata www.nature.com/scientificdata/ between scat DNA and hard parts analysis, Thomas (2015) 17 found a high degree of agreement between the two methods in the proportions of salmon estimated in seal diets (Fig. 5), in addition to other diet species (Online-only Table 2). Furthermore, in a recent report 39 , pinniped diet experts Dominic Tollit and Ruth Joy compared DNA metabarcoding RRA (Termed "DNA-Fixed" in report) to the current best practices method (hard parts biomass reconstruction -"BR-Fixed") and found "near identical" diet contributions between methods (see Supplementary Information for full report, posted with permission). This implies that DNA metabarcoding RRA provides not only better taxonomic resolution than hard parts, but also produces comparable diet proportions to the current best practices method while requiring substantially less labor. A copy of the report is available in Supplementary Information. Given the strengths of RRA for proportional biomass estimation in the absence of correction factors, the cost/labor involved in generating a complete prey library for RCF correction, and fact that harbour seal prey are minimally biased with our 16S marker, we have chosen to present uncorrected RRA as the proportional diet index in the current dataset. Caution should therefore be exercised when generating diet summaries using small numbers of samples (e.g.< 70 per stratum), as is considered best practice with other diet indices 40 . Future efforts may be made to build a 50/50 RCF prey library for harbour seals in the Salish Sea region, in which case the diet database presented here could be RCF corrected for improved sample proportional biomass accuracy. It should also be noted that our methods evaluations did not account for other potential sources of bias (e.g., sampling bias introduced during scat collection, selective influence of blocking oligos, secondary prey consumption, reliance on Genbank for reference database sequences, etc.). Those issues are worth investigating further in future methods refinement.

Usage Notes
As stated previously, we encourage users of these data to follow statistical best-practices when creating diet summaries from this dataset. Ample literature is available on the topic of appropriate size for predator diet studies 40,41 .

Code availability
The bioinformatic code used to process scat DNA sequence data are available in the "Example Bioinformatic Files. zip" folder on figshare along with example data 35 . Sample demultiplexing and sequence taxonomy assignment steps were performed using MacQIIME (version: 1.9.1-20150604) and the code used for these steps is in folder "1_Qiime_processing". Example output files from MacQIIME processing steps containing 16S and COI sequence taxonomy assignments are in the folder "2_Qiime_output_R_input". Those files were then further processed in R (multiple versions) to generate DNA sequence percentages using the code contained in folder "3_R_processing". Note: Demultiplexing requires QIIME mapping files for each sample plate which are stored in the figshare folder 35 "QIIME_Mapping_Files" for sequence data stored the NCBI SRA database 36 . A complete list of the bioinformatic functions in MacQIIME is available online http://qiime.org/scripts/.