Estimation of tuna population by the improved analytical pipeline of unique molecular identifier-assisted HaCeD-Seq (haplotype count from eDNA)

Many studies have investigated the ability to identify species from environmental DNA (eDNA). However, even when individual species are identified, the accurate estimation of their abundances by traditional eDNA analyses has been still difficult. We previously developed a novel analytical method called HaCeD-Seq (Haplotype Count from eDNA), which focuses on the mitochondrial D-loop sequence. The D-loop is a rapidly evolving sequence and has been used to estimate the abundance of eel species in breeding water. In the current study, we have further improved this method by applying unique molecular identifier (UMI) tags, which eliminate the PCR and sequencing errors and extend the detection range by an order of magnitude. Based on this improved HaCeD-Seq pipeline, we computed the abundance of Pacific bluefin tuna (Thunnus orientalis) in aquarium tanks at the Tokyo Sea Life Park (Kasai, Tokyo, Japan). This tuna species is commercially important but is at high risk of resource depletion. With the developed UMI tag method, 90 out of 96 haplotypes (94%) were successfully detected from Pacific bluefin tuna eDNA. By contrast, only 29 out of 96 haplotypes (30%) were detected when UMI tags were not used. Our findings indicate the potential for conducting non-invasive fish stock surveys by sampling eDNA.

HaCeD (Haplotype Count from eDNA)-Seq, was successfully adopted to noninvasively estimate the abundances of various eel species inhabiting the same experimental tank as well as the aquaculture pond at the same time 10 .
In HaCeD-Seq, the accuracy of base identification in the sequencing data is very important. Even a single base difference will be counted as a different haplotype in this method; thus, only reads with a base identification accuracy of 99.9% or higher were used in the previous report on eels 10 . However, even with 99.9% base identification accuracy, the probability of error occurrence at least once in every 300 bases (= the average length of a read) is (1-0.999 300 ) = 26%. Therefore, even 99.9% accuracy is not high enough for this method. In addition, when samples contain several individuals, there are variations in the yield of eDNA derived from each individual as described above. It is generally difficult to discriminate between true minor haplotypes and false haplotypes derived from sequencing errors of the major haplotypes. In our previous report, due to sequencing errors, eels with less than 3% of those having the major haplotypes were indistinguishable from those with false haplotypes 10 .
Safe-SeqS is a sequencing technique that uses unique molecular identifiers (UMIs) to remove sequencing errors 11 . This technique assigns a random UMI tag to each DNA fragment, amplifies the fragment by PCR, sequences the amplified product and gathers the reads with the same UMI tag during data analysis, thus removing sequencing errors. Numerous studies have used UMI tags to remove PCR amplification bias, mainly in singlecell analysis 12,13 . Using UMI tags to remove PCR errors and sequencing errors has been adopted to analytical tools such as pRESTO 14 , MAGERI 15 and AmpUMI 16 , which facilitate the assembly of high-quality reads from sequenced data. Thus, it is interesting to explore whether UMI tags enable HaCeD-Seq to correctly detect the abundance of individuals, even when the abundance is low.
HaCeD-Seq was established to determine population sizes, and it is applicable to organisms with individual specific D-loop sequences, namely, those with high haplotype diversity. The haplotype diversities of the D-loop of tuna are reportedly as high as 0.9980 for longtail tuna 17 , 0.9987-1.000 for yellowfin tuna 17,18 , and 0.996 for Pacific bluefin tuna 18 . The Tokyo Sea Life Park in Japan planned to add naturally harvested Pacific bluefin tuna juveniles to this aquarium in 2016. This was a good opportunity to validate the improved UMI-assisted HaCeD-Seq method, because fin tissue samples could be taken for the determination of D-loop haplotype before they were transferred into the aquarium.
Tuna are top predators with unique features in terms of swimming ability 19 . These fish continuously swim at a high speed to extract oxygen with their gills and to chase prey. Tuna has long been an international market commodity for canning 20 . The fast muscle of tuna is also consumed raw in fish dishes (sashimi); formerly, this consumption was almost exclusively in Japan but is now common throughout the world. The consumption of tuna has greatly increased owing to the desirable taste of sashimi. This fish also has high contents of unsaturated n-3 fatty acids, such as eicosapentaenoic acid (EPA, 20:5 n-3) and docosahexaenoic acid (DHA, 22:6 n-3), which are thought to be beneficial for human health 21,22 .
Due to the increasing demand from consumption, the tuna stocks are considered to be at risk of overexploitation, although it previously had been thought that tuna harvesting would never cause their populations to become threatened 20,23 . In response to the risk of over fishing, the aquaculture programme of the Pacific bluefin tuna started in 1970 in Japan, resulting in predominantly farm-raised fish by 2002 24,25 . Despite this innovative achievement, most tuna aquaculture business still depends on wild stocks of juveniles because the survival rates of larval and juvenile tuna from artificial reproduction is not high 26 . Therefore, it is urgent to assess natural tuna stocks correctly and easily.
The objective of this study was to improve our HaCeD-Seq method by combining UMI assembles and to examine the accuracy of this improved method to estimate the abundance of Pacific bluefin tuna that were reared in a large aquarium tank in the Tokyo Sea Life Park.

Results
UMI tagging to improve base accuracy. eDNA was extracted from water samples taken from a total of 70 Pacific bluefin tuna transported to the Tokyo Sea Life Park with 7 live fish transport vehicles (Fig. 1). This was only one opportunity to obtain samples in our experiments.
Using the extracted eDNA as a template, PCR was performed with and without UMI-tagged primers, and the amplified products were sequenced by MiSeq at 300 bp × 2 paired-end sequencing (Fig. 2). The number of reads obtained is summarized in Tables 1 and 2. For sequencing data without the UMI tags, only high precision reads with all bases above Q30 (equal to 99.9% accuracy) were extracted, with an average of 32% of the reads remaining ( Table 1). The average quality of bases in the obtained filtered reads was 37.9 (99.98% accuracy). The UMI-tagged reads were error-corrected by MAGERI, resulting in an average of 6% of the reads remaining ( Table 2). The average quality of the bases was 39.8 (99.99% accuracy), but the real base accuracy was much higher since the upper limit of base quality output by MAGERI is up to 40. Estimation of tuna abundances in water tanks of live fish transport vehicles. Each of the seven tanks in the transport vehicles contained 7-17 tuna specimens (Table 1). When the D-loop sequences of all 70 individuals were determined by Sanger sequencing, duplicated sequences were found for two sets, whereas one sequence was found for three individuals. Thus, in total, 64 haplotypes were found for the 70 fish in the tanks (Table 3, Supplementary Information Table S1). The sequences of 64 haplotypes determined by Sanger sequencing were identical to those determined by NGS with and without the UMI tags. We also performed HaCeD-Seq with and without the UMI tags on water samples from all the tanks to determine the optimal cluster size cut-off. Since false haplotypes result from sequencing errors, for the most abundant haplotypes, the changes in sensitivity and specificity depend on the cut-off for low-frequency haplotypes. The sequences and frequencies of haplotypes are shown in Supplementary Information Table S2, and we confirmed that the true haplotype sequences with the UMI tags were identical to those without the UMI tags. The data from each tank was normalized by the www.nature.com/scientificreports/ maximum haplotype cluster size and the sensitivities of haplotype detection are shown in Fig. 3A. Here, the sensitivity is the percentage of correct haplotypes among the total haplotypes observed. The sensitivity in HaCeD-Seq was not different between those with and without the UMI tags, although it tended to be slightly higher with the UMI tags. This indicates that the UMI tags remove the PCR bias caused by difference in the concentrations of different haplotypes that had been amplified by PCR.
The specificity results are shown in Fig. 3B. Here, the specificity is the percentage of correct haplotypes compared to all the detected haplotypes. The specificity varied markedly depending on whether the UMI tags were used. We determined the minimum percentage that would result in a specificity of at least 85% as the cut-off. Without the UMI tags, the specificity decreased to below the detection threshold of 2%. By contrast, with the UMI tags, the specificity remained almost unchanged until reaching 0.2%. We chose a threshold of 0.2% with the UMI tags and 1.5% without the UMI tags. The HaCeD-Seq results of the transport vehicle water tank samples using these thresholds are shown in Fig. 4. All of the true haplotypes (determined by Sanger sequencing) were detected both with and without the UMI tags, with exception of tank 6 where only 9/10 true haplotypes were detected without the UMI tags. Across all tanks, the absence of the UMI tags generated 8 false haplotype detections, compared to 4 with the UMI tags.
Estimation of tuna abundance in the aquarium tank. We next conducted the experiment in the aquarium and compared the results obtained from HaCeD-Seq with and without the UMI tags. A total of 38 tuna were already present in the aquarium tank, and then the 70 tuna were newly added. It was not feasible to determine the haplotypes of the existing 38 tuna by Sanger sequencing because it was difficult to collect tissue samples from live specimens in the large aquarium tank. Therefore, the water in the aquarium containing the 38 tuna was sampled, and 36 haplotypes were successfully obtained from an analysis with the UMI tags (Table 3 and "before" sequences in Supplementary Information Table S1). Among these 36 haplotypes, four were identical to those of fish in the transport vehicle tanks. Thus, together with the additional 70 tuna with 64 haplotypes, the total number of unique haplotypes of the total 108 tuna in the aquarium was 96 (Table 3, Supplementary  Information Table S1). www.nature.com/scientificreports/ The HaCeD-Seq analysis of water samples from the aquarium tank detected 90 out of the 96 haplotypes (94%) when the UMI tags were used (Fig. 5). However, HaCeD-Seq analysis without the UMI tags detected only 29 haplotypes (30%), with two false haplotypes. These results show that when a large number of haplotypes need to be detected (i.e., greater than 100), the UMI tags are essential to obtain the correct number of haplotypes.  www.nature.com/scientificreports/

Discussion
In this paper, we report the first application of a random sequence (unique molecular identifier: UMI) to eDNA analysis. In this method, the D-loop is PCR-amplified as the target region by adding the UMI tags to the end of this region with eDNA as a template. The D-loop and UMI sequences are then determined by next-generation sequencing, and identical DNA sequences with the same UMI sequence are counted as one unique sequence. The reads with different UMI tags are considered to be derived from different DNA copies; by counting the number of UMI tags, an accurate number of DNA copies can be obtained. Furthermore, PCR and sequencing errors   The relationship between the detection threshold and sensitivity changed depending on whether the UMI tags were used and depending on the detection threshold for the cluster size. Here, the sensitivity is the percentage of correct haplotypes among the total haplotypes observed. (B) The relationship between the detection threshold and specificity. Here, the specificity is the percentage of correct haplotypes compared to all the detected haplotypes. The horizontal axis shows the detection threshold, which is the size of the cluster when the most abundant haplotype is set to 100%. A threshold that was as low as possible without decreasing specificity was selected. The threshold finally adopted by HaCeD-Seq with the UMI tags (0.2%) is indicated by a blue triangle; the threshold adopted without the UMI tags (1.5%) is indicated by an orange triangle. www.nature.com/scientificreports/ can be avoided by integrating sequences with the same UMI tag, and more accurate D-loop sequences can be obtained referring to the data previously reported [14][15][16]27,28 .
The accuracy of DNA polymerase in DNA synthesis is approximately 99.995%, including that of high-fidelity DNA polymerase 29 . However, as sequencing accuracies have increased over the years, PacBio's HiFi reads exceed 99.99% accuracy, so the effect of PCR errors during library construction has become non-negligible. PCR errors can be also avoided by integrating the results of multiple PCR runs operated in separate tubes. However, because there are cases in which PCR errors might occur at particular sites with similar frequencies even in separate reaction tubes with the same template DNA 27,28 , the removal of PCR errors by software has limitations. For greater accuracy, not only software processing, but also physical processing must be used in combination. We have successfully applied UMI tags to reduce the noise from PCR and next-generation sequencing errors by an order of magnitude when counting haplotypes; these UMI tags were very practical for eliminating PCR errors [14][15][16] .
By using HaCeD-Seq combined with the UMI tags to estimate the number of overlapping haplotypes in the Pacific bluefin tuna rearing in the aquarium tank at Tokyo Sea Life Park, we were able to detect 90 out of 96 haplotypes (94%) with no false haplotype. These results suggest that almost all the haplotypes present in the aquarium tank could be detected. On the other hand, 4 false haplotypes were detected among 71 haplotypes in 7 vehicle tanks with the UMI tags, suggesting the effect of different DNA concentrations in different sizes of tanks even on the number of false haplotypes for the UMI-assisted method. Meanwhile, only 29 out of 96 haplotypes (30%) could be detected with two false haplotypes without the UMI tags. Nevertheless, we were unable to detect 6 out of the 96 haplotypes (6%). Because 6 individuals died during the first month after addition to the aquarium tank, the six missing haplotypes may have resulted from these deaths. Therefore, it is likely that our haplotype detection rate was greater than 94%. The improved detection sensitivity when using the UMI tags is an important achievement to address the challenge of determining fish abundances from open seawater samples www.nature.com/scientificreports/ because the variability in concentration among haplotypes is expected to be greater than tank experiments. HaCeD-Seq with the UMI tags is likely to be adopted in the studies of the species living in open seawater with high haplotype diversity. Although the haplotype diversity of tuna is high 17,18 , only 96 haplotypes were detected in samples from the aquarium which housed 108 individuals. This discrepancy shows that there are differences between the number of haplotypes and the number of individuals. To estimate the abundance more accurately from the number of detected haplotypes, it is necessary to account for overlap in haplotypes among individuals and thus to make estimates based on a population genetic model as well as the effect of different DNA concentrations in open seawater. The iNEXT tool is a method that can produce such an estimate, and it is possible to estimate the error range of the resulting population numbers from the detected haplotype diversity 30 . In the case of Pacific bluefin tuna, the error range was estimated using the D-loop sequence of 507 bluefin tuna in a public database (Fig. 6). For a haplotype number of 96, the estimated number of individuals ranged from 98 to 101. These results will be an important guideline for setting research policies for field studies.
HaCeD-Seq makes it possible to estimate the number of individual fish in a tank from the eDNA contained in a water sample. Some fish hide behind rocks inside the aquarium, and even fish keepers are often unsure of the number of individuals in a given tank. Most fish in aquariums are wild-caught individuals, and the haplotype  www.nature.com/scientificreports/ diversity is expected to be high. Therefore, the method tested in this study can also be applied to monitoring the fish abundance in aquarium tanks as well.
The extent of exploitation of commercially important fish has been increasing due to increasing global demand. In particular, the global depletion of top predatory fish species such as the Pacific bluefin tuna is a serious concern in the context of sustainable fisheries management 31,32 . The estimation of the abundance and biomass (i.e., stock assessment) of fishery resources has heavily depended on fishery information, such as catch data from global stock assessment systems. A variety of resource estimation models have been devised to improve the accuracy of conventional resource estimation, which relies on catch data. The close-kin method, which uses kinship information, has been newly established for Pacific bluefin tuna and southern bluefin tuna 33,34 . This method can estimate the absolute abundance of parental fish based on the number of parent-offspring pairs determined through DNA marker-based parentage analysis. However, it is critical to collect specimens without bias in terms of distribution, migration, spawning time and location. Samples for the close-kin method are collected by fishery-independent sampling surveys or from marketplaces. Therefore, the reliability of this method depends on the accuracy of catch information. By contrast, the method proposed in the current study does not require fish sampling because it uses the eDNA contained in environmental water samples to estimate the abundance of a target species. In addition, as the conventional method of stock assessment relies on fisheries, many individuals are to be missed. High haplotype diversity has already been reported for some commercially important small pelagic fish such as jack mackerel (Trachurus japonicus) 35 , chub mackerel (Scomber japonicus) 36 , and spotted mackerel (Scomber australasicus) 37 for which a total allowable catch (TAC) has been set in the fishery management system of Japan 38 . The haplotype diversities of jack mackerel, chub mackerel and spotted mackerel have been reported to be 0.964-1.000, 0.505-0.967 and 0.996, respectively. Hence, the proposed method could be applicable for stock assessments of these major target species with high haplotype diversity in fisheries. It is interesting to investigate how high haplotype diversity is needed to apply this approach. A combination of the proposed method and the conventional methods is expected to improve the accuracy of estimating the abundance and biomass of aquatic organisms. Furthermore, this method would also potentially be applicable to any commercial or non-commercial species if the estimation process is first established based on haplotype diversity. From such a viewpoint, this method is anticipated to provide a possible venue for multispecies approach under the conceptual framework of ecosystem-based fishery management 39 , which is a new direction for the management of fisheries that places priority on the ecosystem rather than on a specific target species.
There are many studies showing the correlation between the abundance of eDNA and the biomass [5][6][7] . However, abundance estimation in the natural environment has been difficult, because the concentrations of eDNA are variable by variations in size and age-class distributions even in the same species 8 as well as alteration in environmental factors such as water temperature and flow 9,40,41 . This has been an open question linking eDNA to species abundance. Yamamoto et al. 42 used eDNA with quantitative PCR in combination with echo sounder technology to estimate the distribution and biomass of jack mackerel. Our method allows the fish abundance and biomass to be estimated from environmental water samples without any further supplementary method. Such methodological independence is a strong advantage of our approach, which establishes a simple and accurate estimation of fish abundance. Although different numbers of the sequences were distributed across different haplotypes with the UMI tagged data, we can detect almost 100% of the haplotypes. This data would allow us to see how variable eDNA production rates amongst individuals of the same species, which we think would be a very useful addition to our knowledge.
One of the most suitable applications of the present method is to estimate the number of individuals of endangered species. Endangered species are rapidly declining in abundance and biomass, making it more difficult to sample individuals in the field. A direct abundance estimation based on our method using only environmental water samples could be a desirable way to assess the population status of endangered species without disturbing the ecosystem through field sampling. Our proposed method provides a route for the simultaneous pursuit of sustainable management, ecosystem conservation, and necessary assessment.

Materials and methods
Specimens. Pacific bluefin tuna were caught in the Pacific Ocean off Kashiwa Island, Otsuki, Kochi Prefecture; off Kannoura, Toyo, Kochi Prefecture; and off Nachikatsuura, Wakayama Prefecture. All specimens were temporarily kept in cages off Kashiwa Island and then transported by a live fish boat to Misaki, Kanagawa Prefecture. From this location, four live fish transport vehicles with a combined total of seven tanks were used to transport 70 tuna to the Tokyo Sea Life Park on June 12, 2016. Each transport tank (weighing 7-17.1 tons) contained 7-17 tuna (7, 7, 7, 10, 11, 11 and 17 tuna) (Fig. 1). While these transport tanks had no circulation system, aeration with pure oxygen and air were provided. No fish deaths occurred during transportation from Misaki to the park or during transfer from the vehicle tanks to the aquarium. A 10 L sample of environmental water was collected from each tank after transportation. A total of 38 Pacific bluefin tuna were currently housed in a 2,200-ton aquarium tank at the park, and a 10 L sample of the surface water from this tank was collected on June 8, 2016, before adding the 70 new Pacific bluefin tuna. The water temperature of the aquarium tank was 23 °C, and the circulation volume was 2,200,000 L/h. About 60% of the water in the aquarium tank was replaced per month. The water of fresh natural seawater was taken off Hachijojima in Izu Islands and transported to the aquarium. Aeration with pure oxygen and air were provided, where dissolved oxygen concentration was maintained at 6.3-6.7 mg/L. Pieces of dorsal fin tissues of the tuna were dissected with small forceps from the 70 individuals to determine the D-loop sequences by Sanger sequencing, and pit tags were embedded for individual identification. One month after the addition of the 70 tuna, 10 L of surface water was again collected from the aquarium tank on July 6, 2016. The dates of the individuals that died in the aquarium tank between June 8 and Extraction of environmental DNA from water. A 10 L sample of collected environmental water was passed through a 47 mm GF/D (2.7 μm pore size) glass-fibre filter (Whatman). After filtering the samples, the filters were fixed by adding 15 mL of ethanol 46 . The filter was divided into two equal parts with scissors before DNA extraction; one of the two pieces was used for DNA extraction in order to properly fit the extraction kit. DNA extraction from the filter was performed using a DNeasy Blood & Tissue Kit 47 . The extracted eDNA was eluted in 400 µL (twice with 200 µL) of buffer AE supplied with the kit. The extracted eDNA was quantified using a Qubit 2.0 Fluorometer (Thermo Fisher Scientific) and subjected to PCR under the same conditions as used for Sanger sequencing to confirm PCR amplification.

UMI-tagged PCR and MiSeq sequencing.
To construct the UMI-tagged libraries, another primer set (degenerate primers) was designed within the D-loop region: To2nd_CRF2 (5′-CTA GTA CYY AAC CAT TCA TA-3′) and To2nd_CRR1 (TGA CCC YCT AGA RAG AAC G-3′). The extracted eDNA was adjusted for UMItagged libraries according to the following procedures. The construction of the UMI-tagged libraries was performed in a three-step PCR protocol ( Supplementary Information Fig. S1).
For the first PCR, 10 ng of DNA was used as a template, and To2nd_CRF2_CS_UMI13: 5′-CTC TTC CGA TCT GTCNNNNNNNNNNNNNCTA GTA CYY AAC CAT TCA TA-3′ and To2nd_CRR1_withCS: 5′-CTC TTC CGA TCT CAG TGA CCC YCT AGA RAG AACG-3′ were used as primers. The composition of the PCR reaction medium (25 µL total) was 0.5 µL of PrimeSTAR GXL DNA Polymerase, 5.0 µL of 5× Prime STAR GXL Buffer, 2.0 µL of dNTP mixture (2.5 mM each), 1.0 µL of primer To2nd_CRF2_CS_UMI13 (10 µM), 1.0 µL of primer To2nd_CRR1_withCS (10 µM), 10 ng of sample, and adjusted to 15.5 µL with distilled water. Four cycles of 10 s at 98 °C, 30 s at 55 °C and 1 min at 68 °C were carried out in a thermal cycler, and the reaction solution was then stored at 4 °C. Next, 0.1 µL of exonuclease I (Takara Bio), 3 µL of 10× reaction buffer (included with the exonuclease), and 1.9 µL of nuclease-free water were added for a total of 30 µL, and the reaction was carried out at 37 °C for 30 min to remove the primers. The exonuclease was then inactivated at 80 °C for 20 min, and 10 µL of distilled water was added to bring the total volume to 40 µL. The post-reaction mixture was treated with AMPure XP beads (Beckman Coulter), and PCR amplicons were eluted in 17 µL of 10 mM Tris-HCl buffer (pH 8.0), 15.5 µL of which was collected and used as sample for the second PCR.  Library construction without the UMI tag was performed with the primers To2nd_CRF2_withCS: CTC TTC  CGA TCT GTC CTA GTA CYY AAC CAT TCATA and To2nd_CRR1_withCS: CTC TTC CGA TCT CAG TGA CCC YCT AGA RAG AACG as in the first PCR. The subsequent procedures were performed under the same conditions as in the UMI library preparation. The constructed libraries were paired-end sequenced at 600 bp on a MiSeq next-generation sequencer (Illumina).
HaCeD-Seq analytical pipeline. Without UMI tags. After joining the paired-end reads with FLASh ver 1.2.11 48 using the option "-M 300", 20 nucleotides from the 5′-end and 22 nucleotides from the 3′-end were deleted as the adapter and primers region, and the reads that contained one or more < Q30 bases were pruned using a self-made script. The variety and abundance of each haplotype was calculated by clustering the remaining reads at 100% agreement.
Next, 20 nucleotides from the 5′-end and 22 nucleotides from the 3′-end were deleted as the adapter and primers regions of the obtained consensus sequence, and the number of 100% matching consensus sequences was calculated for each haplotype.

Data availability
All MiSeq sequencing data were registered in the DNA Data Base of Japan (DDBJ) under accession number DRA010682.