Phylogenomic and morphological relationships among the botryllid ascidians (Subphylum Tunicata, Class Ascidiacea, Family Styelidae)

Ascidians (Phylum Chordata, Class Ascidiacea) are a large group of invertebrates which occupy a central role in the ecology of marine benthic communities. Many ascidian species have become successfully introduced around the world via anthropogenic vectors. The botryllid ascidians (Order Stolidobranchia, Family Styelidae) are a group of 53 colonial species, several of which are widespread throughout temperate or tropical and subtropical waters. However, the systematics and biology of this group of ascidians is not well-understood. To provide a systematic framework for this group, we have constructed a well-resolved phylogenomic tree using 200 novel loci and 55 specimens. A Principal Components Analysis of all species described in the literature using 31 taxonomic characteristics revealed that some species occupy a unique morphological space and can be easily identified using characteristics of adult colonies. For other species, additional information such as larval or life history characteristics may be required for taxonomic discrimination. Molecular barcodes are critical for guiding the delineation of morphologically similar species in this group.


Results
The results are presented in two parts: (1) phylogenetic relationships among botryllid species, and (2) morphological analyses in the botryllid group. These two sets of results were generated using distinct datasets. The phylogenomic tree was constructed using sequencing reads for 200 loci from 55 specimens of botryllids. Collection information and a unique identifier for each of the 55 specimens can be found in Table 1. The morphological dataset includes all 53 described botryllid species, plus four additional taxa that are not currently described in the literature (see Supplementary Fig. S1 for morphological descriptions of these four taxa). We compiled data on 31 morphological characters for each of these 57 species (see Supplementary Fig. S2), and used these characters in a Principal Components Analysis.

Phylogenetic relationships among botryllid species. 200 Anchored Hybrid Enrichment (AHE) loci
were sequenced in 55 specimens. The mean length of the loci is 705 bp and the median length is 296 bp, indicating that the distribution of locus length is skewed to the right. The range of locus length is between 121 and 5172 bp. The total length of the alignment is 141,107 bp. Figure 1 is the tree generated by ASTRAL, and Supplementary Fig. S3 is the tree generated by RAxML. The topologies of the two trees are identical. The Botryllus genus is paraphyletic with respect to the Botrylloides genus. Focusing on the Botryllus genus first, there are three Botryllus-only clades (a, c, and d in Fig. 1), Clade a comprising Botryllus sp. (Bp1), and two specimens from the Western Pacific (the Philippines and Australia: Bsp1 and Bsp2). Clade c includes a Botryllus specimen from Florida (Bsp3) and a specimen of Botryllus horridus from Japan (Bh1). Clade d is formed by Botryllus gaiae(Bga1, Bga2), and Botryllus schlosseri from Italy (Bsp4), and is a sister group to Clade c. A specimen we collected in Bocas del Drago (Bsp5), Panama (in the Bocas del Toro archipelago), is the only Botryllus in a clade (e) that includes all the Botrylloides species. We will refer to this specimen as Bocas del Drago.
Moving on to the Botrylloides portion of the phylogeny, Clade g contains Botrylloides fuscus (Bf1-Bf3 in Clade h) as a sister taxon to a Botrylloides giganteus/Botrylloides violaceus group (Clade j). Clade r contains three specimens (Bsp7-9) of a species we are calling Rabbit Key in this manuscript, based on its collection location offshore of Rabbit Key in the Florida Keys. This species has also been found offshore of Barnes Key in the Florida Keys (the location of the sample in the phylogenomic trees). Clade r is a sister clade to Clade q, containing a Botrylloides sp. from the Bahamas (Bsp6). Clade r + q is a sister group to Clade p, containing specimens (Bsp10-13) from a species found in Bocas del Toro, Panama, which we will refer to as Bocas del Toro. Clades p, q, and r form Clade o, which is a sister clade to Clade n, a species from the Verde Island Passage in the Philippines which is represented by Bsp16-Bsp23. Clade s includes a specimen from the Pacific (Sydney, Australia: Bsp14) as a sister group to a clade labeled Botrylloides diegensis (v) + a clade labeled Botrylloides niger (u). The clade labeled Botrylloides diegensis (v) contains specimens with three different names: Botrylloides diegensis from California (Bd4 and Bd5), Botrylloides leachii from New Zealand (Bd6-8), and Botrylloides praelongus from Japan (Bd1-3). For brevity's sake, we labeled Botrylloides diegensis/leachii/praelongus samples as Botrylloides diegensis in Fig. 1 Morphological analyses in the botryllid group. The results of the PCA analysis are presented in Fig. 2.
All species have been given a numerical code , and the correspondence between the species name and the numerical code is presented in the legend. For easier viewing, Fig. 2a excludes three species that are widely divergent from the other botryllids: Botryllus flavus, Botryllus magnus, and Botryllus renierii (Species 33,38,47). These three outlier species, which are close together in the PCA, are included in Fig. 2b.
The distances between the species in Fig. 2 point to low morphological variation among many botryllid species. Despite this, there are clear morphological outliers. For example, Botryllus primigenus (43) and Botryllus tuberatus (56) both have small zooids (1.5 and 0.8 mm zooid length, respectively), where the average is 2.8 mm (Supplementary Table S1). Botrylloides lenis (11) has smaller zooids (1.3 mm, when the average is 2.2 mm: Supplementary Table S1) and a thinner tunic than other Botrylloides (1.7-2.0 mm, when the average is 3.25 mm: Supplementary Table S1), although external colony morphology such as tunic thickness may be environmentally influenced (Brunetti 2009, CS Cohen personal observation). Botrylloides giganteus (8) has the largest zooids in the botryllid group (18-20 stigmatal rows), where the average is 10-12. The tunic thickness in this species can reach 15 mm, when the average in the Botrylloides is 3.25 mm.
It should be noted that the first two principal components only represent 13% of the variation in the morphological characters when the outliers are excluded and 14% when they are included, so the attempt to reduce the dimensionality of the data set excluded a large proportion of the useful variation between the species. Therefore, the values in these characters cannot be easily correlated with one other.

Discussion
The topology of the nuclear tree shown in Fig. 1 can be compared to the topologies of three previously published trees. The first two trees were constructed using 13 mitochondrial proteins 38,48 , and the third using 18 s rDNA 39 . The species represented in the mitochondrial trees are Botryllus schlosseri, Botrylloides giganteus (identified as Botrylloides pizoni in Ref. 48 ), Botrylloides leachii (which may be Botrylloides diegensis 24 , Botrylloides niger, and Botrylloides violaceus. The more recent mitochondrial tree 38 also includes Botryllus gaiae, which was formerly Clade E in the Botryllus schlosseri complex. In both of these mitochondrial trees, the Botrylloides clade is sister to the Botryllus clade. This is in contrast to Fig. 1, where Botryllus is paraphyletic with respect to Botrylloides although based on a single taxon (Bsp5). Within the Botrylloides clade, Botrylloides leachii and Botrylloides niger are sister species in both mitochondrial trees and in Fig. 1. Botrylloides giganteus and Botrylloides violaceus are sister species in Ref. 38 and Fig. 1, whereas Botrylloides violaceus is sister to a giganteus/leachii/niger clade in Ref. 48 ).
Comparing nodes in Fig. 1 to well-supported nodes in Ref. 39 's 18 s rDNA Maximum Likelihood Tree (those nodes that have ≥ 80% bootstrap support), both trees show Botryllus species as the outgroups to the Botrylloides clade. The 18 s rDNA tree includes Botrylloides fuscus, Botrylloides violaceus, Botrylloides simodensis, and a sample from Venice Lagoon that was provided and named as Botrylloides leachii by A. Sabbadin. Botrylloides fuscus is an outgroup to a violaceus, leachii, simodensis clade, but the relationship is not well-supported. In the current phylogeny, Botrylloides fuscus is in a clade with Botrylloides violaceus, and this Botrylloides fuscus/Botrylloides violaceus clade is sister to a clade that includes Botrylloides diegensis.
Previous studies in the Botryllus schlosseri species complex have revealed that sister species can have very different dispersal patterns 42,49,50 . There are five clades in the species complex, with Botryllus schlosseri and Botryllus gaiae having widespread ranges, and Clades B, C, and D geographically restricted. Beyond the Botryllus schlosseri species complex, the four widespread species in the botryllid group are Botrylloides diegensis, Botrylloides giganteus, Botrylloides niger, and Botrylloides violaceus. Botrylloides species may have a higher likelihood of global spread than Botryllus species, despite Botrylloides being less diverse than Botryllus (21 vs. 32 described species). Botrylloides larvae are larger than Botryllus larvae and have longer developmental times 39 , which could impact dispersal or settlement abilities. In the current phylogeny, Botrylloides giganteus and Botrylloides violaceus are sister groups, as are Botrylloides diegensis and Botrylloides niger ( Fig. 1 and Supplementary Fig. S3). However, these sister groups may not remain as such when additional species are added, and testing whether range size has a phylogenetic signal across the entire group will require a more taxonomically comprehensive phylogeny.
The morphological distinctions between the genera Botryllus and Botrylloides have been much debated, and taxonomists have disagreed about whether there should be one genus or two genera (reviewed in Ref. 32 ). Based on the morphological analyses in this study, species in each genus cluster together, although there is some overlap where the clusters meet. The five Botrylloides species that cluster with Botryllus in the PCA were named for a single Botrylloides-like character when the majority of the characters are Botryllus-like. This supports the argument that the two genera should be maintained 32 . But with a robust and broadly representative phylogeny, the generic assignations in this group can now be evaluated in a phylogenetic context.
According to phylogenetic systematics, all taxa should be monophyletic 51 . While some evolutionary systematists have argued for the maintenance of paraphyletic groups 52,53 , the precise location in the phylogeny where a nested group should be given a different name cannot easily be determined because evolution is most often a gradual process 54,55 . For this reason and others detailed in Ref. 55 , a broad agreement in the systematic community has formed in favor of monophyly [55][56][57] .
We classified the Bocas del Drago specimen (Bsp5) as Botryllus because many of its morphological characters were Botryllus-like. This species falls at the base of the Botrylloides clade. In a two genera classification scheme, the Botryllus clade would comprise two monophyletic groups plus one species in the Botrylloides clade. This paraphyly of Botryllus is also reflected in the 18S rDNA phylogeny 39 . Because the genus Botryllus is not a monophyletic group, we suggest that the generic distinctions within the botryllid group be re-evaluated. Based on the molecular phylogeny presented here, we agree with taxonomists who propose that the genus Botrylloides should be deleted, as it is a junior synonym of the genus Botryllus 37,47 . Their argument is based on morphology: no clear morphological distinction exists between the two genera and all morphological characters represent "various states of continuous evolution" 37,47 .
The botryllid ascidians are taxonomically understudied, despite their research significance as model systems, and the extensive ranges of Botrylloides diegensis, Botrylloides giganteus, Botrylloides niger, Botryllus schlosseri, and Botrylloides violaceus. Several widely distributed botryllid ascidians have been misidentified, and correct identification of these species is critical for understanding their biology and spread as well as detecting the spread of additional species. The identification of botryllid species within each genus can be accomplished through morphological examination for those species that are clear morphological outliers. But the majority of the botryllid species are morphologically very similar to several other species, so discrimination based on morphological characters alone is very difficult. Species delimitation analyses are therefore necessary to identify and describe species in this group. While the current phylogenomic tree does not have enough samples per taxon to conduct such analyses, we can obtain preliminary results from mtCOI barcode sequences. Table 1 includes a column describing whether each barcode is unique (i.e. does not have 100% identity to any other sequences on GenBank). Table 1 also lists the best BLAST result where the subject and query are labeled as different species, to illustrate the amount of divergence between species. Not all barcodes are unique, but when a barcode query exhibit 100% identity, the subject is always the same species. When subject and query are labeled as different species, the % identity between them is 92% or lower, suggesting a barcode gap between species.
The mtCOI barcodes provide preliminary evidence that the barcoded taxa in the tree are distinct species. Robust species delimitation analyses will be possible using the 200 AHE loci developed here. These loci could www.nature.com/scientificreports/ also be used to expand the current phylogenomic tree, in order to fully represent the evolutionary relationships within the botryllid group.

Methods
Sample collection. Samples were collected from 1995 to 2017 from both artificial and natural substrates ( Table 1). The collection location, geographic coordinates, habitat, and GenBank accession number of mitochondrial cytochrome oxidase I (mtCOI) gene for each sample are presented in Table 1. Photographs of samples can be viewed in Supplementary Fig. S4. A small piece of tissue was removed from each colony in the field, cleaned to remove algae and other contaminants, and preserved in 95% ethanol, RNAlater (Thermo-Fisher), or a DMSO solution saturated with NaCl. For the species that are described in a morphological context in this study, colonies were relaxed using menthol crystals and subsequently preserved in 10% formalin in salt water buffered with sodium borate. Some samples collected in the Philippines were relaxed with tricaine methanesulfonate (MS222). However, we do not recommend this for ethanol samples that are intended for genetic analysis; these samples required multiple rounds of isopropanol precipitation at the end of the DNA extraction process in order to purify them, and even with purification did not always produce usable libraries. PCR products were incubated with 1 µl each of Exonuclease I (New England Biolabs) and Antarctic Phosphatase (New England Biolabs) at 37 °C for 1 h, followed by 90 °C for 10 min. The PCR products were sequenced at the University of Kentucky's HealthCare Genomics Core Laboratory using an ABI-3730 automated sequencer (Applied Biosystems). Forward and reverse sequences were edited and combined into a consensus sequence using Codon Codes Aligner (Codon Code Corporation).
We compared each sequence to botryllid sequences available on GenBank. If the sequence we obtained had 98-100% identity to a sequence identified on GenBank using blastn, we considered our sample to be the same species as the sample on GenBank. Because GenBank sequences can be mis-assigned, we only used GenBank identifications in which the submitting author had independently verified the taxonomic assignment of the sample using morphological characters. In many cases, the mtCOI sequence had no close match on GenBank.
Sample identification: morphological techniques. Taxa were also assigned to species by examination of morphological characters, when formalin preserved samples were available, using descriptions from the literature 11,[30][31][32][33][34][35][36]38,40,[43][44][45][46][47][48][60][61][62][63][64][65] . The 31 analyzed morphological characters are summarized as follows: arrangement of systems, position of ovaries and testes, testes morphology, number of stigmatal rows, completeness of the second stigmatal row, arrangement of stigmata, shape of intestine, location of anterior edge of intestinal loop, location of anus, number of stomach folds, appearance of the stomach folds, shape of the stomach, shape and size of the pyloric caecum, number and size orders of the oral tentacles, distribution of pigment cells in the zooid, zooid length, colony color when living and after fixation, and tunic thickness [30][31][32] . If the species could not be identified with complete certainty from the literature, we used a subset of the 31 characters: arrangement of systems, position of ovaries and testes (if ovaries and testes were present), appearance of the stomach folds, shape of the stomach, and size of the pyloric caecum. At least 30 zooids were examined from each colony.
When possible, we examined the morphology of the same colony from which the mtCOI sequencing was obtained. If a formalin preserved sample of the original colony was not available, we were often able to examine the morphology of another colony with an identical mtCOI sequence. We did not assign a specimen to a specific morphologically-described species if we could not obtain a formalin sample of the original colony or of a colony with an identical mtCOI sequence.
In the course of our species assignment using morphology, several specimens did not match the descriptions of any species previously found in the geographic region in which the specimen was collected. Either these samples represent new species, or they are known species that have only recently been identified from the location where they were collected. To determine where these species fit in relation to the morphologies of the other Botryllus/Botrylloides species, we searched the literature for morphological information on the 53 described botryllid species. We compiled data on the 31 morphological characters described above. The genus names (Botrylloides or Botryllus) and the morphological data come from the type description. In a few cases, the type description is lacking data on the majority of these characters. If a re-description was available, it was used to supplement the type description. Because our examinations were more thorough than type descriptions (often type descriptions do not provide data on individual zooids or individual colonies), we averaged our quantitative www.nature.com/scientificreports/ data across the 30 + zooids we examined in each colony to obtain a single value for each colony. We then averaged quantitative data across multiple colonies to obtain a single value for each character for each species, to match the data available in the literature. A brief description of each considered morphological character is available in Supplementary Fig. S2, and the entire data matrix is available in Supplementary Table S1. Using the morphological data from the 53 described species and the 4 undescribed species, we then conducted a Principal Components Analysis using PCAmixdata 66 as implemented in R version 3.6.1.
To accompany the PCA, we constructed a phylogenetic tree of the 57 species using the 31 morphological characters using MrBayes 3.2.2 67 on the CIPRES (Cyberinfrastructure for Phylogenic Research) Science Gateway 68 .
The GTR + G model of nucleotide substitution was applied to all data sets (Nset = 6). Each analysis was run for 10 million generations, with sampling every 1000 generations. The first 2000 trees were eliminated as burn-in.
Anchored hybrid enrichment (AHE) locus identification and probe design. Our aim was to develop a resource for collecting hundreds of orthologous loci across the botryllid ascidians using Anchored Hybrid Enrichment (AHE) 69 . The pre-existing genomic resources included an assembled genome of Botryllus schlosseri 70 , and two assembled transcriptomes: Botryllus schlosseri 71 , and Botrylloides leachii 72 , recently reassigned to Botrylloides diegensis in Ref. 24 . In order to better represent the high diversity of the botryllid group, we collected low-coverage, whole genome data assemblies for seven additional species (details are given in Supplementary Table S2). DNA extracts for these seven species were sent to the Center for Anchored Phylogenomics (http:// www. ancho redph yloge ny. com) for processing. In brief, after the quality/quantity of DNA was assessed using Qubit, Illumina libraries with single 8 bp indexes were prepared following 73 , with modifications described in Ref. 74 . Libraries were pooled and sequenced on two Illumina HiSeq2500 lanes with a paired-end 150 bp protocol. A total of 125 Gb of data was collected yielding 25-65 × coverage per species. Reads were filtered for quality using the Cassava high chastity filter, demultiplexed with no mismatches tolerated, and merged to remove sequence adapters 75 prior to downstream processing.
In order to identify suitable conserved targets for AHE, we performed reciprocal blast on local machines at the Center for Anchored Phylogenomics using the two assembled transcriptomes (blastn). Using the results from the blast searches, we identified 482 preliminary targets with matching transcripts, which we aligned using MAFFT v7.023b 76 . Alignments were manually inspected in Geneious (vR9, Biomatters Ltd., Kearse et al. 2012), then trimmed to regions that were well-aligned. For the remainder of the locus development/identification, we followed the protocol outlined in Ref. 77 . More specifically, we isolated the Botryllus schlosseri (transcriptome) sequences from the aforementioned alignments, and using those as a reference scanned the Botryllus schlosseri genome for the AHE regions. Regions of 10,000 bp containing a 17 of 20 initial spaced k-mer match, followed by a 55 of 100 confirmation match to one of the references were kept. K-mers are all of a sequence's subsequences of length = k. For example, the sequence GCTA would have the following k-mers: G, C, T, A, GC, CT, TA, GCT, CTA, and GCTA. K-mers from the Botryllus schlosseri transcriptome were used to search the Botryllus schlosseri genome for AHE regions, and matches were based on spaced seeds as described in Ref. 78 . We then aligned (using MAFFT), the best matching genome sequence for each locus to the two transcriptome-derived sequences for that locus. Using Geneious (vR9, Biomatters Ltd.), we identified well-aligned regions of each three-sequence alignment and trimmed the alignments accordingly. The three-sequence alignment contained only two species: Botryllus schlosseri and Botrylloides leachii (recently re-assigned as Botrylloides diegensis) 24 .
In order to incorporate whole genome sequencing (WGS) data from the seven additional species, we utilized sequences from Botrylloides leachii and Botryllus schlosseri in the alignments as references. Each WGS read was checked against the reference database and reads with a preliminary 17 of 20 initial spaced k-mer match, followed by a final 55 of 100 bp consecutive match were retained, then aligned by locus to form seeds for an extension assembly that allowed flanking regions to be recovered (see Ref. 77 for details and scripts). In order to construct the final alignments, the (up to) 10 sequences for each locus were aligned in MAFFT, then trimmed to well-aligned regions after inspection in Geneious (vR9, Biomatters Ltd.). In order to avoid problems associated with missing data in downstream projects 79 , loci represented by less than 50% of the sequences in the alignment were removed from downstream analysis. When alignments from two loci were found to be overlapping (i.e. containing some of the same 20-mers), one locus was removed to ensure that each locus was a unique target. Lastly, we checked for repetitive elements by profiling the k-mers found in the alignments with respect to their occurrence in the WGS reads. Regions with a substantially elevated k-mer coverage were masked. A total of 200 AHE targets resulted from the process. Supplementary Table S3 contains the size of each locus, and genomic position of each locus in the Botryllus schlosseri genome, as determined from the best blastn match to the Botryllus schlosseri genome assembly using the locus sequence as a query. Finally, in-silico probes were tiled uniformly across the 10 sequences for each locus at 3.5 × coverage depth. A total of 54,350 probes covered the 200 AHE targets (total target size ~ 139 kb) that resulted from the process. These loci were successfully amplified in Symplegma brakenhielmi, to provide an outgroup for the phylogenomic tree. These loci will therefore be useful for Symplegma, which is the sister group to the Botrylloides/Botryllus clade 80 . The utility of these loci beyond the genera Botrylloides, Botryllus and Symplegma has not been investigated.
DNA extraction and library preparation methods. DNA for all samples in Table 1 was extracted using the E.Z.N.A DNA isolation kit (Omega BioTek), and an additional isopropanol precipitation was performed to further purify the DNA. The quantity and quality of the DNA extractions were determined using Qubit and 2% TAE agarose gels. The extracted DNA was fragmented into 300-500 bp pieces using a Covaris E220 focused-ultrasonicator with microTUBES (Covaris). Then, library preparation and indexing were performed on a Beckman-Coulter Biomek FXp liquid-handling robot, using a protocol based on Ref. 73  Raw read alignment. Sequence reads were demultiplexed with no mismatches tolerated and filtered for quality using the Illumina CASAVA pipeline with a high chastity setting. Overlapping reads were identified and merged using the approach described by Ref. 75 . This process removes sequence adapters and corrects sequencing errors in overlapping regions. Reads were then assembled using the quasi-de novo approach described by Ref. 77 . This assembly approach uses divergent references to identify sequences coming from conserved regions to which reads can be mapped. The mapped reads are in turn used as references when the assembly is extended into less conserved regions (see Ref. 77 for details). Probe region sequences from eight of the nine species used in the probe design were used as references for the initial mapping, while sequences from the Botryllus schlosseri genome (the 9th species) served as the primary reference. Consensus sequences were constructed from assembly clusters containing greater than an average of 250 reads. Ambiguity codes were employed for sites in which base frequencies could not be explained by a 1% sequencing error.
Phylogenomic tree building. Orthologous groups of consensus sequences were identified using a clustering approach that relied on an alignment-free distance matrix constructed by measuring the degree of 20-mer distribution overlap among taxa (see Ref. 77 for details). Sequences from orthologous sets of loci were then aligned using MAFFT (v7.023b 76 ). Alignments were trimmed and masked (i.e. excluded) using the automated procedure described by Ref. 77 , with 50% threshold required for identifying reliable sites, a 14-base threshold for masking misaligned regions, and 25 sequences required to be present at a site to prevent removal of the site. Alignments were inspected visually in Geneious (vR9, Biomatters Ltd.) to ensure that the settings used in the automated procedure were appropriate and also to identify any undetected misaligned regions. Phylogenomic trees were built using two methods: a concatenated species tree method using RAxML v8.2.8 81 , and a coalescent species tree method using ASTRAL-II v4.10.12 82 . Maximum likelihood gene trees were first created for each locus separately. Then, maximum likelihood trees were created using a concatenated alignment partitioned by locus. A GTR + G model of nucleotide substitution and 1000 bootstrap replicates were employed for both gene trees and the species tree. The gene trees produced by RAxML were then used as inputs for ASTRAL-II. ASTRAL-II obtains quartet trees from the gene tree inputs, and creates a species tree that contains the maximum number of quartet trees present in all gene trees 82 .
A maximum likelihood tree-building framework using concatenated multiple gene alignments to obtain a species tree is a common approach 83,84 . We also employed a second, coalescent-based method (ASTRAL-II). Coalescent-based methods are often used because concatenation can lead to inaccurate species trees with high levels of bootstrap support 84 . ASTRAL-II is a summary method 83 , and is preferable to Bayesian co-estimation coalescent methods due to computational difficulties with datasets containing > 100 loci or > 30 samples [85][86][87] .

Data availability
All mtCOI barcode sequences associated with this study have been uploaded to GenBank: Accession numbers are available in Table 1. Genome raw reads, genome assemblies, and alignments for probes are available on Dryad (https:// doi. org/ 10. 5061/ dryad. 3r228 0gf7). All other data generated or analyzed during this study are included in this published article (and its Supplementary Information files).