The crown-of-thorns starfish genome as a guide for biocontrol of this coral reef pest

The crown-of-thorns starfish (COTS, the Acanthaster planci species group) is a highly fecund predator of reef-building corals throughout the Indo-Pacific region. COTS population outbreaks cause substantial loss of coral cover, diminishing the integrity and resilience of reef ecosystems. Here we sequenced genomes of COTS from the Great Barrier Reef, Australia and Okinawa, Japan to identify gene products that underlie species-specific communication and could potentially be used in biocontrol strategies. We focused on water-borne chemical plumes released from aggregating COTS, which make the normally sedentary starfish become highly active. Peptide sequences detected in these plumes by mass spectrometry are encoded in the COTS genome and expressed in external tissues. The exoproteome released by aggregating COTS consists largely of signalling factors and hydrolytic enzymes, and includes an expanded and rapidly evolving set of starfish-specific ependymin-related proteins. These secreted proteins may be detected by members of a large family of olfactory-receptor-like G-protein-coupled receptors that are expressed externally, sometimes in a sex-specific manner. This study provides insights into COTS-specific communication that may guide the generation of peptide mimetics for use on reefs with COTS outbreaks.

COTS genes are labelled and are marked with red lines; other asteroids, two shades of orange and yellow lines; sea urchins, dark green; hemichordates, light green; molluscs, pink; annelids, purple; cnidarians, black; and vertebrates, blue. The three clades to which COTS sequences belong are indicated by the outer circle. The asterisk denotes the fish-specific true ependymin clade. c, One of the COTS EPDR gene clusters on scaffold 218, with exons (grey bars and arrowheads), intergenic regions and introns (thin black lines) and direction of transcription (arrowhead at end of coding sequence) shown. Scale bar, 10 kb. In all panels, EPDRs secreted by COTS into the seawater are highlighted by red or green ovals as in Fig. 2b. letter reSeArCH 1 3 A p r i l 2 0 1 7 | V O l 5 4 4 | N A T U r E | 2 3 3 both populations declined and recovered in a similar manner during the late Pleistocene epoch ( Fig. 1e and Supplementary Note 4). Given that GBR and OKI genomes are nearly identical, we treat them as one in subsequent analyses.
The COTS genome is a noteworthy addition to the existing suite of deuterostome genomes 10 . It shares many gene family and domain expansions with hemichordates and the sea urchin, although the COTS genome has fewer lineage-restricted gene and domain expansions (Extended Data Fig. 3 and Supplementary Notes 5, 6). The genome also has extensive microsynteny with other deuterostomes, including conserved Hox (Extended Data Fig. 4), ParaHox and Nkx gene clusters 11,12 .
To identify candidate factors for the development of biocontrol mitigation technologies, we targeted genes potentially involved in conspecific chemical communication. COTS, like many other marine invertebrates, rely primarily on their chemosensory system to detect environmental signals including those from prey, predators, and conspecifics during reproduction 13,14 . Water-borne signals probably guide adults to form aggregations before a synchronised spawning event 15,16 .
Proteins and peptides released by aggregating or alarmed COTS into the surrounding seawater were sequenced using mass spectrometry. By mapping these sequences to the genome, we identified gene products released by COTS when aggregating (244) or alarmed (77) or in both situations (73) (Supplementary Note 7). When exposed in a Y-maze assay to seawater containing putative aggregation factors, naive, normally sedentary COTS become highly active and move in the direction of the source of the conditioned seawater ( Fig. 2a and Supplementary  Videos 1, 2). This is consistent with water-borne factors being released during aggregation and detected by conspecifics at a distance. These released factors provide a potential basis for future biocontrol measures that include mass attraction to facilitate efficient collection and removal of COTS. This starfish also reacts rapidly and adversely to C. tritonis-conditioned seawater (Extended Data Of the exoproteins identified, 108 contain signal peptides and are probably secreted in a regulated manner. 71 of these were secreted from aggregating COTS, 14 from alarmed COTS and 23 were secreted under both conditions ( Fig. 2b and Supplementary Note 7). The genes encoding these secreted proteins are expressed in external tissues, including the spines, body wall and mouth, consistent with their release into the surrounding environment (Fig. 2c). Of the secreted proteins, 48 are enzymes, of which 83.3% (40) are hydrolyases, including plancitoxin-1, a type II DNase present in COTS venom 17 . In addition, 37 proteins are related to known secreted signalling and structural proteins, including 15 ependymin-related proteins (EPDRs), five lectins and four proteins related to deleted in malignant brain tumours 1. There are also 21 uncharacterized secreted proteins, 14 of which have substantial shared identity with proteins in other animals (Extended Data Fig. 6).
The detection of 15 EPDRs in the secretome of aggregating COTS (Fig. 2b) suggest that they potentially have a role in conspecific communication; an additional 11 EPDR genes in the COTS genome are also highly expressed, mostly in externally connected organs and tissues (Fig. 3a). Except for the signal peptide and a small number of spatially conserved cysteine residues, these 26 EPDRs share very low sequence similarity. Although EPDRs are present in most metazoans 18,19 , this family has uniquely expanded in asteroids into at least eight orthology groups comprising two larger clades (Fig. 3b, Extended Data Fig. 7 and Supplementary Note 8). Identified asteroid orthologues share little sequence similarity, suggesting that these EPDRs have rapidly evolved to give rise to species-specific repertoires of putative communication factors. Nineteen of the COTS EPDR genes comprise two compact tandem arrays that vary markedly in sequence and expression ( letter reSeArCH 2 3 4 | N A T U r E | V O l 5 4 4 | 1 3 A p r i l 2 0 1 7 provide a potential basis for the development of peptide mimetics for COTS biocontrol. The COTS genome encodes approximately 950 G-protein-coupled receptor (GPCR) genes, similar to other deuterostomes 21,22 (Extended Data Table 2 and Supplementary Note 9). Many of the approximately 750 COTS rhodopsin-class GPCRs-the class comprising chemoreceptors 23 -are organized in species-specific tandem arrays of unique, single exon genes, akin to putative olfactory receptors in other deuterostomes 21,22 (Fig. 4a, b and Extended Data Fig. 8). The enrichment of COTS olfactory-receptor-like GPCR transcripts in external and sensory tissues, including the radial nerve, spine and body wall (Fig. 4c), is consistent with their role in the detection of water-borne signals. A similar enrichment is observed in sea urchins, which also appear to change behaviours because of olfactory signals 13,14,21 . The COTS radial nerve is in direct contact with the external environment 13,14,24 and displays sexually dimorphic expression of rhodopsin GPCRs (Fig. 4c), suggesting that conspecific signals are perceived differently by male and female starfish.
Sequencing of the COTS genome and proteomic analyses have enabled the identification of species-specific secreted factors associated with aggregating starfish, such as EPDRs, which can lead to the development of peptide mimetics for biocontrol measures. The high similarity of GBR and OKI genomes indicates that genome-based mitigation strategies developed for one locale can be applied throughout the species' range. These genomic data will also be useful in ecological and population studies into the causes of COTS outbreaks, contri buting to regional-scale management of this coral reef pest. Further, this study suggests that species-specific secreted factors involved in conspecific communication in marine animals 25 -identified through combined genomic and proteomic approaches-have the potential to revolutionise mitigation technologies for aquatic pests more generally.
Online Content Methods, along with any additional Extended Data display items and Source Data, are available in the online version of the paper; references unique to these sections appear only in the online paper. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment. Genome sequencing and assembly. Genomic DNA was extracted from testes and sperm of two individuals. One was collected from Rudder Reef on the northern Great Barrier Reef (GBR), Australia (16° 11′ 46.4′ ′ S 145° 41′ 48.7′ ′ E) and the second from Motobu, Okinawa (OKI), Japan (26° 40′ 46.1′ ′ N 127° 52′ 46.1′ ′ E) (Fig. 1c). These two genomes were sequenced, assembled and annotated separately using standard methods 11,26 (Extended Data Table 1, Supplementary Fig. 2.1 and Supplementary Note 2). Paired-end libraries of 40× (GBR) and 46× (OKI) coverage were sequenced on an Illumina MiSeq sequencer, generating 250 nucleotide reads. An approximately 800 bp paired-end library for GBR was sequenced in three MiSeq runs, and two paired-end libraries of around 600 bp and around 1,000 bp for OKI were each sequenced in two MiSeq runs. For the GBR genome, 3 mate-pair libraries with insert sizes of 3, 8 and 12 kb were sequenced by Macrogen, Inc. For the OKI genome, 4 mate-pair libraries with insert sizes of 1.5-4, 4-6, 6-8 and 8-12 kb were sequenced on an Illumina HiSeq 2500 sequencer. Overall read coverage including both mate-pair and paired-end libraries was 152× for GBR and 139× for OKI (Extended Data Table 1 and Supplementary Note 2).
(2) DNA from COTS sperm was compared with DNA from Takifugu rubripes and Danio rerio sperm 26,30 by flow cytometry. All nuclei were treated with a DAPI flow cytometry kit and a BD Cycletest Plus DNA Reagent Kit (BD Biosciences), and analysed on a BD FACSAria II cell sorter (BD Biosciences) 31 (Supplementary Note 2). (3) Genome length was estimated on the basis of total scaffold length of the assembled genomes. Transcriptome sequencing and assembly. RNA was extracted from testes, podia, spines and stomach/mouth tissues from the same individuals from which GBR and OKI genomic DNA was obtained; RNA from the other tissues was isolated from different OKI and GBR individuals (Supplementary Note 2). Tissue-specific RNA-seq libraries were generated and sequenced on an Illumina HiSeq 2500 using standard methods 26 . De novo transcriptomes were assembled using Trinity version r20131110 (ref. 32). Genome-guided RNA transcripts were generated using Tuxedo 33 (Supplementary Note 2). Gene modelling and annotation. Both GBR and OKI genomes were masked using RepeatMasker version 4.0.3 (parameters: -qq -pa 8 -gff -species 'fungi/metazoa group' -no_is). Assembled transcripts were then mapped back to either the GBR or the OKI masked genome and used to generate a consensus transcript set via PASA (version 20140417). Only transcripts with over 90% transcript coverage (parameter: min_percent_aligned) and 95% identity (parameter: min_avg_per_id) were merged. Open reading frames (ORFs) predicted from PASA-assembled transcripts using TransDecoder (https://sourceforge.net/p/transdecoder/) were used to train Augustus to generate gene predictions for each genome (Supplementary Note 2). Additionally, all core eukaryotic genes were mapped to each genome using CEGMA (version 2.4). CEGMA predictions were used to train SNAP (version 2013-11-29). Unsupervised genes were also predicted using GeneMarkES (version 20120203).
Final gene predictions were generated using EVM 34 by combining (1) ab initio predictions by Augustus, SNAP and GeneMarkES, (2) consensus transcripts generated by PASA based on combined transcriptomes of both populations and (3) TransDecoder best ORF predictions based on PASA consensus transcripts. A genome browser is available at: http://marinegenomics.oist.jp/cots/. Estimation of intra-and inter-genome heterozygosity. Overall genome heterozygosity was estimated by single-nucleotide polymorphism (SNP) analysis, by mapping paired-end reads onto the scaffolds using BWA 35 . SNPs were called and analysed using Stools 36 . Further SNP analysis was done by mapping OKI reads to the GBR genome, and vice versa. OKI and GBR COTS genomic assemblies were aligned by reciprocal BLASTN+ 37 . Scaffolds > 10 kb or alignments with E values < 1 × 10 −5 were analysed. SNP distribution across each genome was also compared (Supplementary Note 3). LiftOver analysis between GBR and OKI genomes. Comparison of OKI and GBR gene models was performed using batch coordinate conversion (LiftOver) from the UCSC Genome Browser Utilities 38 . LiftOver settings were optimised to generate the maximal number of significant gene model matches between the two genomes (Supplementary Note 3).
Phylogenomic analysis. Phylogenomic methods followed the general approach of ref. 39 (Supplementary Note 4). Transcriptomes from refs 40 and 41 were assembled as described previously. Other publicly available Illumina transcriptomes were digitally normalized and assembled using the 13April 2014 release of Trinity 42 . Contigs were translated with TransDecoder (https://sourceforge.net/p/ transdecoder/) using Pfam 27 as a guide. Predicted proteins and translated transcriptomes were combined for each of the COTS.
For orthology inference, we employed HaMStR 13 (ref. 43) using the 1,032 'model organisms' profile hidden Markov models (pHMMs). Sequences matching the pHMM of an orthology group were then compared to the proteome of Homo sapiens using BLASTP with the default settings implemented by HaMStR. If the H. sapiens amino acid sequence contributing to the pHMM was the best BLASTP hit in each of these back-BLASTs, the sequence was then assigned to that orthology group.
Sequences in orthology groups that were shorter than 50 amino acids were discarded. Redundant identical sequences were removed with UniqHaplo (http://raven.iab.alaska.edu/~ ntakebay/) leaving only the most complete, unique sequences for each taxon. In cases where one of the first or last 20 characters of an amino acid sequence was an X (corresponding to a codon with an ambiguity, gap or missing data), all characters between the X and that end of the sequence were deleted and treated as missing data. Each orthology group was then aligned with MAFFT (mafft-auto-localpair-maxiterate 1000) 44 . Alignments were trimmed with Aliscore 45 and Alicut 46 to remove ambiguously aligned regions. Subsequently, any putatively mistranslated sequence regions were deleted; these were regions of 20 or fewer amino acids in length surrounded by ten or more gaps on either side. Next, alignments that were shorter than 50 amino acids in length were discarded. Last, we deleted sequences that did not overlap with all other sequences in the alignment by at least 20 amino acids, starting with the shortest sequences. Finally, orthology groups, which were sampled for fewer than 15 taxa after these filters, were discarded.
To screen putative orthology groups or evidence of paralogy, an 'approximatelymaximum likelihood' tree was inferred for each remaining alignment using FastTree 2 (refs 47, 48) with the 'slow' and 'gamma' options. PhyloTreePruner 48 was then used to generate a tree-based approach to screen each candidate orthology group for evidence of paralogy. Nodes with support values below 0.95 were collapsed into polytomies and the maximally inclusive subtree was selected where all taxa were represented by no more than one sequence or, in cases where more than one sequence was present for any taxon, all sequences from that taxon formed a monophyletic clade or were part of the same polytomy. Putative paralogues (sequences falling outside of this maximally inclusive subtree) were then deleted. In cases where multiple sequences from the same taxon formed a clade or were part of the same polytomy, all sequences except the longest were deleted.
Phylogenetic analysis was conducted using ML with RAxML 7.7.6 (ref. 49). Matrices were partitioned by gene and the PROTGAMMALG model was used for all partitions. For each analysis, the tree with the best likelihood score after 10 random addition sequence replicates was retained and topological robustness (that is, nodal support) was assessed with 100 replicates of nonparametric bootstrapping (the -f a command line option was used). Multiple sequential Markovian coalescent analysis. The multiple sequential Markovian coalescent analysis method 9 , using default parameters, was used to infer the historical effective population sizes of the OKI and GBR COTS. All paired-end reads were aligned to each soft-masked genome using Bowtie 2 (ref. 50). SAMtools(version 1.19) 36 was used to filter the unmapped reads and reads with minimum base and mapping quality scores of 20. Bcftools (version 1.19) 36 , of the SAMtools package, was then used to call genotype for each position. Real historical time and effective population size were estimated assuming a generation time of 3 years and a substitution mutation rate of 1.0 × 10 −8 per generation, which was based on an estimated genome size of 431 Mb 51

letter reSeArCH
To assess differences in protein domains across metazoan genomes, we examined protein domain expansion and contraction in each species, based on the total number of unique genes that each Pfam domain contained. We used the scaled value for each individual Pfam domain as a proxy for expansion, whereby any value greater than the mean was considered a domain expansion (Extended Data Fig. 3). Analysis of tissue transcriptomes. Seven and ten tissue transcriptomes were sequenced from GBR and OKI, respectively (Supplementary Notes 2 and 6). Trimmed reads from all transcriptomes were mapped to GBR and OKI gene models, and fragments per kb of transcript per million mapped reads (FPKM) were calculated for all genes in all transcriptomes. Relationships between tissue-type and geographic location (that is, GBR versus OKI) were determined using Euclidean distance and principle component analyses (PCAs) of FPKM values for all genes shared between the two genomes based on the LiftOver analysis using the R package, DESeq2 (ref. 54). On the basis of the PCAs, we selected the following eight tissue transcriptomes for further analysis: male and female radial nerves, tube foot (podia), spine, body wall, stomach, mouth and spent testes (Supplementary Note 6). The presented order of these tissues was derived from the Euclidean distance 54 of these transcriptomes on the basis of both expression and Pfam 52 similarity (Supplementary Note 6). In silico prediction of secreted and cleaved proteins. A. planci protein N-terminal signal sequences were predicted using SignalP 4.1 (ref. 55) (neural network and hidden Markov model algorithms), Predisi 56 and Phobius 57 , while transmembrane domains were determined using TMHMM 58 and HMMTOP 59 . A protein was designated as secreted only when it met the criteria of both SignalP and Predisi, and did not have a transmembrane domain. Proteolytic cleavage sites and posttranslational modifications (PTMs) were determined on the basis of homology to other known proteins and predicted with the Neuropred tool (http://neuropro teomics.scs.illinois.edu/neuropred.html) with a cleavage probability > 0.8. Collection of samples and exoproteome analysis. COTS aggregationconditioned seawater was produced from three adult COTS (300-350 mm diameter) that were kept in a single 60 l flow-through glass tank (780 × 380 mm) for 24 h. Water flow was stopped for 1 h before draining 51 l; the COTS remained in situ to minimise release of alarm-related chemistry. Conditioned water was acidified with 0.1% trifluoroacetic acid (TFA), then filtered through a 0.45-μ m PVDF membrane (Millipore) and absorbed onto a C18 Sep-Pak, 5 g sorbent per cartridge, 37-55 μ m particle size (Waters). Filter cartridges were washed with 100% methanol between samples to remove any carryover. Biomolecules were eluted with 70% acetonitrile:0.5% acetic acid, and then lyophilised and stored at − 20 °C. COTS alarm-conditioned seawater was produced when a single adult COTS (300-350 mm diameter)-kept in a single 60 l flow-through glass tank for 24 h before the water flow was stopped for 1 h-was then exposed to a giant triton (separated by a mesh divider) for 1 h. The giant triton was removed and 51 l was drained; the COTS remained in situ. Conditioned seawater was acidified (0.1% TFA), then biomolecules purified and lyophilised as described above. Procedures for collection of both treatments were repeated three times (n = 3). Reconstituted samples containing about 1 mg exoproteins (evaluated by NanoDrop 2000, Thermo Scientific) in 100 μ l extraction buffer (8 M urea, 0.8 M NH 4 HCO 3 , pH 8.0), were processed by reduction, alkylation, trypsin digestion, SCX-HPLC and then NanoHPLC-ESI-Triple Time-of-Flight mass spectrometry (see Supplementary Note 7 for details). Protein identification and quantification. The protein database used for MS/MS data analysis was derived from both GBR and OKI, (Supplementary Note 2, Supplementary Table 7.2a-n). A composite target− decoy database was built with the forward and reverse sequences for calculating the false discovery rate (FDR). MS/MS data were imported to the PEAKS studio (Bioinformatics Solutions Inc., version 7.0) with the assistance of the MS Data Converter (Beta 1.3, http://sciex. com/software-downloads-x2110). De novo sequencing of peptides, database search and characterization of unspecified PTMs were used to analyse the MS/MS data; the FDR was set to ≤ 1%, and (− 10logP) was calculated accordingly where P is the probability that an observed match is a random event. The PEAKS studio parameters are defined in Supplementary Note 7.
The quantitative analysis of proteins was carried out using the label-free quantification module (PEAKS Q 60 ) of PEAKS studio version 7.0, which is based on the relative intensities of featured peptides detected in multiple samples. The detection of features was separately performed on each sample and the expectationmaximisation algorithm 61,62 was used to identify overlapping features. Then, an alignment algorithm 63 was used to align the features of the same peptide from different samples. The extracted proteins in different replicate samples were quantified as described above; for each sample, 1.5 μ g of protein was analysed using LC-MS/MS. Biological triplicate samples of aggregation and alarm were used in tandem repeats for LC-MS/MS procedure as described above, and the relative concentrations of proteins were compared and presented as the final results. The mass shift between different runs was set to 50 p.p.m., and 1.0 min was used for evaluating the retention time shift tolerance. Featured peptides with a FDR threshold of 1%, including PTMs mentioned above, were included in the quantitative analysis. Validation of quantitative analysis was performed as described in Supplementary Note 7. Behavioural response of COTS to signals from starfish aggregations. Adult COTS were collected from various regions of the central GBR by the Australian Marine Parks Tourist Operators (AMPTO) Crown-of-Thorns Starfish Control Program and transported to the Australian Institute of Marine Science (AIMS) SeaSim aquarium precinct (www.aims.gov.au/seasim). COTS were housed in outdoor flow-through seawater aquaria at ambient conditions with temperatures of 26-29 °C and salinity averaging 35 p.p.t. When moved to indoor experimental systems, water temperature and photoperiod were simulated as per ambient natural outdoor conditions. COTS were not fed in captivity; therefore they were replaced fortnightly with freshly collected specimens to minimise behavioural changes owing to partial starvation.
Behavioural responses of starfish were examined in black fibreglass tanks (4.4 × 2.3 m) containing 6,070 l of seawater and a Y-maze (main channel 1.75 m long, channel width of 0.6 m, each arm 2.35 m long) to test for behavioural changes, such as motivation for, or direction of, movement. Seawater supply to each arm of the Y-maze was balanced to give a flow of 5 cm s −1 moving towards the main channel, and a 0.8 m divider at the base of the arms ensured no backflow from one arm into the other. One arm of the Y-maze was fed with ambient seawater directly from a pipe (control). The water to the second arm passed through a 250 l header tank (1 × 0.5 × 0.5 m) that was either empty (control) or contained six adult starfish that had been in place a minimum of 24 h before the experiment. The COTS in the header tank formed aggregations. At the start of the experiment, a test subject starfish was placed into the distal end of the main channel in a 'starter box' , which was 0.6 m 2 , and its movement recorded for up to 8 h on video. As COTS are primarily nocturnal, experiments were conducted at the end of the daylight illumination period and filmed during the nocturnal period. As the aggregating starfish were in an inaccessible header tank, the test subject could not visually detect or physically join the aggregation.
The tanks were illuminated with a bank of 850 nm infrared LED lamps (CMVision, Model IR-200LF/WF) filmed with an infrared acA1300-60gmNIR camera (Basler AG) fitted with a 4.4− 11 mm/F1.6 1/1.8′ ′ manual C-mount CCTV lens (Kowa Optical Products Ltd) and 850 nm cut-off filter (Helipan ES43). The infrared spectrum is beyond the detection range of starfish photoreception (425-580 nm) and therefore does not interfere with overlying photoperiod lighting 64,65 . The tanks were exposed to regional photoperiod changes (19.25° S, 146.8° E) with full sunlight spectrum plasma units (Luxim Model GRO-41-01, Luma America) with crepuscular twilight ramping. In this arrangement, only the reflection of infrared from the body of the starfish is detected by the infrared sensitive camera. Video footage was captured and analysed with Ethovision XT (http://www.noldus.com/animal-behaviorresearch/products/ethovision-xt). The experiments were conducted for n ≥ 10. The results were summed to determine the overall typical behaviour. Statistical analyses were performed using Ethovision XT and SPSS (version 20, IBM) 66 .
Motivation was determined if the test subject moved out of the original starter box. Changes in motivation were graphically represented in heat maps where the frequency of a specific position in a 2D space was visualized as a colour representing the minimum and maximum per-pixel frequency over the duration of the experiment. The spectrum variation was set from dark blue (minimum) to dark red (maximum); heat maps are primarily qualitative. Activity, how long and how frequent a subject has been active, was determined by the number of changed pixels for a current sample divided by the total number of pixels in the arena. Activity is not necessarily an indication of total distance moved, as anxiety movement will be detected as activity and such behaviour is typically triggered by a stimulus. A threshold of > 60% active time was imposed as a measure of 'highly active' . For example, test subject starfish were clearly agitated when exposed to seawater conditioned with aggregated COTS and their behaviour was indicative of searching for the source; even though starfish entered a particular Y-maze arm, few remained or settled within that arm but rather exhibited continued mobility. Analysis of EPDR genes. Potential COTS EPDR genes were identified from (1) transcriptomes via BLAST searches using partial exoprotein sequences, (2) the genome assembly via HMMER3.1 (ref. 67) searches using the ependymin pHMM (Pfam 52 accession PF00811.14), and (3) HMMER searches on ORFs exceeding 50 nucleotides extracted from genome scaffolds using the getorf tool from the EMBOSS package 6.5.7 (ref. 68). EPDR transcripts were then used as queries to identify the correct intron/exon architecture of the genes in the genome assemblies (schematic created using FancyGene 69 ). An alignment of the manually curated GBR EPDRs was created using AliView 70  a HMMER E-value greater than 1 × 10 −5 were found to align poorly and were removed). Sequence logos were created using WebLogo 71 .
EPDR pHMM searches were performed on predicted protein datasets from whole-genome data from select species and on ambulacrarian transcriptomes used in the phylogenomic analyses (see above) using an E-value cutoff of 1 × 10 -5 (Supplementary Note 8). Sequences were aligned and identical or very similar sequences within a species were removed (Supplementary Note 8). Maximum likelihood trees were performed as described above using RAxML 7.7.6, automatic model selection (VT) and 1,000 bootstrap replicates. Bayesian analysis was performed using MrBayes 3.2.6, with automatic (mixed) model selection (BLOSUM) and sampling every 10,000 generations until convergence (standard deviation of split frequencies < 0.01, 2.3 million generations). Topologies of the resulting phylogenetic trees were largely congruent. A sequence logo was created for each subclass of EPDRs using WebLogo 71 . Identification and analysis of GPCRs and olfactory receptor-like genes. Methods for GPCR identification followed the general approach of ref. 72. We screened the protein models of both OKI and GBR genomes, B. floridae, H. sapiens, S. kowalevskii, P. flava, and S. purpuratus using PFAM-scan.pl (ftp://ftp.sanger. ac.uk/pub/databases/Pfam/Tools/) against version 27 of the Pfam-A database. Sequences annotated by PFAM_scan.pl with domains in the GPCR_A Pfam clan (CL0192), and with at least 5 transmembrane regions according to HMMTOP 59 , were considered to be GPCRs and were further annotated with InterProScan 5.8-49.0 (ref. 73).
Many sequences were annotated as rhodopsin-like and therefore sequences annotated with PFAM 00001 were trimmed specifically to the region annotated as '7 transmembrane receptor (rhodopsin family)' by InterProScan and subsequently parsed into subfamilies using FastOrtho (http://enews.patricbrc.org/fastortho/), a modified version of OrthoMCL 74 with an inflation parameter of 1.5. This resulted in the identification of 957 groups of at least two GPCRs in the rhodopsin family (7tm_1) (Supplementary Note 9). The number of rhodopsin genes in each group for each species was visualized using Pretty Heatmap (https://cran.r-project.org/ web/packages/pheatmap) in R 53 .
Other GPCRs were similarly trimmed to the transmembrane receptor region for phylogenetic analysis. The annotations used for trimming each of these GPCRs were as follows: To identify putative olfactory-receptor-like genes in the COTS genomes, we followed the approach of refs 75 and 76 with modifications to incorporate the approaches of ref. 21 (Supplementary Note 9). We built 13 distinct pHMMs from previously curated olfactory-receptor repertoires comprising fishes (fugu, medaka, pufferfish, zebrafish and stickleback), amphioxus, sea urchin ('Specific rapidly expanded lineages of rhodopsin family' GPCRs (surreal GPCRs) groups 1-6) and manually curated olfactory receptors from Swiss-Prot. All non-redundant hits were retrieved from the combined results of all pHMM searches (Supplementary Note 9).
To distinguish olfactory receptors from the other 12 rhodopsin subfamilies (non-olfactory receptors), we conducted a BLASTP search (default settings) against a local database containing all class A or rhodopsin-like GPCRs from the Swiss-Prot database, followed by an all-against-all BLASTP comparison of COTS rhodopsinlike GPCRs. To determine if these COTS paralogue clusters of class A GPCRs are species-specific, and to resolve their relationship to other class A deuterostome GPCRs, we conducted a phylogenetic analysis. The dataset included class A rhodopsin-like GPCRs from S. purpuratus, which includes the surreal GPCRs, and two hemichordates (P. flava and S. kowalevskii), as well as olfactory receptors from fish (fugu, medaka, pufferfish, zebrafish and stickleback) and amphioxus. All sequences that contained 5 to 7 transmembrane helices were considered complete and were included in the phylogenetic analysis. The final dataset (2,615 sequences) was aligned using MAFFT version 7 with the FFT-NS-2 progressive method 77 and the alignment was manually trimmed to conserved blocks of transmembrane regions for phylogenetic tree reconstruction. The maximum likelihood phylogenetic tree was built using MEGA7 using a Poisson model with rate uniformity across sites 78 . Data availability. The Acanthaster planci genome sequence can be accessed at DDBJ and Bioproject (NCBI) as PRJDB3175, which links to the Sequence Read Archive for all genome raw and assembled scaffold (nucleotide) data for GBR and OKI, under BioSamples SAMD00020546 and SAMD00054104, respectively.