A large family of Dscam genes with tandemly arrayed 5′ cassettes in Chelicerata

Drosophila Dscam1 (Down Syndrome Cell Adhesion Molecules) and vertebrate clustered protocadherins (Pcdhs) are two classic examples of the extraordinary isoform diversity from a single genomic locus. Dscam1 encodes 38,016 distinct isoforms via mutually exclusive splicing in D. melanogaster, while the vertebrate clustered Pcdhs utilize alternative promoters to generate isoform diversity. Here we reveal a shortened Dscam gene family with tandemly arrayed 5′ cassettes in Chelicerata. These cassette repeats generally comprise two or four exons, corresponding to variable Immunoglobulin 7 (Ig7) or Ig7–8 domains of Drosophila Dscam1. Furthermore, extraordinary isoform diversity has been generated through a combination of alternating promoter and alternative splicing. These sDscams have a high sequence similarity with Drosophila Dscam1, and share striking organizational resemblance to the 5′ variable regions of vertebrate clustered Pcdhs. Hence, our findings have important implications for understanding the functional similarities between Drosophila Dscam1 and vertebrate Pcdhs, and may provide further mechanistic insights into the regulation of isoform diversity.

A lternative transcription and alternative splicing are two major means to expand the transcriptomic and proteomic repertoire from a single gene 1,2 . Drosophila Dscam1 (Down Syndrome Cell Adhesion Molecules) and vertebrate clustered protocadherins (Pcdhs) are two classic examples of the extraordinary protein isoform diversity that can arise from a single complex genomic locus in two phyla 3,4 . Dscam1 gene encodes 38,016 distinct isoforms via mutually exclusive alternative splicing of 4 arrays of tandem duplicated exons in D. melanogaster 3 . These Dscam1 isoforms are expressed stochastically and combinatorially, and exhibit isoform-specific homophilic binding [5][6][7][8][9][10] . These properties provide the molecular basis of Drosophila Dscam1 as a key molecule for self-avoidance, and genetic studies have indicated that thousands of Dscam1 isoforms are required for neuronal wiring and self-avoidance [8][9][10][11][12][13][14] .
In contrast to insect Dscam1, vertebrate Dscam genes do not generate extraordinary protein diversity 15 .
However, another set of genes, the clustered Pcdhs, might perform the analogous function in vertebrates [16][17][18] . Pcdhs are the largest subgroup of the cadherin superfamily of cell adhesion proteins and are abundantly expressed in the central nervous system. In the human, 52 Pcdh proteins are encoded by 3 tightly linked gene clusters called Pcdha, Pcdhb and Pcdhg, which are organized in a tandem array and on a single chromosome 4 . In these genes, each variable exon is preceded by a promoter, and Pcdh diversity is produced via differential promoter choice and cis-alternative splicing 19,20 . The Pcdh gene cluster encodes a large repertoire of cell surface recognition proteins, which can engage in specific homophilic interactions 21 . Functional experiments show that deletion of the mouse Pcdhg gene cluster could cause defective dendritic self-avoidance in retinal starburst amacrine cells or in Purkinje cells 22 . This observation suggests that clustered Pcdhs, similar to Drosophila Dscam1, may also mediate neurite self-avoidance by specifying single-cell identity [21][22][23][24] . Conversely, such vertebrate clustered Pcdh genes have not been identified in Drosophila 16 .
Given the striking molecular parallels between and complementary phylogenetic distribution of Dscam diversity in Drosophila and the clustered Pcdh diversity in vertebrates, it is attractive to speculate that they may have similar roles. These two phyla appear to have evolved a common molecular strategy for self-avoidance by recruiting different molecules 18 . Nevertheless, since there is a big evolutionary gap between insects and vertebrates, who shared a common ancestor more than 500 million years ago, how the evolutionary transitions and complementarities occurred remains unclear. Moreover, Drosophila Dscam1 generally generates tens of thousands of isoforms, while only 58 isoforms exist for clustered Pcdh genes in mice. This discrepancy in isoform diversity by at least 2 orders of magnitude is unlikely to be explained by the much higher common isoform tolerance for Pcdhs than is assumed for Dscam1 (ref. 18).
In this study, we identified a novel Dscam gene family (sDscam) in Chelicerata that contained tandemly arrayed 5 0 cassettes. The encoded proteins had a striking similarity to Drosophila Dscam1, but all lacked the canonical Immunoglobulin 1 (Ig1)-6, 10 and Fibronectin III (FNIII) 3-4, 6 domains present in classical DSCAM. The N-terminal domains of each sDscam protein are generally encoded by only one of a cluster of tandemly arrayed 5 0 cassettes. These 5 0 cassettes are generally comprised of two or four exons (sDscama and sDscamb), which correspond to variable Ig7 or Ig7-8 domains of Drosophila Dscam1. There was also high splicing complexity across variable 5 0 clusters, which expanded the isoform diversity via a combination of alternative promoter and splicing activities. Thus, Drosophila Dscam1 and Chelicerata sDscam represent examples of convergent evolution for isoform diversity. This genomic organization is remarkably similar to that of the clustered Pcdhs in vertebrates. Hence, our findings have important implications to aid in our understanding of the functional similarities between two structurally unrelated families of Drosophila Dscam and vertebrate Pcdhs, and may provide further insights into the regulatory mechanisms governing the selection of tandemly arrayed 5 0 variable regions.

Results
A novel shortened Dscam gene family in Mesobuthus martensii.
To trace the origins of duplicated exons of the Dscam genes in Arthropoda, the exons encoding the Ig7 orthologues of Drosophila Dscam1 in the M. martensii genome were analysed. These Ig-coding exons were tandemly arrayed across the gene body, similar to Drosophila Dscam1. Nevertheless, RNA-seq analyses and sequencing of 5 0 RACE (rapid-amplification of cDNA ends) products indicated that these transcripts shared no common upstream exons, and therefore, they might initiate immediately upstream of each variable exon (Fig. 1). Importantly, we believe this was located close to the transcription start sites for each variable exon, because a stop codon was generally located in the frame immediately upstream from the ATG initiation codon in each variable cassette ( Supplementary Fig. 1). Last, computerassisted and RNA-seq analyses revealed seven novel Dscam genes in M. martensii, which were characterized by tandemly arrayed 5 0 cassettes ( Fig. 1). Their encoding isoforms were similar to each other and to previously characterized Drosophila Dscam1, but all lacked the canonical Ig1-6,10 and FNIII 3-4, 6 domains present in classical DSCAM. We therefore designated these novel shortened Dscam genes as sDscam. Based on different units of tandemly arrayed 5 0 cassettes, these sDscams could be subdivided into two closely related subfamilies, sDscama and sDscamb (Fig. 1a,b). The former (sDscama) contained tandemly arrayed 5 0 cassettes with 2 exons. This tandem cassette encoded a single Ig domain, which corresponded to the Ig7 of Drosophila Dscam1 (Fig. 1a). Genome-wide analyses revealed the presence of only one member of the sDscama subfamily, which contained at least 40 tandem copies at the 5 0 variable regions.
The tandemly arrayed 5 0 cassette of another gene cluster subfamily (sDscamb) generally contained 4 exons (Fig. 1b). These tandem cassettes encoded 2 Ig repeats, which corresponded to the Ig7-8 domains of Drosophila Dscam1. This is similar to Ig7-8 arrays in Ixodes scapularis Dscam, albeit without the annotation of the first exons 25 . We identified up to 6 members (sDscamb1-sDscamb6) of the sDscamb subfamily, which contained 13, 8, 13, 9, 10 and 2 tandemly arrayed cassettes, respectively. In some cases, tandem cassettes could be made by the combination of different duplication units. Taken together, this unusual organization of the sDscam family potentiates the capacity to expand the transcript isoforms.
sDscam 5 0 clustered organization is conserved in Chelicerata. We examined whether this clustered organization of sDscam found in M. martensii was conserved at the 5 0 variable regions throughout Arthropoda. This analysis was expanded to include the Araneae Stegodyphus mimosarum, 2 Ixodoidean species (I. scapularis and Tetranychus urticae) and Merostomatan Limulus polyphemus. Together, these organisms comprise some of the major taxonomic groups of the Chelicerata subphylum that last shared a common ancestor B420 million years ago 26 . We identified the clustered organization at the 5 0 regions of sDscam in all species of the Arachnida class investigated, although the members of the tandemly arrayed 5 0 cassettes differed among species (Supplementary Fig. 2). This led us to believe that the 5 0 clustered organization of the sDscam family was evolutionarily conserved in Arachnida. Moreover, the sequence comparison revealed the 5 0 clustered organization of the sDscama and sDscamb subfamilies in Merostomatan L. polyphemus ( Supplementary  Fig. 2). However, a similar 5 0 clustered organization was not identified in any of the Dscam genes from the Mandibulata species of insect, Crustacea or Myriapoda classes, suggesting that it arose after radiation of Mandibulata and Chelicerata during the evolution of Arthropoda. Thus, we concluded that the 5 0 clustered organization of sDscam was Chelicerata-specific and conserved throughout Chelicerata evolution.
Origin and lineage-specific expansion of 5 0 clustered sDscam. How the 5 0 clustered organization of the sDscam gene arose was investigated next. Following a comprehensive comparative analysis of Dscam sequences from arthropod species ( Supplementary Fig. 3), it was speculated that the sDscam gene might have originated from the sequential shortening and expansion of the Ig and FNIII domains of canonical Dscam (Fig. 2, Supplementary Fig. 4a) Variable (13) Variable (10) Variable (8) Variable (9) Variable (  the fact that Dscam genes lacking the FNIII3-4 and Ig10 domains are present in all Chelicerata species investigated ( Supplementary  Fig. 3). The further loss of the FNIII domain proximal to the transmembrane domain was followed later by the loss of the coding region encoding the N-terminal Ig1-6 domains ( Fig. 2; Supplementary Fig. 4a). Eventually, a shortened Dscam evolved in the ancestral gene. Second, this shortening was followed later by 5 0 segmental duplication to create two or multiple tandemly arrayed cassettes. The duplication unit may include both exons 1-2 encoding an Ig domain or exons 1-4 encoding two Ig domains and their promoters (green or blue dashed box, Fig. 2; Supplementary Fig. 4a). Moreover, phylogenetic analysis indicated that these clustered cassettes were more similar to each other than to the variable cassettes from other species (Supplementary Figs 5 and 6), suggesting that the variable cassettes were expanded in a species-specific manner. Notably, the genome analysis indicated that most sDscam genes tended to be clustered in Chelicerata (Supplementary Fig. 4b). For example, three sDscam genes clustered in the T. urticae genome, of which sDscamb2 and sDscamb3 were only 4 kb apart and in the same orientation. These findings strongly suggest that sDscam gene clusters result from lineage-specific duplications. Together, these results demonstrate that 5 0 cassette tandem duplication, combined with gene duplication, jointly shaped the large lineagespecific repertoire of sDscam isoforms in Chelicerata.
Expression patterns of sDscam variable cassettes. To determine the expression profiles of the variable cassettes in M. martensii sDscams, paired-end sequencing of poly(A)-tailed transcripts was performed on five dissected adult tissue samples, including the cephalothorax, abdomen, muscles, haemocytes and poison glands. RNA-seq reads were mapped to the genome sequence of sDscams as described above. Based on the RNA-seq data of constitutive exons, the sDscama and sDscamb1-6 transcripts were differentially expressed (Fig. 3a). The sDscama and sDscamb1-6 transcripts were expressed at much higher levels in the cephalothorax than in the abdomen, muscles and haemocytes ( Fig. 3a; Supplementary Fig. 7a). This is largely consistent with previous studies in which Dscams were highly expressed in neural tissues 13,27 . Notably, sDscamb3, sDscamb5 and sDscamb6 transcripts were expressed at maximum levels in the poison glands. It would be of interest to know whether the sDscam isoform diversity contributes to immune protection, as previously reported for Dscam1 isoforms in insects 27 . Transcriptional signals were detected for almost all of the 5 0 variable exons of sDscama and the six sDscamb genes in at least one of the tissues of M. martensii (Fig. 3b,c; Supplementary Fig. 7b,c). For each sDscam gene, the relative abundance of isoforms differed markedly among the variable exons. For example, the most abundant 10 sDscama isoforms accounted for 54.7% and 52.5% of all reads from the cephalothorax and abdomen, respectively (Fig. 3b,c). Interestingly, the variable cassettes most distal to the constitutive exons tended to occur less frequently in all tissues for all sDscams, except for sDscamb4. In sDscamb2-3 and sDscamb5-6, the inclusion frequency of a variable exon largely correlated with its proximity to the first constitutive exon (Supplementary Fig. 8a-d).
Several significant differences existed in the expression profiles of various sDscam variable cassettes in different tissues. The 5 0 variable exon usage in sDscamb1-5 showed moderate to dramatic changes in different tissues, whereas differences in the sDscama cassettes were relatively modest (Fig. 3b,c). Most of the 5 0 variable lg1 lg2 lg3 lg4 lg5 lg6 lg7 lg8 lg9 lg10 lg1 lg2 lg3 lg4 lg5 lg6 lg7 lg8 lg9 lg1 lg2 lg3 lg4 lg5 lg6 lg7 lg8 lg9 exons of sDscama were expressed in the cephalothorax, abdomen, haemocytes and poison glands. Nonetheless, only a subset was lowly expressed in the muscles (Fig. 3b). Similarly, most of the 5 0 variable exons of sDscamb1-6 could be detected in the cephalothorax, abdomen and poison glands, while only a subset was expressed in the haemocytes and muscles. Variable cassette 4 of sDscamb1 was abundantly expressed in the cephalothorax, but was barely detectable in the abdomen ( Fig. 3c; Supplementary  Fig. 7c,d). sDscamb3 variable cassette 11 was abundantly expressed in the poison gland, but was barely detectable in other tissues ( Fig. 3c; Supplementary Fig. 7c,d). These data indicate that the selection of 5 0 variable exons of sDscama and sDscamb is differentially regulated in different tissues.
Variable cassettes are preceded by promoters. To clarify the mechanisms by which isoforms were generated and regulated from a single sDscam gene locus, it was ascertained whether the sDscam genes applied a similar strategy to that in vertebrate Pcdhs, with the alternative use of a separate promoter upstream of each first exon of a variable region 19,20 . In Pcdhs, each first exon is preceded by a promoter and produces a transcript in which the first exon is spliced to common exons. To determine whether each sDscam variable cassette has its own promoter, sequences immediately upstream of the transcription start site of each variable region in sDscama and six sDscamb genes were examined. A rich array of potential promoter elements (PPEs) was predicted to be located upstream of the 5 0 end of each variable region ( Fig. 4a; Supplementary Fig. 9). Therefore our data suggest that each variable cassette is generally preceded by a given promoter.
Next, we firstly validated the promoter activity of sDscamb6, which contains only two tandemly arrayed variable cassettes.   Alternative exon 2 was selected to calculate the level of expression. (c) The relative frequency of the variable exon clusters of sDscamb1-6. Variable cassette 4 of sDscamb1 was abundantly expressed in the cephalothorax (shown as the black arrow), but was barely detectable in other tissues. sDscamb3 variable cassette 11 was abundantly expressed in the poison gland (shown as the blue arrow), but was barely detectable in other tissues. The 25-nt fragmented RNA-seq data sets were mapped to calculate the relative expression level. These results based on 25-nucleotide (nt) mapping were consistent with those based on 50-nt mapping, except for some very lowly expressed tissues ( Supplementary Fig. 7).
To this end, a B1.0-2 kb DNA fragment preceding the variable V1 and V2 cassettes was fused to luciferase in an expression vector. As shown in Fig. 4b, both constructs displayed significant promoter activity in transient transfection reporter assays in Drosophila S2 cells. This indicates that these predictable promoter sequences are sufficient to direct the reporter expression of heterologous cells. To determine the minimal DNA sequence requirements for promoter activity, a series of deletion constructs was tested. Promoter function was not significantly diminished by truncations to B300 bp (Fig. 4b). Moreover, promoter activity was only partially reduced by disruption of a given PPE, suggesting that it resulted from the combinatorial interaction of multiple PPEs, including those beyond the prediction capabilities of the program, which was based on distantly related species. Together, these results indicate that the transcription of individual variable cassettes is under the control of a distinct promoter upstream of each variable exon.
High splicing complexity across the 5 0 variable regions. Inconsistent with the presence of a large first exon in the clustered Pcdh gene 4 , a cassette repeat composed of two or four exons was identified in the clustered sDscam gene. This raised the question of how these variable exons were combined into distinct mRNA isoforms, particularly because the exclusion or multiple inclusions of exons 2, 3 or 4 variants would not result in a frameshift.
To explore this, we defined exon junctions based on a total of 0.7 billion RNA-seq reads from different tissues. At least 264 distinct exon junctions were detected, 249 of which were joined neighbouring junctions in single tandem cassettes. This suggests that most isoforms could be made through joining neighbouring junctions in variable cassette regions. Moreover, we detected a small fraction of isoforms from the same cassette with either exon 2, 3 and/or 4 skipped. In these cases, the variable exon skipping resulted in an incomplete Ig domain (that is, the sDscamb6 variable exon 3.1) (Fig. 5a). This abnormal splicing is analogous   ARTICLE to the skipping of Dscam exon 4 variants, which results in a partial Ig2 domain and is likely to be biologically relevant 28 . In addition, we detected other non-canonical splicing isoforms that contained variable exons from different tandem cassettes, as well as the isoforms containing within-cassette introns (Fig. 5a). Based on the exon junctions from the RNA-seq data, we estimated that B10-40% of isoforms resulted from non-canonical splicing in most sDscam genes, which showed differential expression in various tissues (Fig. 5a,b; Supplementary Fig. 10a,b). Taken together, these data indicate that sDscams have potentially complex splicing patterns at the 5 0 variable regions.
Given the low expression of a considerable number of sDscam variable exons, we systematically examined the possible exon combinations derived from different tandem cassettes using a nested reverse transcription-PCR (RT-PCR) approach. Several unexpected types of splice isoforms were detected. One type of isoform was produced by combining exons from different tandem cassettes, which encoded 2 Ig domains identical to the canonical isoform from a single cassette. For example, sDscamb1 exon 2.1 could be spliced with the downstream variable exon 3.2, while variable exon 3.13 could be spliced with the upstream variable exon 2.10 (Fig. 5c,d). Surprisingly, sDscamb1 variable exon 3.13 could be spliced with the upstream variable exons 4.5 and 4.6, and the resulting variable region of the mRNA isoform encoded 3 Ig repeats (Fig. 5c,d). Moreover, sDscamb3 variable exon 4.10 could be spliced with the downstream variable exon 2.11, and the resulting variable region of the mRNA isoform encoded 4 Ig repeats. Furthermore, other distinct types of variable 3 0 isoforms were detected (Fig. 5e,f). Similar results were obtained for other sDscama and sDscamb genes ( Supplementary Fig. 11). Together, these results show that the multi-exon repeat architecture of sDscams can increase not only Ig sequence diversity but also Ig number plasticity (Fig. 5g).  Intron retention was much more abundant in the abdomen than in the cephalothorax. The 25-nt fragmented RNA-seq data sets were mapped to calculate the intron retention rate. Because of the low expression of the V5 and V6 isoform in the muscles, haemocytes and poison glands (Fig. 3c), the images of these RNA-seq reads are not shown. (c) RT-PCR analysis of V5 and V6 isoform expression. (d) Schematic diagrams of expression of sDscamb2 isoforms. Different types of splice isoforms are indicated by the symbol "I, II, III, IV". (e) RT-PCR was used to detect isoform expression. These experiments revealed the splicing of multiple adjacent cassette variants. Due to the low expression of sDscam variable exons, nested PCR was necessary to amplify the products; only the primers used in the second PCR are depicted. The PCR products were confirmed by cloning and sequencing.

Cap
(B200 kb in the variable regions) and complexity of the clustered Pcdhs. Surprisingly, we found that abundant intron sequences immediately downstream of the last variable exon of each cassette were frequently retained in the RNA-seq data, while cassettes within introns were exclusively spliced out (that is, sDscamb1 V5, Fig. 6a,b). Interestingly, the extent of this retention differed in different tissues ( Fig. 6b; Supplementary Fig. 10b). The frequent occurrence of this unusual intronic retention might be a result of the splicing of the variable exons immediately downstream of the cap-proximal cassette to the constant exon (type II; Fig. 6a). Taken together, we propose that not only the cap-proximal, but also the downstream variable exons spliced to the constant exon.
Next, a more sensitive assay was designed that used primers in exons 1.5 and 4.6 to validate the findings above (Fig. 6a). It was hypothesized that if the downstream variable cassette 6 (V6) could be spliced into the constant when sDscamb1 was transcribed under the control of the V5 promoter, then one mRNA isoform should be produced containing the two neighbouring variable cassettes (V5 and V6) without a withincassette intron, but with the between-cassette sequence (type II, Fig. 6a). The presence of this mRNA isoform was confirmed by RT-PCR and sequencing (Fig. 6c). A similar mRNA isoform was detected in sDscamb2, although the partial interval sequences between the two neighbouring variable cassettes had been spliced out (Fig. 6d,e). Similar mRNA isoforms were observed in other sDscamb genes (Fig. 5e; Supplementary Fig. 11c-h). Taken together, these observations strongly support our hypothesis that not only the cap-proximal, but also the downstream variable cassettes could splice to the constant exon. This also suggests that the expression of 5 0 variable cassettes is not only associated with specific promoter activity, but also with post-transcriptional alternative splicing.

Discussion
This study identified a novel shortened Dscam gene family with tandemly arrayed 5 0 cassettes in Chelicerata. These sDscams had a high sequence similarity to the 3 0 region of Drosophila Dscam1, but shared striking organizational resemblance to the 5 0 variable region of vertebrate clustered Pcdhs. Moreover, sDscam gene family members tended to be arranged in tandem clusters, much like the vertebrate clustered Pcdh genes 4 . Finally, sDscams generally contained separate promoters upstream of each first exon of the variable cassette, as occurs in vertebrate Pcdhs 19,20 . Hence, our findings have important implications for understanding the functional similarities between Drosophila Dscam1 and vertebrate Pcdhs.
Compared with the large exons in clustered Pcdh genes, Chelicerata sDscam genes were composed of two to four exons. This tandem multi-exon organization not only expanded the diversity of amino acid sequences, but also enabled Ig structural plasticity. In Chelicerata sDscams, additional alternative splicing methods might be employed to expand isoform diversity (Fig. 5). For example, additional isoform diversity could be generated  Fig. 1 and Supplementary Fig. 2. The number of copies is shown in parentheses.
through mutually exclusive splicing of within-cassette duplicated exons (that is, sDscamb1 V7; Fig. 5c). Notably, additional sequence and structural diversity could potentially be generated through combining exons from different tandem cassettes. Thus, clustered sDscams could potentially achieve much more isoform diversity than the clustered Pcdh gene. It is very likely that this more complex organization provides a genetic mechanism for generating higher numbers and additional types of isoforms required for the diverse functions and adaptations in Chelicerata. Phylogenetic analysis of Arthropoda Dscam genes revealed that Chelicerata sDscam and Drosophila Dscam1 were classified into different clades ( Supplementary Fig. 3), suggesting that they may have converged on the common protein domain diversity from independent origins. Notably, duplication of the Ig7-encoding exon 9 or its orthologues occurred internally or 5 0 terminally in all Arthropoda species investigated. This suggests that the diversity of Dscam1 Ig7 or its orthologues conferred intrinsic structural and regulatory benefits during Arthropoda evolution. Recent studies indicated that Ig7 domain diversity was crucial for the proper function of Dscam1 (refs 6,8,10,12-14). Dscam1 generates functionally distinct isoforms through mutually exclusive splicing of internal exons in Drosophila (Fig. 7). However, no Chelicerata Dscam genes appeared to have a similar arrangement, although a random array of only two alternatives for the Dscam1 exon 9 orthologue are often observed in Chelicerata (that is, sDscamb1 V7). In contrast, sDscam genes have evolved other mechanisms that serve this function in Chelicerata, through a combination of alternative promoter use and alternative splicing (Fig. 7). In this scenario, Drosophila Dscam1 and Chelicerata sDscam represent examples of convergent evolution for isoform diversity.
It is noteworthy that, compared with Drosophila Dscam1 and other Dscam proteins from metazoans containing 10 Ig and 6 FNIII extracellular repeats, a single transmembrane segment and a cytoplasmic tail 15 , the Chelicerata sDscams reported in this study lacked the N-terminal Ig1-6,10 domains and FNIII3-4, 6 domains present in classical DSCAM. In fact, the Ig domains differed markedly across the immunoglobulin superfamily (IgSF) proteins, ranging from 2 to 10, but with mostly 4 to 5 repeats 29 . Hence, we speculate that such shortened isoforms have important functions. Because Chelicerata sDscams share a striking similarity with Drosophila Dscam1, and there was a remarkable organizational resemblance to the vertebrate clustered Pcdhs, with the latter two proteins both able to mediate self-recognition and self-avoidance, it is reasonable to speculate that Chelicerata sDscams have analogous roles in the nervous system.
Our results indicated that not only the cap-proximal but also the downstream variable cassettes spliced to the constant exon. Based on this evidence, we propose a mechanistic framework for the selection of tandemly arrayed 5 0 variable exons (Fig. 8). This extends and revises a previously proposed model for the mechanism governing the selection of tandemly arrayed 5 0 variable regions 4,19,20 . Interestingly, intron sequences downstream of the variable region exons of Pcdhs were frequently contained in complementary DNA (cDNA) in independently derived cDNA libraries, which were previously assumed to be truncated mRNA isoforms or correspond to trans-splicing precursors 4 . Considering the similarity of the 5 0 gene structure of Chelicerata sDscams and vertebrate Pcdhs, we speculate that these unusual intron-containing cDNAs might be a consequence of the variable exons downstream of the cap-proximal exons spliced to the constant exon in vertebrate Pcdh genes. Therefore, our mechanistic framework might be broadly applicable to tandemly arrayed 5 0 variable exons in invertebrates and vertebrates.
The selection of tandemly arrayed 5 0 cassettes was highly regulated by a variety of mechanisms at both the transcriptional and post-transcriptional levels. Previous studies indicated that expression of the corresponding Pcdh mRNA might correlate with specific promoter activity 19,20 . Because sDscam was under the control of a distinct promoter upstream of each variable cassette, Chelicerata sDscams should be regulated by a similar mechanism. Second, the 5 0 splice site strength might have an effect on the selection of the variable exon. In general, the variant inclusion largely correlated with the strength of the 5 0 splice site, but decreased with distance from the 3 0 splice site of the first constitutive exon 30 . Based on the correlation of the inclusion frequency of a variable exon with its proximity to the first constitutive exon in sDscamb2-3 and sDscamb5-6, it seems that distance had some effect on the inclusion, at least for some genes. This was possibly due to higher levels of pre-mRNA for the proximal exons of the first constitutive exon present after transcription under multiple promoters. Finally, the selection of variable cassettes could easily be overridden in a developmentalor tissue-specific manner by the expression of specific activatorand repressor-binding proteins. Thus, the outcome of the variable exon results from multiple mechanisms acting in an overlapping manner.

Methods
Annotation and identification of Dscams. The sequences of the Dscam genes from the Scorpione M. martensii, the Araneae S. mimosarum, the Ixodoidean I. scapularis and T. urticae, and the Merostomatan L. polyphemus have been annotated through BLAST searches, using the annotated Dscam sequence of the most closely related organism and confirmed by available genome annotation and phylogenetic analysis (http://blast.ncbi.nlm.nih.gov/Blast.cgi; http://flybase.org/ blast/, Supplementary Table 1). Gaps in the Dscam sequences for M. martensii were closed by PCR and sequencing. Genomic DNA was isolated from M. martensii (a gift from Zhijian Cao) using a QIAamp DNA Kit (Qiagen, Hilden, Germany). PCR was performed using primers designed against genomic sequences. Amplification products were cloned into the pGEM-T Easy Vector (Promega, Madison, WI, USA) for sequencing. Primer sequences are available on request. All Dscam homologues were analysed by classifying into families and predicting domains with InterPro 31 (http://www.ebi.ac.uk/interpro/).  Fig. 1. Each variable cassette was preceded by a promoter. The expression of the specific combination of sDscam isoforms was achieved by alternative promoter activation, followed by alternative splicing. When sDscamb was transcribed by a given promoter preceding a variable cassette (V1), both V1 and the downstream variable cassettes (V2, V3) could be spliced into the constant exon 5.
RNA-seq. Five tissues (cephalothorax, abdomen, poison gland, haemocyte and muscle) from an M. martensii adult and the whole body of a L. polyphemus adult were collected for RNA preparation. RNA library construction and paired-end RNA-seq were performed by LC Sciences (Houston, TX, USA). Briefly, total RNA was extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's instructions. The total RNA quantity and purity were analysed using a Bioanalyser 2100 and RNA 6000 Nano LabChip Kit (Agilent, Santa Clara, CA, USA) with RNA integrity number 47.0. For the RNA-seq experiment, B10 mg of total RNA was subjected to enrichment of the poly(A)-tailed mRNAs with poly(T) oligo-attached magnetic beads (Thermo Fisher Scientific, Waltham, MA, USA). After purification, the mRNA was fragmented into small pieces using divalent cations under elevated temperature. Then the cleaved RNA fragments were reverse transcribed to produce the final cDNA library according to the instructions in the mRNA-seq sample preparation kit (Illumina, San Diego, CA, USA). The paired-end RNA-seq was performed on the Illumina Hiseq 2500 platform (Illumina) following the vendor's recommended protocols.
Analysis of RNA-seq data. The RNA-seq reads were de novo assembled to obtain transcripts of M. martensii and L. polyphemus using Trinity 32 (https://github.com/ trinityrnaseq/trinityrnaseq/wiki) with the default parameters. Transcripts sharing high sequence similarity were assigned to a cluster based on the default parameter settings of Trinity. For a cluster, the longest transcript was designated as the unigene of the cluster. The unigenes were functionally annotated based on sequence similarity at the protein level. Specifically, by using BLASTX (E-valueo0.00001), the protein sequences translated from the unigenes were searched against the protein databases, including the NCBI non-redundant protein database, SwissProt, Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Clusters of Orthologous Groups (COG) of proteins. The ends most 5 0 of the sDscam unigenes were analysed for their potential transcription start sites, some of which were further verified by 5 0 RACE. Tophat 33 (http://ccb.jhu.edu/software/tophat/index.shtml) was used for RNA-seq mapping, the results of which were visualized using integrative genomics viewer (IGV) 34 (http://www.broadinstitute.org/igv/). Considering the similarity among exon duplicates, the RNA-seq reads were split into 25-and 50-nucleotide (nt) fragments, which were mapped to calculate the expression levels of variable exons. Furthermore, to eliminate influences on calculations of the expression levels from identical sequence regions among exon duplicates, the 25-and 50-nt fragments with multiple loci were correctly allocated by referring to the mapping results of the full-length RNA-seq data sets. The correlation coefficient was calculated between the 25-and 50-nt mapping results. Similarly, to analyse the intron retention rate, both the 25-and 50-nt fragmented RNA-seq data sets were utilized to calculate the expression levels of the exon and neighbouring intron.
An in-house computational program was developed to search for sequencing evidence supporting the exon-exon junctions. First, exonic sequences covering all of the possible junctions between the variable exons were created. We used 10 positions from each exon in a pair to assign a given read to an exon-exon junction. For example, the 230-nt exonic sequences included 115-nt upstream and 115-nt downstream of the junction for 125-nt RNA-seq reads. Second, all of the RNA-seq reads were mapped onto the exonic sequences created above, and the perfectly mapped RNA-seq reads covering exon-exon junctions were retained. A similar analysis was performed for 25 positions from each exon in a pair to determine the correlation with the results based on 10 positions. In addition, a similar method was used to analyse the exon-intron junction of the isoforms containing withincassette introns.
RT-PCR. Total RNA was isolated using an RNeasy Mini Kit (Qiagen). Total RNA was reverse transcribed using SuperScript III RT (Invitrogen) with oligo(dT)15 primer, and the resulting single-stranded cDNA product was treated with DNase I at 37°C for 30 min. The PCR was implemented with an initial denaturing at 95°C for 3 min, followed by 35 cycles of denaturing at 95°C for 45 s, annealing at 55°C for 50 s, and extension at 72°C for 2 min and 10 s, followed by a final extension at 72°C for 10 min. The products of the PCR or the RT-PCR were purified and cloned into the pGEM-T Easy Vector and transformed into JM109 competent cells. Sequencing of individual clones was carried out using an automatic DNA sequencer. In some cases, nested PCR was necessary to amplify the products. Primer sequences are listed in Supplementary Table 2.
Phylogenetic analysis. The alignment of specific regions between species was performed using the ClustalW2 program (http://www.ebi.ac.uk/clustalw/ index.html). Full-length variable region coding sequences were translated, and the resulting polypeptides were aligned. The genetic distances for each gene were estimated with MEGA 6.0 (ref. 35).
5 0 RACE analysis. The 5 0 RACE analysis was performed according to the 5 0 RACE Kit (Invitrogen) protocol and using the reagents from the kit. Total RNA was extracted from adult M. martensii cephalothoraxes using TRIzol Reagent (Life Technologies, Carlsbad, CA, USA). The RNA was subjected to reverse transcription using SuperScript II at 42°C for 50 min, and incubated at 70°C for 15 min to terminate the reaction. RT-PCR was carried out under the following cycling conditions: an initial denaturation of 2 min at 94°C followed by 30-35 cycles of denaturation at 94°C for 30 s, annealing at 55-60°C for 30 s and extension at 72°C for 30 s, with a final extension at 72°C for 10 min.