Main

Cetaceans include whales, dolphins and porpoises. They are placed phylogenetically in Cetartiodactyla, the clade that includes Cetacea and Artiodactyla (even-toed ungulates such as the hippopotamus, cow and pig)1. Whales and modern terrestrial artiodactyls are related to Indohyus (an extinct semiaquatic deer-like ungulate), from which they are known to have split 54 million years ago2. Underwater adaptations of cetaceans to physiological stress, along with their unique morphology, are interesting. The minke whale is the most abundant baleen whale and is classified into two species: the common minke whale (Balaenoptera acutorostrata) and the Antarctic minke whale (Balaenoptera bonaerensis)3. The wide geographical distribution of the minke whale makes it an ideal candidate for whole-genome sequencing. In addition to a low-coverage (2.59×) assembly of the bottlenose dolphin (Tursiops truncatus) genome4,5,6, there are now several sequenced cetaceans that can be used as resources for evolutionary and population-management studies. We report the de novo assembly of the common minke whale genome and a comparative analysis of additional genomic sequences (30× depth, aligned to the reference genomes but not assembled) of three minke whales, a fin whale (Balaenoptera physalus), a bottlenose dolphin and a finless porpoise (Neophocaena phocaenoides) (Supplementary Tables 1 and 2).

We extracted DNA from male minke whale muscle and sequenced it to a 128× average depth of coverage using the Illumina HiSeq 2000 platform (Supplementary Tables 3–5). Raw reads were assembled into 104,325 scaffolds totaling 2.44 Gb in length (Supplementary Figs. 1–4 and Supplementary Tables 6–9). We assessed the quality of the assembly by aligning the assembled minke whale transcripts onto the scaffolds (>98% coverage) and by using a core eukaryotic gene mapping method7 (>98.6% conserved genes) (Supplementary Tables 10–13). Additionally, we validated heterozygous single-nucleotide variants (SNVs) by Sanger sequencing (Supplementary Fig. 5 and Supplementary Tables 14 and 15). We identified all four analyzed minke whales as North Pacific minke whales (B. acutorostrata scammoni) by mapping their raw reads to a previously published mitochondrial genome8 (Supplementary Figs. 6 and 7). Minke whales have 21 pairs of autosomes and a pair of sex chromosomes (2n = 44), which is common in cetacean9. We could identify eight scaffolds as a small fraction of sex chromosomes (Supplementary Table 16).

We found that the minke whale genome contains 20,605 genes (Supplementary Tables 17–19) and 2,598 noncoding RNAs (Supplementary Table 20). Repetitive elements occupy 37.3% of the whole genome (Supplementary Figs. 8 and 9 and Supplementary Tables 21–24). We confirmed genes on the basis of the transcriptomes of eight organs sequenced from an additionally acquired minke whale sample (Supplementary Table 10). We constructed orthologous gene clusters using eight mammalian genomes (Supplementary Table 25). We found that the minke whale genome contains 12,675 orthologous gene families, excluding singletons, 9,848 of which are shared by all four artiodactyl genomes (minke whale, bottlenose dolphin, cow and pig). Of these gene families, 494 are specific to the minke whale (Fig. 1; estimates of divergence time are shown in Supplementary Figs. 10 and 11). Additionally, we estimated segmental duplication (33.4 Mb) and genomic synteny (30–45%) in the minke whale genome (Supplementary Figs. 12 and 13, Supplementary Tables 26–29 and Supplementary Note). An inspection of the gene families showed that olfactory, rhodopsin-like G protein–coupled receptor and mammalian taste receptor domains were markedly under-represented in the whales compared to in the cow and pig (Supplementary Tables 30 and 31). We further analyzed the genome to identify all olfactory receptor genes and found that the number of these genes is much lower in whales than in other mammals (Supplementary Figs. 14 and 15, Supplementary Tables 32 and 33 and Supplementary Note).

Figure 1: Orthologous gene clusters in the artiodactyl lineage.
figure 1

Shown is a Venn diagram of unique and shared gene families in the minke whale, bottlenose dolphin, cow and pig genomes. The total numbers of gene families are given in parentheses.

We investigated the genotypes underlying the marine adaptations of the whale lineage by analyzing the expansion or contraction of gene families, species-specific amino acid changes and positively selected genes (PSGs). We found that the minke whale genome contains 1,156 expanded and 2,048 contracted gene families (Fig. 2a and Supplementary Tables 34 and 35). Compared with other non-whale mammals, the whale lineage contains a total of 4,773 genes with unique amino acid changes (fixed in the four minke whales and two bottlenose dolphins), and 574 genes had minke whale–specific amino acid changes (fixed only in the four minke whales). Of the 4,773 genes, 695 encoded function-altering amino acid changes that were specific to the whale lineage (Supplementary Table 36). We identified PSGs, on the basis of dN/dS ratios (nonsynonymous substitutions per nonsynonymous site to synonymous substitutions per synonymous site), by comparing the whale genomes with those of cow and pig using the branch-site likelihood ratio test10. We identified 279 and 557 PSGs in the minke whale and bottlenose dolphin, respectively, whereas 64 PSGs were present in both (Supplementary Tables 37–43). Additionally, we identified rapidly evolving gene ontology (GO) categories11 in the minke whale and bottlenose dolphin (Supplementary Tables 44 and 45), as well as copy number variations in the fin whale and finless porpoise (Supplementary Tables 46–48 and Supplementary Note).

Figure 2: Relationship of the minke whale to other mammalian species.
figure 2

(a) Gene family expansion or contraction. The numbers indicate the number of gene families that have expanded (orange) or contracted (blue) since the split from a common ancestor. MYA, million years ago; MRCA, most recent common ancestor. Timelines indicate the divergence times among species. (b) The expanded peroxiredoxin (PRDX) gene family in the whale lineage.

Notably, a number of whale-specific genes were strongly associated with stress resistance. The peroxiredoxin (PRDX) family, which has an important role in eliminating peroxides and in redox signaling generated during metabolism12,13, was markedly expanded (GO:0051920, P = 0.000030, Fisher's exact test, seven genes; Supplementary Table 34), meaning there was an increase in gene number in the whale lineage (Fig. 2b and Supplementary Tables 49 and 50). PRDX1 was expanded in the minke whale (five copies) and bottlenose dolphin (two copies). The fin whale and finless porpoise also had expanded PRDX1 homolog genes. Furthermore, PRDX3 was expanded in the two baleen whales (two copies), and PRDX4 was positively selected in the minke whale and bottlenose dolphin.

The level of O-linked N-acetylglucosaminylation (O-GlcNAcylation) in numerous nucleocytoplasmic proteins is known to increase in response to multiple forms of cellular stress, such as hypoxia, oxidative stress and osmotic stress14,15,16. O-GlcNAc transferase (encoded by OGT), the enzyme that can catalyze the addition of a single N-acetylglucosamine to a serine or threonine residue through an O-glycosidic linkage, was expanded in the minke whale (3 copies) and in the bottlenose dolphin (11 copies), whereas the cow and pig had only 1 copy each (Supplementary Fig. 16 and Supplementary Tables 49 and 50). The fin whale and finless porpoise also had expanded OGT homolog genes.

Perhaps the most marked environmental adaptation for a whale is deep diving, which induces hypoxia. Under hypoxic conditions, reactive oxygen species (ROS) are generated by several cellular mechanisms17,18. Glutathione is a well-known antioxidant that prevents damage to important cellular components by ROS19. Seven glutathione metabolism pathway genes (GPX2, ODC1, GSR, GGT6, GGT7, GCLC and ANPEP) showed cetacean-specific amino acid changes; these changes were present in the four minke whales, a fin whale, two bottlenose dolphins and a porpoise (Fig. 3a and Supplementary Figs. 17–23). GSR in the glutathione metabolism pathway was also positively selected in the dolphin. It is known that the increased expression of GSR increases the antioxidant capacity of cells20. Furthermore, functional categories, such as antioxidant activity (GO:0016209, P = 0.010, 13 genes) and oxidoreductase activity (GO:0016491, P = 0.00000035, 162 genes), were enriched in the minke whale genome (Supplementary Table 34). These signatures likely reflect adaptation to increased diving duration, as these genes can combat the damaging effects of hypoxia-induced ROS. To test this hypothesis, we measured glutathione levels experimentally. Cultured kidney cells from the Atlantic spotted dolphin (Stenella frontalis) showed an increased ratio of reduced glutathione to glutathione disulfide when subjected to hypoxic or oxidative stress (Supplementary Fig. 24 and Supplementary Note).

Figure 3: Cetacean-specific amino acid changes in glutathione metabolism–associated genes and haptoglobin.
figure 3

(a) A positively selected gene (GSR) in the bottlenose dolphin is shown in a red rectangle. Genes with cetacean-specific amino acid changes (GSR, GPX2, GGT6, GGT7, ANPEP, ODC1 and GCLC) are shown in blue rectangles. The seven cetacean-specific genes are involved in glutathione metabolism pathways (KEGG pathway map00480). The solid lines indicate direct relationships between enzymes and metabolites. The dashed lines indicate that more than one step is involved in a process. (b) The positions of unique amino acid changes in the crystal structure of the haptoglobin-hemoglobin complex. The haptoglobin protein is shown in a cartoon form; the CCP domain is green, and the SP domain is yellow. Of the ten amino acid changes, eight positions are represented by violet sticks; the other two positions are not displayed because they were not included in the complex structure. Hemoglobin is shown with green sticks, and the CCP domain of the contacting haptoglobin is represented in an electrostatic potential surface model (blue, positive; red, negative; white, neutral). The black dots indicate the polar interaction between His137 of haptoglobin and the terminal carboxylate of hemoglobin.

Haptoglobin, an antioxidant protein that functions by controlling heme-induced ROS, exhibited ten cetacean-specific amino acid changes (Supplementary Figs. 25 and 26). Haptoglobins bind free plasma hemoglobins, thereby preventing the loss of iron through the kidneys and protecting against renal damage caused by hemoglobin-derived ROS21. We identified two genetic variations (encoding p.Pro48Leu and p.Val58Leu) at the dimeric interface between the complement control protein (CCP) domains in two residues that are in hydrophobic contact (Fig. 3b). In these cases, replacement with bulkier hydrophobic residues appears to strengthen the contact. We observed another notable variant (p.His137Asn) on the hemoglobin-interacting face of the serine protease (SP) domain. His137 participates in a polar interaction with the C-terminal carboxylate of hemoglobin. The p.His137Asn substitution could facilitate two polar interactions between the amide side chain of asparagine and the terminal carboxylate, thereby strengthening the interaction between hemoglobin and haptoglobin.

In whales, blood lactate concentration increases after prolonged diving22,23, and hypoxia is known to control lactate concentration by activating hypoxia-inducible factor24. Lactate dehydrogenase (encoded by LDH) is the enzyme responsible for converting pyruvate to lactate. In our analysis, we found that the LDHA homolog genes had undergone an expansion in mammals that are known to live under hypoxic conditions, namely whales and naked mole rats (Supplementary Fig. 27). Additionally, we discovered that the genes encoding homologs of monocarboxylate transporter 1 (encoded by SLC16A1, also called MCT1), which catalyze the rapid transport of monocarboxylates such as lactate and pyruvate across the plasma membrane, had also expanded in the whale lineage and in naked mole rats (Supplementary Fig. 28).

Whales extract the majority of their water from food by metabolizing fat, but they still consume seawater under certain circumstances25. The renin-angiotensin-aldosterone system (RAAS) is a key hormone system that regulates blood pressure and water balance in response to sodium level. Notably, we found functional changes in angiotensin-converting enzyme 2 (encoded by ACE2) in the whale lineage (p.Val747Ala, p.Asp798Gly and p.Gln801His in minke and fin whales and p.Asp784Gly in bottlenose dolphin and finless porpoise; Supplementary Fig. 29), and five genes in the RAAS pathway (AGTR1, ANPEP, LNPEP, MME and THOP1; Supplementary Figs. 23 and 30–33) had cetacean-specific amino acid changes.

Mysticeti whale species, including minke whales, grow baleen instead of teeth. We observed that ENAM, MMP20 and AMEL, which are involved in tooth enamel formation and biomineralization26,27, are pseudogenes with premature stop codons in the baleen whales (Supplementary Figs. 34–37). Keratin-related gene families, which are essential for hair formation, were contracted specifically in the whale lineage (Supplementary Fig. 38). Additionally, several HOX genes (HOXA5, HOXB1, HOXB2, HOXB5, HOXD12 and HOXD13), which have an important role in the body plan and embryonic development28, were positively selected in the whale lineage compared to terrestrial mammals, reflecting the morphological adaptation of the whale to the aquatic environment (Supplementary Fig. 39 and Supplementary Note). The adaptations of whales to echolocation (Supplementary Fig. 40), blood clotting (Supplementary Table 51) and oxygen transportation (Supplementary Fig. 41 and Supplementary Tables 52–54) are described in the Supplementary Note.

Analysis of whole-genome sequences can provide a general overview of the total genetic variation in a species, and in this study we identified 1.37–1.59 million heterozygous SNVs in the genomes of the three resequenced minke whales (Supplementary Tables 55 and 56). This gave an estimated nucleotide diversity (mean per-nucleotide heterozygosity) of 0.00061, which is comparable to the nucleotide diversity observed in humans (0.00069)29. The nucleotide diversities of the fin whale (0.00151) and bottlenose dolphin (0.00142) were higher than those of the minke whale and finless porpoise (0.00086). Although blue and fin whales are as genetically distant as gorillas and humans30, they are known to interbreed and produce hybrid individuals31, which could explain the observed high level of heterozygosity in the fin whale. Similarly, bottlenose dolphins are known to hybridize with other dolphins32,33,34. We also inferred a marked population bottleneck in the demographic history of the whale using the pairwise sequentially Markovian coalescent (PSMC) model35 (Fig. 4 and Supplementary Table 57). The minke whale and finless porpoise genomes indicated no substantial population increase during the upper Pleistocene age (12,000–130,000 years ago), whereas the fin whale and bottlenose dolphin populations increased, suggesting that either a population expansion or substantial genetic exchanges through hybridization occurred.

Figure 4: Estimated whale population size history.
figure 4

Tsurf, atmospheric surface air temperature; RSL, relative sea level; 10 m.s.l.e., 10 m sea level equivalent; MW, minke whale; FW, fin whale; BD, bottlenose dolphin; PP, finless porpoise; g, generation time; μ, mutation rate (per site, per year). Minke whale and fin whale data were generated on the basis of comparisons with minke whale scaffolds (“-B” after the species abbreviation) during SNV calling, whereas the bottlenose dolphin and finless porpoise data were generated on the basis of comparisons with the bottlenose dolphin scaffolds (“-T” after the species abbreviation) during SNV calling.

To the best of our knowledge, the minke whale reference genome is the first high-depth marine mammalian genome to be sequenced. The cetacean genomes support hypotheses regarding adaptation to hypoxic resistance, metabolism under limited oxygen and high-salt conditions and the development of unique morphological traits. In particular, the expansion of antioxidant-related genes and whale-specific variations in glutathione-associated and haptoglobin proteins are evidence for adaptation to hypoxic conditions during diving. These data will contribute to future studies of marine mammal diseases, conservation and evolution.

Methods

Genome sequencing and assembly.

The samples used for genome sequencing were acquired from four minke whales and a finless porpoise that had been accidently killed near the east coast of Korea; these incidents were investigated by the Korean maritime police. The bottlenose dolphin sample was obtained from Marine Park in Jeju Island, Korea. The fin whale sample was collected from a dead stranded fin whale by the Southwest Fisheries Science Center. Libraries with different insert sizes were constructed at BGI-Shenzhen (import permit: APO/IL 378/12; export permit: ES2012-00776). The insert sizes of the libraries were 170 bp, 500 bp, 800 bp, 2 kb, 5 kb, 10 kb and 20 kb. The libraries were sequenced using a HiSeq2000 instrument. Other cetacean species (three additional minke whales, one fin whale, one bottlenose dolphin and one finless porpoise) were sequenced at the Theragen BiO Institute (TBI), Korea, using a HiSeq2000 instrument with a 400-bp insert library. We applied filtering criteria to reduce the effects of sequencing errors in the assembly (Supplementary Note).

The corrected reads were used to complete the genome assembly using SOAPdenovo-1.05 (ref. 36). Only qualified data were used in the genome assembly. First, the short insert size library data were used to construct a de Bruijn graph. The tips, merged bubbles and connections with low coverage were removed before resolving the small repeats. Second, all qualified reads were realigned with the contig sequences. The number of shared paired-end relationships between pairs of contigs was calculated and weighted with the rate of consistent and conflicting paired ends before constructing the scaffolds in a stepwise manner from the short–insert size paired ends to the long–insert size paired ends. Third, the gaps between the constructed scaffolds were composed mainly of repeats, which were masked during scaffold construction. These gaps were closed using the paired-end information to retrieve read pairs in which one end mapped to a unique contig and the other was located in the gap region. Subsequently, local assembly was conducted for these collected reads. SSPACE v1-1 (ref. 37) was also used to build the scaffolds. The assembly quality was assessed by mapping the DNA and short RNA reads to the scaffolds; 91.0% of the DNA reads and 78.9% of the RNA reads were mapped (Supplementary Tables 9 and 10). The mapping was conducted using BWA-0.6.2 (ref. 38), and SNVs and small insertions or deletions (indels) were called using SAMtools-0.1.18 (ref. 39) with the default options, except that the '–d 5 -D 150' options were used when 'vcfutils.pl varFilter' was executed to filter the SNVs and indels. A total of 141 heterozygous SNVs, which were located in nonrepeat regions, were validated using the Sanger sequencing method, and 139 (98.6%) were true heterozygous SNVs (Supplementary Fig. 5 and Supplementary Table 15). In order to assess the assembly quality, the eight minke whale transcriptomes were assembled using Trinity40, and all transcripts were longer than 200 bp. The assembled transcripts were aligned to the minke whale scaffolds using BLAT41 with default options except for an identity cutoff of 90%. We found that over 98% of the assembled transcripts were covered by the minke whale scaffolds (Supplementary Table 12). Additionally, we used a core eukaryotic gene mapping analysis CEGMA method7 to identify the core genes in the minke whale genome assembly. A total of 452 core eukaryotic genes (98.69%) out of 458 were found in the minke whale assembly (Supplementary Table 13).

Genome annotation.

The genome was searched for repetitive elements using Tandem Repeats Finder42 version 4.04. Transposable elements were identified using homology-based approaches. The Repbase (version 16.10) database of known repeats and a de novo repeat library generated by RepeatModeler43 were used. This database was used to find repeats with software such as RepeatMasker version 3.3.0. Four types of noncoding RNAs (microRNAs, transfer RNAs, ribosomal RNAs and small nuclear RNAs) were also annotated using tRNAscan-SE44 (version 1.23) and the Rfam database45 (Release 9.1) (Supplementary Note).

The locations and structures of genes, as well as their biological functions and pathways, were predicted using three approaches (Supplementary Tables 17 and 18). First, de novo prediction was performed using the repeat-masked genome based on a hidden Markov model. The programs used were AUGUSTUS (version 2.5.5)46 and GENSCAN (version 1.0)47. Second, homologous proteins in other species (from the Ensembl 64 release) were mapped to the genome using tBLASTn (Blast 2.2.23)48 with an E-value cutoff of 1 × 10−5. The aligned sequence and its query proteins were then filtered and passed to GeneWise (version 2.2.0)49 to search for accurately spliced alignments. Third, source evidence generated using the above two approaches was integrated with GLEAN-1-0-1 (ref. 50) to produce a consensus gene set. Additionally, transcriptome sequencing data were mapped to the genome using TopHat51, and gene models were predicted using Cufflinks52. Then, additional gene models were added (mainly from the Cufflinks predictions) to the GLEAN gene set to construct the final gene set. Gene functions were assigned on the basis of the best matches in the alignments using BLASTP with the SwissProt and TrEMBL databases (Uniprot release 2011-08)53,54. The gene motifs and domains were determined using InterProScan (version 4.7)55 against public protein databases, including ProDom56, PRINTS57,58, Pfam59, SMART60, PANTHER61 and PROSITE62. The GO63 IDs of each gene were obtained from the corresponding InterPro entries. All genes were aligned against KEGG64 genes (release 58) (Supplementary Table 19).

Gene families.

Orthologous gene sets were used for genome comparisons. The TreeFam methodology65 was used to define a gene family, which represents a group of genes descended from a single gene in the last common ancestor. BLASTP was applied to all protein sequences using a database containing a protein data set from all species with E values <1 × 10−7, and fragmental alignments were conjoined for each gene pair by Solar. A connection (edge) was assigned between two nodes (genes) if >1/3 of the region aligned to both genes. An H score ranging from 0 to 100 was used to weight the similarity (edge). For two genes, G1 and G2, the H score was defined as follows: score(G1G2)/max(score(G1G1), score(G2G2)); the score shown here is the BLAST bit score. Gene family extraction, i.e., clustering by Hcluster_sg, used the average distance for the hierarchical clustering algorithm, which required a minimum edge weight (H score) of >5 and a minimum edge density (total number of edges/theoretical number of edges) of >1/3. Expansion or contraction was defined by comparing the cluster size of the ancestor to that of each of the current species using the CAFÉ program66. The expansions of the PRDX1 and OGT genes were validated using quantitative PCR (Supplementary Note). tBLASTn was used to identify regions containing olfactory receptor–related sequences with at least one of the following conserved motifs: MAYDRYVAIC (TMIII), KAFSTCASH (TMVI) or PMLNPFIY (TMVII), or variants thereof with a <40% sequence difference from the conserved motifs (Supplementary Note).

Genome evolution.

Single-copy gene families were used to construct a phylogenetic tree for B. acutorostrata and the other sequenced mammalian genomes. Fourfold degenerate sites were extracted from each family and concatenated to form one supergene for each species. The HKY85+gamma substitution model was selected, and PhyML v3.0 (ref. 67) was used to reconstruct the phylogenetic tree. Molecular sequence data of fourfold degenerate sites were used to estimate species divergence time using the program MCMCtree v3.0 with the approximate likelihood calculation algorithm as implemented in the PAML package68 (version 4.5) (Supplementary Note).

Single amino acid polymorphisms in the minke whale and bottlenose dolphin genes were compared with those in the cow and pig genes by multiple sequence alignments using ClustalW2 (ref. 69). Protein sequences of the fin whale and finless porpoise were predicted by aligning and substituting the raw reads to the minke whale scaffolds and bottlenose dolphin scaffolds, respectively. Artifacts were removed from the alignments manually, and the filtering option required ≥1/2 coverage and ≥1/2 well-matched amino acids (the consensus string was '*', ':' or '.'). To exclude individual variation, only amino acid changes shared by all the whales tested (four minke whales and two bottlenose dolphins) were used. Significant changes in protein function ('probably or possibly damaging') were predicted using PolyPhen-2 (ref. 70).

PSGs identified on the basis of dN/dS ratios were predicted using branch-site likelihood ratio tests for single-copy gene families with a conservative 10% false discovery rate (FDR) criterion10. The minke whale was used as the foreground branch, and the cow and pig were used as the background branches for the PSGs of the minke whale. The bottlenose dolphin was used as the foreground branch for the PSGs of the bottlenose dolphin. The coding sequences of the single-copy orthologous genes were aligned using PRANK71, and alignments shorter than 150 bp without gaps were discarded. The codeml program in the PAML package was used to calculate the log likelihoods for the alternative model and the null model. The FDR was determined on the basis of the q values calculated using the q-value library in R72. All the PSGs were mapped to KEGG pathways and assigned GO terms on the basis of their P values, which were calculated by Fisher's exact test with a 10% FDR. The over-representation of glutathione and glutathione disulfide were validated experimentally using kidney Sp1K cells from Atlantic spotted dolphin (S. frontalis). Additional information regarding the methods used to identify rapidly evolving GO categories and copy number variations is provided in the Supplementary Note.

Demographic history.

The population size histories were inferred using the PSMC model35. The consensus sequences of each whale were constructed and divided into nonoverlapping 100-bp bins, which were marked as homozygous or heterozygous on the basis of SNV data sets scanned using the minke whale and fin whale sequencing reads, as well as the bottlenose dolphin and finless porpoise sequencing reads mapped to the minke whale and bottlenose dolphin scaffolds, respectively. The resulting bin sequences were used as the input for PSMC estimation after removal of the sex chromosomes. Bootstrapping was performed to determine the estimation accuracy by randomly resampling 100 sequences from the original sequences. The generation times were derived from a previously published report73.

URLs.

Minke whale genome, http://whalegenome.net/; SOAP, http://soap.genomics.org.cn/; Ensembl, http://www.ensembl.org/index.html; KEGG, http://www.genome.jp/kegg/; Repbase, http://www.girinst.org/repbase/index.html; RepeatMasker, http://repeatmasker.org/; MCMCtree, http://abacus.gene.ucl.ac.uk/software/paml.html.

Accession codes.

The minke whale whole-genome shotgun project has been deposited at the DNA Data Bank of Japan, European Molecular Biology Laboratory and GenBank under accession ATDI00000000. The version described in this paper is the first version, ATDI01000000. Raw DNA and RNA sequencing reads have been submitted to the NCBI Sequence Read Archive database (SRA090057 and SRA091100).