Author Correction: The axolotl genome and the evolution of key tissue formation regulators

In the originally published version of this Article, the sequenced axolotl strain (the homozygous white mutant) was denoted as ‘D/D’ rather than ‘d/d’ in Fig. 1a and the accompanying legend, the main text and the Methods section. The original Article has been corrected online.


Overview
Reads are overlapped, subsequently patched and overlapped again. The initial overlap doesn't have to be a full all-against-all comparison, but a partial overlapping run, comparing a block not against all others, but rather just enough (aligning a read to reads summing up to 15x coverage usually suffices), in order to have enough information for the patching. After the second overlapping run with patched reads, they are annotated (read quality, read trimming, repeat regions). The annotation is then used later to filter repeat induced alignments, detect chimeras and other read artifacts that could negatively impact the assembly pipeline downstream. After this filtering of alignments, we should be left with only true overlaps, based on which the overlap graph is built and then subsequently toured. The paths through the overlap graph represent the contigs.

Requirements
For anything beyond bacterial samples, which can be assembled rather quickly on a standard laptop, a cluster environment with sufficient CPU and storage is recommended. We recommend having at least 8GB of RAM per CPU core for the cluster nodes and a fast interconnect to the cluster-attached storage. RAM per core demand can be lessened by further increasing the partitioning of the data into blocks, at the price of increased I/O demand. Example CPU times can be found in Supplementary Table 1 and have been measured on our in-house cluster based on Intel's Sandy-Bridge and Haswell CPUs. Storage demands increase as the coverage and the repetitiveness of the genome increase. Again, storage requirements for a single phase of MARVEL can be found in Supplementary Table 1. For peak storage requirements, this number needs to be multiplied by two.

Major features
1) A front-to-end assembler of noisy long reads that is not dependent on external toolkits and libraries MARVEL was built with the intention to provide a clean code base that is as free of external dependencies as possible. External dependencies often result in the blight of so called version-chaos, which forces the end-user to have libraries and tools of a certain version installed, thereby lowering the ease-of-use significantly.
2) Uncorrected assembly MARVEL breaks with the established paradigm of correcting the reads to >99% identity prior to assembly. It is vital for the read correction that the reads used to derive the consensus (i.e. corrected) sequence are not taken from the repeat induced alignments. However, given the knowledge of which alignments constitute true overlaps, assembling immediately without prior correction is achievable. This has the added advantage of not constructing hybrid sequences that represent essentially the "average" of the repeats. The probability of creating hybrid sequences correlates with the repeat content of the genome. For genomes featuring a non-excessive repeat content, hybrid sequences can largely be avoided by employing heuristics that enrich for true overlaps, such as using the longest alignments for correction. But this approach starts breaking down rapidly with >50% repeat content, massive heterozygosity or polyploidy. The resulting hybrid sequences subsequently cause problems in the assembly pipeline, resulting in subpar assemblies.
3) Read patching (in lieu of correction) Instead of correcting the reads, we opted for a process we called patching, which repairs large-scale errors (missed adaptors, large "random" inserts, high-error regions, missing sequence) that result in a stop of the alignment once MARVEL reaches such erroneous regions. The main point here is that when restoring large-scale errors with uncorrected parts of others reads, the sequence used for the correction only needs to be good enough to allow the alignment to continue at the native error rate and doesn't actually have to be the right one. Since the likelihood of encountering a large-scale error in a read scales with its length, the most valuable long reads are the ones affected the most and, therefore, patching represents an important way to preserve the long reads for the use in the assembly. Patching essentially works by finding regions in the reads that are: a) of low quality. The quality (i.e. error rate) across a read is derived from the identity of the alignments to it. b) not spanned by any alignment.
The two scenarios above are treated as follows, a) is corrected by taking the sequence of a read that spans the low-quality region and replacing the low-quality region with it. Whereas in case of b), we look to the left/right of the break and see if the same reads are on both sides and that the size of the gap from the perspective of those reads is roughly the same. The region in the read is then replaced by sequence taken from one of the reads that align to it. Supplementary Figures 4-6 illustrate the patching process. Its impact on read length distributions and subsequent assembly statistics can be seen in Supplementary Figures 7 and  8.

4) Repeat masking during overlapping
For large genomes whose full assembly would take prohibitively large amounts of CPU time and disk space, MARVEL offers the option of dynamic repeat masking during the overlapping phase. This is achieved through a component called the masking server, which runs in parallel to the overlapper jobs. This server analyses the alignments computed by the overlapper jobs, collects coverage statistics and derives a repeat annotation from them. The repeat annotation is queried by the overlapper jobs and used for soft-masking regions in reads, causing them to be excluded from the k-mer seeding (i.e. no alignment starts in them). This results in compute time and storage requirements savings of up to one order of magnitude.
Repeat regions are masked and the pile-up of repeat induced alignments is largely never computed and completely removed later-on during the assembly process. This has, in addition to saving compute time and disk space, the added advantage that junctions in the assembly graph are disambiguated ( Supplementary Fig. 9), if there are reads available that span them. This feature significantly reduces the amount of work that has to be performed when touring the assembly graph. 5) Post-assembly correction MARVEL performs read correction after the assembly is done. This ensures that true overlaps are used for the correction and only the reads used to build the contigs need to be corrected, thereby resulting in often significant compute time savings. The correction module can also be used without performing an assembly to produce a complete set of corrected reads, if that is desired.

6) Assembly of uniquely decomposable repeat regions
The resolution of the repeat annotation computed either for the on-the-fly read masking or later on in the assembly process is not at the base-pair level, meaning, that neighboring repeats that are only a few bases apart from one another are merged into a single repeat during the repeat detection. During the assembly, we attempt to decompose repeats into their respective modules and furthermore establish whether the pattern of the repeat modules is unique in the genome. Given a unique pattern of consecutive repeat modules it is often possible to find true overlaps through repeat regions, which otherwise would have resulted in diminished assembly contiguity. 7) Post-assembly polishing Finally, quiver can be used after the assembly resulting in a high-quality sequence. However, it often requires a rather long runtime, especially with the genome the size of the axolotl. In order to reduce the long runtimes and circumvent the execution of blasr, which is the main source of the long runtimes, the position of the reads within the assembly are recycled. This was not possible with the axolotl assembly since the coverage of the reads in the final assembly was too low for a high-quality output from quiver. Therefore, the patched and trimmed reads were corrected and then used to correct the final version of the assembly, which was subsequently polished using short reads.

Setup phase
The highest scoring read of each ZMW from every SMRT cell based is extracted using H5dextract. The following metric is used to score the reads: read_score = 0.2 * read_length + 0.4 * read_quality + 0.4 * slow_polymerase_events The fasta files are then converted to a daligner database providing a compression factor of 4x. This is done by FA2db, during this step the read statistics are gathered and can be viewed with DBstats. Since the read database of certain genomes can exceed hundreds of gigabytes in size, making a direct all-against-all comparison infeasible, thereby necessitating a blockwise overlap computation. For that, the database file is split into blocks of 400MB or smaller.
If the alignments are run on a high performance cluster, the strategy to compute all local alignments between all blocks of the daligner database is created by HPCaligner. Also a strategy to merge all resulting overlap files is created. HPCaligner creates jobs only for the upper triangular block matrix. This happens because daligner produces symmetrical overlap files, i.e. when daligner computes overlaps i,j between databaseBlock i and databaseBlock j the corresponding overlap j,i is also computed. In order to reduce the size of the alignment files on the harddrive, only the anchor positions (called trace points) of the alignment between read i and read j in reference to read i are saved. However, in order to quickly recompute the alignment between read i and read j correctly the trace points in reference to read j also need to be saved. Trace points are set every 100bp.

Patch phase
The local alignment computation is by far the most time-consuming part of the pipeline and can result in computational times of several CPU-months, especially for genomes that exceed 1 Gb and/or have a high repeat content. Therefore, the overlap computation can be performed in 5 different modes (for axolotl mode 4 was used): mode 1: compute all local alignments mode 2: dynamically mask repeat regions WWW.NATURE.COM/NATURE | 9 SUPPLEMENTARY INFORMATION RESEARCH mode 3: dynamically mask contained reads (i.e. reads that are completely contained within other reads) mode 4: dynamically mask repeat regions and contained reads mode 5: dynamically mask repeat regions and contained reads, exclude masking of the flanking regions The dynamic masking is managed by a masking server that keeps track of the number of the local alignments for each read as well as of their exact positions within that read. A previously calculated masking track, if available, can also be used instead of the masking server. In addition, the masking server can be initialized with previously calculated tracks. This was done for the second daligner run in the assembly phase by using the calculated repeat track of the first daligner run. However, this required the adjustment of the coordinates of the previously computed repeat annotation, since those could have shifted during the patching of the reads.
Before daligner compares databaseBlock i with databaseBlock j it fetches the mask tracks for these blocks from the masking server. Network traffic is negligible since block specific tracks are comparatively small (< 1 MB). Once daligner completes the alignments, the masking server analyzes the alignments and updates the read statistics and the masking information.
In order to keep the masking evenly distributed over all reads and mask the dominant repeats first, we schedule daligner jobs processing the main diagonal of the block-vs-block matrix first. Although the diagonal block scheduling has no effect on the final masking results, it seems to enhances the performance due to the early uniform masking of the most common repeat elements. In highly repetitive genomes such as that of the axolotl, the repeat induced masking phase can result in certain reads being (almost) completely masked, effectively eliminating potential alignment seed k-mers and thereby resulting in no overlaps between such reads. This is most likely the reason for the fragmented assembly of the axolotl genome. Nevertheless, it was possible to generate one of the best assemblies of such a giant genome. Had the assembly not been handled in such a way, the demands towards computational time and storage would have been prohibitively large.
For each databaseBlock i a merged file with all overlaps of that block is created called DB.i.las is created using LAmerge.
Based on the alignments computed previously, a new set of reads is derived from the original reads, their local alignments and quality in a process called patching, performed by LAfix. Prior to LAfix read quality (with the resolution of 100 bp segments) and trim information is computed using LAq.
LAfix then eliminates the worst sequencing errors that are validated by other reads. This is based on the local alignments and the previously generated trim tracks. Errors that are patched, including gaps up to 500 bp, are filled. In order to be filled they must be supported by at least 3 high quality reads. Also bad read segments are replaced by good quality segments. A new database with the patched reads is then created and used for the following overlap run.

Assembly phase
First, all overlaps for the assembly phase with the previously patched reads are calculated. For the second daligner run in the assembly phase the repeat annotation of the first daligner run can be reused. The following steps are based on the generated overlaps. First, the remaining gaps shorter than 50 bp within the pairwise alignments were stitched with LAstitch. Then the quality scores for all reads are recalculated and trim tracks generated once again by LAq and subsequently trimmed by LAtrim. This is followed by the repeat annotation of LArepeat. Repeat regions are determined based on the coverage of the overlaps. If a potential repeat region has a coverage of more than twice the expected coverage of the genome the region is annotated as a repeat. All following bases that have a coverage of at least 1.5 times the expected coverage are added to the repeat region, this is in order to compensate for coverage fluctuations in repeat regions. Once the coverage falls below 1.5 times the expected coverage the repeat region is closed. After which, the reads are scanned for gaps (points which are not spanned by any overlap) using LAgap. Gaps can exist at this stage due to either a read being a chimera or left-overs of the "weak" regions in the reads that were not detected in previous stages. In order to resolve a gap, the overlaps from the shorter side are discarded. Gap resolution is followed by a round of trimming. Now with properly trimmed reads devoid of any large-scale errors we can perform a final filtering of the overlaps using LAfilter, discard local alignments and remove repeat induced overlaps. In addition, MARVEL tries to salvage overlaps of repeat regions that span the complete repeat module. This is done by trying to decompose larger repeat regions into a unique series of modules.
Based on the final set of overlaps the overlap graph is built using OGbuild. Two passes over the final overlap files are performed. Pass 1 computes memory requirements for the overlap graph and classifies reads as either proper or contained. Contained means that the read aligns end-to-end (trimmed) to another read. Assuming transitivity in the overlaps, contained reads hold no valuable information and can be ignored. Pass 2 then builds the overlap graph based on overlaps between proper (i.e. non-contained) reads. Afterwards, the graph is written to disk for the subsequent touring. Due to the erroneous nature of the reads and the greedy alignment strategy used by daligner, the assumption of transitivity can break. Therefore, we optionally allow iterative extension of the graph using overlaps involving contained reads, what would otherwise result in dead-ends in the graph during the touring.
Touring the overlap is performed by OGtour. Nodes (reads) are ranked by their read quality and the number of edges (overlaps) leaving into both directions (ends of the read). Nodes with a distance of two or less to an existing tour in the graph, that only have edges in a single direction (ie. dead-ends) or have otherwise been tagged as excluded are not used a starting nodes. The touring continues until all potential start nodes have been exhausted. Given a start node the next node is determined by finding all potential paths of a length n (i.e. the lookahead) from the start node. The edge of the next node used is the edge that is most common amongst those potential paths within the look-ahead. This is continued until the current node has no neighbors left in the direction the path is growing or only has visited nodes as neighbors. Touring then tries to extend the path from the starting node of the tour into the other direction.
After the graph has been toured, the contigs (ie. the paths of the tour) are written either using the patched reads or by running read correction on the final set of overlaps (which all should be true overlaps) and using the corrected reads as drop-in replacement of the patched reads.

Determining the read quality in MARVEL
The quality of a read is inferred at the resolution of 100 bp segments based on the local alignments it receives during the overlapping. The average quality of up to 20 alignments is used as a measure of the level of noise in the read.

Annotation of repeat regions in MARVEL
The repeat regions are annotated initially by the masking server and later by LArepeat. While both use the alignments to determine repeat regions the masking server annotates the repeat region dynamically and once the repeat coverage threshold is exceeded (in most cases twice the sequencing coverage) the region is masked and not used for subsequent k-mer seeding. LArepeat runs sequentially over the reads to annotate the repeat regions. Once a base with a coverage that exceeds the repeat coverage (in most cases twice the sequencing coverage) a repeat region is opened. A repeat region is closed once a base under the non-repeat coverage is found. The minimum overlap length of 1 kb can result in missing repeat annotations if the ends of a read are repetitive, but do not reach far enough into a repeat (ie. have less than 1 kb of repetitive sequence). TKhomogenize can be used to transitively transfer an existing repeat annotation based on the computed alignments, thereby properly annotating the tips of the reads.

Read correction in MARVEL
Despite the noisy nature of PacBio long-reads, MARVEL does not correct the reads to an error-rate of less than 1% as is common practice other assemblers when dealing with noisy long-reads. The reads are first patched after the initial overlap phase, removing the major sequencing related InDels, chimeras and additional artefacts. This is necessary for highly contiguous assembly. At the end of the assembly, only the true overlaps are left and the consensus can then be safely calculated based on the remaining overlaps.

MARVEL assembly pipeline for axolotl
A standard assembly consists of the following steps.
1. H5dextract -extract the reads from the sequencing data generated by the sequencer 2. DBprepare -creates a database for the overlapper, splits it into blocks and creates a "plan" (the set of commands to compute the overlaps and merge them) 3. daligner -the overlapper used by MARVEL 4. LAq -calculates the read quality 5. LAfix -repairs the large-scale errors within the reads 6. DBprepare -creates a database for the overlapper, splits it into blocks and creates a "plan" (the set of commands to compute the overlaps and merge them). This run is based on the patched reads. 7. LAstitch -repair alignments that have been interrupted a left-over bad region 8. LArepeat -create the repeat repeat annotation of the reads 9. TKhomogenize -transitively transfer the repeat annotation using the local alignments 10. LAq -calculate the read quality and trim information 11. LAgap -find borders where alignments break and discard the shortest read side.
Ignore if the gap is in a repeat region. 12. LAq -update the trim track, but only for changes due to LAgap 13. LAfilter -purge all local and repeat induced alignments 14. OGbuild -build the overlap graph 15. OGtour -tour the overlap graph and export the contigs

MARVEL assembly results for other genomes
To determine MARVEL's runtime and space requirements and the quality of its assemblies we performed assemblies of various genomes. See Supplementary Table 1 and Supplementary Figure 10. We further tested our performance on Arabidopsis datasets of varying coverages and found that even at 5x satisfactory assemblies are possible ( Supplementary Fig. 11).

Example of an alignment of a transcript sequence to the genome sequence
The genome sequence was corrected using Illumina sequencing data as described in section 2.2.2, and the sites of heterozygosity determined as described in section 2.2.3. Supplementary  Figure 12 provides an illustration of the transcript sequence to genome sequence alignment before (left) and after (right) the correction/heterozygosity analysis. On the left, the positions marked in red with an asterisk are mismatches or InDels found in the alignment to the uncorrected assembly. On the right, the green arrows mark sites that were corrected, whereas the green arrowheads mark sites where heterozygosity was confirmed. One site was still ambiguous due to lack of Illumina sequence coverage. Further data on the contiguity of the transcriptome contig alignments to the corrected genome is given in the transcriptome section 3.5.

Genome correction
We generated 524,974,802 paired-end (2x250 bp) Illumina reads from libraries made from DNA of the same individual used for PacBio sequencing. The Illumina library had an insert size of 440 bp. This represented approximately seven-fold coverage of the genome (estimated using Pilon 40 ). The reads were mapped to the PacBio genome assembly by Bowtie2 41 using the default settings. Afterwards, Pilon was run to correct both, InDels and polymorphisms with a setting of at least 5 reads covering that particular site being required (Pilon default). After the correction, the PacBio assembly was used to generate the hybrid Bionano scaffolded assembly.

Genome heterozygosity analysis
In order to estimate heterozygosity based on the sequence of the single individual, the same reads that were used to correct the PacBio assembly were mapped to the scaffolded Bionano assembly. SAMtools 42 mpileup was used to generate the BCF file that was further filtered by vcfutils.

DNA extraction and data collection
Genome mapping was performed using Bionano's Saphyr™ System based on NanoChannel array Technology. DNA was extracted as previously described 43 with some alterations. In brief, 50 mg of frozen axolotl spleen tissue was incubated with 2 mL of 2% formaldehyde and 2 mL EtOH to stabilize the DNA. The tissue was disrupted to produce a suspension by blending it with a QIAGEN Tissue Ruptor for 10 seconds. The resulting suspension was embedded in agarose gel plugs in a four point serial dilution titration and processed as described in Bionano Prep Animal Tissue DNA Isolation Fibrous Tissue Protocol. After DNA quantitation, 900 ng of DNA was labeled with 2.1 units of Nt.BspQI and 18 units of Nb.BssSI enzymes in separate labeling reactions. Labeling was completed according to the Bionano Prep Animal Tissue Kit protocol. Each enzyme reaction was run on the Saphyr System. 2.813 Tbp of data were collected on three Saphyr Chip™ for Nt.BspQI and 2.0 Tbp of data was collected on two Saphyr Chip for Nb.BssSI samples, single molecule N50 lengths were 240 kb and 184 kb, respectively.

De novo assembly and two enzyme hybrid scaffolding
Each data set was de novo assembled using Bionano Solve™ 2.1 software as described previously, but clusterArguments.xml was modified to make all stage0 jobs (e.g. refineB0) run on hosts with 128GB memory, and Assembler, alignmol and svdetect jobs on hosts with 256GB memory. In addition many regular jobs that would normally run on the 8GB MIC accelerator cards were instead run on hosts with 128GB memory. In addition the optArguments.xml was modified as follows: To reflect a genome approximately 10x larger than human, the molecule alignment p-values were made 10x more stringent (eg from 1e-11 to 1e-12 for molecule alignments) molecule minlen increased from 150kb to 160kb (180kb for pairwise alignment); to reduce memory usage in score tables Kmax, the maximum consecutive number of misresolved labels, was reduced from 6 to 3. The de novo assemblies were 39 Gb and 34.7 Gb in total length for Nt.BspQI and Nb.BssSI, respectively. Hybrid scaffolding of the PacBio assembly was performed with Bionano Solve hybrid scaffold pipeline for two enzymes (https://bionanogenomics.com/wp-content/uploads/2017/03/30073-Rev-B-Bionano-Solve-Theory-of-Operation-Hybrid-Scaffold-1.pdf). Briefly, two Bionano de novo optical maps (Nt.BspQI and Nb.BssSI) and one sequence assembly are input into the pipeline. Each map is aligned to the sequence assembly after it was in silico digested (enzyme defined sequence motifs are marked to produce a map) and discrepancies are detected, i.e. where a Bionano map diverges from a sequence map by approximately 50 kb or more. Sites of divergence are usually chimeric sequence contigs and sometimes chimeric Bionano maps or phasing errors where large heterozygous structural variants exist. At and around conflict points (10 kb flanking the conflict point), quality scores of the Bionano assembly are assessed and, unless the score is weak (≤ 35), the sequence contig is cut. The conflict resolved contigs and maps are then used for hybrid scaffolding. This is done in two steps. First, each set of Bionano maps are separately merged with the corresponding sequence contigs to create two sets of single enzyme contiguous hybrid scaffold. Next, the two sets of single enzyme hybrid scaffolds are further merged into two enzyme hybrid scaffolds. This two enzyme hybrid scaffold map which has a higher density of enzyme sites, is used for anchoring smaller sequence contigs. Finally, an AGP file is generated and a FASTA file for the scaffolds is exported. To accommodate for genome of this size, the maximum memory limits for the single enzyme hybrid scaffold and two enzyme hybrid scaffold were raised to 256MB and 512MB, respectively. The Bionano map orthogonally validates the sequence assembly for 81% of the sequence after resolving 4,828 putative chimeras; (the other 19% is too small to align based on optical pattern mapping).

Tissues sampled, RNA preparation and sequencing
We assembled a de novo transcriptome primarily based on Illumina sequencing of multiple tissues (Supplementary Table 3) and also incorporated sequencing information available from previous projects 44 .

RNA isolation
Total RNA was isolated using TRIzol (Invitrogen) from all but the cell culture and limb blastema samples, for which RNeasy isolation (Qiagen) was used due to very limited amounts of starting material. The RNA quality was assessed using Agilent BioAnalyzer requiring the RNA integrity number (RIN) 45 to be 8 or higher.

Illumina library synthesis and sequencing
mRNA enrichment was carried out using 1µg of total RNA and the NEBNext Poly(A) mRNA Magnetic Isolation Module (New England BioLabs, NEB). After elution in 15µl 2x first strand cDNA synthesis buffer (NEBnext, NEB) RNA was chemically fragmented by incubating for 15 min at 94°C. The sample was immediately subjected to the strand specific RNA-Seq library preparation (Ultra Directional RNA Library Prep, NEB) using custom ligation adaptors (Supplementary Table 4) 46,47 . In this method dTTPs are replaced by dUTPs (uracil) when synthesizing the second cDNA strand, which later is selectively degraded by uracil-DNA-glycosylase. This only leaves the first strand, which has the opposite orientation to that of the original mRNA. This property of the strand-specific library was used later to determine the proper orientation of the assembled contigs (see Supplement section 3.2).
After the ligation the adaptors were depleted by SpriBead bead purification (Beckman Coulter). Indexing was done in the following PCR enrichment (15 cycles) using custom amplification primers (Supplementary Table 5) carrying the index sequence indicated by 'NNNNNN'.
After SpriBead purification (1:1) the libraries were size selected for fragments in the range of 200-300 bp on the 1,5% Agarose gels (E-Gel, Invitrogen). The DNA was eluted using the QIAquick gel extraction Kit (Qiagen) and quantified with Qubit dsDNA HS Assay Kit (Invitrogen). For Illumina flowcell production samples were equimolarly pooled and distributed on all lanes used for 76 bp, 100 bp or 125 bp strand-specific paired-end sequencing on Illumina HiSeq 2500 or Illumina HiSeq 2000, respectively (Supplementary Table 3

Filtering of Illumina raw reads
Sequences were filtered using a Phred score of greater than or equal to 30 over at least 70% of the read length. Additionally, low complexity reads were removed using the DUST filter 48 and homopolymers using a custom produced software tool. The FASTX toolkit [http://hannonlab.cshl.edu/fastx_toolkit] was used to collapse identical reads into a single sequence. In order to preserve paired-end information, the entire pair was removed if one of the reads did not pass the filters.

de novo transcriptome assembly using Trinity
After quality assessment and data filtering, 1.5 billion Illumina reads and 17,522 Sanger reads were obtained from different libraries. For simplicity unassembled Sanger reads will be referred to as ESTs.
For the complete assembly all Illumina reads, except oocyte and islet libraries were pooled together and assembled by Trinity 49 version r20140717. Oocyte and islet libraries were assembled individually. All non-Illumina ESTs were assembled separately using Mira 50 (v.4.9.3) 50 .
BLAST was used to align the sequences from the two datasets and then CD-Hit 51 set at 95% identity (-c 0.95, -aS 0.8) was used to create a non-redundant dataset. The subsequent merging of both assemblies allowed the joining of some of the Illumina-based contigs that belonged to the same transcript but had failed to be joined due to low coverage at the junction. The 180,649 contigs of the assembly added up to 230,144,606 bases.
The statistics of the final transcriptome are summarized in Supplementary Table 6.

Transcriptome annotation
The transcriptome was annotated (Supplementary Table 7 and Supplementary Fig. 13) using two related approaches: i) the contigs were BLASTed 52 (BLASTx, e-value threshold 1e-5) against the NR database release (2017-07-13) limited to vertebrate GIs only; and ii) for each contig all possible ORFs longer than 69aa were generated and BLASTed (BLASTp, e-value threshold 1e-5) against the NR limited to vertebrate GIs. The rationale behind using two steps rather than a single BLASTx step was as follows: BLASTx allowed for annotation of contigs that had an internal InDel causing a frameshift that would disrupt the ORF, while using BLASTp with predicted ORFs allowed us to identify assembly chimeras, as those might have two annotated, but distinct ORFs.

Assessment of transcriptome completeness
All proteins from the CEGMA 53 dataset were BLASTed against the transcriptome assembly. Of 2,748 proteins belonging to 458 families (6 proteins per family), 2,620 proteins belonging to 455 families could be identified as homologous (e-value threshold 1e-20). The three families that do not have any homologs are KOG2311, KOG1762, KOG0602; these contain only arthropod proteins that have no hits outside of the Arthropoda.
We therefore downloaded the CEGMA subset containing human proteins only and BLASTed those against the transcriptome assembly. Of 456 proteins, we found hits for 452. We also BLASTed the remaining 4 manually and found that 3 indeed have no hits in the assembly, while the last protein has a partial hit, but the e-value was to high (1e-9) for the chosen threshold (1e-20). Those four proteins are: NP_009111___KOG0602 Trehalase, a protein of the small intestine, a tissue that was not sequenced. Trehalase is also absent in several mammals, thus, it is not an essential gene. NP_036255___KOG2311 MTO1 NP_057736___KOG1915 CGI-201 NP_998890___KOG1762 60S acidic ribosomal protein BUSCO analysis was also performed and arrived at the parameters listed in Supplementary  Tables 8 and 9. However, we find this analysis is rather stringent, since manual BLAST analysis of some of the supposed 185 missing BUSCO genes yielded clear matches with significant e-values, indicating that BUSCO's sensitivity is lower than that of BLAST. Therefore we consider the analysis based on CEGMA genes more reliable.

Transcriptome-to-genome mapping statistics
The transcripts were mapped to the genome using simple BLASTn (e-value cut-off 1e-10). If up to 10 (on average 3) BLAST hits were considered and a cumulative coverage of each transcript was calculated ( Supplementary Fig. 14), then 151,257 (85.260%) of all transcript contigs mapping to the genome are covered to 95% or more along their lengths (Supplementary Table 10). There were 3,244 transcript contigs that did not map to the genome. 35 of those have clear homologs (BLASTn, e-value cut-off 1e-30) among different bacterial sequences and are likely to represent contamination of the sequenced samples used

Genome completeness assessment -alignment of ultraconserved elements (UCEs)
We first used non-exonic ultraconserved elements to assess assembly completeness. From the 481 ultraconserved elements, originally defined as genomic regions ≥ 200 bp that are identical between human, mouse and rat 54,55 , we obtained the subset that does not overlap with exons according to the human Ensembl gene annotation. For the remaining non-exonic ultraconserved elements, we used liftOver with parameter -minMatch=0.1 to obtain those that align to chicken (galGal5 assembly) and the teleost fish zebrafish (danRer10 assembly) and medaka (oryLat2 assembly), and thus are likely under strong selection in vertebrates. This resulted in 197 vertebrate non-exonic ultraconserved elements. We used Blat 56 with sensitive parameters (-minIdentity=60 -minScore=30 -minMatch=1 -stepSize=8 -mask=lower) to align these sequences against our axolotl genome, the Xenopus xenTro7 assembly, the tibetan frog nanPar1 assembly and the coelacanth latCha1 assembly. All Blat alignments were filtered for a minimum of 75% identity and at least 50 bp.
The number of aligning UCEs is given in the main text. Inspecting the three non-aligning UCEs, we found that UCE.168, which is 250 kb upstream of the arrestin domain containing 3 (Arrdc3) gene, is also not present in the other two frogs, indicating an amphibian-specific loss and not an assembly issue.

Gene annotation
We obtained all transcripts that were annotated by BLAST and thus have a homolog in other species. We filtered these transcripts for an open reading frame of at least 150 amino acids, resulting in 39,753 transcripts. Additionally, we obtained 1,398 transcripts that are not annotated by BLAST but have an open reading frame of at least 200 amino acids, and thus are likely novel protein-coding genes. We observed that our transcriptome assembly contains many transcripts that overlap in their exons but have different starts and ends, and align to the same protein in other species. These partial transcripts therefore belong to the same gene. To reduce this redundancy and obtain gene models with a higher completeness, we clustered all transcripts that have overlapping exons on the same strand. This resulted in 23,251 annotated coding genes. It should be noted that partial transcripts mapping to different contigs, likely because of long, repeat-rich introns, cannot be clustered. Therefore, the true number of coding genes in the axolotl is likely smaller. (JGI 4.2) and zebrafish 25,672 (GRCz10)) by downloading protein-coding genes from Ensembl Biomart (Ensembl genes version 89) 57 .

Transposable element annotation and expansions
Repeat content was assessed in two ways. An all-on-all mapping of the contig assembly was performed with a repeat coverage threshold of 8X, which resulted in a repeat content figure of ~62%. In parallel, the RepeatModeler package (www.repeatmasker.org) was used to identify repeated regions in the contig assembly (28.4 Gb total) and to generate an axolotlspecific repeat library. This analysis yielded a repeat content of ~61%. In order to classify repeats, RepeatMasker was run with the axolotl-specific repeat library, as well as the available vertebrate, tetrapod and amphibian libraries.
The final repeat statistics shown in Figure 2 and in the main text are based on merging the RepeatMasker annotation (Supplementary Table 11) with the annotation of tools that are specialized in detecting LTR elements (Supplementary Table 12, also see Supplement section 4.2.3 below).

LTR element identification and clustering
Given the abundance of LTR elements, we performed a deeper analysis of those. LTR elements were identified independently using 3 prediction programs: LTR_FINDER 1.0.6 58 (run with and without inbuilt masking of highly repetitive sequences), LTRharvest (GenomeTools 1.5.9), and MGEScan-LTR 59 . Where programs expose the options, maximum and minimum element sizes were set to 50 kb and 2 kb, respectively, and minimum LTR size to 150 bp. As this was program-dependent, however, all identified features were further filtered to be within the defined size range.
To merge sets of predictions, identified features overlapping one another >=80% were merged using BEDOPS 2.4.20 60 . Additionally, features containing or overlapping other features <80% (potentially indicative of nested repetitive elements) were discarded, as were those within 100 bp of the start or end of assembled scaffolds. To ensure the accuracy of predictions and to better exclude nested elements, all predicted regions were translated in all 6 frames and scanned using hmmsearch (HMMER 3.1b2) 61 for protease ('RVP' PF00077.15), reverse transcriptase ('RVT_1' PF00078.22), RNase H ('RNase_H' PF00075.19), and integrase ('rve' PF00665.21), and with the equivalent HMMs retrieved from GyDB 2.0 62 . Individually for each scan, predicted regions with more than one hit were removed and a final list of regions compiled that contained at least one hit from any HMM.
Analysis of the frequencies of predicted element lengths (Extended Data Fig. 1) revealed few elements >16 kb (10.9% of total) and no visible spikes in frequency indicative of expansions of a particular repeat family above this size. Therefore, de novo clustering was performed on elements <16 kb using vsearch 2.4.2 63 . Elements were split into 250 bp-500 bp overlapping bins according to their length, filtered to exclude any low-complexity sequences with bbduk (bbtools 35.66), and clustered with edit distances <=15%. Any clusters containing <10 elements were excluded and passing elements from all bins were de-replicated and again clustered to give final groupings.
Element groups were aligned with MAFFT 7.271 64 and 'Strict' gap-less consensus sequences formed (threshold for consensus set to 0.5, an ambiguous base 'N' was inserted if no base formed a majority). As predictions were from overlapping bins, they were finally dereplicated.
Translations of identified elements were clustered alongside the GyDB 2.0 elements according to their reverse transcriptase (RT) sequences using MAFFT and a neighbor-joining tree built with FastTree 2.1.9 65 with bootstrapping. Placement of the elements was manually assessed to assign their membership of the Retroviridae, Ty3/Gypsy, and Bel/Pao lineages.
Different repeat classes were classified by RepeatMasker using a custom library (Supplementary Phylogenetic analysis on the reverse transcriptase of LTRs and comparison to previously identified elements indicated the presence of three distinct classes of LTR retroelements: Ty3/Gypsy, Bel/Pao, and retroviridae 62 (Figure 2b), with Ty3/Gypsy-type LTR retroelements representing the largest group (11.3%, 3.2 Gb) (Supplementary Table 12). Length analysis of the 2,077 consensus sequences revealed that the epsilon-and spuma-retrovirus groups contribute the longest elements with lengths in excess of 10 kb (Figure 2c, Extended Data Fig. 1).
To estimate the relative age of the LTR retroelements, we computed the number of substitutions to the repeat consensus, which indicated transposon activity over a long period of time followed by a recent and apparently continuing burst of expansion (Figure 2d). This is a profile consistent with previous characterizations of LTR element activity based on 1% genome sequencing of several salamander species 66 . An expansion of LTR retroelements was also observed in other salamanders, conifers and sunflowers 67,68 , suggesting that LTR expansion is a common feature of giant plant and animal genomes.

Genes with short vs. long introns
To compare the intron size distribution between axolotl and other species, we first obtained 1:1 coding gene orthologs between human, mouse and frog from Ensembl Biomart (version 89). To infer axolotl orthologs of human coding genes, we used reciprocal best hits from BLAST searches between axolotl transcripts and human proteins (TBLASTn for BLASTing human against axolotl and BLASTx for the reciprocal BLAST search) with an e-value cutoff of 1e-20. This resulted in 4,949 coding genes with a likely 1:1 orthology relationship between the four tetrapods. We compared the intron size distribution of all these genes, resulting in a median size of 22,759 bp for axolotl, 1,750 bp for human, 1,469 bp for mouse and 906 bp for frog, as mentioned in the main text. Afterwards, we divided this set of genes into developmental genes (a list of developmental genes was kindly provided by Boris Lenhard, personal communication; see Supplementary Table 14) and other genes. For both sets, we determined the size of all introns and used a two-sided Wilcoxon rank-sum test to assess if the differences are significant (Figure 3a, Supplementary Table 15).
Next, we compared functional enrichments between genes with short and long introns. We used the size of the gene (all exons and introns; from transcription start to end) as a proxy for genes with only short vs. at least one large intron. The rationale is that the gene size will be small if all introns are short. In contrast, a single large intron will result in a large gene size. We used the size distribution of all annotated axolotl genes (first quartile 60,981 bp, median 180,521 bp, third quartile 455,994 bp, mean 350,475 bp) and obtained the top 25% genes (size less than the first quartile) as well as the bottom 25% genes (size larger than the third quartile). The genes from the two categories were analyzed for functional enrichments using the PANTHER Overrepresentation Test 69 (release 20170413) and the GeneOntology 70

Assessment of intergenic distances
In order to investigate the expansion of intergenic spaces in the axolotl genome, we compared the distance of non-exonic elements that are conserved throughout tetrapods. We defined a set of 31,058 Conserved Non-Coding Elements (CNEs) in the mouse genome that are conserved in many mammals, including human, and frog. We mapped the mouse mm10 DNA sequences of each CNE to the axolotl genome with lastz (--gappedthresh=3000 --hspthresh=2500 --seed=match6). We then removed CNEs that mapped to more than one axolotl contig, resulting in a set of 13,962 CNEs. Next, we used liftOver with -minMatch=0.1 to map the CNE coordinates to Xenopus xenTro7 and human hg38. We filtered the full set of CNEs to find those neighboring elements that do not overlap with a gene or don't have a gene in between them. For this, we used the refSeq tracks from UCSC genome browser 71 for each of the assemblies, and selected those CNE pairs that fulfill the no-gene criteria in all three species (human, mouse, and frog), resulting in 4,722 pairs of CNEs. From this CNE pair set, we took those pairs that align to the axolotl genome on the same axolotl scaffold. This resulted in 1,272 CNE pairs that we used to calculate the distance (see Supplementary Table  22 for the list of CNEs and statistics).

Pax family
In order to confirm the apparent deletion of the Pax3 gene in the axolotl, we investigated the presence of CNEs in the Pax3 locus and for comparison in the Pax7 locus. From a total of 31,058 CNEs that align between the mouse and Xenopus genomes (assemblies mm10 and xenTro7), we selected those that are located around both Pax3 and Pax7 genes. For Pax3, we included 54 CNEs located within the mouse chr1 genomic locus chr1:77,000,000-81,000,000. For Pax7, we included 42 CNEs located within the chr4 locus chr4:137,900,000-141,800,000. Next, we aligned these elements to the axolotl genome with LASTZ, using the sensitive parameters --gappedthresh=3000 --hspthresh=2500 --seed=match6. We filtered the results for minimum alignment coverage of 40% and minimum identity of 60%.

HoxA locus analysis
HoxA genes from human, mouse, chicken, and frog or alligator were aligned using MAFFT 64 . Then all four protein sequences were aligned by using TBLASTN against the axolotl transcriptome assembly to find axolotl orthologs. An axolotl homolog was only considered as ortholog if it was the best hit for all four query sequences. Orthology was additionally verified by BLAST searches using the axolotl contig sequences against the non-redundant protein database (NR). Finally, the axolotl protein sequences were mapped to the genome assembly using SPALN 72 and the alignments were corrected manually in order to resolve the exact coordinates of the coding sequences (CDS) and calculate the intron lengths (Supplementary Table 17).

Hh, Wnt, Pax family analysis
For the identification of homologous proteins in Ambystoma mexicanum, we obtained all the homologues for human WNT, BMP and HH protein families available in the ENSEMBL database (see Supplementary Tables 23 and 24 for the list of gene sequences used in the analysis), and then used BLAST to identify different homologues using the transcriptome assembly as database, all under the default parameters, and selecting sequences that had an evalue not greater than 1e-10 for further analysis. To confirm the identity of the selected sequences, we aligned them against the UniProt/Swiss-Prot database using BLAST.
To construct the phylogenetic trees, sequences from selected species (Homo sapiens, Mus musculus, Anolis carolinensis, Xenopus tropicalis, Danio rerio and Gallus gallus) were extracted from the dataset previously used to identify homologous proteins in Ambystoma  Fig. 2). Multiple sequence alignment was done using the MUSCLE 73 aligner using either CDS or peptide sequences.
Phylogenetic trees were constructed using RAxML 74 , using a GTR substitution model and a Gamma model of rate heterogeneity, with 1000 bootstrap replicates (the seed used for bootstrapping was 12345) for both protein and nucleic acid based trees.

GENERATION AND CHARACTERIZATION OF Pax7 MUTANT AXOLOTLS USING TALEN AND CRISPR GENE EDITING
Animal experiments were performed conforming to European Union animal welfare legislation, with local approval from the Magistrat Wien. Analysis of the animals in this study were not based on randomization or blinding methods.

TALEN design
We designed Pax7 TALENs targeting the first coding exon (exon 1), using the program 'TALEN hit' (Cellectis), with the following criteria: 17 Repeat Variable Diresidues (RVDs) in each arm of the TALEN (Left or right) and 15-20 base pairs of spacer. We cloned the TALEN RVDs into the RCIscript-GoldyTALEN vector (Addgene 1000000024 and 38142), and synthesized TALEN mRNAs using the mMESSAGE mMACHINE T3 Transcription Kit (Invitrogen) as previously described 75,76 . List of TALENs targeting sequences: Pax7-TALEN#1: We injected TALEN mRNAs (125-500 pg) into single-cell-stage axolotl eggs. Initial assessment of gene modification in larvae was performed using the Surveyor assay that uses PCR amplification of the target site, followed by DNA hybrid melting and reannealing and finally by CelI cutting of mismatched sites and gel electrophoresis. To define the sequence changes in animals showing a high rate of modification, we further carried out genotyping PCR on genomic DNA from F0 larvae at 20-30 days post injection/fertilization as previously described 77 . For analysis of axPax7 exon 1 targeting we used primer pair P1 (5'-CCAACTCCTCCCAAGAACTCTG) & P2 (5'-GTCTGGAGCTCGAAACGCTCAAC), which yields a 425 bp product for the axPax7 wild-type allele. PCR products were cloned into pGEMT vector (Promega) followed by sequencing of 10-15 individual clones. The same genotyping procedure was used to evaluate the progeny from intercrossing of F0 individuals to each other or F1 individuals to each other.

Overview of TALEN F0, F1, F2 generations
Two different TALEN pairs (TALEN pair number 1 and number 2) were injected separately and animals were assessed for mutations and phenotype. Based on the Surveyor (CelI) assay of larval tissue we concluded that Pax7-TALEN#2-injected animals harbored a detectable number of gene modifications. This was confirmed by sequencing PCR products ( Supplementary Fig. 15). These animals did not show an obvious phenotype presumably due to mosaicism in gene modification. Animals harboring gene modifications were raised to adulthood.
We then interbred two F0 Pax7-TALEN#2-high animals and asked whether any animals homozygous for Pax7 frameshift mutations would be produced. From such a mating, a minority (approximately 15%) of F1 animals displayed a developmental phenotype as described in Supplement section 6.1.3. Genotyping indicated that normal animals harbored either no frameshift alleles or one (heterozygous) axPax7 frameshift allele, while affected individuals harbored two copies of frameshifted axPax7 sequences showing a correspondence between homozygous genotype and phenotype. The homozygous animals from these matings were used to first characterize the phenotype and to establish the phenotyping criteria for subsequent crosses.
In parallel, F0 Pax7-TALEN#2-high animals were bred with normal animals and the F1 progeny genotyped to establish F1 animals that were heterozygous for either the 7 nt or the 20 nt deletion. To obtain homozygous mutants, two Pax7 Δ 7nt/+ F1 animals were interbred, and separately, two Pax7 Δ 20nt/+ F1 animals were interbred (two matings). The intercross of the Pax7 Δ 7nt/+ yielded 232 progeny of which 57 showed the developmental phenotype as described in Supplement section 6.1.3 initially assessed at 3 months of age. An intercross of the Pax7 Δ 20nt/+ yielded 345 progeny of which 83 showed the developmental phenotype. From the Pax7 Δ 20nt/+ mating, genotyping PCR of 30 normal-appearing progeny and 30 progeny showing the developmental phenotype was performed. 100% correspondence between the onset of the developmental phenotype and homozygosity for the 20 nt deletion was observed. Similar results were obtained for the Pax7 Δ 7nt/+ intercross. Loss of protein was analyzed as described in Supplement section 6.3.

Description of developmental phenotype of mutant animals
The Pax7-TALEN#2-high animals intercrossed in the F0 generation and the F1 intercross of Pax7 Δ 7nt/+ or Pax7 Δ 20nt/+ all produced a proportion of progeny showing the same profile of developmental defects as described below. Limb muscle but not Schwann cells, nerve, or bone, were depleted in affected individuals (Figure 4d, Extended Data Fig. 7). The trunk and tail muscle were relatively normal at early stages before 3 months (Extended Data Fig. 3). However, over time the trunk and tail muscle were progressively lost to nearly complete depletion including truncation of the terminal tail myotomes (Extended Data Fig. 5). Correspondingly the body posture became progressively curved (Extended Data Fig. 3). Due to the lack of dorsal muscle the spine was visible in whole mount views (Extended Data Fig.  3 and 4). Immunohistochemical analysis with MHC antibodies on cross-sections displayed a strong reduction of tail muscle, abdominal muscle along with dorsal musculature (Figure 4c, Extended Data Fig. 5 and 6).
In addition to the muscle phenotypes, we observed the lack of xanthophores, and iridophores and reduction of melanophores (Figures 4f and g, Extended Data Fig. 8). The most subtle phenotypes were changes in head bone structure and the open brain phenotype. These were investigated in more detail by Alcian blue/Alizarin red staining for bone and paraffin sectioning followed by Hematoxylin/Eosin staining for brain and spinal cord in selected individuals (Figure 4e, Extended Data Fig. 9).
When we examined the phenotype of 78 animals with confirmed Pax7 Δ 20nt/ Δ 20nt genotype, all of the animals presented limbs that were depleted of muscle. We then took subsets of those animals to examine other phenotypes. Alcian Blue/ Alizarin red staining of 7 animals showed that all were missing the praefrontale bone in the head. All of the 12 animals examined lacked xanthophores and had reduced numbers of melanophores. Similarly all 17 animals examined were lacking iridophores. The open brain phenotype was observed in a portion (11/46) of the animals (see Datafile "SD Fig4.xlsx").

Summary
To further exclude the possibility of non-specific TALEN effects, we used CRISPR-Cas9 gene editing to mutate the axPax7 gene in exon 1 and separately exon 2. Injection of CAS9-gRNAs axPax7-EX1-gRNA#3, axPax7-EX2-gRNA#1, and axPax7-EX2-gRNA#3 individually into axolotl eggs elicited efficient gene mutation in F0 animals. When we surveyed the gene mutations present in somatic tissues of the larvae, some individuals displayed a single detectable sequence deletion, while others displayed multiple different sequence changes (Supplementary Fig. 15). After phenotypic classification of the animals (described below), the location of protein loss was examined in those animals by immunostaining of sections every 200 µm along the body axis ( Supplementary Fig. 16). Animals that had developed a phenotype comparable to that seen in F2 TALEN#2 axPax7 Δ 20nt/ Δ 20nt and axPax7 Δ 7nt/ Δ 7nt homozygous mutant animals showed what appeared to be a complete or a nearly complete loss of protein expression (Supplementary Fig. 16). CRISPRengineered animals that had presented the developmental defects in localized body regions or ameliorated phenotypes correspondingly showed mosaicism of protein loss ( Supplementary  Fig. 16). In summary, editing the axPax7 locus at several positions using two genome engineering methods produced animals with comparable phenotypes and a corresponding reduction of axPAX7 protein expression.
Considering the existence of a potential downstream alternative starting methionine in exon 1 ( Supplementary Fig. 15b), we then sought to engineer mutations in Pax7 exon 2. Among three gRNAs targeting axPax7 exon 2, Pax7-EX2-gRNA#1 and #3 injected animals displayed a high proportion of strong and moderate animals. Among 69 animals, 55 showed a strong or a moderate phenotype.
To characterize the gene modifications we carried out genotyping PCR on genomic DNA from F0 larva at 20-30 days post injection as previously described 77 . For analysis of exon 1 targeting, we used the primer pair P1 (5'-CCAACTCCTCCCAAGAACTCTG) and P2 (5'-GTCTGGAGCTCGAAACGCTCAAC), which yielded a 425 bp product for the Pax7 wildtype allele. For analysis of the exon 2 targeting, we used the primer pair P3 (5'-GTGAAGAGCTCACTGTTTATCTTG) and P4 (5'-GGTGTTCATCTGATAACACCTTGT), which yielded a 515 bp product for the Pax7 wild-type allele (Supplementary Fig. 15).
The PCR results indicated that the various individuals harbored a diversity of gene modifications (Supplementary Fig. 15f). Interestingly, samples taken from certain F0 individuals deriving from injection of Pax7-EX2-gRNA#1 displayed either two detectable mutant alleles (Supplementary Fig. 15f, an#1) or a single detectable mutant allele ( Supplementary Fig. 15f, an#2) raising the possibility that the gene modification could have occurred at the one-cell stage. A single allele would have relied on template-dependent repair of one allele with the other or that the other allele had a large deletion that precluded successful PCR. These animals also showed essentially no or few detectable PAX7 + cells in sections taken along the whole body axis, screening sections every 200 µm ( Supplementary  Fig. 16). On the phenotypic level, these animals showed a strong phenotype similar to Pax7 Δ 20nt/ Δ 20nt and Pax7 Δ 7nt/ Δ 7nt homozygous mutants including loss of limb and body muscle, lack of xanthophores and iridophores and the open brain phenotype (Figure 4). These results suggested that such individuals could have had fully or close to fully penetrant mutation of the Pax7 locus for example, that might have occurred at the one cell stage.

Western Blot and immunohistological analysis of mutant tissues
We analyzed protein content and distribution in the F2 Pax7-TALEN#2 and the F0 Pax7-EX2-gRNA#1 axolotls. We first describe the analysis of the F2 Pax7-TALEN#2 animals. Immunofluorescence staining of cryosections from F2 Pax7-TALEN#2 Pax7 Δ 20nt/ Δ 20nt using a monoclonal antibody to PAX7 showed dramatic loss of positive cells associated with muscle masses but maintenance of the signal in dorsal spinal cord (Supplementary Fig. 17a). Western blotting of spinal cord tissue from a 30-day old mutant and a control animal showed the absence of protein at 55 kD, corresponding to protein produced from the start site associated with full-length protein. A smaller isoform of approximately 48 kD was still present in these animals and is presumably the isoform produced from the alternative in-frame methionine codon (ATG) in exon 2 ( Supplementary Fig. 15b and 17b). Since these TALEN engineered animals have an indistinguishable phenotype from the CRISPR exon 2 mediated frameshift animals that lack detectable protein (see below), we conclude that the loss of 55 kD protein leads to a strong loss/reduction-of-function of the Pax7 gene.
We then characterized protein distribution in the F0 CRISPR-engineered Pax7-EX2-gRNA#1 animals that showed a severe phenotype by making sections along the entire body for immunostaining. F0 Pax7-EX2-gRNA#1, animal #1, whose genotyping detected two types of frameshift deletions and no detection of wild-type sequence (An#1 in Supplementary Fig.  15f), showed essentially or nearly complete absence of PAX7 immunofluorescence signal in both, the muscle and dorsal spinal cord (Supplementary Fig. 16). Sections from trunk to tail were examined every 200 micrometers, and the lack of signal was uniform along the entire body. The same analysis with similar results was performed for An#2 shown in Supplementary Figure 15f. Notably these penetrant CRISPR-mediated exon 2 mutants showed little or no expression of PAX7 in spinal cord or muscle, yet had an indistinguishable phenotype to the F2 TALEN exon 1 mutants. This observation lead us to conclude that the Δ20nt and the Δ7nt alleles that lead to loss of the 55 kD form of the PAX7 protein are loss/reduction-of-function alleles.
Genotyping of F0 Pax7-EX2-CRISPR#1 mutants that exhibited a moderate phenotype showed the presence of several types of mutations as well as wild-type sequences ( Supplementary Fig. 15f, an#3 and #4). Immunofluorescence analysis from these animals displayed some PAX7-positive cells in subregions of spinal cord and muscle on cross sections ( Supplementary Fig. 16 second and third row). We conclude that the severity of the phenotypes in F0 Pax7-CRISPR animals is correlated with frequency of observed frame-shift mutations at the targeted Pax7 locus.

Regeneration transcriptome analysis
We aligned each set of tissue-enriched sequences from Bryant et al. (2017) 80 to our transcriptome, which we additionally annotated by BLASTing against the NCBI NR database using GIs specific for defined vertebrate classes: (Lower) Vertebrates, Teleostomi, Amphibia, Sauropsida, Amniotes, and Mammals. We used an e-value cut-off of 1e-20 to initially categorize transcripts according to the presence of orthologs in other vertebrate classes. Given that we expect a higher statistical significance when comparing axolotl sequences to amphibian versus amniote sequences, we demanded a differential threshold of at least ten logs in e-value when assigning a "lack" of ortholog in Sauropsida, Amniotes and Mammals. For one sequence (AMEXTC_0340000033959), no Sauropsidian, Amniote or Mammalian ortholog could be assigned. For the others back-BLASTing of the best the mammalian sequence back to the axolotl sequence database identified a different axolotl sequence as the best hit. The exception was AMEXTC_030000014162. However in this case the e-value, when aligning to amphibian sequences was 0, while for mammals it was 1e-12.
The various tissue-enriched datasets showed a similar proportion of annotated versus nonannotated sequences (Supplementary Table 20). As we wanted to limit our analysis to likely protein sequences, we did not consider the completely non-annotated sequences further but rather focused on the set of sequences with a significant BLAST hit at least up to amphibians (Supplementary Table 21).
The limb-enriched datasets contained the highest proportion of transcripts limited to amphibian orthologs (Vertebrates>Teleostomi>Amphibia) that we call "species-restricted" in the following text (Supplementary Table 20). The converse set of limb-depleted transcripts did not show this rank ordering. The small set of blastema-enriched sequences did not show an unusually high proportion of lower vertebrate-specific genes but we noted that the blastema tissue collection in Bryant et al. (2017) 80 had focused on a subset of blastema cells excluding the epidermis, which is an essential tissue for regeneration. We therefore examined the relative expression levels of the "lower"-vertebrate specific, limb-enriched transcripts in our own, whole-blastema RNA-Seq data compared to internal organs such as the heart as a reference to the Bryant et al. (2017) data, and to a non-limb, "Body" sample that was comparable in tissue composition to the mature limb sample in the Bryant et al. (2017) data. Our RNA-Seq data further allowed us to examine the expression level of these genes in the limb bud. Finally, we examined the gene expression time-course data from Stewart et al. (2013) 81 and Knapp et al. (2013) 82 to select only those sequences that were up-regulated during regeneration compared to mature tissue. The heat-map in Supplementary Figure 18a shows the relative expression of the transcripts in the different tissues (Supplementary Table  25) and in situ hybridization as performed in Knapp et al. (2013) 82 confirmed expression of AMEXTC_0340000000645 in the wound epidermis of the limb blastema ( Supplementary  Fig. 18b).

Annotation of limb regeneration microRNAs
To perform the identification of known and novel microRNAs, we used the data reported in King and Yin (2016) 83 first removing low quality sequences and sequencing adapters using Trimmomatic v0.33 (ILLUMINACLIP:Adapters.fasta:1:20:6 SLIDINGWINDOW:4:15 MINLEN:15). A peak of reads was observed at 22 nucleotides, consistent with microRNAs being the most abundant small RNA molecule type present in all the samples ( Supplementary  Fig. 19).
To search for novel and conserved microRNAs and their respective pre-miRNAs, we mapped all the sequences to the genome using Bowtie2 (v2.2.9) using the parameters -N 0 -L25 -a and retained all the alignments where no mismatch was found. After this, using the software ShortStack v3.8.1 under the parameters --mmap f --ranmax none --dicermin 18 --dicermax 25 --foldsize 400 --pad 5000 --mincov 5, we identified sequences that had the canonical hairpin structure. From all samples we identified 92 qualifying pre-miRNA sequences, from which 47 were at least 85% identical with a length of alignment greater or equal to 18 nt to a microRNA reported in miRbase, whereas 45 were identified as potentially novel (Supplementary Table 26, yellow rows denote potential novel microRNAs). Similar to the previous scenario, the alignments break at the sides of a region that is not used in any alignment (dead sequence). Here the sequencer either produced random sequence or the noise level of the sequence is so high that a meaningful alignment is not possible. Again, we look at the length distribution of the gap from the perspective of the B reads and replace the dead sequence in A with a part from one of the B reads.       Supplementary Figure 16 | PAX7 protein localization in F0 Pax7-CRISPR animals with "strong" and "moderate" phenotypes. Immunofluorescence signal from anti-PAX7 (green), MHC (red) and DAPI (blue) staining. Top row: Representative sections from animal #1 with the genotype shown in Supplementary Figure 15f, that displayed a strong phenotype. Little to no signal was observed throughout trunk or tail sections. White arrows point to major blood vessel that shows autofluorescence in red (insets) and green channels. Arrowhead (Spinal Cord) marks background signal in Mauthner neuron. Arrowhead (Trunk (right, R)) marks blood vessel with autofluorescence in red and green channels and cytoplasmic staining. Second and third rows: Representative sections from animals #3 and #4 from the Supplementary Figure 15f. These animals showed mosaicism in gene modification, and a moderate phenotype. Immunolocalization shows the presence of some PAX7 + cells in the dorsal spinal cord (an#3) and right trunk and tail muscle.     ABL1  ETV1  HOXA11  MAFK  NR4A3  TAX1BP1 ZNF644  ARNT  ETV4  HOXA2  MAX  NSD2  TBR1  ARNT2  EVX1  HOXB9  MED13  PAX5  TBX20  ATF2  EYA1  HOXC11 MEF2A  PAX6  TBX3  ATF3  EYA4  HOXC12 MEF2D  PCDH1  TBX4  BACH2  FEZF2  HOXD10 MEIS1  PIAS1  TBX5  BARX1  FGF10  HSF2  MITF  PITX2  TCEA3  BCL6  FIBCD1  IKZF1  MSC  PKNOX2 TCF12  BHLHE40 FOSB  IRX1  MTA2  POU2F3  TCF15  BNC1  FOXA1  IRX2  MXI1  PPARA  TCF21  BTF3  FOXI1  ISL2  MYB  PRDM1  TCF3  CELF4  FOXJ1  JAG1  MYF5  PRDM8  TFAP2C  CNOT7  FOXN3  JAZF1  NELFA  PROX1  TGIF1  CREB1  GABPA  JUN  NEUROD4 PRRX1  THRA  CTNNB1 GATA4  JUNB  NEUROD6 RFX4  TOX  DDIT3 GATAD2A