AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from Xanthomonas genomic sequences

Transcription activator-like effectors (TALEs) are virulence factors, produced by the bacterial plant-pathogen Xanthomonas, that function as gene activators inside plant cells. Although the contribution of individual TALEs to infectivity has been shown, the specific roles of most TALEs, and the overall TALE diversity in Xanthomonas spp. is not known. TALEs possess a highly repetitive DNA-binding domain, which is notoriously difficult to sequence. Here, we describe an improved method for characterizing TALE genes by the use of PacBio sequencing. We present ‘AnnoTALE’, a suite of applications for the analysis and annotation of TALE genes from Xanthomonas genomes, and for grouping similar TALEs into classes. Based on these classes, we propose a unified nomenclature for Xanthomonas TALEs that reveals similarities pointing to related functionalities. This new classification enables us to compare related TALEs and to identify base substitutions responsible for the evolution of TALE specificities.

Supplementary Figure S1 | Coverage profile of Xoo PXO83 chromosome. We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. We plot the coverage with PacBio reads against the genomic positions ignoring reads with a mapping quality of zero. Except for bordering positions, we find a stable coverage varying around the mean resequencing coverage of approximately 182.

Supplementary Figure S2 | Coverage for different sub-samples of PacBio reads.
We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. We then sub-sample fractions of 0.005 to 1.0 of the mapped PacBio reads and compute the coverage values for all genomic positions using only the sub-sampled reads. Left: We plot the mean coverage against the fraction of sub-sampled reads. As expected, the mean coverage scales approximately linearly with the fraction of sub-sampled reads. For instance, we obtain a mean coverage of 18.3 for a sub-sample containing 10% of the original PacBio reads compared to the original mean coverage of approximately 182.7 using all reads. Right: We also create boxplots of the corresponding coverage values for each of the sub-sampled sets of PacBio reads. As expected, the median coverage scales approximately linearly with the fraction of sub-sampled reads. For instance, we obtain a median coverage of 18 for a sub-sample containing 10% of the original PacBio reads compared to the original median coverage of 182 using all reads.

Supplementary Figure S3 | Minimum coverage of genomic regions for different sub-samples of PacBio reads.
We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. We then sub-sample fractions of 0.005 to 1.0 of the mapped PacBio reads and compute the coverage values for all genomic positions using only the sub-sampled reads. We then compute, for each of the sub-sampled sets of PacBio reads, the fraction of the genome that is covered by at least t reads. We find that for sub-samples containing at least 5% of the PacBio reads, almost all genomic positions (99.998%) are already covered by at least one read. For sub-samples containing at least 20% of the reads, the genome is covered by at least 5 reads. Using all PacBio reads, almost the complete genome (99.89%) is covered by at least 100 PacBio reads.

Supplementary Figure S4 | Concordance of base calls for different sub-samples of PacBio reads.
We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. We then sub-sample fractions of 0.005 to 1.0 of the mapped PacBio reads and compute the coverage values for all genomic positions using only the subsampled reads. We then call for each genomic position the corresponding base, an insertion or a deletion, depending on the most frequent event in the covering PacBio reads. We then compare this call to the base on the assembled Xoo PXO83 chromosome. For lower coverages due to sub-sampling, we find substantial deviations from the final Xoo PXO83 chromosome. For all sub-samples containing at least 10% of the PacBio reads (concordance of 99.995%), however, we find an almost perfect concordance between these calls and the assembled chromosome, reaching 100% using 30% of the reads. This indicates that a local coverage of at least 20 should be sufficient to make high-confidence base calls, although this coverage might not have been sufficient to yield one closed contig in a de-novo assembly. However, the complete set of PacBio reads corresponds to a coverage of at least 80, except for at most 20 positions at each of the chromosome borders.

Supplementary Figure S5 | Ambiguity of base calls.
We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. For each genomic position, we then determine the number of reads supporting the base called in the final assembly of the Xoo PXO83 chromosome and compare it to the number of reads supporting the best alternative, i.e., the best alternative base, or an insertion or a deletion at that genomic position. We create a logarithmic histogram (left) and boxplot (right) of the ratios of the number of reads supporting the called base (N base ) and the number of reads supporting the best alternative (N alternative ). We find that for the large majority (99.9998%) of positions, the called base is supported by at least twice the number of reads as the best alternative, corresponding to a log 2 ratio of 1 (red line). For more than two thirds of the positions, the called base is supported by at least 10-fold the reads compared with the best alternative (green line). This is an additional indication that we may yield a high-confidence genome with the given coverage with PacBio reads.

Supplementary Figure S6 | Coverage of TALE repeats.
We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. We then record the mean coverage values for the genomic locations of each full TALE repeat (left) and each last half repeat (right) of all TALEs predicted on the Xoo PXO83 chromosome. For full and last half TALE repeats, we find a median coverage of approximately 134 and 135, respectively, and a minimum coverage of approximately 93 and 102, respectively. Hence, all TALE repeats yield a sufficient coverage to obtain high-confidence base calls (see above). We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. We then record the coverage profile for the genomic locations of each TALE repeat of all TALEs predicted on the Xoo PXO83 chromosome. For each position in the standard and last half TALE repeats, we create a boxplot of the corresponding coverage values across all repeats (panel A and B). We highlight codons 12 (red) and 13 (green) that code for RVDs. In N*, H*, and S* repeats, the RVD comprises only the AA encoded by codon 12 and positions 100 to 102 (positions 58 to 60 for half repeats) are missing. For each position in the aberrant TALE repeats (panel C), we plot the corresponding coverage profiles. We find a largely uniform coverage across all repeat positions.

Supplementary Figure S8 | Coverage of TALEs is rather uniform.
We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. We then record the average coverage for each TALE repeat and plot the mean coverage values of the individual repeats of each TALE in chromosomal order. We find that the coverage differs between different TALEs due to general fluctuations of coverage along the chromosome. However, for each individual TALE, the mean coverage values of the contained repeats are similar and larger fluctuations (e.g., for TalAR3) are not limited to individual repeats but follow general trends. These findings indicate that reads of two different repeats are not erroneously mapped to the same repeat, which would result in approximately a doubling of coverage. They also indicate that reads belonging to a single TALE are not erroneously divided into two repeats, which would result in an abrupt, substantial drop of coverage. A B Supplementary Figure S9 | Ambiguity of base calls in TALE repeats. We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. For each position of each full, non-aberrant (A) and last half (B) TALE repeat of all TALEs predicted on the Xoo PXO83 chromosome, we then determine the number of reads supporting the base called in the final assembly of the Xoo PXO83 chromosome and compare it to the number of reads supporting the best alternative, i.e., the best alternative base, or an insertion or a deletion at that genomic position. We create a logarithmic boxplot of the ratios of the number of reads supporting the called base (N base ) and the number of reads supporting the best alternative (N alternative ) for each repeat position. Since TALE repeats are highly conserved on the DNA level and mainly differ in the codon pair coding for the RVD, we would expect a different ratio for these codon pairs (highlighted in red and green) in case of erroneous mappings. However, we do not observe a deviating pattern for the RVDcoding positions, which indicates that the base calls in those codon pairs are reliable.
Supplementary Figure S10 | PacBio reads spanning TALEs. We map PacBio reads against the final Xoo PXO83 chromosome sequence in a resequencing experiment using the PacBio SMRT Portal software. For a correct assembly of TALE genes, it is important that a sufficient number of PacBio reads spans the complete TALE including some upstream and downstream genomic sequence, because of the repetitive TALE DNA sequence and the high conservation of N-terminal and C-terminal regions. For each of the PXO83 TALEs, we count the number of PacBio reads that span the TALE and additionally at least 100 bp upstream and 100 bp downstream of the TALE sequence. We find that all TALEs are spanned in this manner by at least 10 PacBio reads, which should be sufficient to place shorter or partially overlapping PacBio reads on the TALEs and to correctly place TALEs in the complete chromosome sequence.
Supplementary Figure S11 | Graphical representation of selected TALE classes. Different TALE classes with aligned RVD sequences are displayed. Identical amino acids (aa) are indicated by black lines. Different aa are indicated by two points between the aa. Red lines between RVDs indicate synonymous substitutions in RVD codons. Abberant short repeats (class AI, NG repeat with 28aa) and abberant long repeats (class AQ, NI repeat with 42aa) are shown in green and red, respectively.
Supplementary Sequences | Full-length DNA sequences of Xoo PXO83 TALEs.