Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon

A significant portion of genes in vertebrate genomes belongs to multigene families, with each family containing several gene copies whose presence/absence, as well as isoform structure, can be highly variable across individuals. Existing de novo techniques for assaying the sequences of such highly-similar gene families fall short of reconstructing end-to-end transcripts with nucleotide-level precision or assigning alternatively spliced transcripts to their respective gene copies. We present IsoCon, a high-precision method using long PacBio Iso-Seq reads to tackle this challenge. We apply IsoCon to nine Y chromosome ampliconic gene families and show that it outperforms existing methods on both experimental and simulated data. IsoCon has allowed us to detect an unprecedented number of novel isoforms and has opened the door for unraveling the structure of many multigene families and gaining a deeper understanding of genome evolution and human diseases.


Supplementary Note 1: IsoCon method details
In this note we describe some of the additional details about the method.
Combining pairwise alignments into a multi-alignment In this section, we describe our heuristic algorithm to construct a multi-alignment matrix from a set of pairwise alignments between a sequence and sequences ∈ . Each entry in corresponds to either a nucleotide or the gap character "-". Our approach is heuristic and may produce a multi-alignment that is not the optimal one according to a global scoring scheme. However, it is designed to work fast and be accurate on similar sequences. It also ensures that identical insertions relative to are aligned in the same columns. This is crucial when using the alignment support of strings for both correction and statistical testing. But, we note that our strategy may not generalize well to alignments for more distant or repetitive sequences. In such cases, a more accurate method such as progressive multiple alignment may work better.
First, for every character of , we construct the corresponding column of . This is done in a straightforward manner, where row contains the character of that aligns to the character of .
Second, we construct the columns of that will contain a gap character in , i.e. those columns where at least one sequence in has an insertion with respect to . Let ( ) be the substring of that is inserted between positions and + 1 in . Here, (0) and (| |) will be any string inserted before the start or after the end of . We let ( ) be the longest insertion observed for position , over all ∈ . If there is more than one longest insertion, we chose the smallest lexicographical string to avoid stochasticity.
For a given insertion site , we pairwise align all ( ) to ( ), using the unit cost model but not allowing internal deletions in ( ); that is, the only allowed deletions in ( ) are prior to its first character or after its last character. To improve performance, we cache the computed pairwise alignments to avoid redundant alignment computations. From these pairwise alignments, we construct new columns in , in between positions and + 1 in . After we do this for every insertion site, our multiplealignment matrix is complete.
Estimating the probability of a sequencing error Recall that we are given a set of reads, a candidate to be tested , and a reference candidate . We align all the reads and to . We then build a multi-alignment matrix of the reads and . We denote the probability that position in read ( , ) is due to a sequencing error as . In this section, we compute , by multiplying the probability of an erroneous base call by the probability that the erroneous base call is exactly the character at position . We derive these as a function of the Phred quality scores and the empirical distribution of the different types of errors (error profile) in the alignment matrix.
Each CCS read comes with a Phred base-quality score for each base. Let be the Phred score of of read at position in the multi-alignment matrix. If , is a gap character, then the quality of the nucleotide immediately to the right of the gap is used. By definition, a Phred quality score of corresponds to a probability of an erroneous base call of 10 /10 1 . The CCS caller in PacBio produces quality values in the range of 3 ≤ ≤ 93; however, we generally have found these can be overconfident estimates.
Therefore, we remap them onto the range of [3, ], with = 43 (default), using the With this adjustment, the probability of an erroneous base call is = 10 − ( )/10 , which in practice means that no base call is allowed a Phred quality score larger than T.
Next, we need to compute the probability that the error results in character , , as opposed to some other character. Ideally, one would want the base caller to provide such information, but it is not currently provided. Instead, we use empirical estimates of the relative frequencies of different error types from the read's alignment. While not ideal, it does model the fact that the error profile across the reads is different for reads with different passes and in different homopolymer regions ( Supplementary Fig. 2). According to the CCS base call uncertainty protocol 2 , the only variant for which we can fully attribute the base-call uncertainty to a given error is when the variant is an indel in the first nucleotide of a homopolymer region. The uncertainty over that position reflects the uncertainty of the homopolymer length. We therefore attribute the whole uncertainty in those cases. For the other variants we are testing, let , , be the relative frequencies of substitutions, insertions, and deletions in each read in . For a given variant at position with a state of either substitution, insertion, or deletion in the candidate to be tested ( ) with respect to the reference ( ), we define as follows: In some cases, Phred quality values are not available, e.g. in simulated data or if the CCS reads have been somehow post-processed prior to IsoCon. In this case we use the same formula as before but estimate empirically as the total number of mismatches (substitutions, deletions, insertions) in read in its alignment to , divided by the length of . For a homopolymer of length ℎ, since a deletion can happen in any of the ℎ bases (similarly, insertion after any of ℎ bases) we let the uncertainty be ℎ in these regions. We have

Implementation details
We avoid any non-determinism by always having explicit tie-breakers. For example, if the number of reachable vertices are the same for two vertices when partitioning the nearest neighbor graph in the PartitionStrings() routine ( Supplementary Fig. 1), we take the one with the most number of neighbors. If still a tie, we chose the lexicographically smallest sequence.
We calculate edit distances using edlib 3 to find the closest neighbors, using the 'global' method. We then align with parasail, before error correction and statistical tests, with parameters gap_open=-2, gap_extend=0 match=2, and mismatch we set to either -1,-2 or -4 depending on how many percent the edit distance is to the total length of the alignment, <1%, <5% or above 5%. Note that higher edit distance have a higher mismatch score to favor indels, as they are more common for PacBio sequencing.
We do not assign reads to candidates in each iteration of the IsoCon algorithm ( Supplementary Fig. 1, Algorithm 3, line 6) from scratch. Instead, we only realign reads assigned to a candidate that was filtered out.
We remove any edge in the nearest neighbor graph that was formed from an alignment with an internal gap of more than W nucleotides (parameter to IsoCon set to 20 by default). Here, internal means that aligned sequence occur both before and after the gap. The intuition behind this is that a larger internal gap in a CCS read likely stems from an exon difference. The reason behind keeping edges that were formed despite gaps in ends is that sequences might be cut at different end positions in the Iso-Seq protocol. IsoCon also has a parameter to ignore smaller length differences on otherwise identically derived candidate transcripts (set by default to 15 base pairs). This is to account for the small variation in primer cutting-sites when they are removed. This parameter will, after the CCS read correction phase, treat all candidates that have identical sequences --up to a start or end offset with less or equal to 15 base pairs --as coming from the same transcript. In practice, the candidate with the highest CCS read support after the correction will be kept (most common primer cut-site), and the candidates with only an offset to this candidate are removed.
Supplementary Note 2: Generating simulated data Generating gene families To simulate a gene family with m member genes, we chose a reference gene as a starting point and download its exons from Ensembl 4 . We concatenate the exons in the order they appear on the genome. We use this artificial transcript as a start sequence, i.e., this is the root node in our evolutionary tree. We then branch the node into two children. We let one child be identical to its parent and in the other child we simulate mutations with mutation rate μ (a parameter). For each nucleotide, we simulate a substitution with probably μ/3, a deletion with probability μ/3, and an insertion with probability μ/3. With probability 1 -μ, we do not alter the nucleotide. We then recursively continue the branching process for each child, in a breadth-first manner until we have m leaves. The sequences at the leaves are then taken as the gene family members. For a meaningful evaluation, if two members have identical sequences, we redo the simulation. The result is a set of m distinct gene family members with identical exon structures and where each exon may or may not be different between members.
Next, we can create n isoforms from a single gene by dropping exons as follows. Let e be the number of exons. One isoform is always taken to be the full set of exons. We can generate up to e -1 additional isoforms by dropping one exon at a time. If n > e, then we can generate up to e(e-1)/2 more isoforms by dropping 2 exons at a time, and so on. We make sure that each isoform we generate is not identical to an isoform we generated for another gene in the family. This could arise if all the mutations that discriminate between the two genes lie in the dropped exons. In such cases, the isoform is skipped the next is one is generated. Our deterministic approach to dropping exons is motivated to make isoforms within and between copies as similar as possible, thereby giving the most challenge to an error-correction algorithm.
Finally, each isoform is given an abundance (i.e. the relative frequency of the isoform in the pool of transcripts). In the case of equal abundance, each transcript has the same abundance. In the case of unequal abundance, where we set the relative abundances of isoforms to be exponentially increasing as follows. Given a sequence of n isoforms 0 , … , −1 , we first randomly permute them and then assign isoform an unnormalized abundance of 2 8 . We then obtain the abundance frequency by dividing each In each experiment, the number of simulated reads were also varied. We can obtain the isoform depth based on the relative abundance of isoforms. For example, in Figure 1, the isoform abundance ranges from 1/828 to 128/828. With 2,500 reads, the coverage ranges from 3.02x (= 2500 * 1/828) to 387x (=2500 * 128/828).

Generating PacBio CCS reads
Before implementing our own circular consensus (CCS) PacBio read simulator, we explored the possibility of using already existing PacBio read simulators such as PBSIM 5 , LongISLND 6 , and SiLiCO 7 . However none of the options were viable for our purpose of simulating full length non chimeric reads. SiLiCO does not simulate the circular consensus (CCS) protocol, which we use. PBSIM and LongISLND both assume genomic (i.e. not transcriptome) data. We tried both PBSIM and LongISLND with different parameters but were unable to get reads that would cover our full transcripts. This is because start and end positions in these tools are simulated to be within the reference sequence that are assumed to be a genome, while for our case the reference sequences are transcripts.
We therefore wrote our own PacBio CCS read simulator. The length of PacBio polymerase reads (i.e. the total length prior to the passes being collapsed by CCS) is simulated from a triangular distribution 8 to roughly mimic the shape of the polymerase read length distribution given in 9 . In the triangular distribution, we set smallest read length to 0 (start base of triangular distribution), the mode read length to 10,000 (top of triangle), and the largest read length to 45,000 (end of base in triangle). The polymerase lengths are generated from this distribution and the the number of passes of each read through a transcript is determined by this read length. We let the number of passes be = ( / ), where L is the polymerase length and is the transcript length. Reads where L<n are discarded, as we only use reads that span the whole transcript. The CCS read will then be simulated with nucleotide accuracy that corresponds to p passes using the reported PacBio CCS error rates 10 . For the case of just one pass, the simulated accuracy will be 13%, using the raw accuracy of the PacBio polymerase 9 and for more than 18 passes, we fix the error rate to 0.999 10 . When an erroneous nucleotide is simulated, the type of error is simulated according to published probabilities for PacBio reads without CCS 11 12 : 68.75% for insertions, 25% for deletions and 6.25% for substitutions.

Computing recall and precision
To measure the accuracy of the results of an algorithm on the simulated data, we defined a true positive as a predicted sequence that is identical to a true isoform, a false positive as all other predicted sequences, and a false negative as a true isoform that is not present exactly in the predicted sequences. Matches were required to be exact, in order to capture nucleotide level accuracy. We then computed recall and precision based on these definitions.

Supplementary Note 3: Additional analysis of simulated data
We expect IsoCon's precision and recall to approach 1 at very high read depths. Supplementary Figure 1 shows accuracy at read depth of 12,500. In all except a few cases (described below), IsoCon successfully captures low abundant gene copies having edit distance 1 to other copies. This includes transcripts with abundance ratio 4:64 and 1:32 in TSPY, with μ=0.0001 and 0.001 respectively, and transcripts with abundance ratio 4:64 and 1:32 in HSFY and DAZ datasets with 0.0001 respectively (Supplementary Figure 1 and 2). There are two cases of false negatives. At mutation rate 0.0001, the lowest abundance transcript did not get captured in HSFY. This transcript has an edit distance of one to a copy to which it has an abundance ratio of 1:4 and an edit distance of two to a copy to which it has an abundance ratio 1:64 (Supplementary Figure 2). This presents a difficult case for the algorithm due to the presence of similar transcripts which are drastically more abundant. For DAZ, the transcript with the second lowest abundance is not captured in 9 out of the 10 replicated experiments, and it has an edit distance of one to a copy to which it has an abundance ratio 1:2 (Supplementary Figure 2). Further investigation showed that this transcript was not a closest neighbor of any node in IsoCon's error correction step, indicating room for future improvement.
We also evaluate IsoCon's and ICE's accuracy and recall on datasets with only one gene copy but several isoforms. We investigated this to confirm that there were no bias in separating transcripts on exon level. We investigated to separate 8 different isoforms for a copy where isoforms were samples with equal and unequal abundance. Recall and precision is shown in Supplementary Figure 5. As expected, since isoforms have different exon structures they are more dissimilar than gene copies differing in only mutations, and therefore, should be easier to capture compared to highly similar gene copies as in previous experiment. Supplementary Figure 5 confirms this for IsoCon. For ICE this is also true in general except for the DAZ datasets with 125000, panel A and B. ICE's recall drops significantly and we are unable to determine the reason for this.
Somewhat counterintuitively, IsoCon's recall is higher for lower mutation rates for the lowest read depth datasets for HSFY and DAZ (Supplementary Figure 5A, 20 reads for HSFY, 20 and 100 reads for DAZ). This is because with low read depth, IsoCon will cluster the highly similar copies into the same partition and, therefore, get increased coverage to recover one of the copies. For the more divergent cases, this cluster merging-giving increased coverage-will not happen and both the copies are instead lost. P-values of all IsoCon's predictions for the simulated datasets are shown in Supplementary Figure 14A.
While our investigation of the dependence of recall on read depth in Figure 2 aimed for a somewhat realistic scenario, we also performed a more controlled experiment in order to see the trends more clearly. We simulated pairs of transcripts with varying relative abundance, edit distance, and read depth (Supplementary Figure 12). The trends show that more reads are needed to recover a transcript when it has lower abundance, when the edit distance to the close transcript is smaller, or when the CCS read quality is lower. Supplementary Figure 12 can be used as a guideline for a prediction of the read depth needed to capture a transcript. A conservative estimate of the total number of reads required to capture a specific transcript can be obtained simply by multiplying the number of reads from the figure by the relative abundance in the figure and dividing by the estimated fraction of the target transcript in the sample. Alternatively, given the read depth of an experiment, Supplementary Figure 12 can give the maximum achievable recall for a transcript based on its abundance and the divergence of its gene family. Finally, we note that the controlled nature of the experiment means that in reality, more reads may be required to achieve the same recall.

Supplementary Note 4: Data processing pipeline
We used a snakemake 13 workflow for the bioinformatic pipeline to process the Iso-Seq data. Supplementary Figure 13 shows the workflow as a graph. The raw data was converted to BAM files using bax2bam, then PacBio's consensus caller 2 was used to obtain CCS reads from subreads. We used the "classify" algorithm (part of the Tofu pipeline for processing Iso-Seq reads) to separate the CCS reads into reads having at least one full pass and reads shorter than one pass. We further separated the reads having at least one full pass based on the primers into batches using a customized script. We used these primer separated batches of reads for downstream analysis with all tools. That is, ICE, IsoCon and proovread was run on each of these batches separately. Both ICE and IsoCon takes a set of full-pass CCS reads and the base pair quality values and perform clustering and error correction. ICE also uses the reads with less than one full pass in a post-polishing step and we supplied ICE with these reads. ICE further classifies its output transcripts as "high quality" if they have >99% base pair accuracy (default parameter value). We observed that the number of "low quality" transcripts was very high (70-80% of the number of CCS reads) and would give ICE extremely low precision. We therefore used only the high quality transcripts for evaluation.
The consensus caller that generated CCS reads has two modes "--polish" and "--noPolish." IsoCon and proovread were evaluated on CCS reads generated with the default --polish option, which is the default option. When we evaluated the original PacBio reads we also used CCS reads generated with the --polish option. On the other hand, ICE documentation suggests using reads from the --noPolish option instead. We tried both versions and our experiments confirmed that that the --noPolish option resulted in slightly better metrics in our evaluations of ICE, compared with --polish. ICE with --noPolish predicted 475 high quality transcripts, compared to ICE with --polish, which predicted only 335 high quality transcripts. We observed that ICE with --noPolish had a higher number (9) and percentage (1.9%) of transcripts fully supported by Illumina RNA-seq reads compared to three transcripts (0.9%) with --polish; it had only slightly lower median transcript support of 93.5%, compared to 93.7% with --polish; finally, ICE with --noPolish option also had two more predictions shared between samples (4 compared to 2). ICE with --noPolish however had only 8 exact matches to ENSEMBL compared to 11 for ICE with --polish. It is therefore ambiguous to which input data should be considered giving the best results with ICE, but we conclude that the difference is minor when compared to results of other approaches, irrespective of which dataset we use. Based on these findings and on ICE's documentation, we decided to present the results for ICE with --noPolish reads in the paper.
Finally, we note that for simulated data, we did not run the polishing step. It is not applicable, as we did not generate any base call quality values or non-full-length reads (Sup. Note B).
We provide the exact run-commands used below: We ran version ccs 2.0.5 (GitHub commit 390a42e), part of the PacBio SMRTlink 4.0 suite. As described above, we also specified the flag --noPolish to generate non-polished reads. We ran version 2.14.0 of proovread. We used the --no_sampling option as suggested by the software author, see https://github.com/BioInf-Wuerzburg/proovread/issues/88).

Supplementary Note 5: Analysis of human testes data
Comparing against database: We downloaded coding DNA sequences (CDS) for our 9 ampliconic gene families from the Ensembl database. There were 61 unique sequences in this database. We aligned the predicted transcripts to the Ensembl databases. We consider a perfect match of a transcript to a CDS if the read is a perfect substring of the CDS transcript with at most 100bp missing in the 3' and 5' ends. We allow the missing ends because primers in our targeted approach will attach within the start and end exon of a transcript, resulting in missing ends (we observed values in the range of ~20-70 bp). If a predicted transcript has more than one perfect match, we chose the match with the smallest sum of clipped bases in the ends. In case there are several perfect matches with the same number of clipped bases, we pick the lexicographically smallest accession as the match, in order to assign a prediction to a unique transcript in the database.

Support from Illumina reads:
We aligned our barcode-separated Illumina reads to the predicted transcripts from IsoCon, ICE and the original CCS reads. We used BWA-MEM with default mapping parameters. We then used a customized script available at https://github.com/ksahlin/IsoCon_Eval/blob/master/analysis/nucleotide_level/get_unsu pported_positions.py to count positions on the predicted transcripts that have at least two Illumina reads supporting the base pair. As IsoCon, ICE, original, and Illuminacorrected CCS reads have different number of predicted transcripts for each gene family, the Illumina coverage per transcript is highly different between the four approaches (especially for TSPY and RBMY). Therefore, the coverage per transcript would be lower for original and Illumina corrected reads, thus favoring ICE and IsoCon in the evaluation. To account for this artifact, we split each method's predictions into batches of 50 transcripts. We then align all Illumina reads to each batch separately. This makes Illumina coverage constant across methods and removes any bias resulting from coverage differences.

Number of groups computation:
To compute the number of groups, we implemented following post processing algorithm to separate the transcripts into groups (i.e. maximal cliques): 1. Perform Smith-Waterman alignments between all pairs of transcripts in each family. We used SW parameters of gap_open = -50, gap_extend = 0, mismatch = -20, and match = 1. 2. Create graph where a node represents an isoform and an edge means that two isoforms can potentially come from the same copy (i.e., no substitution or indel smaller than 3 base pairs in the pairwise alignment). Graphs are included in Sup. Data. 3. Enumerate all maximal cliques in the graph, using a brute-force algorithm. 4. For each maximal clique, use the pairwise alignments to do a progressive multiple alignment of all the transcripts in the clique. Then derive the consensus sequence. This will contain the full set of exons from all isoforms assigned to a copy. Alignments are included in Sup. Data.
In the case of RBMY transcripts 45-61 (Fig. 5), a manual inspection of the alignments revealed incorrect pairwise alignments, due to short repeats. We tweaked these manually and redid steps 2-4.
Open reading frame (ORF) prediction: Each predicted transcript were aligned to the reference sequences used to design the primers (alignments included in Sup. Data).
Those that fully aligned to the references and could be translated into proteins without premature stop codons are categorized as coding; those with premature stop codons are categorized as non-coding. In the case that no stop codon is found, they are categorized as coding. This is because stop codons may be downstream of the primers, or clipped out as part of the primer.
Creating a database of known splice variants: We first constructed a database of known potential splice variants, against which we would later compare IsoCon transcripts. We downloaded all the 105 transcripts of the nine ampliconic gene families from the NCBI RefSeq database 14 . We combined these with the 61 unique CDS sequences from the Ensembl database to obtain 166 known transcripts. We used BLAT (server version, 15 ) to align these transcripts to the hg19 reference genome. We filtered out all non chrY alignments. For each transcript, we then considered the highest scoring alignment and any other alignments that had more than 99% identity and 99% coverage (due to the highly repetitive nature of these families). For every alignment, an hg19 deletion of at least 10nt is marked as an intron and its coordinates as splice junction (we varied the 10nt parameter to 5nt and 3nt in this analysis and obtained the same results). In this manner, the 166 known transcripts had 668 distinct candidate splice variants (i.e. a combination of splice junctions). This number is an overestimate but gives us confidence in the novelty of any splice variants we discover that do not match one of the known candidate splice variants. The alignments contained a total of 471 distinct splice junction coordinates which we stored in a database of known potential splice junctions (Supplementary Dataset 1).

Splice variant analysis:
To analyze splice variants in IsoCon's transcripts, we aligned them to hg19 and stored the splicing coordinates for the highest scoring alignment and any other alignment to chrY with >99% identity and >99% coverage. We aimed to classify IsoCon's predictions into three categories: splice match (SM), novel in catalog (NIC) and "ambiguous" (part of terminology used in 16 ). SM transcripts are transcripts that have identical splice coordinates to a reference transcript in all its junctions. The NIC transcripts are transcripts that "contain new combinations of already annotated splice junctions or novel splice junctions formed from already annotated donors and acceptors " 16 However, these classifications rely on accurate alignments which, given the repetitive nature of our gene families and that the reference does not contain all gene copies, are not always obtainable. We therefore classify transcripts without clear SM or NIC alignments as ambiguous, as they might either contain novel splice junction(s) or be novel distinct gene copies (i.e., not having a representation of the reference genome). In the case of transcripts with more than one alignment, we did not see mixed categories, i.e. all the alignments were either SC, NIC, or ambiguous. We also note that for NIC transcripts, we cannot with full confidence distinguish if it is a true novel splice variant of an existing gene copy or if the transcript is a novel distinct gene copy itself. For IsoCon's 121 predictions that were shared between samples, we found that 83 transcripts were classified as SM, 21 transcripts as NIC, 17 as ambiguous (NIC and ambiguous are presented in Supplementary Table 1).

Analysis of transcripts with lower significance:
For each transcript, IsoCon outputs a significance value which is the maximum of all its pairwise significance values with other transcripts. We investigated the possibility that transcripts with lower significance have an enrichment of false positives. To do this, we investigated our accuracy metrics as a function of significance values, focusing on transcripts with a value greater than 10e-20 (Supplementary Figure 15). We observe that the percentage of transcripts with a pvalue greater than 10e-20 that have full Illumina support is 67%. This is lower than the 80% for all the transcripts with lower p-values, but still fairly high. For reference, recall that for ICE and proovread these numbers are 2% and 15%, respectively. When looking at percentage of nucleotides supported by Illumina, it is 99.0% for transcripts with a pvalue greater than 10e-20, which is almost as high as for transcripts with lower p-values (99.2%). For reference, recall that ICE and proovread transcripts have support of 93% and 96%, respectively. We also observe that out of the 21 distinct transcripts recovered from the Ensemble database, four were recovered by IsoCon predictions with a significance greater than 10e-8. transcripts. This indicates that retaining predictions with higher p-values is likely necessary to obtain higher sensitivity.
Additionally, p-values of all IsoCons predictions are shown in Supplementary Figure  14B.
Running time and memory analysis: We compared runtime and memory of IsoCon, ICE, and proovread on our human samples (Supplementary Table 2). We used a machine with an x86_64 system running Linux v3.2.0-4-amd64 and equipped with 32 2-threaded cores and 512 GB RAM. IsoCon is roughly 2x faster than ICE when running on a single core, and ~5x faster when running over 64 processes. A direct comparison to proovread is challenging because it requires a minimum read length criteria that makes it unable to process the BPY and XKRY families. With this caveat in mind, proovread run times were 12% slower than IsoCon on 1 core and >3x slower on 64 cores.

Supplementary Note 6: Analysis of FMR1 data
We downloaded the 49 isoforms of the FMR1 gene detected in 17 from https://zenodo.org/record/833502#.Wsav_NPwZTb . We also included the three isoforms that were predicted in a previous study 18 but not found in 17 . These three isoforms were constructed from hg19 based on the splicing coordinates given in 18 . We refer to these 52 as validation isoforms. We used BLAT (server version, 15 ) to align these transcripts to the hg19 reference genome and stored, for each isoform, the unique set of splicing coordinates on hg19.
The original CCS read data was downloaded in fastq format from (https://zenodo.org/record/185011#.WtEWadPwZTY). We ran IsoCon on each of the six samples individually with the same parameters as for the ampliconic gene families (default settings, see Supplementary Note D). We then aligned all IsoCon's predictions using the server version of BLAT. We then replicated the filtering criteria of 17 --we retained only the predicted transcripts with an alignment coverage >= 99%, identity >= 95%, and location within the area targeted by the primer design (start between chrX:147,014,079-147,014,470 and end between chrX:147030158-147030505).
We classified each of the 52 validation isoforms as detected if there was a predicted transcript that had identical donor and acceptor splice coordinates across all exons in the transcript. All 52 validation isoforms have distinct splice-junction coordinates, which means that each IsoCon prediction can be assigned to at most one validation isoform.
We ran IsoCon for the FMR1 datasets on a MacBook Pro with three 3.1 GHz Intel Core i7 cores and 16 GB 1867 MHz DDR3. The three premutation carrier samples containing CCS reads from 3 SMRT cells respectively were all processed by IsoCon in approximately 3 hours with a maximum memory consumption of less than a gigabyte. The controls, each of which contained CCS reads from 1 SMRT cell, were each processed in between 30-40 minutes.