Compound heterozygous mutations in the noncoding RNU4ATAC cause Roifman Syndrome by disrupting minor intron splicing

Roifman Syndrome is a rare congenital disorder characterized by growth retardation, cognitive delay, spondyloepiphyseal dysplasia and antibody deficiency. Here we utilize whole-genome sequencing of Roifman Syndrome patients to reveal compound heterozygous rare variants that disrupt highly conserved positions of the RNU4ATAC small nuclear RNA gene, a minor spliceosome component that is essential for minor intron splicing. Targeted sequencing confirms allele segregation in six cases from four unrelated families. RNU4ATAC rare variants have been recently reported to cause microcephalic osteodysplastic primordial dwarfism, type I (MOPD1), whose phenotype is distinct from Roifman Syndrome. Strikingly, all six of the Roifman Syndrome cases have one variant that overlaps MOPD1-implicated structural elements, while the other variant overlaps a highly conserved structural element not previously implicated in disease. RNA-seq analysis confirms extensive and specific defects of minor intron splicing. Available allele frequency data suggest that recessive genetic disorders caused by RNU4ATAC rare variants may be more prevalent than previously reported.

up-regulated genes are significantly enriched in minor intron genes. Abbreviations: fdr10, fdr20: differential expression Benjamini-Hochberg FDR <= 10%, 20%; pv0.05: differential expression nominal p-value <= 0.05; NotSigFc2: genes with log2 (fold-change) >= 2 but nominal p-value > 0.05 (i.e. not significant); UP_MIn: number of up-regulated genes without minor introns; UP_MIy: number of upregulated genes with at least one minor intron; UP_OR, UP_OR_CI1, UP_OR_CI2: Supplementary Table 17. MOPD1 published causal variants and degree of severity. Note that when the same mutation was found in more than one affected individual from the same kindred, information about life expectancy and cause of death is separated by comma (e.g. 10 months, 11 months), if more individuals were affected from an extended pedigree, information may be provided as a range.

SUPPLEMENTARY NOTE 1: PATIENT INFORMATION
Patient 1 (Kindred 1) Patient 1 was born at 35 weeks gestation to non-consanguineous parents of English descent. His birth weight and length plotted below the 3 rd centile for his gestational age. Birth head circumference measurement was not available. The patient suffered repeated upper respiratory and ear infections, and required myrigotomy tube insertion at the age of 2 ½ years. He continued up to the age of 5 years to experience febrile episodes once per month and had multiple admissions to the hospital for bacterial pneumonias. At the age of 6 years, he was admitted for high fever, hepatosplenomegaly and progressive respiratory failure. Chest radiography showed bilateral lung consolidations. He was diagnosed with pneumonia and septicemia and improved after treatment with piperacillin and tobramycin. Throughout his childhood, the patient suffered from moderate atopic dermatitis and repeated episodes of Herpes Simplex Virus (HSV) infections with lesions at the right maxillary ridge and right frontal areas. Upon immunological assessment for his history of recurrent infections, the boy was diagnosed with antibody deficiency (see Supplementary Table 2) and prescribed monthly IVIg therapy, which reduced the infection frequency.
At the age of 14 years, the patient was admitted to hospital for repeated episodes of vomiting, decreasing exercise tolerance and fatigue over a period of 3 months. Physical exam and related investigations revealed a gallop rhythm, hepatomegaly and pulmonary edema. Electrocardiogram showed left ventricular hypertrophy and echocardiogram revealed severe left ventricular dysfunction with ejection fraction of 17%; the heart wall movement showed a pattern typical of non-compaction. The boy's symptoms resolved gradually in response to treatment with furosemide and digoxin [21].
The patient had global developmental delay and mild generalized hypotonia. He sat at eight months and walked at 2 years. His speech was also delayed. At the age of 7 years, the patient underwent psycho-educational assessment. His intelligence could not be assessed by the Wechsler scales because of his inability to co-operate. Instead, the Leiter test was used and showed that the patient's intellectual ability was equivalent to that of a 4 year old (< 1 st centile).
Single word comprehension and expression as well as grammatical comprehension were at the 5 year old level. Basic concepts fundamental to understanding verbal instructions measured at the 3 rd centile.
Head circumference plotted below the 3 rd centile (-3SD) when he was assessed at the age of 17 years. MRI brain was performed and was normal. That patient's weight and height continued to track below the 3 rd centile for age and final adult height was 3-4 SD below the mean. Adult weight plotted 2-3 SD below the mean. Skeletal survey was performed and showed spondyloepiphyseal dysplasia and shortened metacarpals.
Eye exam was performed in light of the diagnosis of Roifman syndrome, and was normal.
Patient 2 is patient 1's older brother and was diagnosed following patient 1's evaluation. The boys also have an eldest brother who is healthy. Patient 2 was born at term birth weight and height below the 3 rd centile. Birth head circumference was not known. He spent his first 11 days in the NICU because of slight breathing and feeding difficulties and jaundice. During infancy and childhood, he suffered repeated episodes of pneumonia and ear infections. Myringotomy tubes were inserted at the age of 2 years and again at 12 years. Immunological work up revealed low number of circulating CD19+ B cells, but normal number and distribution of T cells. While serum IgG, IgA and IgM levels were normal, specific antibodies were universally markedly reduced.
An eye exam was performed at the age of 14 years because of vision concerns and revealed signs of retinal dystrophy. Psychoeducational assessment performed at the age of 11 years because of academic difficulties showed he was functioning at a grade 1 level (4 grade levels below expected for his age) in math computations, reading comprehension, reading and spelling. At the age of 16 years a full scale IQ assessment was performed. The patient's major difficulties were with tasks requiring him to analyze abstract designs and construct them from blocks, solve puzzles and coding. Reading ability on the Woodcock test placed him at grade 3-4. Math assessment using the KeyMay Diagnostic Test showed competency levels at grade 2 or 3. His weight was in the normal range in adulthood (10-25 th centile) but his short stature remained (-3-4 SD). Head circumference in adulthood was within the normal range. Skeletal survey showed spondyloepiphyseal dysplasia with shortened metacarpals.

Patient 3 (Kindred 2)
Patient 3 was born at 36 weeks via emergent Caesarian section due to placental abruption. His birth weight and height plotted below the 3 rd centile. Birth head circumference measurement was not available. The patient suffered repeated upper respiratory infection. Evaluation of his immune function at the age of 2 years, revealed low serum Ig and an inability to mount an antibody response after vaccination. He was given IVIg treatment which led to a slow decline in infection frequency. The patient had one episode of autoimmune hemolytic anemia at age 2 ½ years following a flulike illness. Direct Coombs test was positive and blood smear showed polychromasia with spherocytes and Howell-Jolly bodies. He was treated with a course of Prednisone and the anemia resolved 6 months later. At the age of 7 years, the patient was admitted because of a 5-month history of chronic diarrhea. Gut biopsies obtained during endoscopy revealed signs of chronic colitis at multiple sites. The cecum and transverse colon showed reactive glandular changes; some of the glandular epithelium was infiltrated by neutrophils, eosinophils and lymphocytes; some of the glandular crypt bases had increased apoptosis; the descending colon showed reactive lymphoid follicle within the lamina propria; one mucosal gland contained an intra-epithelial infiltrated of eosinophils; and the rectum had signs of cryptitis and multiple mucosal gland infiltrates of neutrophils in the lamina propria. Immunostains for CMV and adenovirus were negative. The patient had early gross motor delays, sitting at 11 months and walking at 22 months, with cognitive difficulties being more prominent later in his childhood. Psychological assessment was performed when he was in Grade 4, and revealed both a learning disorder and Attention Deficit Hyperactivity Disorder -predominantly hyperactive/impulsive type. Cognitive functioning was tested using the Wechsler Intelligence Scale for Children (WISC-IV). His performance on the main areas of WISC-IV comprehension was at the 50 th centile, perceptual reasoning at 5 th centile, working memory at 21 st centile, and processing speed at the 4 th centile. He was found to be 'exceptionally weak' (identified as performance below the 3 rd centile for age) in the following areas: skills in analyzing and organizing visual perceptual material; memory of meaningful visual information in recognizing changes or additions to scenes; and computational and math reasoning skills, (as measured by two subtests of the WIAT-II), Head circumference measured at 17 years of age plotted just below the 3 rd centile. MRI brain was normal. His weight in adulthood was in the low-normal range (3-10 th centile) and his height remained below the 3 rd centile (-2-3 SD). Skeletal survey was performed because of short stature and brachydactyly; it showed signs of epiphyseal dysplasia, affecting the vertebrae, hips and metacarpals bilaterally. Based on the features of humoral immune deficiency, cognitive delay and spondyloepiphyseal dysplasia, the diagnosis of Roifman syndrome was made. An eye examination was then performed and revealed signs of retinal dystrophy.

Patient 4 (Kindred 3)
Patient 4, a female, was born at 36 weeks gestation to non-consanguineous parents of Lebanese-Austrailian origin. Her weight, height and head circumference plotted at the 10 th centile. Early on, she was noted to be mildly dysmorphic with brachydactyly of the hands and feet and hyper-extensible ligaments. A ventricular septal defect was also found. During the neonatal period, she had a transient conjugated hyperbilirubinemia. While the jaundice settled, hepatosplenomegaly and elevated liver enzymes persisted. She therefore underwent liver biopsy, which was performed at the age of 3 ½ months. This showed expansion of the portal areas and sinusoids with extramedullary haematoposiesis including eosinophils, neutrophils as well as B and T cells. In childhood, the patient suffered repeated discharging ear infections and recurrent oral thrush. Her growth remained stunted, tracking below the 1 st percentile for height and weight. Skeletal survey revealed epiphyseal dysplasia and shortened metacarpals. The vertebrae appeared normal. She was also found to have mild conductive hearing loss. Eye examination was normal. She had generalized hypotonia. The patient's cognitive delays were mild and did not come to attention until school age. Intellectual function, assessed by the WISC-IV test, showed mild intellectual impairment in verbal intelligence (0.5 th percentile) and nonverbal intelligence (2 nd percentile), borderline range processing speed (5 th percentile) and low average working memory (9 th percentile). Her ability to learn and remember structured verbal material was borderline and her memory for unstructured verbal material was average. Patient 5, the younger brother of patient 4 was born at term with weight at the 2 nd centile, length below the 3 rd centile (-3-4 SD) and head circumference below 3 rd percentile. He, too, suffered neonatal jaundice and hepatosplenomegaly, and when the jaundice subsided he continued to have a larger liver and spleen. Evaluation by ultrasound demonstrated increased pressure within the portal tract, which was felt to indicate likely hepatic fibrosis. Phenotypically, he is similar to his sister with similar dysmorphic features (see Supplementary Table 1) and short stature (< 1 st percentile). Distinct from his sister, he had retinal dystrophy, myopic astigmatism and marked sensorineural hearing loss considered sufficient to require hearing aids. His hypotonia and gross motor and speech delays were more severe than his sister's as well.
He suffered repeated infections, especially chest infections, since the age of 1 year. He had repeated pneumonias since the age of 2 ½ years, which were responsive to antibiotics. Patient 5 had delayed motor and speech milestones. His psychological assessment (WISC-IV test) was performed at 7 years 4 months of age and showed mild intellectual impairment in verbal intelligence (0.5 th percentile) and nonverbal intelligence (0.5 th percentile) as well as processing speed (1 st percentile) and borderline range working memory (3 rd percentile). His overall level of intellectual functioning was in the range of mild intellectual impairment. Formal assessment of attention using the Test of Everyday Attention for Children showed extremely low ability to focus his attention on mundane tasks for an extended period of time. His ability to learn and remember structured verbal material was also poor. Skeletal survey showed epiphyseal dysplasia with shortened metacarpals. The vertebrae were normal.

Patient 6 (Kindred 4)
Patient 6 is the youngest known case of Roifman syndrome at 4 years old. He was born at 37 weeks gestation to non-consanguineous parents of Albanian descent. His weight, length and head circumference measured <3 rd percentile. He suffered mild neonatal jaundice. From the age of 8 months he suffered repeated episodes of croup and at 9 months he had bronchiolitis. At 26 months of age he suffered pneumonitis. At 26 months of age he suffered pneumonia and had convulsions with fever. At 3 years he had a second episode of febrile convulsions during an episode of pneumonitis due to parainfluenza 3 and adenovirus. At 51 months, he had mycoplasma pneumonia and at 52 months metapneumovirus pneumonitis. At 53 months, he suffered perforation of his left tympanic membrane and purulent otorrhoea. He also had severe eczema, which was first noticed at 8 weeks of age over his trunk, arms and legs, which gradually improved with age. Evaluation of the immune system revealed low serum IgG with normal IgA and IgM, and low specific antibody levels to tetanus Haemophilus Influenzae and Diptheria Toxin. Lymphocyte markers revealed normal number of circulating T cells and NK cells but low numbers of CD19+ B cells. Linear growth remained delayed at <3 rd percentile, while weight was appropriate for age. Head circumference also remained below the 2 nd percentile. Skeletal survey performed at 28 months demonstrated marked delay in ossification of the capital femoral epiphyses and flattening of acetabular roof. Both capital femoral epiphyses remained entirely cartilaginous. Knee epiphyses appeared fragmented and flattened bilaterally. Metatarsals and metacarpals appeared broadened. The metaphyses were normal and no abnormality was detected in the spine.
The boy's development was mildly delayed with low to average performance. Formal cognitive assessments have not yet been performed. Head MRI and EEG were normal and eye examination showed normal retinas.

Coverage
For both subjects, > 97% of the genome was covered at depth >= 5 by uniquely aligned reads.

SNV and indel variant counts
The two affected siblings had a very similar number of total variants (no quality filter: 4,070,206 and 4,086,111, quality tier 2: 3,495,872 and 3,526,066) as well as gene-mapped, potentially damaging coding (quality tier 2: 483, 496) and non-coding variants (quality tier 2: 107, 104).

SNV and indel prioritization results, primary pipeline
None of the two siblings had any quality tier-2, 1% rare, potentially damaging homozygous variant.
The two siblings had several quality tier-2, 1% rare, potentially damaging X-linked variants, but none was shared. Since the X-linked mode of inheritance had been proposed, we investigated these particularly carefully. None of the coding variants present in only one of the two siblings were sufficiently damaging and disrupting a gene producing the expected phenotypic outcome: one of the two siblings had a missense damaging tier-1 variant on SH3KBP1, frequency 1-0.3%, and a missense damaging tier-1 variant on AFF2, frequency < 0.2%; SH3KBP1 is implicated only in immune and metabolic abnormalities in mouse, while AFF2 is implicated in intellectual disability in humans, but neither matches the full Roifman Syndrome phenotypic spectrum (growth, neurodevelopmental, bone and immune abnormalities). The other sibling had a very rare (< 0.6%) and damaging tier-2 X-linked variant on ARSH (arylsulfatase family, member H), which is not implicated in any human phenotype and does not have a mouse homolog. In addition, one of the siblings had a novel, X-linked non-coding variant disrupting the core splicing site of a poorly characterized antisense transcript gene (HS6ST2-AS1). The two siblings had several quality tier 2, 1% rare, potentially damaging variants forming potential compound heterozygous sets (i.e. compound hets), but only one set of two quality tier-2 variants was shared, corresponding to the non-coding RNU4ATAC variants. Since several of the potential compound hets can be on the same phase (thus not forming authentic compound hets), it is not surprising that the prioritized variants found in both siblings are markedly fewer. Thanks to the short genomic interval between them, the phase of the RNU4ATAC variants was resolved by the variant calling algorithm, showing that the alternate alleles are on opposite phases. Notably, the variant mapping to the MOPD1-implicated 5' stem loop is novel with respect to all frequency databases and dbSNP, while the 3' stem loop variant is reported only for cgW597 (1000 Genomes subset on Complete Genomics) at frequency 0.0008, corresponding to only one allele in the 597 subjects set.
The two siblings also had 8 shared quality tier-2, 1% rare, potentially damaging variants on dominant genes, none of which was novel; three of these variants had frequencies below 0.1% (for the genes HTT, RP1L1, GUCY2D), and thus worth some more detailed investigation; however, only one of them was damaging tier-2 (for the gene HTT, whose known dominant mode of action is through toxic gain of function rather than loss of function), and none of them matched the expected phenotype. As far as the novel variants, which could have a de-novo origin, none was shared by the two siblings. Finally, there was no prior evidence that Roifman Syndrome could have a dominant mode of inheritance. Notably, all prioritized variants shared between the two siblings were quality tier-2, further validating our thresholds for high quality. For detailed results, please see Supplementary Data 1.

SNV and indel prioritization results, secondary pipeline
The two siblings shared 3 autosomal dominant classified as likely pathogenic variants, but none in the other likely pathogenic categories; in fact, all the other variants output by the secondary pipeline were heterozygous were classified as carrier status or mapped to genes with complex modes of inheritance; in addition, one shared variant was classified as forming a likely pathogenic potential compound het in one of the two siblings, but the other variant was not shared. The three heterozygous variants on dominant genes (CYP2C19, SLC6A2, MC1R) were at relatively high frequency (1% or higher) and did not match the expected phenotype. The RNU4ATAC compound het was not found by this pipeline, as it only considers loss-of-function (stop-gain, frameshift, splicing) or known variants classified as pathogenic by HGMD or ClinVar. For detailed results, please see Supplementary Data 1.

Structural variants
Structural rearrangements reported in "highConfidenceSvEventsBeta" had 122 and 127 complex events, 1138 and 1177 deletions, 46 and 49 distal duplications, 2 and 1 interchromosomal events, 6 and 9 inversions, 19 and 25 probable inversions, 74 and 71 tandem duplications, for the two siblings respectively. Events were filtered to retain only the ones disrupting genes and supported by junctions never observed in the 54 unrelated control samples from the Complete Genomics diversity panel (pipeline version 2.0). This resulted in 21 and 20 events for the two affected siblings, with 9 genes disrupted by the same or similar events in both siblings (note that, unlike for SNV and indels, it is preferable to relax the exact overlap criterion to partial overlap). Most of the deletions were intronic, and none of the events severely disrupted the sequence of a gene that could account for the Roifman Syndrome phenotype (see Supplementary Data 3).

Copy number variants
The two affected siblings had 145 and 157 losses, and 135 and 139 gains respectively, with min CNV size 1,306 bp, median size 12,000 bp and max size 350,000 bp. Of these, 26 and 34 losses, and 32 and 32 gains, were not found in the 54 controls and had at least one gene with exonic mapping in the two affected siblings, resulting in 36 genes with a rare exonic loss in both siblings and 27 genes with a rare exonic gain in both siblings. However, all rare exonic CNVs were found in the DGV (Database of Genomic Variants) [22] and/or were completely within a segdup or overlapped other repeat-rich sequence; none of them could account for the Roifman Syndrome phenotype based on the overlapped genes (see Supplementary Data 2).

RNU4ATAC variants in internal whole genome controls
We analyzed the following whole genomes, sequenced for diverse research projects unrelated to Roifman Syndrome or MOPD1: probands (for a total of 6 trios) had at least one RNU4ATAC variant; all subjects but one carried only one heterozygous variant each; one subject (a mother of an autism proband, not reported to have any clinical abnormality) carried two RNU4ATAC heterozygous variants (chr2:122288468:C>T, chr2:122288549:A>G); since neither variant was inherited by her autistic child, the two variants were inferred to be on the same phase, thus not constituting a compound heterozygous set; even in case of true compound heterozygosity, one of the two variants (chr2:122288549:A>G) is at a poorly conserved position within the dispensable region of the 3' stem-loop (94), predicted to have no functional effect on minor intron splicing.

Estimated prevalence of RNU4ATAC genetic disorder
We extracted all RNU4ATAC single nucleotide variants from the three allele frequency control databases based on whole genome sequencing (1000 Genomes, Complete Genomics Wellderly and Complete Genomics 1000 Genomes subset). We transformed allele frequencies to subject counts by multiplying the allele frequencies to the allele total (1092 * 2, 597 * 2, 436 * 2, respectively). For each position corresponding to (a) disease variants (Roifman Syndrome and MOPD1), or (b) disease variants and conserved positions within elements critical for splicing (see main text figure 3), we then added up the corresponding variant subject count from each database, and divided by the subject grand total, to finally obtain the frequency of heterozygous variant carriers in the general population, i.e. the probability of a subject from the general population being a heterozygous variant carrier. We can assume none of the subjects is homozygous or compound heterozygous, as he/she would otherwise develop disease (in addition, for Complete Genomics datasets genotype counts are available, and none is homozygous). These estimates were also calculated excluding the 1000 Genomes dataset, which has lower whole genome sequencing depth. The probability of a pregnancy presenting a bi-allelic RNU4ATAC alteration was estimated as ¼ of the square of the sum of all variant heterozygous carrier probabilities; assuming each variant belongs to a separate haplotype, probabilities at single positions represent separate events; since frequencies are typically < 1%, the probability of combined events (homozygosity, multiple heterozygous positions) is negligible, and thus (in first approximation) we can simply sum the probabilities of heterozygosity without removing the probabilities of combined events. This resulted in the probability 7.80E-06 when considering only disease variants (1.82E-05 when excluding 1000 Genomes), and 1.57E-05 when also considering conserved positions within splicing-critical elements (3.25E-05 when excluding 1000 Genomes), leading to a final estimate of 0.78-3.25 / 100,000 pregnancies (i.e. up to 1 in 30,000).

SUPPLEMENTARY NOTE 4: MOPD1 AND OTHER MINOR SPLICEOSOME DISORDERS
MOPD1 is characterized by severe intrauterine and postnatal growth retardation; short, bowed long bones with severe delay in epiphyseal maturation; severe microcephaly; brain malformation including pachygyria and corpus callosum agenesis; cardiac abnormalities; dysmorphic facial features and sparse hair; stillborn or premature death within the first year(s) of life [15,[23][24][25]. Most often, patients die because of infectious diseases [15], although severe neurological and endocrine abnormalities have also been reported as the cause of death [25]. MOPD2 and MOPD3 are syndromes with related phenotypic presentation. MOPD2 is caused by recessive causal variants of PCNT (pericentrin, entrez-gene id 5116, required for centrosome function), and does not present immunodeficiency. The syndrome originally designated MOPD3 was later recognized to be a phenotypic variation of MOPD1, and the few reported cases presented death in the first years of life due to infections [26][27] or even more premature death due to failure to thrive [13]. In contrast, and probably adding to the confusion, when Majewski later recognized that the originally defined MOPD3 cases should be classified as MOPD1, he also proposed the use of "MOPD3" for other cases of primordial dwarfism with severe intellectual disability, but no immunodeficiency or death in the first years of life, including the case of Caroline Crachami (also known as the Sicilian fairy) [28]. We think this type of MOPD3 is less likely to be caused by RNA4ATAC causal variants, or could be caused by RNA4ATAC causal variants with reduced penetrance, in a different domain than MOPD1 causal variants.
Besides MOPD1 and Roifman Syndrome, specific deficit of the minor spliceosome function has been reported only for isolated familial growth hormone deficiency, caused by compound heterozygous causal variants in the RNPC3 protein-coding gene [29]. This suggests growth retardation, intellectual disability and bone defects as common outcomes of minor spliceosome dysfunction, though with different degrees of severity and specific presentations. Additional disorders are caused by defective minor intron splicing of specific genes and result in organ or even cell type-specific alterations (e.g. spondyloepiphyseal dysplasia tarda is caused by SEDL minor intron splice site causal variants, entrez-gene id 6399) [1].
Strikingly, of all the minor spliceosomal snRNAs, only RNU4ATAC has been implicated in Mendelian disorders. Since minor snRNAs (U4atac, U6atac, U11, U12) are expressed from single loci in the genome, homozygous or compound heterozygous loss-of-function causal variants in these snRNA genes could disable the minor spliceosome function as demonstrated for RNU4ATAC [1]; however, disease-implicated variants may have been missed by exome studies that did not include snRNAs in the captured genomic regions. Furthermore, seven proteins have been reported as specific to the minor spliceosome; even though their knockdown has been shown to interfere with human cell viability [30][31], only RNPC3 recessive causal variants have been implicated in Mendelian disorder. This suggests that more Mendelian disorders caused by spliceosomal defects may be discovered, further extending the associated phenotypic spectrum. Disruption of protein components may result in less severe outcomes, or less widespread organ abnormalities, compared to snRNAs disruption.

Alignment, QC, and summary of results (cufflinks, DESeq, iReckon)
The number of trimmed reads was comprised in the 27,971,870 -38,088,242 range. The human rRNA content was comprised in the 3.74% -12.8% range (of the trimmed reads). Concordantly aligned reads were in the 82.6 -91.5% range (of the trimmed reads) (see Supplementary Table  16).
Cluster analysis confirmed separation of affected and unaffected samples (see Supplementary  Fig. 4).
Cufflinks [10] assembled novel transcripts with minor intron retention for 244 genes; intron retention transcripts were expressed at significantly higher levels in affected subjects for 32 genes, while they were significantly higher in unaffected subjects for only one gene (differential isoform nominal p-value <= 0.05). For these 244 genes, the most expressed physiological transcripts were expressed at reduced levels in affected subjects, a significant trend compared to genes without minor introns (Wilcoxon test p-value < 1E-04, see supplementary results section D4 and D5). iReckon [11] detected a similar number of minor intron retention transcripts, with significantly higher expression levels in affected subjects (Wilcoxon paired test p-value < 1E-13), and a corresponding decrease in physiological transcripts (Wilcoxon paired test p-value < 1E-08, see supplementary results section D7). Specific retention of minor rather than major introns was confirmed in affected subjects using DESeq [9] (Fisher's Exact Test p-value < 1E-100, see supplementary results section D6). In contrast, gene-level expression analysis (which does not distinguish different transcript categories) showed that minor intron genes have increased levels in affected subjects, with a significant enrichment compared to other genes (Fisher's Exact Test p-value < 0.0001, see supplementary results section D2 and D3).

DESeq differential gene expression analysis
Genes carrying a minor intron were significantly enriched in up-regulated genes at different significance thresholds for differential expression (Benjamini-Hochberg FDR adjusted p-value <= 0.10, <= 0.20 and nominal p-value <= 0.05) (see Supplementary Table 6), but they were not significant or just borderline significant for enrichment in down-regulation (same significance thresholds for differential expression, see Supplementary Table 7); up-regulation here is used to indicate higher levels in affected compared to unaffected, and vice-versa for down-regulation. Interestingly, up-regulation enrichment p-values increased in significance when relaxing the differential expression significance threshold, although the odds ratio moderately decreased. As a negative control, we tested enrichment in up-regulated genes with log2 (fold-change) >= 2 but nominal p-value > 0.05, finding a significant depletion. Altogether, these results suggested that noisy differential expression is not driving the up-regulation enrichment. Enrichment was tested using the two-sided Fisher's Exact Test (FET), with contingency table counts defines as: (a) significantly up-regulated (or down-regulated) minor intron genes, (b) significantly up-regulated (or down-regulated) genes without minor introns, (c) minor intron genes with no significant change in the DESeq output, (d) genes without minor introns with no significant change in the DESeq output.
Total number of minor intron genes for DESeq gene differential analysis: 742 Total number of genes for DESeq gene differential analysis: 25,369
Adjusting for different stringency, we found a reasonable overlap between DESeq and cufflinks/cuffdiff up-regulated genes: for the genes without minor introns, 320 of the 722 DESeq and 755 cufflinks/cuffdiff overlapped; the overlap was greater for the up-regulated genes with minor introns: 50 of the 81 DESeq and 77 cufflinks/cuffdiff. Total number of minor intron genes for cufflinks gene differential analysis: 695 Total number of genes for cufflinks gene differential analysis: 24,206

Cufflinks/cuffdiff differential splicing output analysis
Genes carrying a minor intron were significantly enriched in differentially spliced genes at different significance thresholds for differential splicing output (FDR q-value <= 0.05, <= 0.10, <= 0.25 and nominal p-value <= 0.05). Enrichment p-values increased in significance when relaxing the differential splicing significance threshold, while the odds ratio decreased as expected (see Supplementary Table 10).
Total number of minor intron genes for cufflinks gene differential analysis: 727 Total number of genes for cufflinks gene differential analysis: 25,162

Cufflinks isoform intron retention analysis
Cufflinks assembled novel transcript isoforms (transcript class code "j") with minor intron retention (at least 25% of a minor intron overlapped) for 244 genes; intron retention isoforms were expressed at significantly higher levels in affected subjects for 32 genes, while they were significantly higher in unaffected subjects for only one gene (cufflinks differential isoform nominal p-value <= 0.05).
For these 244 genes, the most expressed physiological transcript isoforms were expressed at lower levels in affected subjects, a significant trend if compared to genes without minor introns. For this analysis, we categorized transcript isoforms in three groups: novel with minor intron retention (cufflinks isoform code "j" and at least 25% of a minor intron overlapped), known without minor intron retention (cufflinks isoform code "=" and no overlap with minor or major introns), other (isoforms not matching any of the other two definitions, thus having major intron retention, limited minor intron retention, or not matching known isoforms for other reasons). For each gene, the most expressed physiological transcript isoform was identified as the known isoform with the highest FPKM in the unaffected subjects, relative to other known isoforms. Enrichment in down-regulation of the most expressed physiological transcript isoform in the 244 genes with minor intron retention compared to other genes (excluding any gene with minor introns) was tested using the Wilcoxon one-sided test to compare the log2 fold-change; this resulted in a p-value of 1.402e-05.
We additionally investigated down-regulation of isoforms without intron retention, for genes with minor intron retention. We classified all isoforms in three groups: (i) minor intron retention (overlapping at least 25% of a minor intron), (ii) dubious (overlapping < 25% of a minor intron), (iii) without minor intron retention, regardless of other intron retention events and cufflinks isoform class code. Then, for every gene, we generated FPKM totals for isoform group (i) and (iii) by summing the average FPKM values for the affected and unaffected subject; to stabilize the FPKM ratio between affected and unaffected samples, zero values were replaced with 0.001 (roughly corresponding to the minimum non-zero value) and genes with FPKM for both groups and conditions below the median FPKM value were discarded. Minor intron genes with any assembled isoform (593 genes), genes with at least one minor intron retention isoform (244 genes) and genes with at least one minor intron retention isoform significantly upregulated in affected subjects (32 genes) all presented a small in absolute terms albeit significant negative shift of the log2 (affected / unaffected) FPKM ratio relative to the isoform group without minor intron retention, compared to other genes (see Supplementary Table 11). The shift significance was tested using the two-sided Wilcoxon's test.

DESeq differential intron expression analysis
Of 203,456 processed introns in the DESeq differential results, 9,235 corresponded to genomic loci with overlapping multiple genes and were ignored for further analysis. In the resulting DESeq differential output table, 776 unique introns had a match to a minor intron. Compared to major introns, minor introns were extremely enriched in the up-regulated introns, when considering all genes (p-value < 1E-100, odds ratio > 40), and also when considering only up-or down-regulated genes. An even more extreme enrichment effect size (odds ratio > 150) was found when considering only unchanged genes (DESeq gene differential expression nominal p-value >= 0.1, absolute log2 (fold change) < log2 (1.1), corresponding to 2,975 genes); this was probably the case because unspecific intron up-regulation tied to overall gene up-regulation is not present for such genes. Finally, when only minor intron genes were considered, we still found a strong enrichment in minor intron up-regulation compared to major introns. We additionally demonstrated that the enrichment was present when defining upregulated introns based on DESeq significance (p-value, BH-FDR) as well as when defining upregulated introns based on DESeq fold-change (i.e. the effect size as opposed to the significance of intron up-regulation) (see Supplementary Table 12).

iReckon intron retention analysis
iReckon isoforms were classified based on 7 mutually exclusive categories: (a) known isoform with intron retention involving at least one minor intron ("IntronRetention_MIy"), (b) known isoform with intron retention involving only major intron(s) ("IntronRetention_MIn"), (c) canonical known isoform ("known"), (d) novel isoform without intron retention ("novel"), (e) novel isoform with intron retention involving at least one minor intron ("NovelIntronRetention_MIy"), (g) novel isoform with intron retention involving only major intron(s) ("NovelIntronRetention_MIn"), (h) unspliced pre-mRNA ("pre-mRNA"). Categories were constructed by parsing iReckon isoform names, matching the tags "IntronRetention", "novel", "unspliced"; in addition, isoform exons were matched to minor intron coordinates, to further categorize intron retention events as involving at least one minor intron or only major introns. Then, for each gene and each subject, isoform RPKM values were summed to obtain category totals, resulting in 7 RPKM values per gene. RPKM tables for different subjects were merged, replacing missing values with RPKM = 0. The merged table had 22,580 genes, including 538 / 744 minor intron genes. More genes had "IntronRetention_MIy" isoforms in affected subjects compared to unaffected (affected: 158, 96; unaffected: 13,16,16). The RPKM for that category were accordingly shifted to higher levels for affected compared to unaffected To more systematically explore the relative representation of different isoform categories we transformed absolute RPKM to percentage RPKM: for each subject and for each gene, RPKM for each category were divided by the total gene RPKM. We found a major difference between affected and unaffected "IntronRetention_MIy", which was followed up by more detailed tests, while other differences were less pronounced and did not correlate with the affected / unaffected condition and thus might reflect an isoform detectability bias (see Supplementary Figure 5).
Since iReckon does not provide a native differential expression analysis for genes and isoforms (unlike cufflinks), we decided to compare aggregate abundance estimates (RPKM) for different isoform groups, for each affected-unaffected pair. Therefore, for each affected-unaffected subject pair, we tested the subset of genes with detected "NovelIntronRetention_MIy" isoforms in any of the affected or unaffected subjects, demonstrating that (i) the "NovelIntronRetention_MIy" category RPKM is significantly higher in affected subjects, (ii) the "known" category RPKM is significantly lower in affected subjects. We used two one-sided paired Wilcoxon t-test, where genes were used to pair RPKM values from each affected and unaffected subject tested (see Supplementary Table 13-14).