Single molecule real time sequencing in ADTKD-MUC1 allows complete assembly of the VNTR and exact positioning of causative mutations

Recently, the Mucin-1 (MUC1) gene has been identified as a causal gene of autosomal dominant tubulointerstitial kidney disease (ADTKD). Most causative mutations are buried within a GC-rich 60 basepair variable number of tandem repeat (VNTR), which escapes identification by massive parallel sequencing methods due to the complexity of the VNTR. We established long read single molecule real time sequencing (SMRT) targeted to the MUC1-VNTR as an alternative strategy to the snapshot assay. Our approach allows complete VNTR assembly, thereby enabling the detection of all variants residing within the VNTR and simultaneous determination of VNTR length. We present high resolution data on the VNTR architecture for a cohort of snapshot positive (n = 9) and negative (n = 7) ADTKD families. By SMRT sequencing we could confirm the diagnosis in all previously tested cases, reconstruct both VNTR alleles and determine the exact position of the causative variant in eight of nine families. This study demonstrates that precise positioning of the causative mutation(s) and identification of other coding and noncoding sequence variants in ADTKD-MUC1 is feasible. SMRT sequencing could provide a powerful tool to uncover potential factors encoded within the VNTR that associate with intra- and interfamilial phenotype variability of MUC1 related kidney disease.

SCIENTIFIC REPORTS | (2018) 8:4170 | DOI: 10.1038/s41598-018-22428-0 #613092; ADTKD-REN), and HNF1B (OMIM #137920; ADTKD-HNF1B)) have so far been repeatedly assigned to the ADTKD-spectrum 1,[5][6][7][8] . Shared clinical features in ADTKD and phenotype overlap with other renal kidney diseases (RKDs) frequently make a definite clinical or histopathological diagnosis difficult 4,8,9 . Sanger sequencing or NGS based analyses of the first three identified genes is a widely available standard, while analysis of the MUC1 gene still poses a substantial diagnostic problem. ADTKD-MUC1 constitutes the first kidney disorder where mutations reside within a coding variable number of tandem repeat (VNTR contained in exon 2). The inaccessibility of many genomic repeats to direct short-read based sequencing approaches explains why the association of the MUC1 gene to ADTKD has been obscured for long 5 .
In contrast to simple tandem repeats commonly seen in (coding and noncoding) repeat expansion diseases (e.g. Huntington Disease or Fragile X Syndrome etc.) the actual size of the coding 60 basepair VNTR in ADTKD-MUC1 seems irrelevant to the pathomechanism of ADTKD ( Fig. 1) 5,10,11 . As the VNTR size of each parental MUC1 allele is highly polymorphic (usually between 20 to 125 units per allele) with no expansion or retraction of units, commonly used methods to detect repeat size alterations are per se useless here 12 .
The prototypic causative variant in MUC1 associated ADTKD, the insertion of a eighth cytosine base (insC) in a seven cytosine stretch within one unit of the VNTR composed of almost identical 60-mer units (according The cytosine insertion into the 7C-homopolymer in a single repeat unit of the variable number of tandem repeats (VNTR) domain is exemplarily introduced into the next-to-last repeat unit (indicated by arrows). In the lower part, coding genomic VNTR sequence and the corresponding amino acid sequence are shown for WT and the insC mutant. The insC causes a frameshift introducing a premature stop codon shortly beyond the VNTR domain and thereby, creating a 'neoprotein' lacking the SEA, TM and the cytoplasmic domains (b) MUC1-VNTR assembly of the risk allele in family F1. The complete VNTR sequence is shown for affected individuals III-10, IV-16, IV-27, and IV-28. This allele was identified in all affected individuals but it was not identified in unaffected individuals. The insC was detected in the second consensus repeat (highlighted in red). Capital letter code of the 60 mer VNTR-units according to Kirby et al. are displayed next to each repeat sequence. Nucleotide sequence differences within the repeat units are highlighted in dark grey. Uniform pseudo-repeats (nonvariable) units at beginning and at the end of the VNTR are underlined.
to HGVS nomenclature recommendations a duplication, but for the sake of clarity we stay with term insertion used in most publications), has been repeatedly found worldwide (also arbitrarily referred to as c.428dupC) 4,5 . Based on the VNTR assembly from three families, the exact location of the insC with regard to the unit number seemed to be variable, although always occurring within the 7C-homopolymer at relative base positions 53-59 of the 60-mer unit ( Fig. 1) 5 . Due to the nature of the VNTR, insC within different located 60-mer unit numbers as well as other small insertions or deletions located at different positions, would result in a frameshift leading to a similar, presumably toxic MUC1-neoprotein that is retained within the cell (Fig. 1) 5 .
The signal transducing transmembrane MUC1 protein is abundantly expressed in most epithelia and over-expressed in epitheloid malignancies. MUC1 is also expressed in renal development and later found at the apical surfaces of normal kidney epithelia that form luminal surfaces [13][14][15][16] . The role of MUC1 in the kidney is not very clear apart from the N terminal ectodomain of the glycoprotein being released into the urine. At urothelial surfaces MUC1 and uromodulin form protective polymer layers. Recently it was shown that MUC1 regulates the renal calcium channel TRPV5 and potentially is protective against calcium nephrolithiasis by increasing urinary calcium reabsorption 17 .
Since the high GC content in combination with the genomic architecture of the VNTR largely prevents direct sequencing, a probe extension (snapshot) assay refined by mass spectrometry validation has been developed as a first diagnostic test. The test needs to accomplish the detection of a single 60-mer unit carrying the insC against a high background of normal units (expected frequency between 0.8% (1/125 units) to 5% (1/20 units) depending on individual VNTR allele sizes) 4,5,18 . The evident limitations of the assay, the loss of almost all structural and positional information, led us to hypothesize on the existence of other causative alterations located within the VNTR that cannot be interrogated by the current assay.
To generate complete VNTR assemblies we thus adapted targeted single molecule real time (SMRT) sequencing on the PacBio RS II platform. We argued that this method could improve the testing in general and especially of sporadic cases, a group that has been largely omitted from current studies 4,5,19,20 . We show precise VNTR structural assembly and location of the causative insC, a prerequisite for identification of additional causative variants within the VNTR and detailed characterisation of genotype-phenotype correlations in ADTKD-MUC1.

Material and Methods
Probands. All investigations were conducted in accordance with the principles of the Declaration of Helsinki and after obtaining written informed consent. The study was approved by the local institutional review boards of the Universities of Cologne (IRB approval Nr 237/2013). Out of our (non-syndromal) ADTKD cohort previously tested negative for the ADTKD-genes UMOD, REN and HNF1B (including MLPA analysis for HNF1B, SALSA MLPA P241; MRC-Holland) we selected DNA samples from 9 families with positive snapshot findings as well as six families and one sporadic case with negative snapshot results who were still suspected to have ADTKD-MUC1 based on their clinical course and kidney biopsy findings for SMRT sequencing. Clinical and biochemical data were collected retrospectively from medical charts.
Library preparation and SMRT sequencing. Target amplicons of equal size were pooled equimolar and a SMRT bell library was prepared as recommended by Pacific Biosciences (10 kb Template Preparation and Sequencing with Low-Input DNA) without an initial fragmentation. Raw sequence data is available upon request. Libraries were quantified by fluorometry and quality was assessed by capillary electrophoresis (Agilent DNA 12000 reagents and chips, Agilent). SMRT bell templates were bound to polymerases using the DNA/polymerase binding kit P6 and v2 primers. Polymerase-template complexes were bound to magnetic beads with the Magbead Binding Kit and sequencing was done on the PacBio RS II sequencer with C4 sequencing reagents with a movie length of 180 or 360 min. Cluster analysis to extract haplotypes of individuals was done with the in house algorithm (Wenzel and Pannes unpublished). Amplicons were split according to the 16mer barcodes of PacBio. Additionally amplicons were split with a Perl script.
Assembly of the MUC1 VNTR using an in-house algorithm. Assembly of both paternal VNTR alleles was performed with the help of an in-house developed algorithm using a database-software (Microsoft ACCESS). The barcode-separated reads were converted into text files using Notepad++ editor and loaded into the database. Each single read was recognized by " > m" (fasta format) and at the end by barcodes, respectively. The second step included the identification of start and end of each single 60-mer repeat unit by known and universal sequences of fixed repeat units 1 and 9. Part three included the determination of the repeat type based on sequences of actual known repeat variants described in Kirby et al. Certain repeat types are recognized based on selected base pair sequences within one repeat unit. The 60 basepair sequences of all known repeat types were subdivided into 9 "identification sequences", respectively, each with a length of 4 bases in a row. This procedure overcomes the sequence variation based on reading failures made by the polymerase. Reconstruction of the repeat type assembly can be transferred into Microsoft EXCEL including two sheets: 1. Identified repeat variants corresponding to the capital letter code provided by Kirby et al. 5 are shown in the first sheet for every single read allowing complete VNTR assembly. If one repeat unit type cannot be identified this position remains blank providing the possibility to identify so far unknown repeat variants. The datasets generated during and/or analysed during the current study using the in-house algorithm are available from the corresponding author on reasonable request.

Results
Prerequisites for MUC1-VNTR long-read sequencing. Using a modified PCR protocol published by Fowler et al. allowed stable generation of PCR amplicons across a wide range of MUC1 VNTR allele sizes (from 25 to 74 60-mer units corresponding to ca. 1.5 to 4.5 kb amplicons) that could be visualized by agarose gel electrophoresis (Fig. 2) 22 .
The fidelity of the PCR was validated by allele length determination using Southern blot analysis (data not shown).
We observed preferential amplification of shorter VNTR alleles (<4 kb), a phenomenom that is known as short allele dominance 23,24 . PCR amplification had to be repeated for 8 out of 43 individuals with at least one long allele, for PCR failure.
Targeted SMRT sequencing of the MUC1-VNTR. Direct sequencing of the long-range MUC1-VNTR on the PacBio RS II platform was performed in 43 individuals (16 unaffected, 32 affacted) in 6 independent runs and resulted in an average sequencing depth of 160 complete reads (minimum 4, maximum 600 reads). Libraries were prepared by pooling 5 to 18 long-range (LR) PCR products depending on VNTR size that were subsequently analyzed on single SMRT cells.
Short allele dominance was observed in the first run. Shorter VNTR alleles (<4 kb) were overrepresented (range 32 to 600 complete reads) compared to longer alleles (range 4 to 229 complete reads). To minimize this effect PCR products were pooled for library preparation according to their length in all further runs into two libraries containing amplicons <4 and >4kb, respectively (for detailed information please see Material & Methods). This measure improved read depth for the longer alleles (range 36 to 440 complete reads).
Sequencing had to be repeated for 8 individuals for low complete VNTR coverage, which was mostly associated with allele size >4kb.  Table 1 and Fig. 4a).  Interfamilial variability of insC position within the X-unit in the VNTR. SMRT sequencing consistently enabled us to generate sequence reads that allowed for complete assembly of both the wildtype and the ADTKD risk allele carrying the insC. The precise location of the 60-mer unit location harbouring the insC within the MUC1-VNTR could be solved in eight of the nine positive families studied. In family F3 the insC was detected in three affected family members, but for unclear reasons we were not able to locate the precise 60-mer unit here.
Position of the insC ranged (counting only the variable units; see Fig. 4a, Table 1) from the second up to the 39 th unit and always occurred on the background of the X unit which is the most abundant 60-mer unit found within the VNTR (51.4-62.0%). The exact position of the causative variants was consistent in all affected individuals within families. Two families (F5 and F7) carried the insC in the 39 th variable unit, but their VNTR assembly and allele size was different (Table 1, Figs 3 and 4a). Of note the Swiss family F1 demonstrated exactly the same risk allele assembly and position of the prototypic mutation within the second X unit as one family (family 4) previously reported by Kirby et al. 5 Identical assembly could indicate that these two families are related.

SMRT sequencing in unclear non UMOD/HNF1B/REN ADTKD families. To further investigate our
hypothesis that the VNTR might be a mutational hotspot prone to other mutations (e.g. small deletions etc.) within or outside of the 7C homopolymer a small cohort of three large families (F12, F13, and F16), three smaller families (F10, F14 and F15) and one sporadic case (F11), all compatible with a diagnosis of ADTKD-MUC1, were analysed ( Table 2, Fig. 4b).
No other small insertion or deletion resulting in a frameshift consequence and no causative structural variant could be detected by complete VNTR assembly, although family F12 demonstrated linkage to a 12.9 Mb interval encompassing the MUC1-locus on chromosome 1q21 with a maximum parametric LOD score of 1.8 under assumption of autosomal dominant inheritance (data not shown; only affected individuals analyzed) 25 . No linkage data were available of the latter two larger families F13 and F16. Large deletions and loss of heterozygosity (LOH) could be excluded in this cohort as we were always able to reconstruct two parental VNTR alleles.
MUC1 mutations outside the VNTR were excluded by Sanger sequencing in all families (plus whole exome sequencing in F12).

Identification of novel repeat variants.
We found altogether nine novel repeat variants and annotated them in continuation of the one letter code introduced by Kirby et al. (Fig. 5) since there is no suitable HGVS nomenclature for complex tandem repeat (TR). Three of these repeat variants include the following non-synonymous SNPs on relative repeat position 7G > A (p.G3S), position 22C > A (p.P8T), position 40C > A (p.P14T) termed Q, R, and N. In addition three novel repeat variants with synonymous SNPs were found, which were termed L (rel. position 18G > A, p.P6=), O (rel. position 42G > A, p.P6=), and P (36G > A, p.P6=). The remaining repeat variants 5C, M and S consist of previously described SNPs in a novel combination. Variant 5C is a fusion of previously described variants 5 (fixed variant) and C (non-synonymous SNPs at rel. positions 8G > A (p.G3D) and 59C > A (p.P20Q)). The repeat variant M contains one synonymous SNP at rel. position 21C > A (p.P7=) and variant S three non-synonymous known SNP´s at relative positions 27C > G (p.D9E), 29C > G (p.T10S), and 58C > G (p.P20A).

Discussion
Sequencing with the PacBio RS II system based on single molecule real time (SMRT) technology offers the benefit of extraordinarily long sequencing reads with high accuracy. We adapted SMRT for targeted sequencing of the mutational hotspot in the coding VNTR of the MUC1 gene. The need for double stranded DNA and the comparatively low capacity of the system made the generation of PCR amplicons spanning the VNTR indispensable. SMRT sequencing has been previously performed for complex genomic regions, including another mucin, encoded by the MUC5AC gene, mainly secreted in the respiratory tract 26 . The discovery of the genetic basis of this long sought ADTKD type cannot be overemphasized as it has opened the field of tandem repeats to the spectrum of hereditary kidney disease. The paper by Kirby et al. also points out to the pitfall of missing hereditary disease contained in those repetitive sequences with the current strategies employed in gene identification studies 5 .
Tandem repeats (TR) are a major class of repetitive DNA which makes up for a significant amount of the humane genome. TR the size of 9 to 80 bp that are repeated a few to more than a hundred times are also called VNTRs as they can be highly polymorphic within a population. From a molecular point of view tandem repeat polymorphisms (TRPs) provide a dynamic source of genomic variability as their mutation rate is higher and the extent of polymorphism (multiallelic) is more diverse than that contained in single nucleotide polymorphisms (SNPs). Expansions in short/simple TR´s are known to cause many monogenic disorders, which mainly result in neurodegenerative disease.
The technical dilemma caused by the difficult nature of the VNTR's genomic sequence (tertiary structure resulting from the repetitive sequence, GC content, and repeat length) is well illustrated by the surprisingly few publications on ADTKD-MUC1 after the initial description. These inherent problems have created a diagnostic bottleneck for MUC1 testing worldwide with only a few laboratories performing the probe extension assay and only one laboratory being able to validate the snapshot assay by mass spectrometry 4,5,20,27 .
SMRT sequencing allowed complete assembly of both the wildtype and the ADTKD risk allele and precise positioning of the insC in all families (except family F3). In contrast even mutations occurring at the 5′ and 3′ borders of the VNTR are usually not accessible to Sanger sequencing or to short read based NGS technologies.
The reason why several affected family members from F3 repeatedly showed ambiguous position of the insC on various X units remains unexplained, but could not be related to a low number of complete VNTR reads. Interestingly, the prototypic mutation insC was exclusively found on the background of the X unit in our cohort. In contrast the only other publication providing data on VNTR assembly identified the insC on the variable X-and B-unit and on the"fixed" 5-unit 5 . Our observation could be an accidental finding as the number of analysed families in our study was still small and the X unit constitutes the most abundant unit within the VNTR. A binomial test calculates a one-tailed P-value of 0.016 for the chance of exactly 8 mutational hits in the X unit in 8 families, which does not support the hypothesis of a chance finding.
By pinpointing the precise repeat unit carrying the prototypic insC we could confirm the allelic heterogeneity of MUC1 associated kidney disease. The mutational spectrum is further expanded by a recent study that identified the first causative MUC1 mutation close to but outside of the VNTR that is accessible to standard sequencing   28,29 .
In the largest study on ADTKD-MUC1 24 families, with the insC identified by the snapshot method, have been described with high intra-and interfamilial variability regarding the progression of renal disease and onset of ESRD 20 . They speculated on the existence of anticipation (worsening of the phenotype in succeeding generations) in some families, a phenomenon that has been noted in the very first publication on the MUC1 locus on chromosome 1q21 25 . Although the underlying molecular mechanism in a non-expansion TR would be unclear at the moment, precise structural information on the MUC1-VNTR (allele sizes, position of the frameshift mutation, polymorphisms that could possibly alter the structure and thereby the toxicity of the frameshift protein) could provide valuable information to predict a potential role of the hypervariable VNTR in anticipation/disease progression from observer bias. To our knowledge this study contains the most comprehensive data on MUC1-VNTR topology, but was neither designed to nor is powered to answer those questions. Assessing the impact of coding and noncoding variation for potential genotype phenotype correlations in ADTKD-MUC1 will  [30][31][32] . According to these studies, the flanking regions were characterized by a highly conserved pattern of variant units A-F in genomic DNA from individuals of different ethnic background. It was concluded that the variant repeat topology could have resulted from multiple duplication events in the phylogeny of the MUC1 repeat domain. Later it was shown that the entire domain contains clusters of variant A and B units interspersed in clusters of X units 30 . As the A, C, E, and F unit exhibit replacements of the second proline residue in the STAPPA motif, the expressed mucin is characterized by a pronounced underglycosylation here. Again at the protein level the DT to ES replacement within the immunodominant DTR motif results in higher conformational flexibility at this site and altered B lymphocyte responses 32 . Upper case letters were continued following the one letter code introduced by Kirby et al. Nucleotide sequence differences to the canonical repeat variant "X" are highlighted in dark grey. The repeat type 5C is a fusion of repeat types 5 and C. It is composed of the first half of the uniform repeat type 5 and the second half of the repeat type C. At the top corresponding amino acid sequences are shown for the canonical and variant-containing units. Synonymous variants are marked in light blue. Nonsynonymous variants and corresponding amino acid exchanges are highlighted in different colors. Among other variants we identified two novel non-synonymous variants in repeat units N and Q resulting in amino acid replacements G to S in the HGT motif (CCG to ACG in unit Q) and P to T in the PAP motif (CCG to ACG in unit N).
SCIENTIFIC REPORTS | (2018) 8:4170 | DOI:10.1038/s41598-018-22428-0 require a larger number of resolved VNTR datasets. Thus, our study provides a methodological basis to sequence this region which might help to uncover the factors explaining the considerable intra-and interfamilial variability in progression of MUC1 related kidney disease.
As the pathomechanisms in ADTKD-MUC1 are likely linked to inflammatory and immune processes in the kidney tissue and structural variation in the VNTR domain has been shown to alter B lymphocyte response, it is not beyond reason to assume a potential role of the VNTR's topology here (please see Fig. 5 for further information) [30][31][32] .
In summary we recommend to perform first line or complementary SMRT sequencing in all snapshot positive and negative suspected ADTKD-MUC1 cases (to potentially identify novel mutations occurring within the VNTR in the latter). We propose to document next to the precise position of the causative mutation, allele lengths and (sequence) topology of both the wild type and the disease causing allele in all future studies until the role of the VNTR becomes clearer.