A comprehensive WGS-based pipeline for the identification of new candidate genes in inherited retinal dystrophies

González-del Pozo, María; Fernández-Suárez, Elena; Bravo-Gil, Nereida; Méndez-Vidal, Cristina; Martín-Sánchez, Marta; Rodríguez-de la Rúa, Enrique; Ramos-Jiménez, Manuel; Morillo-Sánchez, María José; Borrego, Salud; Antiñolo, Guillermo

doi:10.1038/s41525-022-00286-0

Download PDF

Article
Open access
Published: 04 March 2022

A comprehensive WGS-based pipeline for the identification of new candidate genes in inherited retinal dystrophies

María González-del Pozo ORCID: orcid.org/0000-0003-4858-938X^1,2^na1,
Elena Fernández-Suárez ORCID: orcid.org/0000-0001-6478-5968^1,2^na1,
Nereida Bravo-Gil^1,2,
Cristina Méndez-Vidal^1,2,
Marta Martín-Sánchez¹,
Enrique Rodríguez-de la Rúa^3,4,
Manuel Ramos-Jiménez⁵,
María José Morillo-Sánchez³,
Salud Borrego^1,2 &
…
Guillermo Antiñolo ORCID: orcid.org/0000-0002-2113-074X^1,2

npj Genomic Medicine volume 7, Article number: 17 (2022) Cite this article

6773 Accesses
7 Citations
19 Altmetric
Metrics details

Subjects

Abstract

To enhance the use of Whole Genome Sequencing (WGS) in clinical practice, it is still necessary to standardize data analysis pipelines. Herein, we aimed to define a WGS-based algorithm for the accurate interpretation of variants in inherited retinal dystrophies (IRD). This study comprised 429 phenotyped individuals divided into three cohorts. A comparison of 14 pathogenicity predictors, and the re-definition of its cutoffs, were performed using panel-sequencing curated data from 209 genetically diagnosed individuals with IRD (training cohort). The optimal tool combinations, previously validated in 50 additional IRD individuals, were also tested in patients with hereditary cancer (n = 109), and with neurological diseases (n = 47) to evaluate the translational value of this approach (validation cohort). Then, our workflow was applied for the WGS-data analysis of 14 individuals from genetically undiagnosed IRD families (discovery cohort). The statistical analysis showed that the optimal filtering combination included CADDv1.6, MAPP, Grantham, and SIFT tools. Our pipeline allowed the identification of one homozygous variant in the candidate gene CFAP20 (c.337 C > T; p.Arg113Trp), a conserved ciliary gene, which was abundantly expressed in human retina and was located in the photoreceptors layer. Although further studies are needed, we propose CFAP20 as a candidate gene for autosomal recessive retinitis pigmentosa. Moreover, we offer a translational strategy for accurate WGS-data prioritization, which is essential for the advancement of personalized medicine.

Genome-wide association studies

Article 26 August 2021

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

Genomic data in the All of Us Research Program

Article Open access 19 February 2024

Introduction

Inherited retinal dystrophies (IRD) constitute a group of clinically and genetically heterogeneous, rare Mendelian disorders that lead to irreversible and progressive visual impairment due to dysfunction or loss of photoreceptors¹. The most common form of IRD is retinitis pigmentosa (RP, ORPHA:791) defined by the primary death of rods, which results in night blindness and constriction of the visual field². To date, pathogenic variants in 89 genes can cause RP (RetNet, the Retinal Information Network, https://sph.uth.edu/retnet/, accessed January 2021), however, an estimated 40% of cases remain without a genetic diagnosis after testing for the most prevalent retinal genes, suggesting that the RP in these patients could be attributed to mutations that were either undetectable by the current methods, or not routinely analyzed, such as deep-intronic variants, complex structural variants (mobile elements insertions, inversions, translocations, etc.), or variants in yet unidentified disease genes^3,4,5,6.

In this scenario, identifying novel disease genes or variants is important to increase the diagnostic rate and to facilitate new approaches for clinical care of IRD patients. The advances in next-generation sequencing (NGS) technologies have ushered in a new era for genetic diagnosis and disease-gene discovery⁷. Recent studies have reported the clinical utility of Whole Genome Sequencing (WGS), especially for rare diseases^8,9, and its large expectations on personalized medicine¹⁰, highlighting that the use of WGS as a first diagnostic strategy could constitute a unique and powerful analysis. This approach provides a bigger evenness of coverage and the proportion of transcripts covered in their entirety compared to targeting sequencing, allowing a superior detection of structural variants, variants in non-coding regions, and detection of variants in GC-rich regions¹¹. However, the clinical translation of this approach is currently limited due to its still high cost, a large amount of generated raw data, and the lack of efficient protocols for the WGS-data analysis^12,13. Nevertheless, in recent years, the cost of generating genome information has shown a rapid decline making it possible a greater application of WGS as in the clinical research as in some health care systems^9,10. Concerning bioinformatic processing, it is still necessary the application of advanced filters to categorize variants efficiently¹⁰. In this regard, deleteriousness predictors provide the opportunity to facilitate variant prioritization in WGS studies. Multiple prediction algorithms have been developed but it is still unclear which ones and how they should be applied in human disease studies to minimize both false-positive and false-negative rates¹⁴.

The aim of this work was to design a WGS-based pipeline for the identification of potentially pathogenic variants in a group of previously analyzed RP patients without genetic diagnosis. In this regard, we conducted a comparative study of 14 variant pathogenicity prediction tools to choose the most reliable cutoff for variants associated with IRDs. These results enabled us to optimize the filtering and prioritization of WGS data in order to rapidly obtain a dataset enriched in likely pathogenic variants. The application of our workflow allowed us to discover a variant in the CFAP20 gene in one family. Here, we propose CFAP20 as a new likely candidate gene for arRP.

Results

Establishment of the optimal cutoffs

The carefully curated training dataset comprised a total of 942 distinct rare SNVs located in any of the IRD associated genes, including 247 pathogenic or likely pathogenic variants and 695 benign or likely benign variants (Supplementary Table 1). ROC curves for each tool were computed using the prediction scores from the training dataset (Fig. 1A, B). Of note, a subgroup of 99 splicing variants (34 pathogenic/likely pathogenic variants and 65 benign/likely benign variants) was used for the ROC curves of the splicing predictors.

**Fig. 1: The ROC curves and combinatorial analysis results for different pathogenicity prediction tools.**

The specificities of each prediction method were evaluated according to AUC values. We found that all values were significantly >0.5 (P-value < 0.0001) indicating that all methods were suitable to discern between pathogenic and benign variants. For the training dataset, the predictor with a higher AUC was CADDv1.6 (AUC = 0.891) (Fig. 1A), whereas for the splicing subset the predictor with higher AUC was NNS (AUC = 0.971) (Fig. 1B).

Although three different approaches were conducted to establish the optimal cutoff for each prediction method, the optimal threshold was defined as the value in which the sensitivity is 90% for each predictor (Table 1). In order to visually compare the distribution of the filtered variants using both the cutoff most widely described in the literature and the cutoff calculated in this study, dot histograms were represented (Supplementary Fig. 1).

Table 1 Relevant statistical data and the optimal cutoffs for the 14 prediction tools tested in this study.

Full size table

Optimization and validation of the discovery pipeline

As the estimated FP rates, with the exceptions of CADD and the splicing tools, were not acceptable in most cases (≥35%) (Table 1), a combinatorial analysis was carried out. For this purpose, we applied our cutoff values to filter the training dataset and calculated the TP and FP rates in each of the 109 combinatorial models (Supplementary Table 2). Thirty-six of the predictor combinations met the criteria (TP ≥ 85%, FP ≤ 35%, and Missing values ≤ 30%), including 11 non-splicing and 25 splicing tool combinations. Models passing quality filters were graphically assessed by bubble plots (Fig. 1C, D). Considering the balance between FP and TP rates, the optimal combination of splicing tools was “SpliceAI + NNS”, which presented the lowest FP rate (3.08%) with a still elevated TP rate (94.18%). On the other hand, four of non-splicing predictors: “CADDv1.6”, “CADDv1.6 + MAPP”, “CADDv1.6 + MAPP + Grantham”, and “CADDv1.6 + MAPP + Grantham + SIFT” were initially proposed as the most suitable options.

To finally determine the most enriched approach in likely causal variants, the IRD validation dataset was submitted to the four combinations of the non-splicing tools. This dataset comprised a total of 5085 distinct variants in known IRD genes, including 49 pathogenic causal mutations. Taking into account the ratio of causal and non-causal variants prioritized in each model (Fig. 2A), the “CADDv1.6+MAPP + Grantham+SIFT” combination showed to be the most accurate option with enrichment of causal variants of 28.57%.

**Fig. 2: Validation of the discovery pipeline in three different inherited diseases.**

The application of the discovery pipeline (Fig. 3) in the IRD validation dataset allowed us to validate the 89.80% (44 out of 49) of the causal variants. The remaining 10.20% (5 out of 49) were filtered out by CADDv1.6 cutoff and consisted of two in-frame variants, two splicing variants in non-canonical positions, and one missense variant (Fig. 2B). Additionally, the discovery pipeline was applied in the dataset from the hereditary cancer cohort and neurological diseases cohort to evaluate its efficacy in these diseases. Regarding the hereditary cancer cohort, the 97.83% (90 out of 92) of the causal variants were validated (Fig. 2C). In the neurological diseases cohort, our algorithm allowed us to recover the 95.65% (44 out of 46) of the causal variants (Fig. 2D). The nature of the variants that integrate each validation dataset can influence the validation ratios, being the highest for the hereditary cancer dataset, which is composed of 70, 66% of loss-of-function variants (stop gain, frameshift, and canonical splicing), in contrast to the 44.9% of loss-of-function variants of the IRD cohort. Furthermore, the highest ratio of causal and non-causal variants was obtained applying the same combination of tools (“CADDv1.6+MAPP + Grantham+SIFT”).

**Fig. 3: Discovery pipeline for WGS-data analysis.**

Application of the discovery pipeline

The discovery dataset encompassed more than twelve million of SNVs, of which 7,724,071 variants passed the recurrence and multiallelic variants filters. The application of the frequency filtering revealed 523,478 variants, of which 1524 variants passed “CADDv1.6 + MAPP + Grantham+SIFT” filter (Fig. 4A).

**Fig. 4: Variants filtering and prioritization scheme using the optimal cutoffs vs. the literature cutoffs.**

The pedigree filtering applied below is exclusive of each family, so the number of variants pending to be manually evaluated varies according to the initially assumed mode of inheritance and the genotype/phenotype of the sequenced individuals as a first approach (Table 2). In simplex families, variants consistent with autosomal recessive, autosomal dominant, and X-linked traits have been considered. In consanguineous families, variants that were homozygous in affected patients but not in their unaffected relatives were first prioritized, followed by the compound heterozygous variants.

Table 2 Variants prioritized using the WGS pipeline in the RP families of the discovery cohort.

Full size table

This approach resulted in the identification of 45 rare SNVs prioritized in the seven RP families of the discovery cohort (~6 variants per family), all of them were absent in homozygous status in unrelated controls (0 homozygous in gnomAD database). According to ACMG¹⁵ criteria, these variants were classified as pathogenic (n = 6), likely pathogenic (n = 1), variants of uncertain significance (n = 33), and likely benign variants (n = 5), which were located in 42 different genes (Table 2). Eleven out of these genes have been previously associated to a human phenotype according to OMIM database (accessed in November 2021) (Supplementary Table 3). Of note, the RPGR orf15 region was manually inspected in the 14 patients of the discovery cohort due to its difficulty to sequence. We tested the coverage of this region, resulting in a mean coverage of 10.53x in men and 20.87x in women within the most complex interval (chrX:38144794-38146346; GRCh37) (Supplementary Fig. 2). Non-causal variants were detected here.

The number of variants remaining after the application of each filtering step in family A is depicted in Fig. 4. The pedigree filter further reduced the number of candidate pathogenic variants to 160, including ClinVar pathogenic variants and variants passing “SpliceAI+NNS” thresholds.

As family A was consanguineous, two homozygous variants were firstly prioritized, one in the CFAP20 gene (c.337 C > T; p.Arg113Trp), and the other in the FAHD2A gene (c.328 T > C; p.Cys110Arg); none of which have been previously associated with a human phenotype in OMIM database. It should be noted that, when the threshold values previously described in the literature were used (Supplementary Table 4), the number of variants in each step was greater, being up to 90% more for manual curation (from 2 to 20) (Fig. 4B).

During the manual prioritization, CFAP20 was selected for further analysis, since it is a ciliary gene^16,17,18 that interacts with a known RP gene (RPGeNet¹⁹). Besides, the function and mutational data reported in the literature^20,21 stronger supported the prioritization of CFAP20 over FAHD2A, which was discarded based on its poor functional and mutational bibliographic support, its lack of interaction with other known RP genes, and the milder effect of the variant according to the ACMG¹⁵ criteria (Table 2). Sanger sequencing confirmed segregation of the CFAP20 variant with the RP in the five members of Family A (Fig. 5A). Remarkably, up to now, this variant has been detected only in heterozygous state in 5 out of 165,392 unrelated controls (MAF = 0.0000121) from different public allele frequency databases such as gnomAD, EVS, Bravo, 1000 g, and CSVS²², which collects genomic data from Spanish-local population. Moreover, we investigated how tolerated were variants in the CFAP20 gene in the base of the gnomAD constraint metric LOEUF. The statistical performance denoted outstanding discrimination by the LOUEF score, reflected in the high AUC value obtained (AUC = 0.932) in the ROC curve analysis. The LOUEF score for the CFAP20 gene is 1.008 which is under our established cutoff (≤1.455) (Supplementary Fig. 3).

**Fig. 5: Segregation studies and in silico pathogenicity assessment of the candidate variants identified in *CFAP20*.**

The manual prioritization in the rest of the families (Families B–G) is resulting in a number of prioritized variants and genes (Table 2). However, further expression, localization, segregation, and interaction studies are needed to evaluate the role of these variants in the etiopathogenesis of the RP in these families.

Regarding the SVs analysis, after applying the pedigree and manual filters, no variants consistent with the disease were identified in the discovery cohort.

Protein structural analysis, expression assays, localization studies, and mutational screening of CFAP20

To evaluate evolutionary conserved positions in CFAP20, we performed the alignment of 11 CFAP20 orthologous sequences using Jalview. The strong evolutionary conservation of the CFAP20 protein and the complete physicochemical conservation of the mutated residue Arg113 is shown in Fig. 5B.

Furthermore, three-dimensional modeling for CFAP20 using PyMOL Molecular Graphics System showed that Arg113, a positively charged amino acid, interacts with three other amino acids through hydrogen bonding (Fig. 5C). Specifically, Arg113 forms one hydrogen bond with Ser110 and Thr111, and two with Thr120. In silico mutagenesis at position 113 to tryptophan, a non-polar aromatic amino acid, predicted loss of two hydrogen bonding interaction points, (Ser110, and Thr111).

In addition, the protein-protein interaction studies revealed a network, comprised of 25 CFAP20-connected proteins, some of which are involved in ciliary function or forming part of the spliceosome (Fig. 6A). Remarkably, CFAP20 interacts with disease-causing proteins including: (i) ARL2BP, associated with RP, (ii) TBC1D32 and FOXJ1, related with ciliopathies, and (iii) LRRK2 and DICER1, involved in retinal degeneration in animal models.

**Fig. 6: Analysis of CFAP20 interaction network, and expression profiles.**

In order to study the expression of CFAP20 in different human tissues, we used real-time PCR and ready-to-use cDNA from retina, brain, placenta, kidney, and skeletal muscle. As a result, we found that the expression level of CFAP20 mRNA was the highest in adult retina, followed by kidney and placenta (Fig. 6B).

The tissue distribution of human CFAP20 was also investigated by immunohistochemistry using human retina sections from unaffected individuals. Specific immunolabeling using the CFAP20 antibodies was observed, from the stronger to the weaker staining, in the inner segment of the photoreceptor cells, the outer plexiform layer, the nucleus of the cells of the inner nuclear layer, and in the ganglion cells layer (Fig. 6C).

Amplicon NGS sequencing of all coding exons and its intronic flanking regions of CFAP20 revealed no variants consistent with the disease among the 264 additional IRD unsolved cases analyzed.

Clinical findings in the family A

The family A proband, a 43-year-old female, is the first child of first-degree cousin parents with two other unaffected siblings. The patient displayed progressive night blindness with photophobia since age 17 and impaired color vision, poor visual acuity (left eye, 20/100; right eye, 20/63), and concentric narrowing of visual field, at diagnosis. The recent fundoscopic study, and the fundus autofluorescence imaging, were consistent with a clinical diagnosis of typical RP characterized by bone spicule pigmentation, narrowed retinal vessels, loss of the retinal pigment epithelium, and atrophic patches in macula (Fig. 7 A and B). OCT imaging revealed generalized atrophy of the photoreceptor cells layer but relatively preserved in central macula (Fig. 7C). Full-field electroretinography (ERG) revealed completely bilateral extinguished scotopic and photopic responses (Fig. 7D). The abolished ERG responses, the RPE degeneration, and the diminished visual acuity (best-corrected visual acuity of 0.2 in both eyes) indicated an advanced disease. Additional findings included posterior capsular opacification. The patient did not display systemic symptoms consistent with a syndromic phenotype. Other unrelated pathologies present in the index patient were subclinical hypothyroidism and beta-thalassemia.

**Fig. 7: Ophthalmic characterization of the right (OD) and left (OS) eye of the 43-year-old female with RP from the family A.**

Discussion

To date, targeted sequencing, such as gene-panel sequencing and WES, are the NGS approaches more frequently used in the clinical setting. However, the recent advances in WGS have enabled wider use of this technology, even leading to its gradual incorporation in some health systems⁹. Currently, we consider that the cost-benefit balance regarding data quality, analytical efforts, and diagnostic rate indicates that panel-based sequencing is still the most efficient first NGS strategy for the detection of disease-causative genetic variants in IRD, at least in the context of the diagnostic routine of public hospitals²³. However, around 40% of cases remain unsolved after this application, which would be eligible for larger-scale techniques as WGS. Thus, these extended strategies would be applied only as a second step and would not replace panel sequencing. Nevertheless, WGS is starting to emerge as an efficient first-level test²⁴, thanks to its ability to screen for both deep-intronic regions and variants in novel genes, and its greater uniformity of coverage allows better detection of structural variants. Before proceeding to the identification of variants in novel genes, it may be helpful to discard the presence of any pathogenic allele types in genes already involved in the disease, only in this way, the level of uncertainty associated with the causality of a variant in a new candidate gene would be reduced. However, one of the most important barrier to implementing WGS in the clinical practice is data management and storage²⁵. The lack of systematized protocols to filter and prioritize causative variants in WGS data, prompted us to develop an effective approach to be used as a standardized workflow for the identification of disease-relevant variants in novel candidate genes for IRD.

Deleteriousness prediction methods are instrumental for variant effect interpretation helping to prioritize large amounts of data generated by sequencing projects. This study provides a comprehensive analysis of which predictor tool, or combination of them, is best suited for discovery applications, as well as which are the most reliable cutoffs regardless of those reported in the literature. In this regard, although CADDv1.6 prediction showed the highest performance, probably because it is an ensembled method that provides scores for all types of variants²⁶, the filtered FP rate was still very elevated. The combination of this method with the predictors MAPP, Grantham, and SIFT enabled us to further reduce the number of neutral variants. Additionally, the use of our customized cutoffs, instead of the published thresholds, allowed us to significantly reduce the number of variants on the common VCF file, resulting in an increased effectiveness by reducing the number of variants for manual filtering. Of note, although this pipeline could be used for the analysis of both, WES and panel data, it is specially designed for WGS data, since not all annotation tools work well with large sequencing experiments²⁷.

Our results demonstrated the importance of integrating different prediction tools in a standardized pipeline and applying filters validated and optimized using local carefully curated datasets. In fact, previous work highlighted the need for a detailed catalog of local variability since there are relevant differences in allelic frequencies of both polymorphic and pathogenic variants between populations²⁸. For this reason, working with local datasets is crucial for an accurate establishment of the clinical significance of candidate variants. Although other authors^26,29 have performed multiple comparisons among prediction methods, the input data was taken from public databases which may not be properly curated or be deficient in local data, leading to the misclassification of variants and limiting the accuracy of the resulting performance estimations^26,29. In addition, unlike other studies in which variants with high MAF composed the neutral dataset^29,30, our group of benign variants was previously filtered by MAF letting us test how well a predictor performs when the benign variants have the same allele frequency that known pathogenic variants. This fact approaches our study to a real filtering scenario being able to establish a more precise fixed threshold. The favorable results obtained using heterogeneous validation cohorts demonstrated that our optimized pipeline could be applied to the analysis of NGS data from individuals with other genetic disorders, not only for IRDs patients. Hence, the implemented translational strategy allows an accurate prioritization and assessment of NGS data in the clinical setting, which is essential to establish personalized medicine.

Remarkably, the application of our pipeline to the discovery cohort allowed the identification of one homozygous variant (c.337 C > T; p.Arg113Trp) in the candidate gene CFAP20 as the most likely cause of non-syndromic RP in one of the families. Previous studies, involving unicellular^16,31, and multicellular organisms¹⁸, showed that Bug22 (ortholog name of the cilia and flagella associated protein 20, CFAP20) plays a critical role in cilia and flagella formation and morphogenesis. Bug22 depletion causes defects in ciliary and flagellar morphology and motility in Paramecium¹⁶, Chlamydomonas¹⁷, and Drosophila¹⁸ (Supplementary Table 5). Of note, knockdown experiments in Zebrafish¹⁷ revealed a phenotype consistent with ciliary dysfunction³² including a curved body axis, short somite length, and defective heart-looping orientation. In addition, CFAP20 has also been detected in the primary cilium-derived photosensory rod outer segments of mouse retina³³. These results implied that CFAP20 may be also important for assembly or stability of cilia in vertebrates¹⁷. Moreover, depletion of CFAP20 in human hTERT-RPE1 cells resulted in the appearance of longer cilia, and reduced axonemal polyglutamylation¹⁸, demonstrating the implication of CFAP20 in the regulation of post-translational modifications of the ciliary axoneme in human cells. The fact that almost one-quarter of known photoreceptor degeneration genes are associated with ciliary structure or function^33,34, along with the high evolutionary conservation of CFAP20, and its low LOEUF score (below our cutoff), support the prioritization CFAP20 as a candidate gene for autosomal recessive IRD.

Sequencing of more than one individual per family and the application of the recurrence filter has allowed us to refine the number of likely causative homozygous variants, which in consanguineous individuals would be expected to be higher. Our patient, born to consanguineous parents, harbored a homozygous rare missense variant in CFAP20 (c.337 C > T; p.Arg113Trp), and received a clinical diagnosis consistent with non-syndromic RP. Recently, a conference report described another family with three affected individuals with clinical manifestations partially resembling the phenotype observed in our proband, including RP with an onset in adolescence²¹. These patients harbored two heterozygous CFAP20 variants, one missense, and one canonical splicing variant, segregating in the family²¹. In addition, the three siblings had a history of learning disabilities in school and motor coordination difficulties, suggesting the implication of CFAP20 in a syndromic form of RP. As occurs with mutations in ~30 ciliary genes³⁵, the manifestation of extra-ocular features can vary from patient to patient^36,37, depending on the severity of the mutations^36,37, the genetic background³⁸, the presence of genetic modifiers³⁹ or tissue-specific alternative splicing⁴⁰, among other factors. Interestingly, depending on the mutation, the same ciliary gene can cause syndromic or non-syndromic retinopathies, thus emphasizing the highly refined specialization of the photoreceptor neurosensory cilia, and raising the possibility of photoreceptor-specific molecular mechanisms⁴¹.

Further, we observed high CFAP20 gene expression in the retina compared to other tissues, and localization in the inner segment of photoreceptor cells, suggesting that CFAP20 could have a role in the human retina. Moreover, the molecular modeling of CFAP20 revealed that the p.Arg113 residue may be involved in some interactions with important biological roles. In fact, p.Arg113 was predicted to interact with p.Thr111, one of the seven consensus positions in species that have cilia or centrioles, suggesting a relevant role of this specific residue in the development and function of the cilia or centrioles¹⁶. These data suggest that the CFAP20 variant, p.Arg113Trp, could affect protein folding and interaction with the consensus residue p.Thr111.

PPI network analysis of CFAP20 significantly contributed to our understanding of potential relationships between CFAP20 interactors and retinal disease mechanisms. One of the top-ranked interactors of CFAP20 was ARL2BP, a known autosomal recessive RP gene⁴² required for the formation of ciliary doublets of the photoreceptors and for the morphogenesis of its outer segment⁴³. We also found other ciliopathy associated partners of CFAP20, namely, TBC1D32, mutated in patients with oro-facio-digital syndrome type IX^44,45; FOXJ1, implicated in primary ciliary dyskinesia 43⁴⁶; LRRK2, a Parkinson disease 8 gene, involved in retinal degeneration by a gain-of-function mechanism in Drosophila^46,47; and DICER1, which deficit induces retinal pigmented epithelium degeneration in a mouse model of age-related macular degeneration⁴⁸. The establishment of a robust interaction network led us to hypothesize that the variant identified in our family might alter some of the interactions with other crucial proteins involved in the etiology of retinal degeneration. However, further functional studies that deepen our understanding of these interactions and their role in disease are needed to test this hypothesis.

Clinically, genotype and phenotype correlations are only now starting to emerge for CFAP20, which demands the comprehensive screening of larger patient cohorts to better understand disease pathogenesis in new cases with candidate CFAP20 variants. Nevertheless, if confirmed, CFAP20-associated disease would be clinically variable, ranging from isolated to syndromic RP with a spectrum of neurological defects. The identification and characterization of additional cases will contribute to a better understanding of the factors influencing the variable expressivity of clinical features possibly associated with mutations in this novel candidate gene.

In conclusion, the arrival of the WGS techniques into the clinical practice has aroused great expectations about its potential for identifying the genetic bases of diseases. In this scenario, the development of a translational pipeline for the analysis of WGS data in the clinical setting, based on the reliable use of computational prediction tools, becomes a priority. The use of statistically proven filtering criteria using in-house curated patient genetic data, reinforced the huge diagnostic and discovery capacity of WGS. Our study suggests that the combination of several prediction tools and the use of customized cutoff values improve enormously WGS-data management. Herein, the application of our pipeline has allowed us to identify, in one family, a homozygous variant in CFAP20, a potential candidate gene for autosomal recessive RP. Therefore, our study could contribute to expand the mutational landscape of ciliary genes associated to human diseases, reinforcing the importance of this complex organelle as a key player in photoreceptor degeneration.

Methods

Subjects and previous NGS studies

The research was conducted in accordance with the tenets of the Declaration of Helsinki (Edinburgh, 2000)⁴⁹, and all experimental protocols were approved by the Institutional Review Board of the University Hospitals Virgen del Rocio and Virgen Macarena (Spain). Written informed consent was obtained from all participants. The genomic DNA of all subjects was isolated from peripheral blood using standard procedures. All affected individuals underwent a thorough ophthalmic evaluation as described elsewhere⁵⁰.

This study involved 429 individuals grouped in three different cohorts: the training cohort (n = 209), the validation cohort (n = 206), and the discovery cohort (n = 14) (Fig. 8). The training cohort comprised 209 IRD patients selected among those who received a genetic diagnosis at the Department of Maternofetal Medicine, Genetics and Reproduction of the University Hospital Virgen del Rocio of Seville in the period from 2016 to 2019 using different NGS targeted approaches^51,52,53, among others. The accurate genetic characterization of these patients enabled this group to design and define the prioritization pipeline.

**Fig. 8: Composition of the cohorts and datasets used in the study.**

The validation cohort was composed of 206 additional, unrelated patients who also underwent targeted sequencing at our department (unpublished data). This cohort was composed of three sub-cohorts of affected patients from IRD (n = 50), hereditary cancer (n = 109), and neurological diseases (n = 47). The sub-cohort of IRD patients including 33 patients with a genetic diagnosis and 17 patients without a genetic diagnosis to conduct a blind trial, allowing an unbiased evaluation of the parameters proposed with the training dataset. In order to assess if our pipeline could be applied to the analysis of other inherited diseases, the hereditary cancer cohort and the neurological diseases cohort, comprising genetically diagnosed individuals, were employed.

The discovery cohort involved 14 individuals, of which nine were affected and five were unaffected members, belonging to seven unsolved IRD families (Families A–G). WGS was conducted in all the individuals of the discovery cohort, and a comprehensive analysis of the 274 genes previously associated with IRD (RetNet), including coding and non-coding regions, was performed as previously described⁵⁴, but no causal variants were detected in any of these genes. The discovery cohort was employed for the application of the validated workflow in order to achieve their genetic diagnosis and the identification of new disease genes. Interestingly, to facilitate the filtering and prioritization of variants in novel genes, the unaffected individuals of the rest of the families were used as pseudo-controls of the family in the study.

Additionally, 264 unsolved IRD individuals from our cohort were collected in order to conduct the mutational screening of the novel candidate genes.

The genomic data of the individuals belonging to the three cohorts were combined using the VCF sort tool⁵⁵ and the VCF combine tool⁵⁶. The multi-sample VCF files comprised the study datasets (Fig. 8) enabling the application of the pipeline in a more efficient way.

Curation of the training dataset

The training dataset composed of SNVs affecting IRD genes was first filtered by MAF ( ≤ 0.01) and by the number of homozygous individuals in GnomAD (0, 1). The resulting variants were then classified according to ACMG¹⁵, using VarSome⁵⁷ v10.1 as a support, and their clinical association in multiple databases (ClinVar, LOVD, HGMD professional, and the literature review). This categorization allowed us to differentiate two groups of variants: (i) Pathogenic and likely pathogenic; and (ii) Benign and likely benign.

The statistical analysis of the splicing predictors was conducted using a subgroup of variants affecting intronic positions ±10 and the first/last codon of the exons. This subgroup was similarly classified as: (i) Pathogenic and likely pathogenic; and (ii) Benign and likely benign attending to the same criteria mentioned above.

Those changes that were not clearly classified in these categories (Variants of Unknown Significance) were discarded for the statistical analysis.

Predictive tools tested in this study

To obtain the prediction scores used in the statistical analysis, the training dataset was annotated using Alamut® Batch v1.11 software (Interactive Biosoftware), Bystro Genomics²⁷, and Ensembl Variant Effect Predictor (VEP, web interface release 104)⁵⁸ (Supplementary Table 4).

Alamut® Batch is based on efficient external prediction tools reporting update information, of which we used the deleteriousness prediction scores for Sorting Intolerant From Tolerant⁵⁹ (SIFT), Grantham⁶⁰, PhastCons⁶¹, PhyloP⁶², Multivariate Analysis of Protein Polymorphism⁶³ (MAPP), Splicing Predictions in Consensus Elements⁶⁴ (SPiCE), Splice Site Finder-like⁶⁵ (SSF), MaxEntScan⁶⁶ (MaxEnt), and NNSplice⁶⁷ (NNS). Bystro Genomics provides three prediction methods: PhastCons-100way, PhyloP-100way, and CADDv1.3. Since the CADD version provided by Bystro is only defined for single-nucleotide variants, a more recent version of CADD (GRCh37-v1.6) was also tested, which was obtained from VEP annotation. This variant annotator gives also the SpliceAI⁶⁸ prediction allowing its assessment. Therefore, two different versions of PhastCons, PhyloP, and CADD were evaluated independently to assess the most efficient method.

To compare the performance of the quantitative score of these prediction methods, SIFT and MAPP scores given by Alamut® Batch were converted, so that a higher score indicates a higher risk of deleteriousness. Similarly, scores of splicing tools SSF, MaxEnt and NNS were converted into the percent variation between the scores for the wild-type sequence and variant sequences. Among the four different delta scores (DS) provided by SpliceAI, the maximum score was used (Supplementary Table 4).

In addition, motivated by the fact that genes that are crucial for the function of an organism will be depleted of loss-of-function variants in natural populations, whereas non-essential genes will tolerate their accumulation⁶⁹, we evaluated the tolerance to inactivation of the novel candidate genes using the constraint metrics from gnomAD. Among them, the LOEUF Score (“loss-of-function observed/expected upper bound fraction”) was used for its good performance to improve molecular diagnosis and advance in the understanding of disease mechanisms⁷⁰.

Comparison of the predictive tools

To calculate potential cutoff values with a certain degree of sensitivity and specificity for each of the predictive tools, we conducted receiver operating characteristic (ROC) curves using the prediction scores of the training dataset and the ROC curve toolbox of SigmaPlot v14 (Systat Software, Inc). Resulting data were used to establish the optimal cutoff for each prediction method by using three different approaches: Youden’s index⁷¹, the cutoff value in which sensitivity is equivalent to specificity⁷², and the cutoff value in which sensitivity is 90%.

The area under the ROC curve (AUC) was used to compare the prediction tools, considering a value <0.5 as the result of chance and statistical randomness⁷³, and a value close to 1 as a sign of utility of the predictor. The DeLong et al. method⁷⁴ was used for the calculation of AUC since our data type was paired. Sensitivity, specificity, and AUC values were computed with a confidence level of 95%. Due to the existence of missing values for the different prediction methods, the pair-wise deletion⁷⁵ was computed to compare ROC areas. The distribution of both categories of variants (pathogenic and benign) along the prediction scores, were also plotted by dot histograms for each predictor (Supplementary Fig. 1), representing the literature cutoffs (Supplementary Table 4) and our selected optimal values (Table 1) as horizontal lines.

Similarly, a ROC curve analysis was conducted to compare the LOEUF Scores from 207 known autosomal recessive IRD (arIRD) genes (https://sph.uth.edu/retnet/) with the LOEUF Scores from 374 olfactory receptor genes as relatively unconstrained genes. Low LOEUF scores indicate strong selection against predicted loss-of-function (pLoF) variation in a given gene, while high LOEUF scores suggest a relatively higher tolerance to inactivation. The LOEUF cutoff in which sensitivity is 90% was obtained (Supplementary Fig. 3).

In order to ascertain which was the optimal combination of predictors that allowed preserving a high True-Positive (TP) rate, reducing the False-Positive (FP) rate, a combinatorial analysis was performed. Based on its ease of subsequent application, a total of 109 combinations of different predictors, divided into three groups, were analyzed as shown in Supplementary Table 2. We conducted bubble plots to visually inspect the data. To select the most appropriate models, the following ad hoc criteria were established: TP rate ≥85%, FP rate ≤35%, and missing values rate ≤30%. If the model met the criteria, we prioritized a lower FP rate.

Finally, the selected combinatorial models were applied in the IRD validation dataset to determine the most optimal filtering steps for our discovery pipeline, according to the percentage of recovered causal and non-causal variants.

Variants filtering, prioritization, and pathogenicity assessment

The validated combination of predictors was applied to the WGS data from the discovery cohort as part of our optimized discovery pipeline (Fig. 3).

Briefly, for SNVs and indels, the recurrence filtering, consisting of removing homozygous variants in the unaffected individuals (pseudo-controls), and the multiallelic variants filtering were applied using the tool “Filter tabular” from open source, web-based platform Galaxy⁷⁶ (VCF 1). On the one hand, the VCF 1 file was annotated with the population allele frequency from gnomAD database using the Slivar v0.2.7 software⁷⁷ and, then, the frequency filtering (MAF ≤ 0.01) was applied. The resulting VCF file (VCF 2) was annotated in VEP and filtered by the CADD (CADD PHRED ≥ 22.25) and SpliceAI (max. SpliceAI DS ≥ 0.405) separately. Variants passing these filters were used to create a third and fourth VCF files which were also annotated with Alamut® Batch. Then, MAPP filtering (≤0.098 or missing), Grantham filtering (≥28 or missing), and SIFT filtering (≤0.175 or missing) were applied for the VCF 3, and NNS filtering (≥62.73 or missing) was applied for the VCF 4.

On the other hand, the VCF 1 was intersected with Clinvar VCF (October 2021) to recover variants classified as pathogenic or likely pathogenic in ClinVar database (ClinVar filtering) regardless of whether they meet the above-mentioned filtering criteria or not. This set of variants (VCF 5) was also annotated in Alamut® Batch. All these prioritized variants converged into a single file enriched in pathogenic SNVs and indels (Fig. 3).

Regarding the structural variants (SVs), the CNVs calling was performed using the tool Estimation by Read Depth with Single-nucleotide variants v1.1 (ERDS)⁷⁸, which generated as output a VCF file containing all called SVs per individual. Then, we employed the VCF sort tool⁵⁵ and the VCF combine tool⁵⁶ to create a single multi-sample VCF, which was annotated and ranked using the AnnotSV 2.2 online software⁷⁹. CNVs prioritization was done using the subsequent filters: (i) Genotype filtering which considers only homozygous, heterozygous, and hemizygous deletions and duplications excluding complex and multi-allelic CNVs; (ii) Recurrence filtering which limits the co-occurrence of the same CNV in no more than three individuals of our discovery cohort; (iii) Frequency filtering (MAF ≤ 0.01 or absent in gnomAD); and (iv) SiteType filtering consisting of prioritizing events that include exonic bases. In addition, we used the Mobile Element Locator Tool (MELT v2.2.2)⁸⁰ to discover mobile element insertions (Alu, L1, and SVA elements) in the discovery cohort. The resulting call sets were annotated using AnnotSV and filtered according to the quality status and the recurrence between samples.

Remarkably, a single multi-sample file containing the passing filters variants (SNVs, indels, and SVs) of the 14 individuals, belonging to seven IRD families, was the starting point for the application of the pedigree filtering. This filter should be applied considering the specific pedigree of each family. This step was the first filter specific to the family in the study and focused on the analysis of only those variants present in the index patient, taking into account the genotype, and the phenotype, of the additional sequenced family members. In a first approach, we carry out the prioritization of variants considering the mode of inheritance initially assumed and a common genetic cause in all affected individuals of the same family. However, in those families in which this approach did not lead to candidate variants, the data analysis was conducted under other considerations.

Finally, we conducted a manual curation of candidate variants considering: (i) the number of heterozygous, hemizygous and homozygous individuals and constraint metrics of gnomAD; (ii) the results of the application of ACMG classification¹⁵ rules; (iii) the clinical significance recorded in additional variant databases (HGMD professional, LOVD, ClinGen, DGV⁸¹ or DECIPHER⁸²); and (iv) the reported retinal association regarding gene function, interaction networks (RPGeNet¹⁹), expression databases, animal models, etc.

Candidate variants were segregated by Sanger sequencing (SNVs), PCR (MEIs) or RT-PCR (CNVs) according to the manufacturer’s protocols (BigDye® Terminator v3.1 Cycle Sequencing Kit, 3730 DNA Analyzer, Applied Biosystems, USA; Qiagen Multiplex PCR Master Mix, and RT2 SYBR Green ROX qPCR Mastermix Qiagen, Hilden, Germany) in additional family members. The primers used are available in Supplementary Table 6. Structural, expression, localization, and mutational screening studies were conducted if needed.

In case no likely candidate variants were detected using this pipeline, a reanalysis of the data, including the screening of both deep-intronic regions of novel genes, and complex rearrangements, are being conducted.

Protein structural analysis

The multiple sequence alignment was generated by Jalview v2.11.1.0⁸³ with the T-Coffee alignment algorithm⁸⁴. Sequences of CFAP20 orthologs were obtained via UniProt⁸⁵ and filtered for reviewed (Swiss-Prot), including A8IU92 (Chlamydomonas reinhardtii), Q9Y6A4 (Homo sapiens), Q9VKV8 (Drosophila melanogaster), Q6PBJ2 (Danio rerio), A0CDD4 (Paramecium tetraurelia), Q8BTU1 (Mus musculus), Q6B857 (Bos taurus), Q499T7 (Rattus norvegicus), Q5ZHP3 (Gallus gallus), Q6GL74 (Xenopus tropicalis) and Q86D25 (Caenorhabditis elegans).

Protein predictive models of human CFAP20 were obtained using I-Tasser^86,87. Among the predicted structures, the model with the highest C-score was selected. To analyze the impact of mutagenesis on terms of size and hydrogen bonding, PyMOL Molecular Graphics System, v1.8⁸⁸ was used.

The protein-protein interaction (PPI) network was created by integrating Biological General Repository for Interaction Datasets (BioGRID v3.5)⁸⁹ and IntAct databases⁹⁰ at EMBL-EBI. To restrict the number of PPIs to those with higher levels of evidence, we removed the PPIs predicted by spoke expanded co-complexes. Cytoscape v3.8.0⁹¹ was used to construct and visualize the PPI network which included common interaction pairs in both databases. The function of connected genes was checked in OMIM (https://omim.org/), Uniprot⁸⁵, and the literature.

Expression and localization studies in the human retina

The expression of the human CFAP20 gene was evaluated by real-time qPCR using the RT² SYBR Green ROX qPCR MasterMix (Qiagen, Hilden, Germany) in an Applied Biosystems 7500HT instrument (Life Technologies, CA, USA) with ready-to-use cDNA from five different tissues: retina (QUICK-Clone™ Clontech Laboratories, Inc., CA, USA), brain, kidney, placenta and skeletal muscle (Zyagen, CA, USA). The relative expression of CFAP20 in the mRNA in retina tissue vs. the other tissues was determined using the comparative Ct (2^-ΔΔCt) method⁹² with GAPDH as endogenous control. All the samples were executed in triplicates.

Localization studies of human CFAP20 in retina sections were done by immunohistochemistry. The human retina sections belonged to five unaffected donors from the University Hospital Virgen del Rocio-Institute of Biomedicine of Seville Biobank (Andalusian Public Health System Biobank and ISCIII-Red de Biobancos PT17/0015/0041). For this purpose, four-micrometer-thick tissue sections from paraffin blocks were baked for 20 min at 65 °C. Antigen retrieval was performed with a PT Link instrument (Agilent, CA, USA), using EDTA buffer (97°C, 20 min). Sections were immersed in H₂O₂ aqueous solution (Blocking peroxidase reagent, Agilent, CA, USA) for 10 min to exhaust endogenous peroxidase activity and then covered with 1% blocking reagent (Roche, Mannheim, Germany) in PBS, to block nonspecific binding sites. Sections were then incubated with a 1:400 dilution of primary antibody (Abcam, ab225952) for 1 h at room temperature in a humid chamber. Later, horseradish peroxidase polymer conjugated secondary antibodies (Visualization reagent, Agilent, CA, USA) were used for 1 h at room temperature in a humid chamber and 3,3'-diaminobenzidine was applied for 5 min to develop immunoreactivity. Slides were counterstained with hematoxylin and mounted in DPX (BDH Laboratories, Poole, UK). Images of the stained sections were obtained with an Olympus BX61 microscope and the cellSens Dimension software (Olympus, PA, USA).

Mutational screening of CFAP20 in additional IRD families

To evaluate the prevalence of CFAP20 variants in additional IRD families of our cohort, we designed an amplicon NGS-based approach of all coding exons of CFAP20 and their flanking intronic regions (Supplementary Table 6). For this purpose, 264 additional unsolved IRD patients underwent deep-amplicon sequencing using a Custom rhAmpSeq library Panel (Integrated DNA Technologies, Inc., IA, USA) in the Illumina’s MiSeq instrument (2 × 150bp paired-end). Data analysis was conducted using MiSeq Reporter software (v2.6) without flag duplicates.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The authors confirm that the data supporting the findings of this study are available within the article and its supplementary materials. The prioritized variants were submitted to ClinVar database under the accession ID: SCV002061327. The Whole-genome sequencing data are not publicly available due to families enrolled in this study did not provide additional consent to share raw dataset in a public repository. De-identified data or additional specific variant information may be accessible and requested from corresponding authors G.A. (guillermo.antinolo.sspa@juntadeandalucia.es) and S.B. (salud.borrego.sspa@juntadeandalucia.es).

References

Toulis, V. et al. Increasing the genetic diagnosis yield in inherited retinal dystrophies: assigning pathogenicity to novel non-canonical splice site variants. Genes https://doi.org/10.3390/genes11040378 (2020).
Hartong, D. T., Berson, E. L. & Dryja, T. P. Retinitis pigmentosa. Lancet. 368, 1795–1809 (2006).
Article CAS PubMed Google Scholar
Arno, G. et al. Mutations in REEP6 cause autosomal-recessive retinitis pigmentosa. Am. J. Hum. Genet. 99, 1305–1315 (2016).
Article CAS PubMed PubMed Central Google Scholar
Van Schil, K. et al. Mapping the genomic landscape of inherited retinal disease genes prioritizes genes prone to coding and noncoding copy-number variations. Genet. Med. 20, 202–213 (2018).
Article PubMed Google Scholar
Nishiguchi, K. M. et al. A founder Alu insertion in RP1 gene in Japanese patients with retinitis pigmentosa. Jpn. J. Ophthalmol. 64, 346–350 (2020).
Article CAS PubMed Google Scholar
Webb, T. R. et al. Deep intronic mutation in OFD1, identified by targeted genomic next-generation sequencing, causes a severe form of X-linked retinitis pigmentosa (RP23). Hum. Mol. Genet. 21, 3647–3654 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zhu, X. et al. Identification of novel USH2A mutations in patients with autosomal recessive retinitis pigmentosa via targeted next‑generation sequencing. Mol. Med. Rep. 22, 193–200 (2020).
CAS PubMed PubMed Central Google Scholar
Liu, H. Y. et al. Diagnostic and clinical utility of whole genome sequencing in a cohort of undiagnosed Chinese families with rare diseases. Sci. Rep. 9, 19365 (2019).
Article CAS PubMed PubMed Central Google Scholar
Turro, E. et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature 583, 96–102 (2020).
Article CAS PubMed PubMed Central Google Scholar
van El, C. G. et al. Whole-genome sequencing in health care. Recommendations of the European society of human genetics. Eur. J. Hum. Genet. 21, S1–5 (2013).
PubMed PubMed Central Google Scholar
Dockery, A., Whelan, L., Humphries, P. & Farrar, G. J. Next-generation sequencing applications for inherited retinal diseases. Int. J. Mol. Sci. https://doi.org/10.3390/ijms22115684 (2021).
Dewey, F. E. et al. Clinical interpretation and implications of whole-genome sequencing. JAMA 311, 1035–1045 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ng, P. C. & Kirkness, E. F. Whole genome sequencing. Methods Mol. Biol. 628, 215–226 (2010).
Article CAS PubMed Google Scholar
Niroula, A. & Vihinen, M. How good are pathogenicity predictors in detecting benign variants? PLoS Comput. Biol. 15, e1006481 (2019).
Article CAS PubMed PubMed Central Google Scholar
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).
Article PubMed PubMed Central Google Scholar
Laligné, C. et al. Bug22p, a conserved centrosomal/ciliary protein also present in higher plants, is required for an effective ciliary stroke in Paramecium. Eukaryot. Cell 9, 645–655 (2010).
Article PubMed PubMed Central Google Scholar
Yanagisawa, H. A. et al. FAP20 is an inner junction protein of doublet microtubules essential for both the planar asymmetrical waveform and stability of flagella in Chlamydomonas. Mol. Biol. Cell 25, 1472–1483 (2014).
Article PubMed PubMed Central Google Scholar
Mendes Maia, T., Gogendeau, D., Pennetier, C., Janke, C. & Basto, R. Bug22 influences cilium morphology and the post-translational modification of ciliary microtubules. Biol. Open 3, 138–151 (2014).
Article PubMed PubMed Central Google Scholar
Arenas-Galnares, R. et al. RPGeNet v2.0: expanding the universe of retinal disease gene interactions network. Database https://doi.org/10.1093/database/baz120 (2019).
Boldt, K. et al. An organelle-specific protein landscape identifies novel diseases and molecular mechanisms. Nat. Commun. 7, 11491 (2016).
Article CAS PubMed PubMed Central Google Scholar
Billie Au, P. Y.; Tagoe, J.; Novak, J.; MacDonald, I. 40th Annual David W Smith workshop on malformations and morphogenesis. Am. J. Med. Genet. A. 182, 877–942 (2020).
Peña-Chilet, M. et al. CSVS, a crowdsourcing database of the Spanish population genetic variability. Nucleic Acids Res. 49, D1130–D1137 (2021).
Article PubMed Google Scholar
Martín-Sánchez, M. et al. A multi-strategy sequencing workflow in inherited retinal dystrophies: routine diagnosis, addressing unsolved cases and candidate genes identification. Int. J Mol. Sci. https://doi.org/10.3390/ijms21249355 (2020).
Marshall, C. R. et al. The medical genome initiative: moving whole-genome sequencing for rare disease diagnosis to the clinic. Genome Med 12, 48 (2020).
Article PubMed PubMed Central Google Scholar
Michelson, D. J. & Clark, R. D. Optimizing genetic diagnosis of neurodevelopmental disorders in the clinical setting. Clin. Lab. Med. 40, 231–256 (2020).
Article PubMed Google Scholar
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kotlar, A. V., Trevino, C. E., Zwick, M. E., Cutler, D. J. & Wingo, T. S. Bystro: rapid online variant annotation and natural-language filtering at whole-genome scale. Genome Biol. 19, 14 (2018).
Article PubMed PubMed Central Google Scholar
Dopazo, J. et al. 267 Spanish exomes reveal population-specific differences in disease-related genetic variation. Mol. Biol. evolution 33, 1205–1218 (2016).
Article CAS Google Scholar
Dong, C. et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 24, 2125–2137 (2015).
Article CAS PubMed Google Scholar
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).
Article CAS PubMed Google Scholar
Meng, D., Cao, M., Oda, T. & Pan, J. The conserved ciliary protein Bug22 controls planar beating of Chlamydomonas flagella. J. Cell Sci. 127, 281–287 (2014).
CAS PubMed Google Scholar
Malicki, J., Avanesov, A., Li, J., Yuan, S. & Sun, Z. Analysis of cilia structure and function in zebrafish. Methods Cell Biol. 101, 39–74 (2011).
Article PubMed Google Scholar
Liu, Q. et al. The proteome of the mouse photoreceptor sensory cilium complex. Mol. Cell. Proteomics. 6, 1299–1317 (2007).
Article CAS PubMed Google Scholar
Wright, A. F., Chakarova, C. F., Abd El-Aziz, M. M. & Bhattacharya, S. S. Photoreceptor degeneration: genetic and mechanistic dissection of a complex trait. Nat. Rev. Genet. 11, 273–284 (2010).
Article CAS PubMed Google Scholar
Bujakowska, K. M., Liu, Q. & Pierce, E. A. Photoreceptor cilia and retinal ciliopathies. Cold Spring Harb. Perspect. Biol. https://doi.org/10.1101/cshperspect.a028274 (2017).
Estrada-Cuzcano, A. et al. BBS1 mutations in a wide spectrum of phenotypes ranging from nonsyndromic retinitis pigmentosa to Bardet–Biedl syndrome. Arch. Ophthalmol. 130, 1425–1432 (2012).
Article CAS PubMed Google Scholar
Murga-Zamalloa, C. A., Swaroop, A. & Khanna, H. RPGR-containing protein complexes in syndromic and non-syndromic retinal degeneration due to ciliary dysfunction. J. Genet. 88, 399–407 (2009).
Article CAS PubMed PubMed Central Google Scholar
Badano, J. L. et al. Heterozygous mutations in BBS1, BBS2 and BBS6 have a potential epistatic effect on Bardet–Biedl patients with two mutations at a second BBS locus. Hum. Mol. Genet. 12, 1651–1659 (2003).
Article CAS PubMed Google Scholar
Ramsbottom, S. A. et al. Mouse genetics reveals Barttin as a genetic modifier of Joubert syndrome. Proc. Natl. Acad. Sci. USA 117, 1113–1118 (2020).
Article CAS PubMed Google Scholar
Wheway, G., Lord, J. & Baralle, D. Splicing in the pathogenesis, diagnosis and treatment of ciliopathies. Biochim. Biophys. Acta Gene Regul. Mech. 1862, 194433 (2019).
Article CAS PubMed Google Scholar
Sanchez-Bellver, L., Toulis, V. & Marfany, G. On the wrong track: alterations of ciliary transport in inherited retinal dystrophies. Front. Cell Dev. Biol. 9, 623734 (2021).
Article PubMed PubMed Central Google Scholar
Davidson, A. E. et al. Mutations in ARL2BP, encoding ADP-ribosylation-factor-like 2 binding protein, cause autosomal-recessive retinitis pigmentosa. Am. J. Hum. Genet. 93, 321–329 (2013).
Article CAS PubMed PubMed Central Google Scholar
Moye, A. R. et al. ARL2BP, a protein linked to retinitis pigmentosa, is needed for normal photoreceptor cilia doublets and outer segment structure. Mol. Biol. Cell 29, 1590–1598 (2018).
Article CAS PubMed PubMed Central Google Scholar
Adly, N., Alhashem, A., Ammari, A. & Alkuraya, F. S. Ciliary genes TBC1D32/C6orf170 and SCLT1 are mutated in patients with OFD type IX. Hum. Mutat. 35, 36–40 (2014).
Article CAS PubMed Google Scholar
Alsahan, N. & Alkuraya, F. S. Confirming TBC1D32-related ciliopathy in humans. Am. J. Med. Genet. A https://doi.org/10.1002/ajmg.a.61717 (2020).
Wallmeier, J. et al. De novo mutations in FOXJ1 result in a motile ciliopathy with hydrocephalus and randomization of left/right body asymmetry. Am. J. Hum. Genet. 105, 1030–1039 (2019).
Article CAS PubMed PubMed Central Google Scholar
Liu, Z. et al. A drosophila model for LRRK2-linked parkinsonism. Proc. Natl Acad. Sci. USA 105, 2693–2698 (2008).
Article CAS PubMed PubMed Central Google Scholar
Kaneko, H. et al. DICER1 deficit induces Alu RNA toxicity in age-related macular degeneration. Nature 471, 325–330 (2011).
Article CAS PubMed PubMed Central Google Scholar
World Medical Association Declaration of Helsinki. Ethical principles for medical research involving human subjects. JAMA 310, 2191–2194 (2013).
Article Google Scholar
Mendez-Vidal, C. et al. Whole-exome sequencing identifies novel compound heterozygous mutations in USH2A in Spanish patients with autosomal recessive retinitis pigmentosa. Mol. Vis. 19, 2187–2195 (2013).
PubMed PubMed Central Google Scholar
Bravo-Gil, N. et al. Improving the management of inherited retinal dystrophies by targeted sequencing of a population-specific gene panel. Sci. Rep. 6, 23910 (2016).
Article CAS PubMed PubMed Central Google Scholar
Bravo-Gil, N. et al. Unravelling the genetic basis of simplex retinitis pigmentosa cases. Sci. Rep. 7, 41937 (2017).
Article CAS PubMed PubMed Central Google Scholar
González-Del Pozo, M. et al. Searching the second hit in patients with inherited retinal dystrophies and monoallelic variants in ABCA4, USH2A and CEP290 by whole-gene targeted sequencing. Sci. Rep. 8, 13312 (2018).
Article PubMed PubMed Central Google Scholar
González-Del Pozo, M. et al. Unmasking retinitis pigmentosa complex cases by a whole genome sequencing algorithm based on open-access tools: hidden recessive inheritance and potential oligogenic variants. J. Transl. Med. 18, 73 (2020).
Article PubMed PubMed Central Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Article CAS PubMed PubMed Central Google Scholar
Vcflib. A simple C++ library for parsing and manipulating VCF files. GitHub https://github.com/vcflib/vcflib (2015).
Kopanos, C. et al. VarSome: the human genomic variant search engine. Bioinformatics 35, 1978–1980 (2019).
Article CAS PubMed Google Scholar
McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Article PubMed PubMed Central Google Scholar
Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003).
Article CAS PubMed PubMed Central Google Scholar
Grantham, R. Amino acid difference formula to help explain protein evolution. Science.185, 862–864 (1974).
Article CAS PubMed Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article CAS PubMed PubMed Central Google Scholar
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Article CAS PubMed PubMed Central Google Scholar
Stone, E. A. & Sidow, A. Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res. 15, 978–986 (2005).
Article CAS PubMed PubMed Central Google Scholar
Leman, R. et al. Novel diagnostic tool for prediction of variant spliceogenicity derived from a set of 395 combined in silico/in vitro studies: an international collaborative effort. Nucleic Acids Res. 46, 7913–7923 (2018).
Article CAS PubMed PubMed Central Google Scholar
Shapiro, M. B. & Senapathy, P. RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. Nucleic Acids Res. 15, 7155–7174 (1987).
Article CAS PubMed PubMed Central Google Scholar
Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004).
Article CAS PubMed Google Scholar
Reese, M. G., Eeckman, F. H., Kulp, D. & Haussler, D. Improved splice site detection in genie. J. Comput. Biol. 4, 311–323 (1997).
Article CAS PubMed Google Scholar
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 e524 (2019).
Article CAS PubMed Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ruopp, M. D., Perkins, N. J., Whitcomb, B. W. & Schisterman, E. F. Youden index and optimal cut-point estimated from observations affected by a lower limit of detection. Biom. J. 50, 419–430 (2008).
Article PubMed PubMed Central Google Scholar
Habibzadeh, F., Habibzadeh, P. & Yadollahie, M. On determining the most appropriate test cut-off value: the case of tests with continuous results. Biochemia Med. 26, 297–307 (2016).
Article Google Scholar
Dave, R. A. & Morris, M. E. Novel high/low solubility classification methods for new molecular entities. Int. J. Pharm. 511, 111–126 (2016).
Article CAS PubMed PubMed Central Google Scholar
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Article CAS PubMed Google Scholar
Peugh, J. L. & Enders, C. K. Missing data in educational research: a review of reporting practices and suggestions for improvement. Rev. Educ. Res. 74, 525–556 (2004).
Article Google Scholar
Johnson, J. E. et al. Improve your galaxy text life: the query tabular tool. F1000Res. 7, 1604 (2018).
Article PubMed Google Scholar
Pedersen, B. S. et al. Effective variant filtering and expected candidate variant yield in studies of rare human disease. NPJ Genom. Med. 6, 60 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhu, M. et al. Using ERDS to infer copy-number variants in high-coverage genomes. Am. J. Hum. Genet. 91, 408–421 (2012).
Article CAS PubMed PubMed Central Google Scholar
Geoffroy, V. et al. AnnotSV: an integrated tool for structural variations annotation. Bioinformatics 34, 3572–3574 (2018).
Article CAS PubMed Google Scholar
Gardner, E. J. et al. The mobile element locator tool (MELT): population-scale mobile element discovery and biology. Genome Res. 27, 1916–1929 (2017).
Article CAS PubMed PubMed Central Google Scholar
MacDonald, J. R., Ziman, R., Yuen, R. K., Feuk, L. & Scherer, S. W. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986–D992 (2014).
Article CAS PubMed Google Scholar
Firth, H. V. et al. DECIPHER: database of chromosomal imbalance and phenotype in humans using ensembl resources. Am. J. Hum. Genet. 84, 524–533 (2009).
Article CAS PubMed PubMed Central Google Scholar
Waterhouse, A. M., Procter, J. B., Martin, D. M., Clamp, M. & Barton, G. J. Jalview version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics 25, 1189–1191 (2009).
Article CAS PubMed PubMed Central Google Scholar
Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000).
Article CAS PubMed Google Scholar
UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
Yang, J. & Zhang, Y. I-TASSER server: new development for protein structure and function predictions. Nucleic Acids Res. 43, W174–W181 (2015).
Article CAS PubMed PubMed Central Google Scholar
Roy, A., Kucukural, A. & Zhang, Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5, 725–738 (2010).
Article CAS PubMed PubMed Central Google Scholar
Schrodinger, LLC. The PyMOL Molecular Graphics System, Version 1.8 https://pymol.org/2/ (2015).
Oughtred, R. et al. The BioGRID interaction database: 2019 update. Nucleic Acids Res. 47, D529–D541 (2019).
Article CAS PubMed Google Scholar
Orchard, S. et al. The MIntAct project-IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 42, D358–D363 (2014).
Article CAS PubMed Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Livak, K. J. & Schmittgen, T. D. Analysis of relative gene expression data using real-time quantitative PCR and the 2(-Delta Delta C(T)) Method. Methods 25, 402–408 (2001).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors thank the families who participated in this study, the donors and the University Hospital Virgen del Rocio-Institute of Biomedicine of Seville Biobank (Andalusian Public Health System Biobank and ISCIII-Red de Biobancos PT17/0015/0041) for the human specimens used in this study, and the Andalusian Association of Retinitis Pigmentosa (AARP). This work was supported by the Instituto de Salud Carlos III (ISCIII), Spanish Ministry of Economy and Competitiveness, Spain and co-funded by the European Union (ERDF, “A way to make Europe”) [PI18-00612; PI21-00244], Regional Ministry of Health and Families of the Autonomous Government of Andalusia [PEER-0501-2019] and the Foundation Isabel Gemio/Foundation Cajasol [FGEMIO-2019-01]. EFS is supported by fellowship FI19/00091 from ISCIII (ESF, “Investing in your future”). MMS is supported by a fellowship associated with the CTS-1664 project, which has been funded by the Regional Ministry of Economy, Knowledge, Enterprise, and the University of the Regional Government of Andalusia. NBG is supported by a fellowship RH-0118-2020, which has been funded by the Regional Ministry of Health and Families of the Autonomous Government of Andalusia.

Author information

These authors contributed equally: María González-del Pozo, Elena Fernández-Suárez.

Authors and Affiliations

Department of Maternofetal Medicine, Genetics and Reproduction, Institute of Biomedicine of Seville, University Hospital Virgen del Rocio/CSIC/University of Seville, Seville, Spain
María González-del Pozo, Elena Fernández-Suárez, Nereida Bravo-Gil, Cristina Méndez-Vidal, Marta Martín-Sánchez, Salud Borrego & Guillermo Antiñolo
Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), Seville, Spain
María González-del Pozo, Elena Fernández-Suárez, Nereida Bravo-Gil, Cristina Méndez-Vidal, Salud Borrego & Guillermo Antiñolo
Department of Ophthalmology, University Hospital Virgen Macarena, Seville, Spain
Enrique Rodríguez-de la Rúa & María José Morillo-Sánchez
Retics Patologia Ocular, OFTARED, Instituto de Salud Carlos III, Madrid, Spain
Enrique Rodríguez-de la Rúa
Department of Clinical Neurophysiology, University Hospital Virgen Macarena, Seville, Spain
Manuel Ramos-Jiménez

Authors

María González-del Pozo
View author publications
You can also search for this author in PubMed Google Scholar
Elena Fernández-Suárez
View author publications
You can also search for this author in PubMed Google Scholar
Nereida Bravo-Gil
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Méndez-Vidal
View author publications
You can also search for this author in PubMed Google Scholar
Marta Martín-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Rodríguez-de la Rúa
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Ramos-Jiménez
View author publications
You can also search for this author in PubMed Google Scholar
María José Morillo-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Salud Borrego
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Antiñolo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.G.-P. and E.F.-S. are co-first authors. G.A. and S.B. conceived and designed the study. E.R.-R. and M.J.M.-S. performed the ophthalmic evaluations. M.G.-P., E.F.-S., N.B.-G., C.M.-V., and M.M.-S. conducted the experiments. E.F.-S., M.G.-P., N.B.-G., C.M.-V., and M.M.-S. analyzed and interpreted the generated data. M.R.-J. conducted the electrophysiological study. S.B. coordinated the integration of additional clinical information. M.G.-P., and E.F.-S., wrote the manuscript with the collaboration of all co-authors. G.A. and S.B. revised the paper critically for important intellectual content. All authors approved the final version to be published.

Corresponding authors

Correspondence to Salud Borrego or Guillermo Antiñolo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Materials

Reporting summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

González-del Pozo, M., Fernández-Suárez, E., Bravo-Gil, N. et al. A comprehensive WGS-based pipeline for the identification of new candidate genes in inherited retinal dystrophies. npj Genom. Med. 7, 17 (2022). https://doi.org/10.1038/s41525-022-00286-0

Download citation

Received: 29 April 2021
Accepted: 04 February 2022
Published: 04 March 2022
DOI: https://doi.org/10.1038/s41525-022-00286-0

This article is cited by

From Data to Cure: A Comprehensive Exploration of Multi-omics Data Analysis for Targeted Therapies
- Arnab Mukherjee
- Suzanna Abraham
- K. S. Mukunthan
Molecular Biotechnology (2024)
The inner junction protein CFAP20 functions in motile and non-motile cilia and is critical for vision
- Paul W. Chrystal
- Nils J. Lambacher
- Michel R. Leroux
Nature Communications (2022)

Subjects

Abstract

Similar content being viewed by others

Genome-wide association studies

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Genomic data in the All of Us Research Program

Introduction

Results

Establishment of the optimal cutoffs

Optimization and validation of the discovery pipeline

Application of the discovery pipeline

Protein structural analysis, expression assays, localization studies, and mutational screening of CFAP20

Clinical findings in the family A

Discussion

Methods

Subjects and previous NGS studies

Curation of the training dataset

Predictive tools tested in this study

Comparison of the predictive tools

Variants filtering, prioritization, and pathogenicity assessment

Protein structural analysis

Expression and localization studies in the human retina

Mutational screening of CFAP20 in additional IRD families

Reporting summary

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Materials

Reporting summary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

From Data to Cure: A Comprehensive Exploration of Multi-omics Data Analysis for Targeted Therapies

The inner junction protein CFAP20 functions in motile and non-motile cilia and is critical for vision

Search

Quick links