Introduction

Rare diseases (RD) affect all populations worldwide. Although each disease is individually rare, their cumulative prevalence is estimated at approximately 5% of the population [1]. With a global population of nearly 8 billion people in 2023, this figure suggests that around 400 million people may suffer from an RD, and 330 million of them could live in low-and-middle-income countries (LMIC) (estimated from [2]). Given the rarity of each condition, affected individuals, families and health systems face many challenges in access and quality of care. The prolonged “diagnostic odyssey” or “medical pilgrimage” is among these challenges, which can last decades [3].

The majority of RDs have a genetic etiology. Exome and genome sequencing (ES and GS, respectively) have revolutionized diagnosis and discovery of RDs in the past 15 years in an unprecedented way [4], particularly for persons with rare undiagnosed disorders (RUD). The early use of ES and GS has been shown to be cost-effective compared with stepwise testing, and has clinical utility in terms of changes in clinical management and reproductive decisions [5,6,7,8,9,10]. Nevertheless, these powerful diagnostic tools are not available globally. Only 31 countries have ES as a clinical diagnostic service listed in the Genetic Testing Registry [11], and most of them are in high- or high-middle-income countries, according to World Bank classification, except for India and Bangladesh, which have large pópulations sizes and thus likely high testing volumes [12]. The lack of availability and coverage of complex laboratory genomic services increases the already substantial healthcare gaps for persons with RD. Barriers to implementation of ES and GS in LMICs include lack of adequately trained and certified personnel, especially in laboratory genetics and genomics and in bioinformatics; higher costs of equipment and reagents compared with the global North, lack of coverage and reimbursement of testing, and insufficient regulatory, legal and ethics frameworks [13, 14]. As a result, there is limited availability of genomic testing for persons with RDs, and therefore, prolonged, and unsolved diagnostic odysseys. In some countries, patients and clinicians opt to outsource testing, but this usually requires substantial personal out-of-pocket payments, which also limits access to only those who can afford these payments [15, 16].

Although Chile is a high-income country according to World Bank standards [12], it can be considered as an “emerging nation” in terms of genomic medicine implementation [17]. For example, to date, there are no high-throughput sequencers in clinical laboratories to provide ES or GS for diagnosis of RD in a cost- and time-efficient manner, and there are no training or certified programs in clinical laboratory genetics and genomics or in bioinformatics.

It remains unknown if and how countries with limited genomic resources can develop RD diagnostic resources. To overcome this gap, we implemented a program for RD in Chile, called DEcoding Complex Inherited PHenotypes of Rare Diseases (DECIPHERD), with a specific focus on patients with RUD. The strategy was implemented in two phases: first, a training phase, which relied on international collaboration, followed by a second, local development phase. This second phase used a hybrid strategy, with clinical evaluation and variant data analysis interpretation performed locally, but ES outsourced abroad. We report on the process of implementation, and the results of the first 103 analyzed probands.

Methods

Implementation of the DECIPHERD sequencing and analysis pipeline consisted of two phases: a training phase and a local development phase. During the training phase, samples from probands and their parents were submitted to Baylor College of Medicine (BCM) for trio ES. The resulting exome data was analyzed by members of the DECIPHERD team who had the opportunity to train at BCM. In addition, a workshop on variant interpretation, led by collaborators from the US, was hosted at Universidad del Desarrollo (UDD). In parallel, probands underwent clinical exome sequencing (CES) and analysis at UDD. Comparisons of the overall diagnostic yield of both strategies and of the concordance of results for a subset of probands that were analyzed under both strategies were performed to decide the best design for the local development phase.

Participants

Probands with RUD, of any age or sex, and residing in Chile, were candidates to participate. The inclusion criteria were: [1] the presence of 2 or more major congenital anomalies, or one major and several minor anomalies, [2] neurological abnormalities such as global developmental delay or intellectual disability with one major or multiple minor congenital anomalies, or [3] significant dysfunction of the immune system, for which a genetic etiology was suspected. Probands were referred to the DECIPHERD study by their treating physicians if a cause was not identified despite prior evaluations at their local health centers (e.g., karyotype, FMR1, microarray, metabolic or gene panel testing, among others). Demographic information was obtained from interviews with probands and/or parents or guardians. Candidate participants were presented by their treating physicians at a weekly screening meeting led by the core DECIPHERD team, which includes clinicians, laboratory geneticists and bioinformaticians. Clinical history was presented using a structured form, and the probands’ clinical features were described using Human Phenotype Ontology (HPO) terms [18]. After a multidisciplinary discussion, probands that fulfilled the inclusion criteria were invited to participate. Parent participation for trio analysis was preferred when possible. Written informed consent was obtained from the participants and/or their guardians.

Samples and DNA extraction

Blood samples were obtained in EDTA tubes from probands and their parents when available. Genomic DNA was extracted using the DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany).

Exome sequencing and bioinformatics pipelines

Training phase

For trio exome sequencing at BCM (BCM-ES-trio), exome enrichment libraries were obtained using the NimbleGen VCRome 2.1 kit (Roche, Switzerland) according to the manufacturer’s protocol. These samples were subsequently sequenced using Illumina HiSeq 2500 equipment (Illumina, CA, USA). All data was aligned to GRCh37. Data processing and variant annotation was performed using the Variant Analyzer bioinformatic pipeline developed at BCM (https://github.com/BCM-Lupskilab/VariantAnalyzer). Variants with less than 3 reads were excluded, as well as synonymous, deep intronic and intergenic variants, and 3’ and 5’ UTR regions. Annotated variants were evaluated using both recessive, dominant, and X-linked models. For the recessive model, variants with minor allele frequency (MAF) greater than 0.005 within the Baylor-Hopkins Center for Mendelian Genomics (CMG), ESP 5400, 1000 Genomes, gnomAD v.3 and ExAC databases were excluded, as well as in our internal database control population database (unpublished). Also, potential recessive variants were excluded if they were present in the ExAC database with a homozygous or hemizygous count of 10 or greater. For autosomal dominant model analyses, variants were excluded if their MAF were greater than 0.001 in these databases, and if they were present in the gnomAD V3 and our database with an allele count greater than 5. Rare variants were analyzed manually considering specific variant characteristics including type of variant and the combined annotation dependent depletion (CADD) score (). Different publicly available resources were used to evaluate the association of a patient’s phenotype with his/her detected variant allele such as OMIM (https://www.omim.org), gnomAD (https://gnomad.broadinstitute.org), UCSC Genome browser (https://genome.ucsc.edu), PubMed (https://pubmed.ncbi.nlm.nih.gov/) among others.

A subset of candidate variants was orthogonally confirmed by Sanger sequencing at UDD. Regions containing the genetic variants were amplified by PCR with custom-design primers using the ApE software [19] (ApE-A plasmid Editor (utah.edu) and synthetized by Macrogen™ (Korea)). The resulting ~300 bp amplicons were sequenced bidirectionally using BigDye™ Terminator v1.1 Cycle Sequencing Kit (Applied Biosystems, Foster City, CA, US) and the SeqStudio™ Genetic Analyzer (Thermo Fischer Scientific, Waltham, MA, US).

Local development phase

In parallel, and due to the lack of available high-throughput sequencing equipment in clinical laboratories in Chile, we initially performed clinical exome sequencing (CES) in probands only at UDD (UDD-CES-Solo), using the SOPHiA Clinical Exome Solution V1 (SOPHiA Genetics, Lausanne, Switzerland). This method captures exons from 4964 genes known to cause Mendelian disorders. Sequencing was carried out in an IIlumina MiSeq sequencer. Results were analyzed using the SOPHiA DDM™ software with the same filters described above and compared with those obtained in the training phase. A subset of candidate variants was confirmed by Sanger sequencing in the proband and parents if available as described above.

Subsequently, we opted for a hybrid model, where ES was performed abroad, and the analysis of the results was performed at UDD by laboratory scientists of the DECIPHERD core team (DECIPHERD-ES). Sequencing was outsourced to Novogene (Beijing, China) using SOPHiA Exome Solution_V1 capture (SOPHiA Genetics, Lausanne, Switzerland), that captures the exons of approximately 19,680 protein-coding genes. The resulting annotated variants were evaluated by the core DECIPHERD team using the SOPHiA DDM™ platform. This platform provides information on population frequencies of the databases mentioned above, but it also includes those obtained from the SOPHIA users´ community and our local cohort. All the annotated variants were reviewed stepwise according to their inheritance model, starting by autosomal recessive (AR, considering homozygous and compound heterozygous variants), followed by autosomal dominant (AD) and X-linked (XL) models, as well as de novo occurrences, using the same frequency filters described for the training phase, and with the additional filter of excluding variants present in two or more unrelated participants from our cohort. The process also allows the detection of copy-number variants (CNVs). HPO terms were used to create virtual gene panels to initially narrow down the search for candidate variants within genes already associated to each patient´s phenotype, subsequently the exomes were also analyzed in a phenotypically unbiased manner. Rare variants were analyzed manually considering variant characteristics including type of mutation and combined annotation dependent depletion (CADD) score. The same databases described above were used to obtain additional variant information. We then prioritized the filtered variants based on their ACMG interpretation provided by the SOPHiA platform and further assessed their interpretation with additional freely available resources, including ClinVar (www.ncbi.nlm.nih.gov/clinvar), Varsome (www.varsome.com) and Franklin (https://franklin.genoox.com). Candidate variants were interpreted as either pathogenic (P), likely pathogenic (LP), variant of unknown significance (VUS), likely benign (LB) or benign (B). In case of discrepancies among these resources, the interpretation provided by ClinVar, if available, was selected; if a variant was not included in ClinVar, the most common interpretation among the other resources was chosen.

Clinical interpretation

Results were considered “informative” when a P, LP variant or VUS was identified in a gene consistent with the proband’s phenotype. Informative cases were then labeled as “solved” for P/LP variants and “suggestive” when a VUS was identified. Recessive conditions in which only one informative variant was identified were considered “partially informative”, given that we could not define if this represented carrier status only, a novel potentially dominant form of the disease, as has been previously described for other disorders [20] or a compound heterozygote for a non-coding variant undetected by this method. Results were considered “uninformative” when there were no variants identified that could explain the patient’s phenotype. Results were returned to families and referring clinicians in a genetic counseling session and with a written report.

Statistical analysis

All data were stored in REDCap [21]. Descriptive analysis included frequencies, means and ranges. To identify variables associated with the likelihood of obtaining an informative result, first, we built bivariate logistic regression models for each of the following variables: sex, age (in years, and categorized in groups representing childhood (0–12 years), adolescence (12–18 years) and adulthood (18–52 years)), type of care (public or private), number of systems affected (based on the grouping of individual HPO terms into “systems” according to their hierarchy in the HPO’s terms tree), and whether the patient had congenital anomalies, neurological or immune compromise (represented as binary variables each). Then, we built a multivariate regression model including all variables mentioned before. The analyses were performed in RStudio v.2023.06.0 + 421 with R v.4.2.2, using the glm function and the binomial family modifier.

Results

Demographic and clinical characterization of patients

Between June 2019 and December 2022, 133 probands were screened for the study and 103 of them consented to participate. Non-participants included probands who did not fulfill inclusion criteria (n = 21) or declined to participate (n = 9).

To characterize the probands’ phenotypes, we collected information from clinical records and in-person evaluations, describing clinical manifestations using HPO terms. Demographic characteristics are summarized in Table 1. The most frequent clinical inclusion criteria were congenital anomalies, with or without neurological abnormalities, present in 75% of probands (Supplementary Table 1). When grouped by systems, the overall cohort had involvement of 21 different organ systems. The majority of probands had neurologic manifestations, followed by craniofacial, musculoskeletal and growth manifestations (Fig. 1a). All probands had manifestations in at least two major systems, with a median of six and a range of two to eleven compromised systems (Fig. 1b). These results show that probands had heterogeneous and complex clinical manifestations.

Table 1 Demographic characterization of probands.
Fig. 1: Clinical manifestations.
figure 1

The proband’s phenotypes were categorized into systems based on the Human Phenotype Ontology (HPO) hierarchy tree. A Frequency of affected systems in probands. The figure shows the percentage of probands with the corresponding system affected. B Number of affected systems per proband. Absolute frequency of systems affected per participant, based on the previously mentioned definition of system.

Comparison of findings in the training and initial development phase

During the training and the initial development phases, 13 trios had BCM-ES and 16 probands had UDD-CES, with an overlap of 6 probands. An informative finding was identified in 7 BCM-ES-trio analysis (53.8%) and 4 UDD-CES-solo analysis (25%). Three of the six overlapping probands had informative findings identified by both strategies. In addition, one UDD-CES-solo uninformative case had an informative result by BCM-ES-trio, resulting from a gene that was not included in the CES panel. The higher yield of trio ES compared with CES led to the decision to continue with ES in the local development phase.

Exome analysis

After the learning phase, 74 additional probands were included in the local development phase (DECIPHERD-ES) using the hybrid model. From this point onward, findings are presented in aggregated form including both the training and local development phases. Ten probands with uninformative results on UDD-CES subsequently had DECIPHERD-ES. Therefore, we performed 113 total analyses in the 103 probands, of which 97 were ES. Of those undergoing ES, seventy-one (69%) participated as trios with both parents, 7 (6.7%) as duos (proband with mother, 3.8%) and 19 were solo cases (19.6%), for a total of 246 participants.

Forty-seven probands (45.6%) had informative results, that is a P/LP variant or a VUS in genes known to cause conditions consistent with their clinical phenotypes. The different strategies had different yields of informative results. CES provided informative results for 4 of 16 probands (25%), ES-solo for 9 of 19 (47.5%), ES duo in 2 of 7 (28.6%) and ES-trio in 32 of 71 families (45.0%). The results of these strategies are summarized in Table 2.

Table 2 Detection rate by familial and sequencing strategies.

Thirty-one of these 47 probands with informative results were considered “solved” cases, i.e. a molecular diagnosis concluded with the identification of P/LP variants (30% of all probands and 66% of those with informative results), of which 23 had an AD condition, 4 AR and 4 XL. In contrast, 13 cases had VUS and were considered as “suggestive results” (12.6% of all probands and 27.6% of those with informative results, respectively), of which 6 corresponded to AD conditions, 3 AR and 4XL. Finally, three cases were considered “partially informative” (2.9% and 6.4%, respectively) since they had a single variant identified in a gene associated with a recessive condition and a phenotype consistent with this AR disease. Clinical and molecular findings of the patients with informative results can be found in Table 3.

Table 3 Informative variants.

Four patients with suggestive results warrant further description since they illustrate interesting findings. Patient ID#64 had diagnosis of renal and pancreatic cysts at age 2 years, with recurrent infections and was found to have two VUS in PKD1, inherited in trans from healthy parents. This is generally considered an AD disease; however, cases have been described with biallelic hypomorphic mutations in this gene, postulated to be an AR form of disease [22]. Another patient (ID#91) with severe ID and born to consanguineous parents had a homozygous P/LP variant in SETD5, known to cause dominant ID, but with incomplete penetrance and variable expressivity [23, 24]. In this case, one parent was described as having mild ID, and we propose the existence of a potentially more severe recessive phenotype for this condition. Finally, two patients (ID#106 and #125) had VUS in two different genes each, with each variant potentially associated with different components of their phenotype, suggesting that these situations may constitute dual genetic diagnoses.

This heterogenous cohort allowed the identification of novel variants potentially associated with disease. We found 55 distinct informative variants in the 47 probands (considering homozygotes as having one same variant). Fifty variants were SNVs and 5 were CNVs. Among informative variants, 25 (45.4%) were confirmed de novo, and 26 (47.2%) were inherited, underscoring the utility of performing trio ES. Parental samples were unavailable for 4 probands, and thus the inheritance of their variants could not be determined. Twenty-six informative variants (47.3%) were listed in ClinVar as of June 30, 2023. Interestingly, the other twenty-nine informative variants (52.7%) were not listed in ClinVar, of which sixteen were predicted as P/LP and twelve as VUS according to ACMG criteria. In summary, more than half of informative variants had not been reported in public databases at the time of this writing.

Finally, to assess for factors associated with higher likelihood of informative findings, we performed bivariate and multivariate analysis of probands´ demographics and clinical phenotypes using logistic regression. In the bivariate analyses, the presence of neurologic abnormalities, immune dysfunction, and being in the 12–18 years of age group were associated with significant diagnostic yield. In the multivariate analysis, only being in the 12–18 years of age group was positively associated with significant diagnostic yield (Table 4).

Table 4 Logistic regression analysis. Identification of variables associated with obtaining an informative result.

Discussion

This is the first reported experience of a local implementation program for the genetic evaluation of patients with RUD in Chile using exome sequencing. The process allowed us to establish a trained team for clinical assessment and exome analysis through a rigorous pipeline that enabled the diagnosis in nearly half of the study participants. It also permitted tens of patients with severe previously undiagnosed disorders to receive information on the cause of their condition.

The process of implementation highlights the relevance of international collaboration to obtain critical knowledge and training when these resources are unavailable locally. To circumvent the lack of genomic capabilities at scale in Chile, our strategy relied on outsourcing the sequencing component, while leveraging and developing capacities in the clinical evaluation and bioinformatics analysis that can be performed in-house. Albeit the use of a commercially available interpretation platform increases the costs compared to the use of freely available software, we opted for the former for data security reasons, consistency in the analyses, and to build this project with standards that could be transferred to a clinical laboratory. The hybrid strategy used in DECIPHERD can serve as an intermediate step in establishing autonomous local genomic services for undiagnosed diseases and other genetic conditions in settings of similar limited resources.

As result of the study, we were able to identify a molecular etiology for almost half of the participants across a wide range of clinical phenotypes. This is similar to published data from large studies in European and North American populations [10, 25] and higher than studies in Argentina and Brazil, reporting detection rates of 30–40% [26, 27]. Our approach also identified patients in special situations, such as known dominant conditions presenting in probably recessive forms, and also potential dual diagnosis.

ES had higher diagnostic yield than CES in this study. We explored the latter strategy in the initial phase of the program, given the available local sequencing capacities and published reports showing that proband-only CES could be a cost-effective solution in resource-limited settings [28]. Nevertheless, the use of CES limits the identification of potential novel candidate genes, which can be of special relevance in understudied populations. In addition, and as seen in other admixed populations, de novo variants were a major contributor to the severe phenotypes observed in the patients, highlighting the value of including parental studies [10, 25, 29, 30]. The ES pipeline also allowed detection of both SNVs and CNVs, demonstrating the benefits of a single integrated genomic test for persons with RUD. These observations provide arguments in favor of the recommendation for trio ES analysis to increase diagnostic rates this population [6].

Regarding the inherited variants, three of the eight patients with homozygous findings reported parental consanguinity. This information may have been unknown or undeclared by the rest of the families, or the findings may alternatively suggest founder effects that could be of relevance to the broader Chilean population.

Importantly, almost half of our informative findings have not been previously described in large, publicly available population or clinical databases such as gnomAD and ClinVar. This knowledge gap in local genomic population also impacts the filtering process; the use of an internal database, generated through this and other ongoing projects was crucial for interpretation. This highlights the need to increase diversity of genomic variation studies, since including understudied populations can make substantial contributions to relevant new knowledge, valuable for clinical and other genomic applications [31,32,33].

The presence of neurologic findings (mainly intellectual disability) was significantly associated with diagnostic yield, while immune dysfunction was negatively associated. The former is similar to the recent findings of Wright et al. [10], but we did not find other statistically significant and clinically relevant associations as shown in their study. This could be due to the small sample size that limits statistical power. It is relevant to note that all probands entering the study due to dysfunction of the immune system had prior negative gene panel testing, while the remaining patients had more heterogeneous pre-entry genetic testing.

Limitations of our study included the relatively small sample size and the small number of samples that underwent head-to-head comparison of results which precluded an in-depth analysis of performance metrics in this implementation process. Pre-inclusion clinical evaluations were very heterogeneous among participants as mentioned above, owing to the different availability of genetic services across the country. Despite DECIPHERD´s efforts to have broad geographic representation and work as a country-wide network, most of the participants came from the Metropolitan Region of Santiago, which has the largest concentration of the population and even disproportionately larger concentration of medical subspecialists [16]. Despite these limitations and potential biases, this work likely represents a “real-world” situation for Chilean RUD patients.

Several next steps are necessary to increase the diagnostic yield and to implement genomic sequencing in the Chilean health care system. First, the study of this understudied population has the potential of identifying variants in novel, unassociated candidate genes. Selected variants in novel candidate genes have been shared in GeneMatcher [34] and connections with colleagues worldwide are underway. We anticipate this may contribute to the discovery of new genetically determined disorders. Second, reanalysis has proven to improve diagnostic yield of genomic testing based on new discoveries and updated bioinformatic tools [25]. Our team is committed to reanalyzing exome results of this cohort after at least one year of initial analysis, and we predict this could also contribute to improving the diagnostic yield. In addition, we plan to analyze secondary findings among the families that consented to receive them, which may lead to additional clinically relevant findings. It would be ideal to also implement other strategies such as genome and/or transcriptome sequencing to further increase diagnostic yield for patients with non-informative or partially informative findings [35,36,37].

Nevertheless, local implementation of a sequencing pipeline is still hindered by prohibitive costs. If prices drop to cost-efficient levels for LMIC countries, it may be feasible to implement NGS at scale in our country. Sample volume is a pricing factor, particularly in a small country, and therefore collaboration between different areas and centers that use NGS can also facilitate the development of local sequencing centers. Formal and more extensive training of clinical, laboratory and bioinformatics workforce is also crucial, along with certification and accreditation. Performing testing in the country is key for patient access, as the Chilean health system does not provide financial coverage for testing performed abroad.

Another crucial element to consider for decision-makers is the impact of reaching a diagnosis. In parallel to the diagnostic pipeline described in this article, our team is collecting quantitative and qualitative information on the effects of achieving a genetic diagnosis for patients, caregivers, healthcare teams and the healthcare system to contribute with locally pertinent data and insights that can guide the elaboration of national policies.

Our work highlights the feasibility of establishing a program for rare and undiagnosed diseases in a country with limited genomic resources. This is aligned with the World Health Organizations call to advance genomic medicine worldwide [14], and the United Nations Sustainable Development Goals to “leave no one behind” [38]. We expect this work may be useful for other countries in similar situations to develop their own RUD programs.