Introduction

Small airways and interstitial pulmonary disease (also known as, childhood hereditary interstitial lung diseases, chILD) refers to complex respiratory disorders characterized by overlapping signs and symptoms of pulmonary dysfunction1. These entities often manifest clinically from early infancy2,3,4. A leading cause is surfactant dysfunction, which has been associated with pathogenic variants involving, for example, ABCA3 (ATP-binding cassette, subfamily A, member 3) and CSF2RB (granulocyte–macrophage colony-stimulating factor receptor, beta)5,6,7,8,9. Overlapping etiologies include pathogenic variants of SFTP (surfactant, pulmonary-associated proteins) family genes and MUC5B (MUCIN 5, subtype B, tracheobronchial), which have been associated with idiopathic pulmonary fibrosis10,11. Similarly, pathogenic variants of SCNN1B (sodium channel, nonvoltage-gated 1, beta subunit) have been linked to a structurally-destructive small airway disease that leads to bronchiectasis12. Variants of SERPINA1 (serpin peptidase inhibitor, clade A, member 1) can cause alpha-1-antitrypsin deficiency13.

Currently, the best approach to diagnose these disorders is genetic tests, which lower the cost and need for invasive investigations, such as lung biopsy14,15,16,17. It is well to know that a reasonable yield of the lung biopsy include findings of fibro-inflammatory changes in autoimmune setting that may warrant specific intervention, and histological analysis in individuals with variants of unknown clinical significance.

The local population of United Arab Emirates include tribes from Arabian Peninsula, Persia, Baluchistan, and East Africa. Founder mutations and autosomal recessive disorders are exceptionally common18,19. Many of these diseases may be amenable to prevention through genetic screening and counselling. This study examines the pathogenicity of variations in genes associated with interstitial lung disorders, mainly found in our pediatric patients. Its main purpose is to provide computational and clinical information that improves family counseling. In addition, the results may also improve the clinical care of these children and provide opportunities for participating in clinical trials that involve molecularly-targeted therapies.

Methods

This retrospective data collection study was approved by ‘Tawam Human Research Ethics Committee’ (SA/AJ/566 on 19th April 2018 and AA/AJ/653 on 19th June 2019). Informed consent to participate in this ‘Retrospective Chart Review’ was exempt. All methods were performed in accordance with the relevant guidelines and regulations.

Our pediatric pulmonary service routinely request genetic studies for children with an unexplained respiratory disease. These investigations are performed by Centogene AG (Germany), and include diagnostic exome sequencing20 or Comprehensive Pulmonary Disease Panel (https://www.centogene.com/science/centopedia/comprehensive-pulmonary-disease-panel.html. Accessed 01 June 2020).

Variant information in open databases was combined using Ensembl Variant Effect Predictor21. Pathogenicity prediction included scores available in dbNSFP (One-Stop Database of Functional Predictions and Annotations for Human Non-synonymous and Splice Site) for SIFT, PolyPhen, Condel, CADD, FATHMM, LRT, MetaLR, MetaSVM, Mutation Assessor, Mutation Taster, PROVEAN, REVEL, and VEST322,23. Multiple sequence alignment was performed to determine amino acid conservation at sites shown in Table 1, and to compute Jensen-Shannon Divergence (JSD) scores. Amino acid sequences of proteins from Homo sapiens (human), Pan troglodytes (chimpanzee), Mus musculus (house mouse), Rattus norvegicus (Norway rat), Canis lupus familiaris (dog), Equus caballus (horse), Bos taurus (bovine), Xenopus tropicalis (frog), and Gallus gallus (chicken) were collected from NCBI RefSeq and aligned using MUSCLE24 in Geneious 9.1.8 (https://www.geneious.com). Aligned sequences were exported in FASTA format to compute JSD. Potential binding pockets in the studied proteins were evaluated using firestar25 and 3DLigandSite26, post translation modifications were assessed using UniProt27, PhosphoSitePlus28 and iPTMnet29, and functional domains were determined using InterPro30. The variants were grouped into three clusters—likely pathogenic, uncertain and likely benign—by k-means clustering in R version 3.6.0, using all pathogenicity scores in Table 1 (Table S1, Supplementary Material). This method yielded p < 0.020 on the Kruskal–Wallis test between the three groups for each of the 13 scoring tools. The American College of Medical Genetics (ACMG) classification of the variants from Varsome31 was evaluated.

Table 1 Studied variants of interstitial lung diseases.

Homology modeling

Homology models were generated for variants for which suitable template structures were available. The following Protein Data Bank (PDB) structures were used for modeling: CSF2RB—PDB ID: 2GYS; SCNN1B—PDB ID: 6BQN; and SERPINA1—PDB ID:3NE4. Models were generated using Schrödinger Prime 2019-2 (Prime, Schrödinger, LLC, New York, NY, 2019).

Statistics

The analyses were performed using SPSS statistical package (version 20). The Kruskal–Wallis H test (non-parametric, k independent samples) test was used to compare groups of variants. The Mann–Whitney U test [nonparametric, 2 independent samples, “Exact Sig (2-tailed)”] was used to compare two groups of variants. p < 0.05 was considered significant.

Ethics approval and consent to participate

This retrospective (Chart Review) data collection study was approved by ‘Tawam Human Research Ethics Committee’ (SA/AJ/566 on 19th April 2018 and AA/AJ/653 on 19th June 2019).

Results

Table 1 summarizes the pathogenicity of 24 variants of the studied six gene families; two genes are linked to surfactant metabolism dysfunction (ABCA3 and CSF2RB), two to pulmonary fibrosis (MUC5B and SFTP), one to bronchiectasis (SCNN1B), and one to alpha-1-antitrypsin deficiency (SERPINA1). Twenty variants are missense, two nonsense, and two intronic. None of the variations in the coding region are at sites known to be post translationally modified based on UniProt27, PhosphoSitePlus28 and iPTMnet29.

Figure 1 shows a multidimensional scaling (MDS) plot for the 20 missense variants, using all pathogenicity prediction scores in Table 1. Seven variants cluster in the lower left zone (‘red’), likely pathogenic with mean ± SD (median) Condel scores of 0.804 ± 0.097 (0.855). Five variants cluster in the middle right zone (‘green’), likely benign with Condel scores of 0.110 ± 0.131 (0.042). The remaining eight variants are in between (‘orange’), ‘uncertain’ with Condel scores of 0.430 ± 0.231 (0.468).

Figure 1
figure 1

A multidimensional scaling (MDS) plot for the 20 missense variants, using all scores shown in Table 1. The three k-means clusters obtained are colored in red (likely pathogenic), orange (uncertain), and green (likely benign).

Figure S1 (Supplementary Material) shows ‘dot plots’ for distribution of pathogenicity prediction scores of the MDS plot. The difference between the three clusters is significant for each score (p-values using the Kruskal–Wallis H test): SIFT = 0.0198; PolyPhen = 0.001; Condel = 0.0007; CADD = 0.0009; FATHMM = 0.0024; MetaLR = 0.0005; MetaSVM = 0.00063; Mutation Assessor = 0.01778; Mutation Taster = 0.00725; PROVEAN = 0.00596; LEVEL = 0.00035; and VEST3 = 0.0013.

The difference between the two clusters ‘likely pathogenic’ and ‘likely benign’ for each score (p values using the Mann–Whitney U test) is: SIFT = 0.003; PolyPhen = 0.003; Condel = 0.003; CADD = 0.003; FATHMM = 0.003; MetaLR = 0.003; MetaSVM = 0.003; Mutation Assessor = 0.018; Mutation Taster = 0.003; PROVEAN = 0.003; LEVEL = 0.003; and VEST3 = 0.003.

The difference between the two clusters ‘likely pathogenic’ and ‘uncertain’ for each score (p values using the Mann–Whitney U test) is: SIFT = 0.675; PolyPhen = 0.006; Condel = 0.002; CADD = 0.009; FATHMM = 0.014; MetaLR = 0.002; MetaSVM = 0.002; Mutation Assessor = 0.232; Mutation Taster = 0.232; PROVEAN = 0.232; LEVEL = 0.001; and VEST3 = 0.014.

The difference between the two clusters ‘uncertain’ and ‘likely benign’ for each score (p values using the Mann–Whitney U test) is: SIFT = 0.045; PolyPhen = 0.006; Condel = 0.011; CADD = 0.003; FATHMM = 0.030; MetaLR = 0.003; MetaSVM = 0.006; Mutation Assessor = 0.019; Mutation Taster = 0.030; PROVEAN = 0.011; LEVEL = 0.003; and VEST3 = 0.006.

Four autosomal recessive (AR) variants involve ABCA3. ABCA3:p.Ala149Val has conflicting predictions of pathogenicity (Table 1), mainly due to the high scores of CADD and FATHMM. Ala149 is conserved in mammals (JSD: 0.769, Fig. 2A); it is replaced by Val149, which has a nonpolar sidechain. It is found in heterozygous state with DNAH5:p.Gln1835*. Its clinical significance is unknown; Varsome ACMG classification is likely benign. ABCA3:p.Val1057Met also has conflicting predictions of pathogenicity (Fig. 1). Val1057 is replaced by methionine in horse (JSD: 0.712, Fig. 2B). The child also has heterozygous SFTPA1:p.Gly98Ala. Its clinical significance is unknown; Varsome classifies it as uncertain significance. ABCA3:p.Val1399Met has pathogenic scores (Fig. 1). Val1399 is highly conserved (JSD: 0.779, Fig. 2C) and InterPro30 indicates that this residue is part of the ATP-binding cassette (ABC) transporter-like domain (IPR003439) of the protein. Evaluation of functionally important residues using firestar25 suggests the adjacent residue Ala1398 could be part of the ATP binding site. It is identified in homozygous state in three siblings with severe respiratory disease (one died at 7 months of age). Findings on their chest radiographs and computerized tomography (CT) scans suggest small airway disease (diffuse ground-glass opacification). Both computational and clinical data indicate this variant is pathogenic. ABCA3:p.Arg1559* is nonsense (CADD: 12.31). It is found in heterozygous state during screening for genetic diseases. Its clinical significance is unknown; Varsome classifies it pathogenic.

Figure 2
figure 2

Twenty-one amino acid regions, centered on the missense variation, obtained from a multiple sequence alignment of protein sequences from human, chimpanzee, mouse, rat, dog, horse, bovine, frog, and chicken, where available. (A) ABCA3, A1149V (c.446C > T); (B) ABCA3, V1057M (c.3169G > A); (C) ABCA3, V1399M (c.4195G > A); (D) CSF2RB, V105I (c.313G > A); (E) CSF2RB, R461C (c.1381C > T); (F) SCNN1B, R206W(c.616C > T); (G) SCNN1B, E468K (c.1402G > A); (H) SCNN1B, R624H (c.1871G > A); (I) SERPINA1, P393S (c.1177C > T); (J) SFTPA1, G98A (c.293G > C); (K) SFTPA1, N225K (c.675C > G); (L) SFTPC, H59R (c.176A > G); (M) SFTPC, T158M (c.473C > T).

Two variants involve autosomal recessive CSF2RB. CSF2RB:p.Val105Ile has benign scores (Table 1). Val105 is not conserved (JSD: 0.548); it is replaced by isoleucine in multiple species (Fig. 2D). In CSF2RB (Fig. 3A), Val105 (Fig. 3B) is located on a solvent-exposed loop. Since Ile105 (Fig. 3C) has similar physiochemical properties, it is not expected to significantly affect the protein structure or function. It is probably benign, in agreement with its Varsome ACMG classification. CSF2RB:p.Arg461Cys has conflicting predictions of pathogenicity, mainly due to the low scores of LRT, Mutation Assessor, Mutation Taster, and PROVEAN. Arg461 is conserved (JSD: 0.675); it is replaced by cysteine in frog (Fig. 2E). It is identified in the MDS plot pathogenic (Fig. 1). It is found in heterozygous state during screening for genetic diseases. Its clinical significance is unknown, in agreement with the Varsome ACMG classification of uncertain significance.

Figure 3
figure 3

Structural models of wild type and variant proteins. The protein structure is shown in white cartoon representation and the amino acid is shown in stick representation. The red boxed region in each case is enlarged in the subsequent images. (A) Structure of CSF2RB with (B) wild type Val105 and (C) variant Ile105. (D) Structure of SERPINA1 with (E) wild type Pro393 and (F) variant Ser393. (G) Structure of SCNN1B with (H) wild type Arg206, (I) variant Trp206, (J) wild type Glu468 and (K) variant Lys468.

Six variants involve autosomal dominant MUC5B. MUC5B:p.Arg2200Gln, MUC5B:p.Thr3451Met, and MUC5B:p.Pro4895Ser have consistent benign scores (Table 1). They are identified in children with significant respiratory infections. Here, the clinical information is inconsistent with Varsome ACMG classification of these variants (Table 1). MUC5B:p.Ile4979Thr and MUC5B:p.Gly5580Arg have conflicting predictions of pathogenicity. Gly5580 is located in the von Willebrand factor type C (VWFC) domain (InterPro ID: IPR001007) of the protein. Both variants are identified in compound heterozygous state in two siblings with severe respiratory disease from birth (one died at 3 years of age). In one sibling, chest radiographs and CT scans at 10 months and 3 years of age show marked perihilar bands of atelectasis and bronchial wall thickening (small airway disease). The clinical information is also inconsistent with Varsome ACMG classification of these variants (Table 1).

MUC5B:c.16861G > T, p.Glu5621* has pathogenic predictions (e.g., CADD: 37.0). It is identified in homozygous state in two siblings with severe respiratory disease since birth. One sibling has homozygous MUC5B:c.16861G > T plus heterozygous SFTPA1:c.675C > G, and one has only homozygous MUC5B:c.16861G > T. The one with the two different variants has more severe respiratory disease (e.g., frequent intensive care admissions). The one with only homozygous MUC5B:c.16861G > T had lung biopsy at 18 months of age, which showed significant alveolar growth abnormality (deficient alveolarisation) and interstitial fibrosis (Fig. 4)32.

Figure 4
figure 4

Lung (left lower lobe) biopsy at 18 months of age in the child with homozygous MUC5B:c.16861G > T, p.Glu5621*. Hematoxylin and eosin stain showing diffuse enlargement and simplification of the airspaces with thin alveolar septae (stars), mild interstitial fibrosis (long thin arrow), and intra-alveolar macrophages (short thin arrow).

Three variants involve autosomal dominant SCNN1B. SCNN1B:p.Arg206Trp has conflicting predictions of pathogenicity (Table 1), mainly due to the high CADD and Mutation Taster scores. Arg206 is highly conserved (JSD: 0.836, Fig. 2F). In SCNN1B (Fig. 3G), Arg206 is located on a solvent exposed β-strand on the surface (Fig. 3H); it does not make notable interactions within the protein. The change to aromatic Trp206 (Fig. 3I), while significant in terms of structure and physiochemical properties, is largely local. This variant is found in two cousins with mild bronchiectasis. Radiologically, the disease mainly involves the small airways (Fig. 5A). The clinical information suggests pathogenicity (Table 1). SCNN1B:p.Glu468Lys has conflicting predictions of pathogenicity (Table 1). Glu468 is conserved (JSD: 0.788, Fig. 2G). The negatively charged Glu468 is located on a solvent exposed helix on the surface of the protein (Fig. 3J). The change to positively charged Lys468 (Fig. 3K) is physiochemically drastic. Its location and lack of intramolecular interactions, however, may not affect the protein. It is found in a child with severe respiratory symptoms and cricoid cartilage cleft. His radiological findings are ground-glass opacification and dependent atelectasis; his lung biopsy shows lipid-laden alveolar macrophages. The clinical information suggests pathogenicity (Table 1). SCNN1B:p.Arg624His also has conflicting predictions of pathogenicity (Table 1). Arg624 is highly conserved (JSD: 0.821, Fig. 2H). It is found in a child with recurrent sinusitis and normal chest radiograph at 10 months of age. He also has heterozygous DNAH5:c.8765G > A. The clinical significance of this variant is unknown.

Figure 5
figure 5

(A) Chest radiograph and unenhanced high resolution chest CT axial image of an 18-month-old girl with heterozygous SCNN1B:p.Arg206Trp. The chest radiograph shows hyperinflated lungs with bronchial wall thickening and band of atelectasis. The chest CT image demonstrates features of air-trapping, bronchial wall thickening and mild bronchiectasis. (B) Chest radiograph and unenhanced high resolution chest CT axial image of an 18-month-old girl with heterozygous SFTPA1:p.Gly98Ala. The chest radiograph shows bilateral and quite symmetrical ground-glass opacification, relatively spares the lung apices. The chest CT image demonstrates a combination of septal thickening and alveolar ground-glass opacification creates a typical pattern of crazy-paving.

SERPINA1:p.Pro393Ser (autosomal recessive) has consistent pathogenic scores (Table 1). Pro393 is highly conserved (JSD: 0.795, Fig. 2I). In SERPINA1 (Fig. 3D), Pro393 is located at the beginning of a β-strand (Fig. 3E). Physiochemical properties of serine are notably different, and the Ser393 variant (Fig. 3F) is likely to affect the structure or intramolecular interactions in this protein. This variant, also known as Mwürzburg, results in a significant reduction in the level of the enzyme in vitro and in vivo, indicating it could affect the structure and function of the protein33. It is found in heterozygous state during screening for genetic diseases.

Eight variants involve SFTP (surfactant, pulmonary-associated proteins; autosomal dominant). SFTPA1:p.Gly98Ala has pathogenic scores, except for LRT and Mutation Taster (Table 1). Gly98 is conserved (JSD: 0.700), but replaced by alanine in chicken (Fig. 2J). It is identified in a child with severe respiratory disease and a crazy-paving pattern on the chest CT suggesting interstitial lung disease (Fig. 5B). The clinical information is inconsistent with the Varsome ACMG classification of likely benign (Table 1). SFTPA1:p.Asn225Lys has conflicting predictions of pathogenicity (Table 1). Asn225, located in the Collectin, C-type lectin-like domain (InterPro ID: IPR033990) of the protein, is highly conserved (JSD: 0.860, Fig. 2K). It is identified in a child with severe lung disease and homozygous MUC5B:p.Glu5621*. Its clinical significance is unknown, and its Varsome ACMG classification is likely benign (Table 1). SFTPA2:p.Val25Ile has benign scores (Table 1), consistent with the likely permissible replacement of valine with leucine. It is found in two siblings with atopy and recurrent sinusitis. SFTPA2:p.Tyr191Cys has conflicting predictions of pathogenicity (Table 1). Tyr191 is located in the Collectin, C-type lectin-like domain (InterPro ID: IPR033990) of the protein. The variant is identified in a toddler with chronic wet cough and normal chest radiograph at 14 months of age; he lost to follow-up. Its clinical significance is unknown, and its Varsome ACMG classification is likely benign (Table 1). SFTPB:c.1039-6C > G is found in a toddler with respiratory symptoms since birth, which improved with age. He has normal chest radiographs at 2 and 4 months of age. SFTPC:p.His59Arg has conflicting predictions of pathogenicity (Table 1). His59, part of the surfactant protein C, N-terminal propeptide (InterPro ID: IPR015091), is highly conserved (JSD: 0.858, Fig. 2L), favoring PolyPhen (0.96) and CADD (23.6) scores. It is found in a child with chronic wet cough and normal chest radiograph at 5 years of age. Its clinical significance is unknown, and its Varsome ACMG classification is likely benign (Table 1). SFTPC:p.Thr158Met has benign scores (Table 1). Thr158, located on the BRICHOS domain (InterproID: IPR007084) of the protein, is not highly conserved (JSD: 0.614, Fig. 2M). The variant is found in a boy with respiratory symptoms since infancy, which improved with age. His chest radiograph at five years of age is normal. Its clinical significance is unknown, and its Varsome ACMG classification is also ‘uncertain significance’ (Table 1). SFTPD:c.199 + 9G > A has a CADD of 5.1, and a benign Varsome ACMG classification (Table 1). It is found in heterozygous state during screening for genetic diseases.

Discussion

The results here show significant respiratory diseases associated with likely pathologic variants, such as ABCA3:p.Val1399Met, MUC5B:p.Ile4979Thr, MUC5B:p.Gly5580Arg, MUC5B:p.Glu5621*, SCNN1B:p.Arg206Trp, SCNN1B:p.Glu468Lys, and SFTPA1:p.Gly98Ala. Many of these variants have conflicting predictions of pathogenicity. Therefore, investigating phenotypes associated with such variants is important. Future studies, however, are needed to determine their prevalence in the community. It is worth emphasizing that family (parents and all siblings) genetic studies are important when a pathologic variant is identified. The cost of this endeavor may need to be included in the original agreement between treating institution and investigating laboratory.

Identifying a variant as disease-causing is expected to improve the overall clinical care plan, including counseling. Some of these children may be eligible for lung transplantation, and the variant analysis may be helpful in this regard17. Other hopes may include gene therapy and gene editing (when available). Many of these variants are autosomal dominant and, thus, are not directly amenable to prevention by premarital screening. Autosomal dominant disorders, however, are pliable to prevention through a preimplantation genetic testing. This procedure involves in vitro fertilization followed by biopsy of the embryo for genetic testing. The selected embryo is then transferred into the uterus34. Thus, a genetic diagnosis is essential for all these serious disorders.

Improved efforts are needed to minimize a delayed diagnosis or treatment. The success with management of cystic fibrosis (including the novel use of specific ATP analogs) should encourage translational research focused on other devastating respiratory diseases, such as chILD. Advancements toward this goal require continual reports on the molecular diagnosis.

Another important information gathered from this study is the conflicting predictions of pathogenicity. For example, ABCA3:p.Val1057Met has a SIFT score of zero (damaging) and a PolyPhen score of 0.195 (benign), with a classification of uncertain significance in Varsome and no reports in ClinVar. Another example is SCNN1B:p.Glu468Lys with a SIFT score of zero, a PolyPhen score of 0.369 (benign), not reported in ClinVar, and ACMG classification of uncertain significance in Varsome. This autosomal dominant variant is identified in a child with severe respiratory disease. Thus, it is clear that a thorough clinical interpretation of genetic variants is needed. Moreover, clinicians need to provide detailed information on the natural history of the disease for both index case and extended family. In addition, commercial laboratories need to commit to a better investigation of variants, including variants of unknown significance without extra charges.

MUC5B has sequences with vastly varying lengths in different species (e.g., human, 5792; chimpanzee, 7982; rat, 4096). This huge gaps potentially affect the pathogenicity scores, as predictors directly or indirectly depend on sequences alignments. Examining MUC5B in the dataset is necessary to understand the level of normalization used for this gene. Therefore, the prediction scores for MUC5B:p.Arg2200Gln, MUC5B:p.Thr3451Met, and MUC5B:p.Pro4895Ser may require future confirmation.

In summary, variants associated with interstitial lung and small airway diseases are described here. It is clear that affected children show significant respiratory symptoms at tender age, and the disease advances as time progresses. Genetic tests should also be included in the evaluation of adults with an unexplained lung disease. Homology modeling of the variants may assist in designing compounds that modulate the function of the defective proteins. The results emphasize the use of genetic tests in unexplained respiratory disorders. They also help in generating population based genetic panels for childhood lung diseases.