Genetic variants of small airways and interstitial pulmonary disease in children

Genetic variants of small airways and interstitial pulmonary disease have not been comprehensively studied. This cluster of respiratory disorders usually manifests from early infancy (‘lung disease in utero’). In this study, 24 variants linked to these entities are described. The variants involved two genes associated with surfactant metabolism dysfunction (ABCA3 and CSF2RB), two with pulmonary fibrosis (MUC5B and SFTP), one with bronchiectasis (SCNN1B), and one with alpha-1-antitrypsin deficiency (SERPINA1). A nonsense variant, MUC5B:c.16861G > T, p.Glu5621*, was found in homozygous state in two siblings with severe respiratory disease from birth. One of the siblings also had heterozygous SFTPA1:c.675C > G, p.Asn225Lys, which resulted in a more severe respiratory disease. The sibling with only the homozygous MUC5B variant had lung biopsy, which showed alveolar simplification, interstitial fibrosis, intra-alveolar lipid-laden macrophages, and foci of foreign body giant cell reaction in distal airspaces. Two missense variants, MUC5B:c.14936 T > C, p.Ile4979Thr (rs201287218) and MUC5B:c.16738G > A, p.Gly5580Arg (rs776709402), were also found in compound heterozygous state in two siblings with severe respiratory disease from birth. Overall, the results emphasize the need for genetic studies for patients with complex respiratory problems. Identifying pathogenic variants, such as those presented here, assists in effective family counseling aimed at genetic prevention. In addition, results of genetic studies improve the clinical care and provide opportunities for participating in clinical trials, such as those involving molecularly-targeted therapies.


Methods
This retrospective data collection study was approved by 'Tawam Human Research Ethics Committee' (SA/ AJ/566 on 19th April 2018 and AA/AJ/653 on 19th June 2019). Informed consent to participate in this 'Retrospective Chart Review' was exempt. All methods were performed in accordance with the relevant guidelines and regulations. Our pediatric pulmonary service routinely request genetic studies for children with an unexplained respiratory disease. These investigations are performed by Centogene AG (Germany), and include diagnostic exome sequencing 20 or Comprehensive Pulmonary Disease Panel (https ://www.cento gene.com/scien ce/cento pedia / compr ehens ive-pulmo nary-disea se-panel .html. Accessed 01 June 2020).
Variant information in open databases was combined using Ensembl Variant Effect Predictor 21 . Pathogenicity prediction included scores available in dbNSFP (One-Stop Database of Functional Predictions and Annotations for Human Non-synonymous and Splice Site) for SIFT, PolyPhen, Condel, CADD, FATHMM, LRT, Met-aLR, MetaSVM, Mutation Assessor, Mutation Taster, PROVEAN, REVEL, and VEST3 22,23 . Multiple sequence alignment was performed to determine amino acid conservation at sites shown in Table 1, and to compute Jensen-Shannon Divergence (JSD) scores. Amino acid sequences of proteins from Homo sapiens (human), Pan troglodytes (chimpanzee), Mus musculus (house mouse), Rattus norvegicus (Norway rat), Canis lupus familiaris (dog), Equus caballus (horse), Bos taurus (bovine), Xenopus tropicalis (frog), and Gallus gallus (chicken) were collected from NCBI RefSeq and aligned using MUSCLE 24 in Geneious 9.1.8 (https ://www.genei ous.com). Aligned sequences were exported in FASTA format to compute JSD. Potential binding pockets in the studied proteins were evaluated using firestar 25 and 3DLigandSite 26 , post translation modifications were assessed using UniProt 27 , PhosphoSitePlus 28 and iPTMnet 29 , and functional domains were determined using InterPro 30 . The variants were grouped into three clusters-likely pathogenic, uncertain and likely benign-by k-means clustering in R version 3.6.0, using all pathogenicity scores in Table 1 (Table S1, Supplementary Material). This method yielded p < 0.020 on the Kruskal-Wallis test between the three groups for each of the 13 scoring tools. The American College of Medical Genetics (ACMG) classification of the variants from Varsome 31 was evaluated.
Statistics. The analyses were performed using SPSS statistical package (version 20). The Kruskal-Wallis H test (non-parametric, k independent samples) test was used to compare groups of variants. The Mann-Whitney U test [nonparametric, 2 independent samples, "Exact Sig (2-tailed)"] was used to compare two groups of variants. p < 0.05 was considered significant.
Ethics approval and consent to participate. This retrospective (Chart Review) data collection study was approved by 'Tawam Human Research Ethics Committee' (SA/AJ/566 on 19th April 2018 and AA/AJ/653 on 19th June 2019). Table 1 summarizes the pathogenicity of 24 variants of the studied six gene families; two genes are linked to surfactant metabolism dysfunction (ABCA3 and CSF2RB), two to pulmonary fibrosis (MUC5B and SFTP), one to bronchiectasis (SCNN1B), and one to alpha-1-antitrypsin deficiency (SERPINA1). Twenty variants are missense, two nonsense, and two intronic. None of the variations in the coding region are at sites known to be post translationally modified based on UniProt 27 , PhosphoSitePlus 28 and iPTMnet 29 . Figure 1 shows a multidimensional scaling (MDS) plot for the 20 missense variants, using all pathogenicity prediction scores in Table 1. Seven variants cluster in the lower left zone ('red'), likely pathogenic with mean ± SD (median) Condel scores of 0.804 ± 0.097 (0.855). Five variants cluster in the middle right zone ('green'), likely benign with Condel scores of 0.110 ± 0.131 (0.042). The remaining eight variants are in between ('orange'), 'uncertain' with Condel scores of 0.430 ± 0.231 (0.468). Figure S1 (Supplementary Material) shows 'dot plots' for distribution of pathogenicity prediction scores of the MDS plot. The difference between the three clusters is significant for each score (p-values using the       (Table 1), mainly due to the high scores of CADD and FATHMM. Ala149 is conserved in mammals (JSD: 0.769, Fig. 2A); it is replaced by Val149, which has a nonpolar sidechain. It is found in heterozygous state with DNAH5:p.Gln1835*. Its clinical significance is unknown; Varsome ACMG classification is likely benign. ABCA3:p.Val1057Met also has conflicting predictions of pathogenicity (Fig. 1). Val1057 is replaced by methionine in horse (JSD: 0.712, Fig. 2B). The child also has heterozygous SFTPA1:p.Gly98Ala. Its clinical significance is unknown; Varsome classifies it as uncertain significance. ABCA3:p.Val1399Met has pathogenic scores (Fig. 1). Val1399 is highly conserved (JSD: 0.779, Fig. 2C) and InterPro 30 indicates that this residue is part of the ATPbinding cassette (ABC) transporter-like domain (IPR003439) of the protein. Evaluation of functionally important residues using firestar 25 suggests the adjacent residue Ala1398 could be part of the ATP binding site. It is identified in homozygous state in three siblings with severe respiratory disease (one died at 7 months of age). Findings on their chest radiographs and computerized tomography (CT) scans suggest small airway disease (diffuse groundglass opacification). Both computational and clinical data indicate this variant is pathogenic. ABCA3:p.Arg1559* is nonsense (CADD: 12.31). It is found in heterozygous state during screening for genetic diseases. Its clinical significance is unknown; Varsome classifies it pathogenic.

SCNN1B (sodium channel, nonvoltage-gated 1, beta subunit); BESC1 (bronchiectasis with or without elevated sweat chloride 1; MIM#211400); AD
Two variants involve autosomal recessive CSF2RB. CSF2RB:p.Val105Ile has benign scores ( Table 1). Val105 is not conserved (JSD: 0.548); it is replaced by isoleucine in multiple species (Fig. 2D). In CSF2RB (Fig. 3A), Val105 (Fig. 3B) is located on a solvent-exposed loop. Since Ile105 (Fig. 3C) has similar physiochemical properties, it is not expected to significantly affect the protein structure or function. It is probably benign, in agreement with its Varsome ACMG classification. CSF2RB:p.Arg461Cys has conflicting predictions of pathogenicity, mainly due to the low scores of LRT, Mutation Assessor, Mutation Taster, and PROVEAN. Arg461 is conserved (JSD: 0.675); it is replaced by cysteine in frog (Fig. 2E). It is identified in the MDS plot pathogenic (Fig. 1). It is found in heterozygous state during screening for genetic diseases. Its clinical significance is unknown, in agreement with the Varsome ACMG classification of uncertain significance.
Six variants involve autosomal dominant MUC5B. MUC5B:p.Arg2200Gln, MUC5B:p.Thr3451Met, and MUC5B:p.Pro4895Ser have consistent benign scores (Table 1). They are identified in children with significant respiratory infections. Here, the clinical information is inconsistent with Varsome ACMG classification of these variants (Table 1). MUC5B:p.Ile4979Thr and MUC5B:p.Gly5580Arg have conflicting predictions of pathogenicity. Gly5580 is located in the von Willebrand factor type C (VWFC) domain (InterPro ID: IPR001007) of the protein.
Both variants are identified in compound heterozygous state in two siblings with severe respiratory disease from birth (one died at 3 years of age). In one sibling, chest radiographs and CT scans at 10 months and 3 years of age show marked perihilar bands of atelectasis and bronchial wall thickening (small airway disease). The clinical information is also inconsistent with Varsome ACMG classification of these variants (Table 1).  Table 1. The three k-means clusters obtained are colored in red (likely pathogenic), orange (uncertain), and green (likely benign). www.nature.com/scientificreports/ MUC5B:c.16861G > T, p.Glu5621* has pathogenic predictions (e.g., CADD: 37.0). It is identified in homozygous state in two siblings with severe respiratory disease since birth. One sibling has homozygous MUC5B:c.16861G > T plus heterozygous SFTPA1:c.675C > G, and one has only homozygous MUC5B:c.16861G > T. The one with the two different variants has more severe respiratory disease (e.g., frequent intensive care admissions). The one with only homozygous MUC5B:c.16861G > T had lung biopsy at 18 months of age, which showed significant alveolar growth abnormality (deficient alveolarisation) and interstitial fibrosis (Fig. 4) 32 .
Three variants involve autosomal dominant SCNN1B. SCNN1B:p.Arg206Trp has conflicting predictions of pathogenicity (Table 1), mainly due to the high CADD and Mutation Taster scores. Arg206 is highly conserved (JSD: 0.836, Fig. 2F). In SCNN1B (Fig. 3G), Arg206 is located on a solvent exposed β-strand on the surface (Fig. 3H); it does not make notable interactions within the protein. The change to aromatic Trp206 (Fig. 3I), while significant in terms of structure and physiochemical properties, is largely local. This variant is found in two cousins with mild bronchiectasis. Radiologically, the disease mainly involves the small airways (Fig. 5A). The clinical information suggests pathogenicity (Table 1). SCNN1B:p.Glu468Lys has conflicting predictions of pathogenicity (Table 1). Glu468 is conserved (JSD: 0.788, Fig. 2G). The negatively charged Glu468 is located on a solvent exposed helix on the surface of the protein (Fig. 3J). The change to positively charged Lys468 (Fig. 3K) is physiochemically drastic. Its location and lack of intramolecular interactions, however, may not affect the protein.
It is found in a child with severe respiratory symptoms and cricoid cartilage cleft. His radiological findings are ground-glass opacification and dependent atelectasis; his lung biopsy shows lipid-laden alveolar macrophages. The clinical information suggests pathogenicity (Table 1). SCNN1B:p.Arg624His also has conflicting predictions of pathogenicity (Table 1). Arg624 is highly conserved (JSD: 0.821, Fig. 2H). It is found in a child with recurrent www.nature.com/scientificreports/ sinusitis and normal chest radiograph at 10 months of age. He also has heterozygous DNAH5:c.8765G > A. The clinical significance of this variant is unknown. SERPINA1:p.Pro393Ser (autosomal recessive) has consistent pathogenic scores (Table 1). Pro393 is highly conserved (JSD: 0.795, Fig. 2I). In SERPINA1 (Fig. 3D), Pro393 is located at the beginning of a β-strand (Fig. 3E). Physiochemical properties of serine are notably different, and the Ser393 variant (Fig. 3F) is likely to affect the structure or intramolecular interactions in this protein. This variant, also known as Mwürzburg, results in a significant reduction in the level of the enzyme in vitro and in vivo, indicating it could affect the structure and function of the protein 33 . It is found in heterozygous state during screening for genetic diseases.
Eight variants involve SFTP (surfactant, pulmonary-associated proteins; autosomal dominant). SFTPA1:p. Gly98Ala has pathogenic scores, except for LRT and Mutation Taster (Table 1). Gly98 is conserved (JSD: 0.700), but replaced by alanine in chicken (Fig. 2J). It is identified in a child with severe respiratory disease and a crazypaving pattern on the chest CT suggesting interstitial lung disease (Fig. 5B). The clinical information is inconsistent with the Varsome ACMG classification of likely benign (Table 1). SFTPA1:p.Asn225Lys has conflicting predictions of pathogenicity (Table 1). Asn225, located in the Collectin, C-type lectin-like domain (InterPro ID: IPR033990) of the protein, is highly conserved (JSD: 0.860, Fig. 2K). It is identified in a child with severe lung disease and homozygous MUC5B:p.Glu5621*. Its clinical significance is unknown, and its Varsome ACMG classification is likely benign (Table 1). SFTPA2:p.Val25Ile has benign scores (Table 1), consistent with the likely permissible replacement of valine with leucine. It is found in two siblings with atopy and recurrent sinusitis. SFTPA2:p.Tyr191Cys has conflicting predictions of pathogenicity (Table 1). Tyr191 is located in the Collectin, C-type lectin-like domain (InterPro ID: IPR033990) of the protein. The variant is identified in a toddler with chronic wet cough and normal chest radiograph at 14 months of age; he lost to follow-up. Its clinical significance is unknown, and its Varsome ACMG classification is likely benign (Table 1). SFTPB:c.1039-6C > G is found in a toddler with respiratory symptoms since birth, which improved with age. He has normal chest radiographs at 2 and 4 months of age. SFTPC:p.His59Arg has conflicting predictions of pathogenicity (Table 1). His59, part of the It is found in a child with chronic wet cough and normal chest radiograph at 5 years of age. Its clinical significance is unknown, and its Varsome ACMG classification is likely benign ( Table 1). SFTPC:p.Thr158Met has benign scores (Table 1). Thr158, located on the BRICHOS domain (InterproID: IPR007084) of the protein, is not highly conserved (JSD: 0.614, Fig. 2M). The variant is found in a boy with respiratory symptoms since infancy, which improved with age. His chest radiograph at five years of age is normal. Its clinical significance is unknown, and its Varsome ACMG classification is also 'uncertain significance' (Table 1). SFTPD:c.199 + 9G > A has a CADD of 5.1, and a benign Varsome ACMG classification (Table 1). It is found in heterozygous state during screening for genetic diseases.

Discussion
The results here show significant respiratory diseases associated with likely pathologic variants, such as ABCA3:p. Identifying a variant as disease-causing is expected to improve the overall clinical care plan, including counseling. Some of these children may be eligible for lung transplantation, and the variant analysis may be helpful in this regard 17 . Other hopes may include gene therapy and gene editing (when available). Many of these variants are autosomal dominant and, thus, are not directly amenable to prevention by premarital screening. Autosomal dominant disorders, however, are pliable to prevention through a preimplantation genetic testing. This procedure involves in vitro fertilization followed by biopsy of the embryo for genetic testing. The selected embryo is then transferred into the uterus 34 . Thus, a genetic diagnosis is essential for all these serious disorders.
Improved efforts are needed to minimize a delayed diagnosis or treatment. The success with management of cystic fibrosis (including the novel use of specific ATP analogs) should encourage translational research focused on other devastating respiratory diseases, such as chILD. Advancements toward this goal require continual reports on the molecular diagnosis.
Another important information gathered from this study is the conflicting predictions of pathogenicity. For example, ABCA3:p.Val1057Met has a SIFT score of zero (damaging) and a PolyPhen score of 0.195 (benign), with Glu468Lys with a SIFT score of zero, a PolyPhen score of 0.369 (benign), not reported in ClinVar, and ACMG classification of uncertain significance in Varsome. This autosomal dominant variant is identified in a child with severe respiratory disease. Thus, it is clear that a thorough clinical interpretation of genetic variants is needed. Moreover, clinicians need to provide detailed information on the natural history of the disease for both index case and extended family. In addition, commercial laboratories need to commit to a better investigation of variants, including variants of unknown significance without extra charges. MUC5B has sequences with vastly varying lengths in different species (e.g., human, 5792; chimpanzee, 7982; rat, 4096). This huge gaps potentially affect the pathogenicity scores, as predictors directly or indirectly depend on sequences alignments. Examining MUC5B in the dataset is necessary to understand the level of normalization used for this gene. Therefore, the prediction scores for MUC5B:p.Arg2200Gln, MUC5B:p.Thr3451Met, and MUC5B:p.Pro4895Ser may require future confirmation.
In summary, variants associated with interstitial lung and small airway diseases are described here. It is clear that affected children show significant respiratory symptoms at tender age, and the disease advances as time progresses. Genetic tests should also be included in the evaluation of adults with an unexplained lung disease. Homology modeling of the variants may assist in designing compounds that modulate the function of the defective proteins. The results emphasize the use of genetic tests in unexplained respiratory disorders. They also help in generating population based genetic panels for childhood lung diseases.

Data availability
All data generated and analyzed in this study are included in the article.