Quantitating the epigenetic transformation contributing to cholesterol homeostasis using Gaussian process

To understand the impact of epigenetics on human misfolding disease, we apply Gaussian-process regression (GPR) based machine learning (ML) (GPR-ML) through variation spatial profiling (VSP). VSP generates population-based matrices describing the spatial covariance (SCV) relationships that link genetic diversity to fitness of the individual in response to histone deacetylases inhibitors (HDACi). Niemann-Pick C1 (NPC1) is a Mendelian disorder caused by >300 variants in the NPC1 gene that disrupt cholesterol homeostasis leading to the rapid onset and progression of neurodegenerative disease. We determine the sequence-to-function-to-structure relationships of the NPC1 polypeptide fold required for membrane trafficking and generation of a tunnel that mediates cholesterol flux in late endosomal/lysosomal (LE/Ly) compartments. HDACi treatment reveals unanticipated epigenomic plasticity in SCV relationships that restore NPC1 functionality. GPR-ML based matrices capture the epigenetic processes impacting information flow through central dogma, providing a framework for quantifying the effect of the environment on the healthspan of the individual.

A major challenge in the genomic era is to understand the impact of genetic and epigenetic diversity on the protein fold and its function in biology in response to the environment [1][2][3][4] . Lysine (Lys) acetylation/deacetylation balance, which is catalyzed by writers (histone acetyltransferases (HATs)) and erasers (histone deacetylases (HDACs)), manage epigenetic information to harmonize numerous cellular processes through readers (acetyl-binding proteins) 5 . Given the strong links to development, metabolism and aging, Lys acetylation/deacetylation has been considered an important epigenetic mechanism that uses cellular and environmental cues to specialize individual phenotypes in response to their unique genomic and environment exposure [5][6][7][8] . How the acetylation/deacetylation balance managed by HDAC coordinates the phenotype of genetic variation programmed by central dogma that is responsible for fitness in health and disease is largely unknown.
To generate a more complete understanding of the impact of HDAC on genetic variation in the population and its impact on the individual, we apply variation spatial profiling (VSP) 1 to understand the process triggering onset and progression of Niemann-Pick C1 (NPC1) disease, an early onset neurodegenerative disease caused by accumulation of unesterified cholesterol in the late endosome/lysosome (LE/Ly) compartments of all cells. VSP is a Gaussian-process regression (GPR)-based machinelearning (ML) (GPR-ML) approach 9,10 we have recently developed to characterize spatial covariance (SCV) relationships that predict the genomic-phenomic transformation in the individual in response to genetic diversity in the extant population 1,3,11 . VSP emphasizes sequence-to-function as a prior to structure as it is function that dictates the process of fitness in evolability in response to the local genetic, biologic and physical environment 1 . Using our VSP strategy to define the impact of HDAC on Niemann-Pick C1 variants found in the world-wide population that disrupt cholesterol homeostasis, we show on a residue-byresidue basis that epigenetic-sensitive SCV relationships in the protein fold can be retuned by HDACi to coordinate protein trafficking with the flow of cholesterol in the LE/Ly. Restoring plasticity to the fold design disrupted by variation in the population we predict will delay the age of neurological disease onset. Our results highlight the power of GPR-ML to quantitatively capture the role of HDAC biology in acetylation balance to improve fitness using a SCV matrix based definition of central dogma (SCV[DNA|RNA|Protein]) 1 .

NPC1 variant impact on trafficking and cholesterol.
To characterize the process by which HDACs impact genetic variation in the human population using GPR-ML 1 , we focused on the rare autosomal recessive childhood disease, Niemann-Pick type C (NPC). NPC disease is triggered by familial genetic diversity within the NPC1 gene (95% of patients) through >300 disease variants 12,13 . Defects in NPC1 function lead to abnormal accumulation of cholesterol and other lipids in the LE/Ly in all cell types, although typically manifested in clinical presentation as neurodegeneration where the age of onset and severity of disease is unique for each patient genotype [13][14][15][16][17] . NPC1 encodes a multimembrane spanning protein (Fig. 1a) that is translocated and folded in the endoplasmic reticulum (ER) and trafficked through the Golgi to late endosome (LE)/lysosome (Ly) (LE/Ly) compartments where it manages cellular cholesterol homeostasis 18,19 . Immunoblot analysis of patient-derived NPC1 primary fibroblasts harboring different alleles shows substantial heterogeneity in both polypeptide expression and stability compared with fibroblasts expressing the wild-type (WT) NPC1 ( Supplementary  Fig. 1, Supplementary Table 1). NPC1 acquires up to 14 N-linked glycans during cotranslational translocation into the ER. These ER-localized high mannose glycoforms are sensitive to digestion by endoglycosidase (Endo) H (Endo H S ) in cell lysates prepared by detergent solubilization. Following delivery from the ER to the Golgi, the Nlinked glycans are progressively processed to Endo H resistant (Endo H R ) glycoforms by the Golgi, leading to slower migration on SDS-PAGE. In primary fibroblasts (Fig. 1b), the WT-NPC1 glycoform was highly resistant to Endo H digestion, indicating efficient transfer to the Golgi whereas heterozygous alleles and the common I1061T homozygous variant fibroblast populations 20 showed differential sensitivity to Endo H with I1061T being restricted to the ER (Fig. 1b). To characterize the impact of each allele on NPC1 features, we silenced NPC1 expression with shRNA to generate stable Hela and U2OS null cell lines 21 . These null cell lines were transiently transfected with a sparse collection of plasmids that each harbors one of 48 distinct NPC1 diseaseassociated variants distributed among the various NPC1 domains (Fig. 1c, Supplementary Fig. 2) 21 .
To follow trafficking from the ER, we measured the level of Endo H S and Endo H R glycoforms for each variant and generated an Endo H R /(Endo H S + Endo H R ) ratio that reports on the ER versus post-ER distribution of NPC1 variants in the cell, referred to hereafter as the trafficking index (TrIdx) 1 . We binned this sparse collection of variants into four functional classes (classes I-IV) based on TrIdx (Fig. 1d, left panel, black circles; Supplementary Table 2, Source data are provided as a Source Data file). Class I variants lack polypeptide expression in response to non-sense or splicing (truncation) mutations (null) (Supplementary Table 2); class II missense variants are largely ER retained (defined as~< 0.  Table 2). Comparison of the two developmentally distinct cell-based lineages suggests a remarkable general conservation of NPC1 variant trafficking through the endomembrane system to the LE/Ly ( Supplementary  Fig. 3).
The impact of cholesterol (Chol) storage in the LE/Ly for each of these variants was measured using a well-established automated high content screening image analysis based on filipin staining in the U2OS cell line 21,22 (Fig. 1d, right panel, black circles). The combined results revealed a gradient of cholesterol accumulation in the LE/Ly compartments reflecting the differential impact of a specific variant on trafficking and/or function in the LE/Ly (Fig. 1d, black circles) 21 . Because each of the 48 variants tested contribute to clinical disease, these results suggest that either nascent synthesis, ER stability, trafficking from the ER or function in LE/Ly can differentially contribute to clinical presentation of disease.
Impact of SAHA on trafficking and cholesterol. The FDAapproved histone deacetylase inhibitors (HDACi) Vorinostat (SAHA) and Panobinostat (LBH589) can correct the I1061T phenotype in cell-based models by stabilizing the I1061T protein for export to the LE/Ly where it contributes to improved cholesterol homeostasis 21,23,24 . Each of the 48 NPC1 variants were examined for the impact of SAHA on Chol in the LE/Ly in the U2OS cell line to that of trafficking in the Hela cell line given the   (Fig. 1d, left panel, classes III and IV) showed a more variable impact of SAHA on TrIdx. In contrast, HDACi improved Chol homeostasis for most variants to a level comparable to or greater than that observed in either untreated or SAHA-treated WT-NPC1 cells (Fig. 1d, right panel), a result similar to that observed for LBH589, a HDACi previously shown to correct I1061T at nM concentrations 21 ( Supplementary Fig. 4a). The effects of HDACi were NPC1-dependent as restoration of Chol homeostasis by SAHA was not observed in fibroblasts lacking NPC1 23 and in null cell lines lacking NPC1 21 . NPC1 contains three luminal domains (SNLD1, MLD3, and CLD5) and three transmembrane domains (NTMD2, STMD4, and CTMD6) with 13 transmembrane helices (TM 1-13) (Fig. 1a) [25][26][27][28][29] . When we grouped variants by structural domains (Fig. 1a, c), both the luminal MLD3 and CLD5 variants that have relatively low TrIdx values showed a statistically significant correction of TrIdx in their response to SAHA compared with all other domains (Fig. 1e, left panel). In contrast, cholesterol homeostasis was significantly restored by either SAHA (Fig. 1e, right panel) or LBH589 ( Supplementary Fig. 4b, c) 21 for variants in all domains. While Chol homeostasis of NPC1 variants in the vehicle control showed a modest but significant correlation with TrIdx ( Fig. 1f, left panel, Pearson's r-value = −0.36, p-value = 0.01 (ANOVA test)), SAHA eliminated this correlation by shifting most variants to a class III phenotype (Fig. 1f, right panel, Pearson's r-value = −0.07, p-value = 0.64 (ANOVA test), Fig. 1g). Moreover, the correlation between the delta (Δ) of Chol and Δ of Trldx in the absence or presence of SAHA, or the Δ of Chol and the Δ of the mature NPC1 glycoform, indicating delivery to the Golgi in response to SAHA, were not statistically significant ( Supplementary Fig. 5). These results suggest that SAHA differentially impacts the management of variants contributing to ER export and those contributing to cholesterol homeostasis in the LE/Ly, illustrating that the HDACi modified environment can significantly and differentially contribute to variant functionality.
VSP analysis of NPC1 variants. To understand in depth the complex genotype to phenotype transformation process responsible for NPC1 disease in response to HDACi reported by NPC1 variants, we applied variation spatial profiling (VSP) to quantitate the spatial covariance (SCV) of sequence-to-function-to-structure relationships 1 for the entire NPC1 polypeptide in the absence or presence of HDACi. VSP uses sparse known variants and their related functional features in the population to predict a quantitative function value (referred to as a SCV set-point) for each residue in the polypeptide chain. These set-point values are associated with an uncertainty that assigns the SCV tolerance for all known and unknown variants in predicting any functional feature 1,11 . The combined known and predicted unknown values constitute a phenotype landscape that provides a quantitative interpretation of covarying sequence-to-function-to-structure relationships across the entire NPC1 fold, in this case the impact of epigenetics on the functional rescue of NPC1 variant by HDACi.
To build phenotype landscapes 1 , the x-axis is the normalized linear position of each of the 48 variant in the NPC1 polypeptide sequence (referred to as VarSeqP) that provides spatial information across the entire primary sequence as would be normally assigned by transcription and translation machineries (Fig. 2a). To assign the relationship of each genotype to a function, the y-axis input was set as the known value of each variant's Chol measurement (Fig. 2a). To address the impact of VarSeqP-to-Chol relationships on the TrIdx, given the spatial importance of cellular location in proper management of cholesterol by NPC1 in LE/Ly 18,30-32 , an input z-value was set as the known TrIdx value for each variant (Fig. 2a, z-axis, colored gradient). The spatial relationships defined by the x-axis genotype and functional y-and z-axes coordinates provide a quantitative framework to understand the transformation from the linear genotype to phenotype reflected in function based on the choice of y-and z-axis coordinates 1 .
To assess the SCV relationships for all possible 1128 pairwise combinations of known variants (Fig. 2a), a molecular variogram 1 is computed (Fig. 2b, Supplementary Fig. 6a). Here, the molecular variogram shows that the spatial variance of TrIdx in the absence of SAHA (Fig. 2b, black line, y-axis) increases according to VarSeqP-to-Chol distances (Fig. 2b, x-axis), referred to as the range 1 (Fig. 2b, x-axis position 0.19, vertical dashed black line), until it reaches a plateau (Fig. 2b, y-axis position 0.05, horizontal black line). A range of 0.19 suggests that the TrIdx and Chol function of variants are generally dependent on each other only over a short sequence range of~250 amino acids (Fig. 2b, x-axis) 1 , indicating a SCV modular design of the NPC1 polypeptide sequence that relates genotype to the different phenotypes contributing to trafficking and cholesterol management. Strikingly, SAHA significantly reduced the range shown on the x-axis (Fig. 2b, red line SAHA, range value 0.03 (~40 amino acids)). Moreover, the spatial variance on the y-axis defines a plateau which describes the stringency of the fold 1 . Here, the plateau of the spatial variance for TrIdx is reduced by~40% in response to SAHA ( Fig. 2b; compare horizontal black line vehicle control (control (CTL), plateau = 0.05) to the horizontal red line in the presence of SAHA (plateau = 0.03); Supplementary Fig. 6a, p-value = 7 × 10 −23 (student's ttest)). Thus, the plateau spatial variance defined by the molecular variogram suggests that HDACi increases the set-point tolerance of each known variant residue interaction leading to increased plasticity in TrIdx. The outcome is to reduce the loss-of-function impact of the variants, resulting in an improvement in cholesterol homeostasis for most of the disruptive variant features.
Phenotype landscapes predict the TrIdx-functional structure. To expand the spatial relationships modeled by the molecular variogram with known variants (Fig. 2b) to all other unknown residues, we apply GPR-ML 1 to generate an output TrIdxphenotype landscape (Fig. 2c). The Trldx-phenotype landscape quantitates all SCV relationships between the known and the unknown (predicted) TrIdx values (z-axis) for each amino acid residue in NPC1 polypeptide chain in response to the VarSeqP (x-axis) and Chol (y-axis) coordinates relationships, thereby linking sequence position to cellular location to cholesterolrelated function of NPC1 (Fig. 2c). GPR-ML not only generates set-point predictions on a residue-by-residue basis but also assesses the uncertainty of the prediction (Supplementary Fig. 6b) (i.e., SCV tolerance of each set-point), of which the leave-one-out cross-validation achieves a more accurate prediction than multivariate linear regression or decision tree-based regression methods ( Supplementary Fig. 6c-e). The uncertainty (or confidence) in using SCV relationships for prediction is defined by contour maps that are embedded in the phenotype landscape ( Fig. 2c, Supplementary Fig. 6b) that show the strength of all SCV relationships found in the range of the molecular variogram ( Fig. 2b) 1  these sequence regions are critical for export from the ER based on poor acquisition of Endo H R glycoforms. Furthermore, TrIdxphenotype landscapes based on input data from U2OS cells reveal similar regions critical for trafficking ( Supplementary Fig. 6g), indicating the conserved role of MLD3 and CLD5 mediating NPC1 trafficking across different cell lines. Strikingly, the TrIdx-phenotype landscape (Fig. 2c, left panel) undergoes a general compression in the presence of SAHA (Fig. 2c, right panel). These results reveal that the global improvement in cholesterol homeostasis in response to HDACi (Fig. 2c, y-axis) occurs, in part, through new SCV relationships that convert poor class II TrIdx values (Fig. 2c, left panel, orangered) to improved class III TrIdx values (Fig. 2c, right panel,  yellow). For example, SCV cluster 2 in CLD5 (Fig. 2c, left panel, dashed oval 2), that has a severe defect for both TrIdx and Chol, undergoes a coordinated shift towards WT-like Chol function (yaxis) and a class III TrIdx (z-axis) (Fig. 2c, right panel, dashed oval 2) as highlighted in a 3D projection of the TrIdx-phenotype landscape ( Supplementary Fig. 6f, cluster 2). In contrast in SCV cluster 1, the lack of significant correction of Chol of H510P 21 is due to the inability of SAHA to improve its TrIdx (Fig. 2c,  Supplemental Fig. 6f). Moreover, the large improvement of the TrIdx of R518Q set-point does not significantly improve cholesterol homeostasis reflecting the critical role of this residue in binding of NPC2 for cholesterol transfer (Fig. 2c, Supplementary Fig. 6f) 21 .
VSP generated TrIdx-phenotype landscapes (Fig. 2c) can be directly mapped onto the NPC1 structure 25,27,29 to generate a functional structure 1 that enables function assignment of each amino acid residue in a structure snapshot (Fig. 2d). The functional structure reveals the overlapping and/or distinct contribution of all the NPC1 residues to trafficking (Fig. 2d, ball color) and cholesterol homeostasis (Fig. 2d, ball size) with a prediction confidence (Fig. 2d, ball transparency) 1 . Strikingly, the TrIdx-functional structure reveals a molecular handshake between MLD3 and CLD5 (Fig. 2d, CTL panel, clusters 1 and 2) as a central feature that determines the ER export of NPC1 in Hela (Fig. 2d, CTL panel, red residues) and U2OS cell lines ( Supplementary Fig. 6h) from the ER. Moreover, atomic resolution mapping of the impact of SAHA on the functional structure reveals a significant improvement in cholesterol homeostasis for nearly all residues (Fig. 2d, SAHA panel, increased of ball size), partially by shifting the class II TrIdx (Fig. 2d, CTL panel, red balls) to that of class III indicative of significant export from the ER (Fig. 2d, SAHA panel, orange and yellow balls). Thus, the phenotype landscape predicted from the collective of fiduciary NPC1 variants found in the world-wide patient population using VSP teach us about the SCV modules and their function linked structural features required for normal NPC1 trafficking.
Phenotype landscapes predict the Chol-functional structure. To assign a value to cholesterol homeostasis based on the TrIdx response to NPC1 variants, we flipped the biological features used for y-axis and z-axis (Fig. 3a, Supplementary Fig. 7). The molecular variogram modeling of these relationships in the absence of SAHA shows that the spatial variance of Chol (Fig. 3b, y-axis) increases according to the distance defined by the VarSeqP-to-TrIdx spatial relationship (Fig. 3b, x-axis), revealing a range~0.08 (Fig. 3b, black vertical dash line). A range of~0.08 suggests that the spatial variance of cholesterol homeostasis (Fig. 3b, black line, y-axis) is dependent on VarSeqP-to-TrIdx over a sequence range of~100 amino acids (Fig. 3b, black line, x-axis). Strikingly, SAHA decreases the range from 0.08 to 0.02 and significantly reduces the spatial variance of the Chol value by nearly 70% (Fig. 3b; compare black line plateau = 0.02 with red line plateau = 0.005; Supplementary Fig. 7a, p-value = 5 × 10 −79 (student's t-test)). The decreased Chol spatial variance of NPC1 variants in response to SAHA (Fig. 3b, 70%) is significantly larger than the decrease observed for TrIdx spatial variance (Fig. 2b, 40%), indicating that SAHA correction in post-ER LE/Ly acidic compartments increases the overall SCV plasticity 1 of the variant fold to improve function to facilitate Chol homeostasis.
We next generated an output Chol-phenotype landscape that predicts cholesterol responses across the entire polypeptide sequence in the context of all VarSeqP-TrIdx spatial relationships (Fig. 3c, Supplementary Fig. 7b-e). Interestingly, the Cholphenotype landscape in the absence of SAHA reveals two SCV clusters in the top 25% confidence quartile that show class III TrIdx yet have severe cholesterol homeostasis defects (Fig. 3c, left, dashed ovals). One cluster is found in STMD4 (cluster 3) and the other spanning CLD5 and CTMD6 (cluster 4) (Fig. 3c, left panel). These SCV relationships suggest that clusters 3 and 4 are critical in mediating cholesterol management in the LE/Ly. Indeed, P691 in SCV cluster 3 has been shown to be involved in cholesterol binding 26,27,33,34 . Thus, the Chol-phenotype landscape reveals that regions in CLD5, CTMD6 and STMD4 contribute together to tune cholesterol flow (Fig. 3c, left panel).
The Chol-phenotype landscape (Fig. 3c, left panel) undergoes a striking change in the presence of SAHA (Fig. 3c, right panel) highlighting the ability of SAHA to improve cholesterol homeostasis for most of variants (Fig. 3c, red-orange in the left panel shifts to cyan and blue in the right panel). Moreover, the high confidence region (Fig. 3c, left panel, 25%; right panel, 5%, respectively) defined by the range in the molecular variogram ( Fig. 3b) are decreased substantially by SAHA indicating a loss of spatial interdependency of variant residues triggering disease. The correction of cholesterol homeostasis (z-axis) of variants found in the regions defined by SCV clusters 3 and 4 even in the absence of improvement of their TrIdx (y-axis) indicates that SAHA can independently adjust the function of NPC1 in the LE/ Ly to improve cholesterol homeostasis (Fig. 3c, Supplementary  Fig. 7f) 21 .
By projecting Chol-phenotype landscape at atomic resolution onto the structure of NPC1, the variation seen in the NPC1 population can now be used to map a potential path of cholesterol flow in NPC1 (Fig. 3d, e). Based on class III variants (Fig. 3d, CTL panel, medium size balls) that are primarily defective in cholesterol homeostasis (Fig. 3d, CTL panel, orange-red balls), the Chol-functional structure reveals (Fig. 3d, CTL panel) the critical residues for cholesterol export that include SCV cluster 4 residues in CLD5 and CTMD6, as well as SCV cluster 3 residues in STMD4, the later recently proposed to form a second cholesterol binding site 27 . Moreover, the proline-rich linker between SNLD1 and the TM region that has been suggested to facilitate cholesterol transfer 35 is now shown by SCV analysis to have little impact on trafficking yet contribute to the flow of cholesterol in NPC1 (Fig. 3d, CTL panel). The structural mapping of Chol-phenotype landscapes based on trafficking values from the Hela and U2OS cell lines yields similar pattern (Supplementary Fig. 7g, h), indicating VSP is able to capture a common phenotypic feature using different cell and tissues harboring variant genotypes 1 , suggesting that these are conserved genetic and epigenetic relationships. Based on our SCV analyses, a potential cholesterol flow path is highlighted in the TM region (Fig. 3e). The SCV relationships of variants found along the flow path lead to improved cholesterol homeostasis in response to SAHA (Fig. 3e, SAHA panel, green-cyan-blue), suggesting a role for CLD5 and CTMD6 in export of cholesterol from the LE/Ly by increasing NPC1 plasticity (i.e., a decrease in the SCV dependence) (Fig. 3b).
HDACi responsive delta (Δ)-functional structures. Given the differences in the basal and HDACi condition (Figs. 2 and 3), we can generate functional structures that directly capture the delta (Δ) values from TrIdx and Chol-phenotype landscapes. The Δ TrIdx structural state (Fig. 4a, b) highlights that the trafficking properties conferred by MLD3 and CLD5 residues are corrected by SAHA (Fig. 4a, cyan to blue). In contrast, SNLD1 is largely resistant to SAHA (Fig. 4a, yellow to red in left panel; gray in right panel). The TM region is also largely resistant to SAHA (Fig. 4a, yellow to red in left panel; gray in right panel) as the trafficking of the variants involved in the predicted path of cholesterol flow are not changed by SAHA (Fig. 4b, yellow balls). In contrast to the Δ TrIdx-functional structures, the Δ Cholfunctional structure highlights the fact that the residues distributed throughout the polypeptide chain can participate in the response of NPC1 to SAHA (Fig. 4c, cyan to blue). These changes are particularly evident in residues involved in the predicted path of cholesterol flow (Fig. 4d) TM2   TM3   TM4   TM5   TM6   TM7   TM9   TM10   TM11   TM12   TM13   TM8   R1077Q   Y1088C  W1145R   P691S   Y634C   L1191F  TM2   TM3   TM4   TM5   TM6   TM7   TM9   TM10   TM11   TM12   TM13   TM8   R1077Q   Y1088C  W1145R   P691S   Y634C   L1191F   TM2   TM3   TM4   TM5   TM6  TM7   TM8   TM9   TM10   TM11   TM12   TM13  Y634C   P691S   L1191F   W1145R   Y1088C   R1077Q   TM2   TM3   TM4   TM5   TM6  TM7   TM8   TM9   TM10   TM11   TM12   TM13   cholesterol homeostasis for TM3-4 in STMD4 and TM9-13 in CTMD6 (Fig. 4d, cyan and blue regions) is achieved without significant improvement of TrIdx (Fig. 4b, yellow regions), indicating that the dynamics of these TM helices in response to SAHA contribute uniquely to the flow of cholesterol in the LE/Ly.

Mechanism of action of SAHA.
HDACi have been shown to modulate the post-translational acetylation state of numerous cellular proteins 5,8,36 , thereby altering their functional states impacting many cellular pathways including altered stability and targeting for degradation 21,24,[37][38][39][40] . To examine the impact of HDACi on the acetylation of NPC1, we immunoprecipitated cellular acetylated proteins with an acetylated-lysine (AcK) antibody in the absence or presence of SAHA and probed for the presence of acetylated NPC1. Intriguingly, SAHA promoted hyperacetylation for both WT-NPC1 and I1061T-NPC1 (Fig. 5a), consistent with our recent observations that hyperacetylation of I1061T-NPC1 variant by treatment with the low affinity (mM) HDACi valproic acid (VPA), that targets similar Zn +2 -dependent HDACs to SAHA, prevents the proteosomal degradation of I1061T-NPC1 41 . To explore this result, we analyzed the heat shock response (HSR) pathway 42,43 , under the control of the heat shock factor 1 (Hsf1) transcription factor. We have previously shown that the chronic expression of misfolded proteins, such as I1061T-NPC1, results in the sustained activation of the HSR termed the maladaptive stress response (MSR) 44 , and that silencing Hsf1 promotes the stability and trafficking of I1061T-NPC1 44 . We observed that SAHA induced a significant reduction in the level of both Hsf1 and Hsf1-P (Fig. 5b) indicating a reduced MSR. Furthermore, the cellular expression of BAGs 1-3, nucleotide exchange factors regulating the activity of heat shock protein 70 (Hsp70) induced by HSR are also significantly reduced by SAHA (Fig. 5b). These results are consistent with the observation that siRNA-mediated silencing of BAGs or chemical disruption of BAG interactions using JG-98, an Hsp70 allosteric regulator 45 , can correct the trafficking defect associated with multiple NPC1 disease-causing variants 11 . Moreover, we found that SAHA treatment results in a significant reduction of HDAC7, a principle HDAC that mediates correction of NPC1 41 Fig. 8a, b), we found that class III variants containing patients have a significant correlation with a late age of neurological onset (ANO) when compared with all other patients ( Supplementary Fig. 8c, middle panel). To map these epigeneticsensitive SCV relationships that direct ANO, we used the TrIdx as the y-axis coordinate to predict the phenotype landscape for the ANO (z-axis) ( Fig. 6a; Supplementary Fig. 8d). Strikingly, we found in the ANO-phenotype landscape a prominent region defined by a SCV cluster in CLD5 with class III TrIdx properties that shows a significant late ANO (Fig. 6a, b, dashed oval, cyanblue), likely due to the ability of these residues to maintain a higher ratio (TrIdx) of post-ER functional protein. The known variants that contributed to this predicted age-dependent SCV cluster (V950M, S954L, P1007A, and T1036M) are highly responsive to SAHA treatment. These results illustrate how VSP can be used to quantitate SCV set-point tolerance on a residueby-residue basis to predict strong candidates for clinical trials (Fig. 6c,~70% percentile). Moreover, because SAHA also improves nearly all class II CLD5 variants to a class III phenotype with the resultant improvement in cholesterol management (Fig. 6d, arrow), VSP predicts that improving even class II CLD5 variants (e.g., I1061T) to a class III trafficking by a folding corrector, as has been found for CFTR 48 , through either a small molecule chemical chaperone or a proteostasis interventional strategy, may significantly reduce disease progression and reduce the impact of disease in early onset patients.

Discussion
We have proposed 1 that natural variants distributed across the world-wide population comprise a library of evolutionary tuned information that can be used through GPR-ML based SCV to define the functional properties of the protein fold crucial for human health. We have now shown that epigenetic modulation of NPC1 variants reveal distinct regions of SCV relationships that clearly separate residues involved in trafficking from those responsible for cholesterol homeostasis in downstream LE/Ly. SCV relationships revealed that CLD5 is likely a pivotal module where it forms a biological handshake with MLD3 to direct trafficking from the ER. Given that residues in the STMD4-CTMD6 module have less impact on trafficking from the ER, the genome based phenotype landscapes and corresponding functional structures lead us to suggest that the CLD5 may largely contribute to the organization of residues in the transmembrane spanning STMD4-CTMD6 modules to serve as a tunnel to  TM2   TM3   TM4   TM5   TM6   TM7   TM9   TM10   TM11   TM12   TM13   TM8   R1077Q   Y1088C  W1145R   P691S   Y634C   L1191F   TM2   TM3   TM4   TM5   TM6  TM7   TM8   TM9   TM10   TM11   TM12   TM13   Y634C   P691S   L1191F   W1145R   Y1088C   R1077Q   TM2   TM3   TM4   TM5   TM6   TM7   TM9   TM10   TM11   TM12   TM13   TM8   R1077Q   Y1088C  W1145R   P691S   Y634C   L1191F   TM2   TM3   TM4   TM5   TM6  TM7   TM8   TM9   TM10   TM11   TM12   TM13   Strikingly, many variants contributing to NPC1 disease do not exhibit severe trafficking defects yet are critical for function in downstream LE/Ly, consistent with the notion that export from the ER is not about quality control-rather about governing the SCV tolerance value of a set-point assign to each residue in the fold that we now show can be sensitive to epigenetic control. This conclusion is consistent with the many observations that~30% of regions contributing to the protein fold are not rigidly ordered, rather are highly dynamic in design, in many cases requiring a cyclic interaction with multiple downstream partners or ligands binding of cholesterol in an acidic LE/Ly in the case NPC1, to achieve function. Mechanistically, we suggest that GPR-ML reveals the potential for amino acid residues in the polypeptide chain to individually or as a collective, contribute to fold management for function through a weighted epigenetic-sensitive SCV matrix that can be discovered using GPR-ML 1 . This matrix response is likely sensitive to proteostasis buffering folding pathways, to UPS or autophagic pathways promoting degradation 55 , or to HSR management of post-ER stability and function reflecting each residue's role in NPC1 function.
Mechanistically, whether HDACi sensitive events are due to direct alteration of the acetylation balance of Lys residues in the NPC1 polypeptide chain 41,56 , more indirectly through transcriptional and/or post-translational mechanisms affecting HDAC-sensitive proteostasis pathways as shown herein 24,43,57,58 , and/or other HDAC-sensitive events facilitating endomembrane trafficking (ER-Golgi-LE/Ly) compartment function 24,59 , remains to be determined. Recent controversial results [60][61][62][63][64] of HDACi effects in mouse models emphasize the need for use of the human genome sequence and patient related models as captured herein, to point the way 1 , particularly in response to epigenetic modifications that are sensitive to the evolutionary trajectory and the local environment. Consistent with this prediction, from a natural history perspective, we have shown that epigenetic-sensitive SCV relationships can predict residues that are more likely to be responsive to HDACi in human clinical disease.
The sensitivity of SCV relationships dictating the NPC1 fold in disease to HDAC is generalizable given that HDACi correct a broad range of misfolding disorders that are triggered by genetic variation in the population, including cystic fibrosis 38 , alpha-1-antitypsin deficiency 39,40,47 , Gaucher's disease 65 , and neurodegenerative are controlled by the histone deacetylase activity. As such, VSP provides us with an quantitative platform to not only mechanistically understand the impact of genetic variation on function and structure 1 , but as shown herein, its epigenetic plasticity. The striking remodeling effect of HDACi on SCV relationships linking genomics to phenomics in the NPC1 disease population reveals a potential role for the acetylation sensitive epigenome and epiproteome to (re)tune global flow of information from the genome to the functional proteome (Fig. 6e). We have previously proposed 1 that information flow through central dogma can be best captured by changing our current linear (DNA → RNA → Protein) perspective to a SCV-based matrix view (Fig. 6e, SCV [DNA|RNA|Protein] where the vertical lines are defined by GPR-ML matrices that capture the SCV-based information flow through central dogma that generates biology. We now suggest that GPR-ML describes the epigenetic-sensitive SCV matrices (Fig. 6e, epi-SCV[DNA|RNA|Protein]) to reveal the environment sensitive complexity in the individual that make each of us who we are in health and disease 1 over a lifespan (Fig. 6e).
Human fibroblast cells. Human NPC1 fibroblasts were obtained from Coriell Institute (Camden, NJ) (Supplementary Table 1). All human skin fibroblasts were maintained in modified Eagle medium containing FBS.
Stable silencing of NPC1. NPC1-deficient stable HeLa-shNPC1 and U2OS-shNPC1 cells were generated by silencing the endogenous npc1 gene with 3′-UTR shNPC1 lentivirus. Mission shRNA clone was purchased from Sigma against 3′ UTR (TRCN000000542) of human NPC1 (pLKO-shNPC1) for knockdown of npc1 gene in human cell lines. Stable clones of npc1-deficient cells were selected with 5 μg/ml puromycin for 3 weeks. Stable HeLa-shNPC1 cells were cultured in DMEM medium supplemented with 10% FBS and 5 μg/ml puromycin. Stable U2OS-shNPC1 cells were cultured in McCoy's 5A medium supplemented with 10% FBS, 50-units/ml penicillin, 50 μg/ml streptomycin, 5 μg/ml puromycin, and 1 mg/ml G418. ΔU3hNPC1-WT has been used as a WT construct in precious publications 21,75 . It was derived from pSV-SPORT/NPC1 that has been used as the WT control protein in previous publications 56,76,77 . The site-mutagenesis was generated by Quick-Change XL Site-directed Mutagenesis Kit (Stratagene, La Jolla, CA) using pMIEG3-NPC1 as template, and all the variants used in this study contain the four background variants derived from human ΔU3hNPC1-WT (i.e., WT-V) construct.
Endoglycosidase H (Endo H) assay. In Hela cells, NPC1 variants were transiently expressed for 72 h using Fugene transfection reagents. Cells were washed with 1× PBS twice and lysed the cells with RIPA lysis buffer (10 mM Tris pH 8.0, 140 mM NaCl, 1% NP-40, 0.1% sodium-deoxycholate, 0.1% SDS, 1× protease inhibitor cocktail (PIC), 1 mM PMSF) on ice for 30 min. Cell lysates were obtained by 16,000 × g centrifugation at 4°C. Protein concentration was measured using the standard BCA assay. Around~300-500 mg of cell lysates were incubated with 1-2 mg of rabbit polyclonal anti-NPC1 antibody or rat monoclonal anti-NPC1 antibody for overnight at 4°C. In total, 20-30 μl of protein A/G-Sepharose or Gam-maBind G-Sepharose beads (GE Healthcare) were added to the cell lysates to capture the antibody-bound NPC1 for 1 h at 4°C. Beads were washed with RIPA lysis buffer 3× with final wash with 1× PBS. Immunoprecipitated NPC1 was eluded with 1× denaturation buffer (0.5% SDS, 40 mM DTT) at 95°C for 10 min. Eluted NPC1 was divided into equal parts and incubated with absence (−) or absence (+) of 1000 units Endo H enzyme for overnight at 37°C. Endo H digested samples were subjected to 4-12% Bis-Tris gradient (Invitrogen) or 4-20% gradient (Bio-Rad) SDS-PAGE and immunoblotted with rat monoclonal anti-NPC1 antibody (1:3000). Endo H sensitive (Endo H S ) and Endo H resistant (Endo H R ) were identified by their unique migration on SDS-PAGE gels. In U2OS cells, cells were seeded at 2.0 × 10 5 cells/ml in 12-well plates and incubated overnight. Transfections were performed using the reagent FuGENE6 from PROMEGA (Madison, WI) with a 1:4 ratio, 2 μg DNA:8 μL FuGENE6 in Opti-MEM (1×) + Hepes + sodium bicarbonate + L-glutamine + 5% FBS. Samples were then incubated for 5 h at 37°C and the medium was replaced with normal growth medium. Cells were then incubated for another 48 h and visualized under the fluorescent microscope for GFP-positive transfected cells. These variants were all transiently expressed in the U2OS-SRA-shNPC1 cells. After 72 h from initial plating, cells were lysed with 50 μL/well of 1× RIPA (150 mM NaCl, 1.0% IGEPAL Ca-360, 0.5% Na-deoxychlorate, 0.1% SDS, 50 mM Tris pH 8.0, 1× protease inhibitors, 1× benzonase) on ice for 30 min. Samples were collected and incubated at 37°C for 20 min, centrifuged at 20,817 × g at 4°C for 20 min. Supernatant was collected and the BCA assay performed. Endo H was performed by addition of 1× glycoprotein denaturing buffer (New England Biolabs) to 20 μg of lysate and incubated at 50°C for 30 min. For each sample, 10,000 U/ml Endo H and 1× Glycobuffer (New England Biolabs) were added and incubated overnight at 37°C. Protein samples were separated on 4-12% Bis-Tris BOLT protein gel (Invitrogen) and transferred to nitrocellulose membrane. Membranes were probed with NPC1 1:1000 and tubulin 1:25,000 diluted in 5% milk solution.
SAHA treatment on NPC1 variants. Human NPC1 variants were transiently expressed in HeLa-shNPC1 cells. After 24 h posttransfection, cells were treated with control DMSO or 10 μM SAHA for 72 h. Cell lysates were processed and Endo H treatment of the NPC1 variants were performed as described above.
Probing proteostasis components in response to SAHA. Patient I1061T/I1061T fibroblasts were seeded in a 6-well plate and grown until confluent. Once confluent, cells were treated with DMSO or SAHA at 10 μM for 48 h at 37°C in normal growth media. Cells were then washed twice with 1× PBS and then lysed with 75 μL/well of 1× RIPA (150 mM NaCl, 1.0% IGEPAL Ca-360, 0.5% Na-deoxychlorate, 0.1% SDS, 50 mM Tris pH 8.0, 1× protease inhibitors, 1× benzonase) on ice for 30 min. Samples were collected and then incubated at 37°C for 20 min. Samples were then centrifuged at 20,817 × g at 4°C for 20 min. Supernatant was  Table 3) are used as input value for VSP to generate the ANO-phenotype landscape that uses variant position information (x-axis) and cell-based TrIdx measurement (y-axis) to predict the ANO in the clinic (z-axis, color scale). The SCV cluster with late ANO (cyan-blue) in top 25% quartile of prediction confidence is highlighted by dashed oval. b The predicted ANO with highest confidence for each residue is mapped on the structure with the color code the same as panel (a). Variants with late ANO are shown in balls and labeled. c Distribution of the delta value of cholesterol homeostasis in response to SAHA for all the variants tested 21 . The variants with late ANO in the SCV cluster are shown in blue and labeled. d The Chol-phenotype landscape of CLD5 (Fig. 3c, asterisks) are shown with the TrIdx classes highlighted by dash lines. Black arrow indicates the TrIdx and Chol shift of class II CLD5 variants to class III in response to SAHA. e HDACi remodels the SCV relationships to manage genotype to phenotype transformation. Addition of HDACi to the same collection of variants from NPC1 population creates a new SCV matrix linking genotype to phenotype. VSP captures the impact of these epigenetic changes through SCV matrices to interpret and predict the dynamics of protein fold for its function in the context of local epigenetic environment. The epigenetic modulation of SCV (epi-SCV in the figure) provides a quantitative platform to characterize the ability of the environment to manage the information flow through central dogma (epi-SCV[DNA|RNA|Protein]) collected and BCA assay was performed. Once samples were prepared, 40 μg of total protein was separated on a 10% SDS-PAGE. Proteins were transferred to nitrocellulose and probed with antibodies of HSF1 (1:10,000 dilution), HSF1 phosphorylated (1:5000 dilution), Bag1 (1:1000 dilution), Bag2 (1:1000 dilution), and Bag3 (1:1000 dilution), and HDAC7 (1:1000 dilution). GAPDH is probed by GAPDH antibody (1:100,000 dilution) as loading control. Each was further probed with their specific secondary antibodies and detected by chemiluminescence.
Natural history analysis. NPC1 subjects were evaluated at the National Institutes of Health Clinical Center in Bethesda, Maryland in an observational/natural history protocol (NCT00344331; https://clinicaltrials.gov/ct2/show/NCT00344331? term=niemann+pick&rank=11). This study was approved by the Eunice Kennedy Shriver National Institute of Child Health and Human Development Institutional Review Board. Written consent, and when feasible assent, were obtained. Clinical information related to disease onset was collected from guardians and record review. Neurological severity scores were determined by clinical evaluation of nine major and eight minor clinical areas, such as eye movement, ambulation, speech, swallow, fine motor skills, cognition, hearing, memory, seizures and so on 78 . Ageadjusted severity scores were determined by dividing the neurological severity scores by the age of the subject at the time of evaluation. The age of neurologic onset of each variant is calculated by averaging the age of neurologic onset of the patients (Supplementary Table 3) who contain that variant.
VSP methods. VSP uses a biological adaptation of the general principles of geostatistics 1,9 , a well-established Gaussian-process regression (GPR)-based machine-learning (ML) approach that provides a means to predict the distribution of a large range of geological, epidemiological, anthropological and environmental measures in complex geologic landscapes using sparse sampling methods. Below we described how we apply classical GPR-based ML concepts through VSP to understand the role of variation in human disease and its application to precision medicine.
The initial step for VSP is molecular variogram analysis. Assume we have a variable z (z-axis value) which is positioned by x and y coordinates (x-and y-axis values) that describe the phenotype landscape. A molecular variogram is first used to describe how the spatial variance (i.e., the degree of dissimilarity) of z changes according to the separation distance defined by the x and y coordinates which enables the calculation of the spatial covariance (SCV) relationships in the dataset. Suppose the ith (or jth) observation in a dataset consists of a value z i (or z j ) at coordinates x i (or x j ) and y i (or y j ). The distance h between the ith and jth observation is calculated by and the γ(h)-variance for a given distance (h) is defined by γ(h)-variance is the semivariance of z value between the two observations, which is also the whole variance of z value for one observation at the given separation distance h. Here in the paper, we refer γ(h)-variance as spatial variance as indicated in the y-axis of molecular variogram (main text Figs. 2b and 3b). By Eqs. (1) and (2), the distance (h) and γ(h)-variance for all the data pairs are generated. Then, the average values of γ(h)-variance for different distance intervals are calculated to plot γ(h) versus h used in the molecular variogram. The distance where the model first flattens out is known as the range. Sample locations separated by distances closer than the range are spatially correlated, whereas locations farther apart than the range are not. The spatial covariance (SCV) at the distance (h) is calculated by is the covariance at zero distance representing the global variance of the data points under consideration (i.e., the plateau of the variogram). GPR-based ML is an interpolation/regression method providing optimal unbiased prediction based on the modeled SCV relationships generated by the molecular variogram 9 . In our study, we used a GPR-based regression approach referred to as ordinary Kriging that has the least assumptions and is the form most commonly used 79 . Essentially, SCV can predict the unknown value by local weighted averaging the surrounding known values, where the weight associated with the known value is determined according to their positions both in relation to the unknown point and to one another.
According to variogram, observations within close distance are usually highly correlated and have more weight for prediction. To solve the optimum and unbiased weights of SCV relationships, molecular geostatistics aims to minimize the variance associated with the prediction of the unknown value at location u, which is generated according to the expression where z Ã u is the prediction value while z u is the true but unknown value, C i,j and C i,u are SCV between data points i and j, and data points i and u, respectively, and C u,u is the SCV within location u. ω i is the weight for data point i. The SCV is obtained from the above molecular variogram analysis. To ensure an unbiased result, the sum of weight is set as one.
Equations (3) and (4) not only solved the set of weights associated with input observations, but also provided the minimized 'molecular Kriging variance' at location u which can be expressed as where C u,u is the SCV within location u, ω i is the weight for data point i, and C i,u are SCV between data points i and u. µ is the Lagrange Parameter that is used to convert the constrained minimization problem in Eq. (3) into an unconstrained one. The resulting minimized Kriging variance provides a weighted SCV score that represents the confidence for using the SCV relationships both within the input data points and in relation to the unknown locations. The confidence level is tightly linked with the distance range in the variogram. The shorter distance between the unknown point to the input data points, the higher confidence for using the SCV relationships for prediction. With the solved weights W, we can calculate the prediction of all unknown values to generate the complete phenotype landscape by the equation where z Ã u is the prediction value for the unknown data point u, ω i is the weight for the known data point, and z i is the measured value for data point i 9 .
The parameters, for example, range and sill (i.e., plateau) of the variogram in this study are indicated in Fig. 2b and Fig. 3b, respectively. They were chosen by minimizing the residual sums of squares of the fitting. Spherical model was chosen for variogram modeling because it yielded the best leave-one-out cross-validation result for the datasets when compared with linear, exponential and gaussian models. In the final prediction, for each unknown position, minimal 5 nearby variants and maximal 19-25 nearby variants were included in the weighted averaging calculation (i.e., Eq. (6)). All the procedures were performed in the ordinary Kriging module in GS+, version 10 from gammadesign software (https:// geostatistics.com/). We also tried the ordinary Kriging module in gstat package 80 , which gave identical results. The source code for running gstat package in R is uploaded to https://doi.org/10.17632/ycw667nv5f.1 (see Code availability). The multivariate linear regression and random forest regression were performed in R by using basic lm function or randomForest package, respectively.
We used leave-one-out and k-fold cross-validation to assess the performance of VSP strategy. In the leave-one-out cross-validation all data are initially used to build the molecular variogram and Kriging models. Then, we remove each data point, one at a time and use the rest of the data points to predict the missing value. We repeat the prediction for all data points and compare the prediction results with the measured value to generate the Pearson's r-value and its associated p-value (ANOVA test). For the k-fold cross-validation, samples are randomly partitioned into k = 48, 24, 12, 8, 4, 3, or 2 sets. Of the k sets, a single set is used as validation data and the remaining k-1 sets are used as training data. The size of training and validation subsamples are indicated for each k-fold in the figures. The crossvalidation process is repeated k times and every set is used as validation once. The prediction of each sample is collected. For k < 48, the partition process is repeated five times and the averaged Pearson's r-and p-value of the correlation between predicted value and actual value is reported. The error bars associated with each prediction is the prediction confidence. In the correlation analyses, we took the confidence level into account as weight. A prediction with small uncertainty will have a larger weight because it is more precise than prediction with larger uncertainty. The weight is calculated as: ω i ¼ 1 i is the variance for i. Quantitative correlation analyses and p-value calculations were performed using the software Originpro 2016 or R. A p-value < 0.05 was considered to indicate statistical significance.
Phenotype landscapes predicted based on a sparse collection of variants contain experimental information that comprises the full range of values on function y-and z-axis for the entire polypeptide sequence (x-axis). To map the function predictions onto structure, we assign the prediction value with highest confidence to each residue to generate a functional structure that illustrates all values interpolated from the sparse collection variants used to generate the phenotype landscape displaying at atomic resolution. PDB:3JD8 and PDB: 5U73 are used for NPC1 structure maping 25,29 . All the structural presentations were produced by the software of PyMOL.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.