Galectins are a family of glycan-binding proteins characterized by their affinity for β-galactosides and the presence of one or more structurally conserved carbohydrate recognition domains (CRDs)1. With fifteen members identified in vertebrates, galectins display diversity in ligand specificity and can be found in both intracellular and extracellular environments2,3. Notably, galectins have been shown to act as modulators of cell behaviour by regulating signalling processes as well as inflammatory and immune responses4. Galectins are promising candidates as diagnostic markers and novel drugs targets for a number of human diseases4,5.

To date, three subtypes of galectins have been identified, based on the number and structural arrangement of the CRDs: prototype, chimera and tandem-repeat6. While high-resolution structures of many full-length galectins remain elusive, crystallographic studies have revealed a significant structural similarity among CRDs. Common to most CRDs is a conserved β-sandwich fold with an overall jellyroll topology as well as a signature sequence for carbohydrate recognition7.

The tandem-repeat subtype of galectins contains two distinct CRDs (galectin-4N at the N-terminus and galectin-4C at the C-terminus) connected in a single polypeptide chain by a linker region6. Studies with tandem-repeat galectins have shown that the linker’s role, likely mediating the intramolecular interactions of CRDs, is associated with potency in inducing a specific biological response8,9,10,11,12,13. Other proposed roles for the linker region include protein-protein interactions, membrane insertion and positioning the CRDs10,11,13.

Despite the importance of the linker, structural studies of galectins have thus far been limited to the individual CRDs or to engineered tandem-repeat galectins where the linker has been truncated. Furthermore, the anticipated flexibility of the linker and its susceptibility to proteolysis have made structural characterizations of full-length tandem-repeat galectins particularly challenging. In order to unravel the structural mechanisms that govern signalling modulation by tandem-repeat galectins, we chose human galectin-4 as our model of study. Galectin-4 belongs to the tandem-repeat category of galectins, together with galectins -6, -8, -9 and -12. Galectin-4 is largely expressed by intestinal epithelial cells and shows antagonist effects depending on the type of cancer.

Galectin-4 functions as a tumour suppressor of human colorectal and pancreatic cancer14,15,16. By contrast, in liver and lung cancer, the leading types of cancer that cause death worldwide, galectin-4 expression leads to increased metastasis and cancer progression17,18, suggesting its use as a promising target for drug development5. Here, we provide the first structural characterization of the full-length human galectin-4 using X-ray crystallography, small- and wide-angle X-ray scattering (SAXS/WAXS), molecular modelling, molecular dynamics simulations and differential scanning fluorimetry assays. Our findings reveal that full-length galectin-4 folds as a compact structure and provide insight into the process by which the linker-peptide mediates recognition through correlated movements and transient interactions. These results shed light on the structural role of galectin-4’s linker-peptide and its biological function in this important class of proteins. Moreover, the generated knowledge and experimental tools described here can be exploited to investigate the role of galectin-4 under different pathological conditions.


Protein production and thermal analysis of galectin-4, galectin-4N and galectin-4C

Galectin-4 is composed of 323 amino acids residues, which can be divided into an N-terminal domain (aa 1–150; galectin-4N), linker-peptide (aa 151–178) and C-terminal domain (aa 179–323; galectin-4C)19 (Supplementary Fig. S1). The full-length protein and its individual domains, galectin-4N and galectin-4C were cloned, overexpressed and purified as described in the methods section. First, the folding stability of each construct was examined by differential scanning fluorimetry (Thermofluor), a methodology used to monitor protein unfolding. By measuring the fluorescence-probe intensity as a function of temperature, thermofluor assays allow for the comparison of melting temperatures (Tm), transition profiles and thermal shift (ΔTm) values compared to the reference curves (obtained in buffer) at different conditions. Here, a positive ΔTm indicates thermal stabilization induced by changes in the physicochemical environment.

Reference curves resulted in sigmoidal profiles with respective Tm values of 55.92 ± 0.05 °C for galectin-4, 56.8 ± 0.1 °C for galectin-4 N and 68.12 ± 0.05 °C for galectin-4C (Fig. 1a). The thermal behaviour of galectin-4 and its domains was also evaluated against the 94 additives from the Solubility & Stability Screen kit (Hampton Research) (Supplementary Table S1). Analysis of thermal shift (ΔTm) values in the presence of additives revealed that galectin-4C displays the largest ΔTm values and the most distinctive behaviour under changes in the physicochemical environment (Fig. 1b). Lower ΔTm values are observed for the full-length protein than its CRDs, suggesting that the galectin-4 gained stability due to the interaction between the CRDs.

Figure 1
figure 1

Thermofluor assays.

(a) Normalized thermal denaturation curves for galectin-4, galectin-4N and galectin-4C. Measured apparent unfolding temperatures were 55.92 ± 0.05 °C for galectin-4, 56.8 ± 0.1 °C for galectin-4N and 68.12 ± 0.05 °C for galectin-4C. (b) Evaluation of thermal shift profile for galectin-4, galectin-4N and galectin-4C at different categories of additives. Bars show all additives that contribute to interpretable transitions with positive and/or negative thermal shift for the three proteins. Compounds and the respective thermal shift values are listed in Supplementary Table S1. (c) Thermal shift profile as function of lactose concentration.

The thermal shift of the three constructs was also evaluated in the presence of lactose, a low affinity β-galactoside ligand for galectin-4. A hyperbolic profile dependence on lactose concentration was observed, allowing for the estimation of saturating ΔTm values of 9.4 ± 0.4 °C, 9.3 ± 0.4 °C and 9.3 ± 0.6 °C, for galectin-4, galectin-4N and galectin-4 C, respectively (Fig. 1c). Fitting of the apparent binding constant, k, for lactose yields similar values for galectin-4 and galectin-4 N of 53 ± 6 and 50 ± 5 mM, respectively and a k of 78 ± 10 mM for galectin-4C. Apparent affinities obtained by thermofluor, which are proportional to the dissociation constants20, are in agreement with previous findings which described lactose as a weak ligand with galectin-4C displaying 1.5 times lower affinity than galectin-4N (1.3 mM and 1.9 mM for galectin-4N and galectin-4C, respectively21,22).

Additionally, melting curves for full-length galectin-4 were evaluated at different ionic strengths and pH values using the Solubility & Stability Screen 2 kit (Hampton Research). Although the Tm for full-length galectin-4 was lower with decreasing pHs, the melting curves consistently occurred in a single-domain protein denaturing event, suggesting that the global structure of galectin-4 remains stable as a compact unit over a wide range of conditions.

Structural models for galectin-4N, galectin-4C and full-length galectin-4

To elucidate the full structural architecture of galectin-4, we solved the crystal structures of galectin-4N and galectin-4C at 1.48 Å and 1.78 Å resolution, respectively23,24 (Table 1). The final models for galectin-4N and galectin-4C are comprised of residues 5 to 152 and 184 to 323, respectively and share the same structural features previously described by Bum-Erdene and co-workers21,22. Both structures show the canonical β-sandwich fold arranged in a jellyroll topology, in which the monomer is formed by two antiparallel β-sheets, each composed of six (F0-F5/F0′-F5′ and S1-S6/S1′-S6′) β-strands (Fig. 2a).

Table 1 Data collection and refinement statistics.
Figure 2
figure 2

Crystal structures of galectin-4N and galectin-4C.

(a) Overall β-sandwich fold of galectin-4N (blue) and galectin-4C (pink) structures. The antiparallel β-sheets are shown in blue (F0-F5) and cyan (S1-S6a/b) for galectin-4N and pink (F0′-F5′) and light pink (S1′-S6a′) for galectin-4C. (b) Electrostatic potential surface for both the galectin-4N and galectin-4C structures. Front view (β-sheet S1-S6/S1′-S6′) and back view (β-sheet F0-F5/F0′-F5′). The circle marks the canonical binding site. (c) Canonical (pink) and extended (yellow) binding sites of galectin-4 domains. The main residues involved in binding interactions are represented as sticks. (d) Sequence alignment of galectin-4N and galectin-4C showing secondary structures elements. Marked in bold are the conserved residues. Highlighted in pink are the residues of canonical carbohydrate-binding site; the star is the only conservative substitution in the binding site residues between both domains. In yellow are the extended binding site residues.

Structural analysis of both galectin-4N and galectin-4C domains, which share a root mean square deviation (RMSD) of 1.2 Å between Cα atoms, reveal a large difference in charge distribution when the electrostatic potential surface is calculated at the physiological pH 7.4 (Fig. 2b). The galectin-4C surface charge distribution is mostly positive, whereas, the galectin-4N surface displays a more heterogeneous distribution with a positive region localized in the binding site.

The carbohydrate-binding site is located in a shallow pocket composed of residues present in the S4/S4′, S5/S5′ and S6/S6′ strands and the S5/S5′ adjacent loop. The residues involved are His63/236, Asn65/238, Arg67/240, Asn77/249, Trp84/256, Glu87/259 and Arg89/Lys261 in the galectin-4N/galectin-4C structures, respectively (Fig. 2c,d). The S2/S2′ and S3/S3′ strands, thought to contribute to the selectivity between galectin-4N and galectin-4C domains, form an extended cleft that permits interaction with different ligands. The main amino acid substitutions in the galectin-4N/galectin-4C structures are His135/Thr309, Gln137/Glu311 and Asp139/Gln313 for the S2/S2′ strand and Arg45/Ser220, Phe47/Ala222 and Val51/Lys226 for the S3/S3′ strand (Fig. 2c,d). Arg45 in the S3 strand from galectin-4N has been identified as the main residue to interact with a cholesterol sulphate ligand25 and to contribute weakly to lactose-3′-sulfate interaction22. Asn224 and Lys226 (S3′ strand), as well as Glu311 and Gln313 (S2′ strand) from galectin-4C have been shown to establish additional interactions with lacto-N-tetraose and lacto-N-neotetraose ligands21. Also in galectin-4C, Ser220 was identified as responsible for A-type saccharide preference21. Additional differences are observed in the loops between strands S3/S3′-S4/S4′ and S4/S4′-S5/S5′, where insertions are observed when comparing galectin-4N and galectin-4C amino acid sequences (Fig. 2d).

A structural model for full-length galectin-4 was obtained by combining molecular modelling and molecular dynamics (MD) simulations. First, ab initio prediction was used to generate different models of the linker-peptide. The best models, which share a compact structure and the presence of a short helix segment, were elected based on geometry and agreement between observed and predicted content in secondary structure. The linkers were combined as a single polypeptide chain with the X-ray structures of galectin-4N and galectin-4C, which were randomly arranged in relation to each other giving rise to six different starting models for full-length galectin-4. The model with the lowest potential energy (Fig. 3a) was submitted to a conformational refinement by MD. We began with a standard backbone-restrained solvation and thermalisation (2 ns) to achieve a pressure of 1 atm and a temperature of 37 °C (310 K) in the simulation box. A 30 ns production simulation was subsequently performed to ensure that the system reached and maintained proper equilibrium. The resulting trajectory was then analysed by principal component analysis (PCA), allowing us to select the lowest energy frame, which was designated as the starting point to all further rounds of MD simulations described in this work (Fig. 3b).

Figure 3
figure 3

Model of full-length galectin-4.

(a) Cartoon representation of the initial model for full-length protein (b) Overall fold of galectin-4 model after equilibrium dynamics and geometry optimization. (c) Representation of inter-domain interactions mediated by hydrogen bonds.

The galectin-4 model displays four antiparallel β-sheets connected by a linker-peptide that can be described as a proline-rich hinge followed by a short α-helix (amino acids 170–173) and an extended region (Fig. 3b). We observe a compact structure, having overall dimensions of 74 Å × 55 Å × 45 Å, in which the CRDs interact with each other and with the linker-peptide. These interactions are stabilized by 10 hydrogen bonds and 152 non-bonded contacts (Fig. 3c). The contact areas between interfaces were determined to be 465 Å2 (galectin-4N/linker), 349 Å2 (galectin-4N/galectin-4C) and 418 Å2 (linker/galectin-4 C).

Solution conformation of human galectin-4

To evaluate the energy-minimized full-length galectin-4 model obtained by MD (Fig. 3b), the overall conformation of the protein was examined in solution by X-ray scattering, a technique that is ideally suited for probing ligand-induced conformational changes and for examining dynamic proteins that are challenging to crystallise. In-line size exclusion chromatography (SEC) was used to separate any mixtures as well as to ensure accurate background subtractions. Scattering was measured over a wide range of scattering angles on galectin-4 both in the absence of any ligands and in the presence of 30 mM lactose (Supplementary Fig. S2). For each sample, approximately 500 exposures were collected as the elution flowed directly into a continuous-flow cell. In each case, sample homogeneity was confirmed in the central region of the elution peak (Supplementary Fig. S2, blue regions) by singular value decomposition (SVD) and Guinier analysis (Supplementary Fig. S2)26 and thus, the scattering profiles within these regions were averaged (Fig. 4a, gray circles). A comparison of the experimental curve with the theoretical scattering of a model of galectin-4, in which the CRDs are non-associating (Fig. 3a, dotted curve) shows a poor fit, whereas a comparison with the theoretical scattering calculated from the full-length model described above (Fig. 3b, black curve) shows remarkable agreement. Consistent with this result, the ab initio shape reconstruction of galectin-4 derived from the SAXS data also suggests a compact conformation in which galectin-4N and galectin-4C are associated (Fig. 4b). Interestingly, the scattering of galectin-4 in the presence of lactose is nearly superimposable with that of ligand-free galectin-4. Only a subtle difference is apparent at low angles, corresponding to features at large length scales. Consistent with this, Guinier analysis yields slightly different radii of gyration for galectin-4 without and with lactose of 23.7 ± 0.1 Å and 24.9 ± 0.1 Å, respectively. The subtle expansion in the conformation upon addition of lactose is best visualized by an increase in the width of the pair-distance distribution function, P(r) (Fig. 4c).

Figure 4
figure 4

Solution conformation of full-length galectin-4 examined by X-ray scattering.

(a) The experimental scattering of galectin-4 in the absence of ligand (gray) is well fit by the theoretical scattering of the full-length model in Fig. 3b (solid line), confirming that the two CRDs associate in solution. In contrast, a comparison of the experimental scattering to the theoretical scattering of the model found in Fig. 3a in which the CRDs are non-associating (dotted), shows a poor fit. (b) An ab initio shape reconstruction generated from ligand-free galectin-4 scattering data also shows good agreement with the full-length model. (c) Addition of lactose leads to a subtle expansion in the width of the pair-distance distribution function, P(r) and a slight increase in radius of gyration.

Molecular dynamics simulations

We performed molecular dynamics simulations of both galectin-4 and the galectin-4-lactose complex to investigate the behaviour of the protein in the presence and absence of a ligand. For each system, we performed four independent trajectories of 100 ns using different seeds (named MD 1, MD 2, MD 3 and MD 4). Analysis of the RMSD for backbone atoms showed that all simulations systems reached equilibrium before 100 ns (Supplementary Figs S3 and S4). Variations among MDs simulations showed that the apo structure adopts two main conformations: an “open” conformation with an average Rg of 23 Å and a “closed” conformation with an average Rg of 22 Å (Supplementary Fig. S3). The Rg histogram for MDs also revealed that in the protein-lactose complex, galectin-4 is stabilized in the “open” conformation (Supplementary Fig. S4).

Analyses of RMSD plots for each independent domain (Supplementary Figs S3 and S4) reveal that galectin-4C remained stable throughout the MD trajectory. Galectin-4N was shown to converge to similar structures sharing in average 1.4 Å deviation. Larger conformational fluctuations were observed in the linker-peptide, as expected for this type of disordered secondary structural element (Supplementary Figs S3 and S4).

Inter-domain communication in galectin-4

To guarantee an investigation over a well-thermalized system we extended the MD 1 simulation to 250 ns and compared the 150 ns time interval, between 100 and 250 ns for both simulations (with and without lactose). RMSD plots (Fig. 5a,b) consistently showed differing galectin-4 behaviour in the absence and presence of lactose. In both cases, the linker-peptide generally demonstrated the highest deviation values, which are correlated with conformational changes associated to the full-length structure (Fig. 5a,b). Moreover, in the presence of the ligand, the galectin-4N domain showed a higher structural variability than galectin-4C.

Figure 5
figure 5

RMSD plots for molecular dynamics simulation with (+lactose) and without (−lactose) lactose, 150 ns trajectories.

RMSD by domains structure (a) (−lactose) and (b) (+lactose). (c) RMSF box chart for MD simulation without and with lactose and cartoon putty representation of mobility through trajectory (inset); the blue-white-magenta scale calculated B-factor from 0 to 250 Å2. Porcupine plot of the first eigenvector generated through principal component analysis of the representative structure with lactose in (d) front view and (e) bottom view. The vectors, represented as blue arrows, show the tendency of movement. Plot of atomic correlations of MD without lactose (f) with lactose (g). The correlated movements are shown in pink and anticorrelated movements in blue scale bar. The bars indicate the portion of the graph relating to each domain, white for galectin-4N, light gray for linker and dark gray for galectin-4C.

For both MD simulations, we evaluated mobility using root mean square fluctuation (RMSF) box charts (Fig. 5c). The average RMSF was 1.0 ± 0.4 Å for galectin−4 and 2.0 ± 0.8 Å for the galectin-4-lactose system. Overall, the highest B-factors were in the galectin-4-lactose system, indicating greater flexibility than galectin-4 without lactose (Fig. 5c, inset). In both cases, the flexible regions were mainly found on the N-terminus, linker-peptide and regions between β-strands, with an emphasis on seven loops of galectin-4 (S3-S4, S5-S6, S3′-S4′, S4′-S5′, S5′-S6′, F4′-F5′ and F5′-S2′) and sixteen loops of galectin-4-lactose (F0-S1, F2-S3, S3-S4, S4-S5, S5-S6, S6-F3, S2-F1, F0′-S1′, F2′-S3′, S3′-S4′, S4′-S5′, S5′-S6′, S6′-F3′, F4′-F5′, F5′-S2′ and S2′-F1′).

This protein flexibility is related to the nature of intramolecular interactions. Hydrogen bond pairs with more than 10% occupancy were analysed between domains (Supplementary Table S2). For the MD simulation without lactose, we observed four H-bond pairs between galectin-4N/linker, five between galectin-4C/linker and four between galectin-4N/galectin-4C, of which, only five had greater than 50% occupancy. With lactose, there are seven H-bond pairs between galectin-4N/linker, nine between galectin-4C/linker and five between galectin-4N/galectin-4C, however only eight pairs interacted more than 50% of the time. Although the two MD simulations share only one H-bond pair, 148ASN(D22)-171HIS(ND1), eight common residues are involved in different H-bonding interactions. Moreover, a structural comparison between simulations at 250 ns revealed that the main interactions are non-bonded contacts, among which, many residues are the same in both systems.

Due to its more compact structure, the model without ligand showed larger interface areas than the galectin-4-lactose complex (Supplementary Fig. S5). The contact areas between surfaces in galectin-4 were determined to be 540 Å2 (galectin-4N/linker), 481 Å2 (galectin-4N/galectin-4C) and 334 Å2 (linker/galectin-4C). For the structure with lactose, these values were 325 Å2, 202 Å2 and 428 Å2, respectively. These interface areas suggest that in the first system the linker-peptide is shifted towards galectin-4N, while in the system with lactose it is shifted towards galectin-4C. The dynamic nature of the interface where the interaction are sustained by transient contacts, gives this region an intrinsic flexibility.

Principal component analysis (PCA) was used to estimate the primary domain motions (Fig. 5d,e). The results indicate that only a portion of the linker showed significant movement in the simulation without lactose. In contrast, both CRDs showed opposing rotational movements when in presence of lactose (Fig. 5d,e). According to the RMSD plot (Fig. 5b), the structural rearrangement in the linker is associated with a movement that pushes the CRDs in opposite directions (Fig. 5d,e).

Additionally, correlation plots showed that both structures, galectin-4 and galectin-4-lactose, have different structural correlation patterns (Fig. 5f,g). Galectin-4 mainly showed positive intra-domain correlations, with few anti-correlated movements between CRDs. Although the linker had shown high flexibility, its movement was not correlated with any domain (Fig. 5f). The galectin-4-lactose complex, in contrast, showed a larger number of positive and negative correlations (Fig. 5g), involving residues of all domains.

Despite movement, the low RMSD of each domain through trajectory (Fig. 5b) indicates low structural variability. Even so, galectin-4N and galectin-4C show long-range anti-correlated movements with respect to each other (Fig. 5g). The combination of these two behaviours reflects a correlated movement of rigid bodies mediated by the exchange of weak interactions with the linker.


It is well known that CRDs share a conserved β-sandwich fold and that there is a sequence signature for carbohydrate recognition and binding (Supplementary Fig. S1)7. However, one of the most notable properties about galectins and their CRDs is the meticulous way in which they discriminate among different glycans, resulting in a variable and complex biological response27,28.

Studies have demonstrated that the tandem-repeat galectins are more potent than galectins-1 and -3 in activating signalling in T cells and neutrophils9,12,13. In addition, they display a broad spectrum of biological activities as major signalling modulators both inside and outside the cell. This characteristic suggests that a combination of two distinct CRDs and a linker-peptide brings together chemical, structural and dynamic diversity able to impact on potency and on the plurality of carbohydrate-dependent events involved in their signalling ability and adhesive properties10.

The impact of tandem-repeat galectins on biological response has been associated with structural flexibility, relative orientation and spacing between CRDs9. However, structural and dynamic characteristics of tandem-repeat galectins, including the type of interactions between CRDs and the linker-peptide, remain elusive and thus merit concentrated investigative efforts. However, despite the importance of this class of proteins in both physiological and pathological processes, the flexibility imposed by the linker and its susceptibility to proteolysis29 have made these studies very challenging.

As an important step toward assessing the underlying mechanisms that govern the function of tandem-repeat galectins acting on multiple targets, we presented for the first time a structural model of human galectin-4 based on a combination of theoretical and experimental approaches. The final model of galectin-4, constructed based on X-ray crystallography, molecular modelling and MD simulations and further supported by SAXS experiments, reveals that galectin-4 folds as a compact structure in which the CRDs interact both with each other and with the linker-peptide (Fig. 3b). The galectin-4 domains, galectin-4N, galectin-4C and the linker-peptide, were found to be mainly connected by weak (hydrogen and other non-bonded interactions) and transient contacts, revealing the dynamic nature of the interfacial interactions (Supplementary Table S2).

Experimental evidence for interaction between the CRDs was also observed when comparing the thermal denaturation profiles of the full-length galectin-4 with its independent domains (Fig. 1a). Although there was an 11 °C difference between the melting temperatures of the CRD domains, large enough to be distinguished if the unfolding process was characterized by sequential (non-cooperative) events of CRD domains, the profile for the melting curve obtained for full-length galectin-4 was consistent with a single-domain protein denaturing event (Fig. 1a). The same profile was observed when galectin-4 was submitted to different pH, ionic strengths and additives. This results reinforces the hypothesis that CRDs are not only associated under physiological conditions, but also remain together under diverse conditions, including those that mimic acidic extracellular microenvironments characteristic of tumour tissue30 in which the protein is often present.

Corroborating the idea of a compact structure, full-length galectin-4 was also shown to be more stable than its independent domains (Fig. 1b). In fact, a comparison of the melting curves of galectin-4, galectin-4N and galectin-4C allowed us to compare the behaviour of isolated CRDs with full-length galectin-4 and infer the individual contribution of each CRD for galectin-4 structure.

Differences between the galectin-4N and galectin-4C melting curves under the different conditions are notable (Fig. 1b, Supplementary Table S1) and can be explained as a consequence of variation in their chemical properties, i.e., number and charge distribution of amino acids among CRDs (Fig. 2b). Galectin-4C was shown to be more sensitive to changes in the chemical environment, displaying larger thermal shift (ΔTm) values, but it appears more stable than galectin-4N overall (Fig. 1b, Supplementary Table S1). In agreement, MD data shows that galectin-4C is more rigid (Fig. 5b), a requirement to compensate for increased thermal fluctuations. In contrast, the larger RMSD values observed during simulation reveal that galectin-4N can be more plastic (Fig. 5b), a characteristic that allows this domain to be more promiscuous in carbohydrate recognition and binding, as well as more potent in achieving a biological response.

Careful analysis of melting curves and thermal shift values under different chemical environments reveals that galectin-4 takes advantage of the stability of both domains to remain stable over a larger range of chemical conditions, i.e., the most stable domain governs the denaturation process of galectin-4 (Fig. 1b). This combined response is a reflection of its compact structure and of the ability of the linker-peptide to switch back and forth between CRDs that allows for transient interactions to stabilize the more susceptible domain (Supplementary Table S2).

The similarity between the hyperbolic profile dependence on lactose concentration for galectin-4 and galectin-4N indicates that the response for the full-length protein is governed by a single binding site with similar properties to those of galectin-4N domain (Fig. 1c). The lack of a clear evidence of the contribution of the galectin-4C binding site for full-length protein behaviour (Fig. 1c) can be explained as a result from the contribution of the linker, as observed in our MD simulations (Supplementary Fig. 5). Whether the cross talk between galectin-4N and galectin-4C has a positive or a negative impact on galectin-4C lactose recognition remains to be elucidated.

Thermofluor studies complemented by our MD data provide insight into protein flexibility under different conditions. These results demonstrated that the sequence variation among galectin-4-CRDs, although preserving the integrity of the CRD β-fold sandwich and sequence signature for carbohydrate recognition, enable CRDs to respond differently to a given chemical environment. Thus, physiologically, the CRDs not only work as agents of glycan recognition, but can also be considered biochemical sensors of the microenvironment important for adapting the lectin properties of galectin-4 to different conditions and thereby assuring its biological impact in distinct physiological and pathological processes.

Different from the apo protein, the galectin-4-lactose complex is found stabilized in an open conformation, characterized by a hinge-bending motion (Fig. 5d,e) and a decrease in contact areas between domains (Supplementary Fig. S5). Consistent with our MD results (Supplementary Fig. S4), an increase in radius of gyration is observed by SAXS in the presence of lactose. Covariance analysis showed that the movement between linker and CRDs is directly correlated (Fig. 5g). Whereas, analysis of both RMSD and RMSF distributions demonstrates that both CRDs move as rigid bodies, without any significant intra-domain distortion or disruption of the carbohydrate-binding site (Fig. 5b,c).

Together, thermofluor, SAXS and MD analyses associate this lactose-stabilized, elbow-hinged switch in the full-length galectin-4 with a gain of thermal stability in each individual CRD domain (Fig. 1c) and flexibility (Fig. 5c). In another words, the enthalpy gain associated to lactose binding is compensated by an entropy loss within CRD domains and is correlated with an entropy gain in the full structure.

Our work also sheds light on the role of the linker-peptide as a key element in tandem-repeat galectins. In the galectin-4 model, the linker was observed to function as a molecular hinge that mediates the interaction between the CRDs (Fig. 3c), thanks to the high content of proline residues, 28.6%, that imposed severe restrictions in the conformation and movement of this region. In fact, a comparison among the five known tandem-repeat galectins and their isoforms reveals the existence of ten different linker-peptides characterized by high variability in length and amino acid distribution, but sharing a high content of proline residues (Supplementary Fig. S1). This feature affects the global structure of tandem-repeat galectins and in the manner in which the linker-peptide coordinates the movement and distance between CRDs. Thus, it is reasonable to predict that each member of the tandem-repeat galectin subfamily possesses a structural arrangement that depends on features of all individual domains. Galectin-4 and its homologue galectin-6, for example, share high sequence identity, but very distinct linker-peptides capable of offering unique structural and dynamic features for each protein and in turn unique biological roles. Our model for galectin-4 provides the basis for further investigation.

Notably, all tandem-repeat galectin linker-peptides share proline-rich regions (PRRs). Besides their influence on protein structure and stability, PRRs are also described as binding domains31. In particular, they have a unique architecture which allows them to participate in molecular interactions that rely on multiple weak binding sites31. This architecture is characterized by restricted mobility, which reduces the unfavourable entropy loss of peptides upon binding. It is further influenced by the flat hydrophobic surface of prolines and the characteristics of the amide bond preceding proline, which make it a strong hydrogen bond acceptor. The unique architecture of PRRs can be particularly important in protein-protein and protein-nucleic acid interactions involved in intracellular signalling dependent on tandem-repeat galectins4. In particular, the continuous surface observed in galectin-4, as a consequence of its single domain arrangement, may favour protein-protein interactions including galectin-4 dimerization, as previously observed25. This is in contrast to a scenario in which the CRDs are flexible and move independently of each other.

In summary, a multi-technique approach has allowed us to investigate the structure of galectin-4 and its thermal and dynamic behaviours. Our results suggest that changes in the physicochemical environment have a direct effect on the ability to CRDs to reach different conformational states and in turn modulate ligand recognition. The relative positions between the CRDs and the extent of cross talk between them depend on the structural features of linker-peptide, in an orchestrated mechanism of detection and response to a cellular stimulus.


Protein cloning, expression and purification

The human galectin-4 open reading frame (GenBank: CR536544.1), coding for amino acids 1–323, was amplified from a previously constructed plasmid encoding galectin-4 and was cloned into the EcoRI/XhoI site of the pET-28a (Novagen) modified vector, pET-28a-SUMO. This vector was designed to produce an N-terminal His-tagged SUMO fusion protein via the insertion of a carrier ubiquitin-like protein, SMT3 from Saccharomyces cerevisiae (UniProtKB/Swiss-prot: Q12306.1), between the NheI and BamHI sites. DNA sequencing confirmed proper insertion of the galectin-4 gene fragment into the pET28a-SUMO vector. Escherichia coli Rosetta (DE3) cells (Novagen), transformed with the expression vector, were cultured in LB media containing 34 μg ml−1 chloramphenicol and 30 μg ml−1 kanamycin at 37 °C. Overproduction of recombinant galectin-4 was induced by adding 50 μM of isopropyl β-D-1-thiogalactopyranoside once the optical density OD600 reached 0.5. Growth continued for 24 h at 25 °C and 180 rev min−1. Cells were harvested by centrifugation at 10,000g for 10 minutes at 4 °C. The cell pellet was kept on ice and suspended in lysis buffer (50 mM monosodium phosphate pH 8.0, 600 mM NaCl, 14 mM β-mercaptoethanol and 1 tablet of EDTA-free SIGMAFASTTM protease inhibitor cocktail). Cells were subsequently disrupted by ten 30 s, 10 W sonication pulses applied at 30 s intervals. The lysate was then clarified by centrifugation at 4 °C and 16,000 g for 30 minutes. The resulting supernatant was loaded onto a Ni-NTA column pre-equilibrated with buffer A (50 mM monosodium phosphate pH 8.0, 600 mM NaCl and 14 mM β-mercaptoethanol). The column was washed with a step gradient of 0 and 25 mM imidazole added to buffer A, at ten column volumes each. The His6-SUMO-galectin-4 fusion eluted with ten column volumes of buffer A plus 500 mM imidazole. Protein fractions were identified by their absorbance at 280 nm, pooled, concentrated using a 10 kDa cut-off centrifugal filter unit Amicon® Ultra-15 (Millipore) and dialyzed against buffer A. The His6-tagged SUMO was cleaved by a ULP1 protease (Ubiquitin-like-specific Protease 1– EC for 16 h at 8 °C. The sample was subsequently loaded onto a Ni-NTA resin column where galectin-4 was separated from ULP1 and SUMO through elution with buffer A plus 25 mM imidazole.

Galectin-4N (N-terminal domain from human galectin-4, residues 1–152)23 and galectin-4C (C-terminal domain from human galectin-4, residues 179–323)24 were cloned, expressed and purified as previously described. All three proteins were further submitted to size exclusion chromatography using a Superdex200 10/300 column (GE Healthcare) pre-equilibrated with 50 mM HEPES pH 7.2, 150 mM NaCl and 14 mM β-mercaptoethanol. Purity of the resultant fractions was analysed by SDS-PAGE stained with Coomassie Brilliant Blue.

Thermofluor for galectin-4, galectin-4N and galectin-4C

Thermofluor was used to map the response to chemical environments of galectin-4 and its domains galectin-4N and galectin-4C. The experiments were conducted in an Mx3005P RT-PCR (Agilent Technologies) using SYPRO® orange (492/610 nm) (Invitrogen) as a fluorescent probe to detect exposed hydrophobic regions of the proteins. Samples were filtered through 0.2 μm membranes (Millipore) and quantified at 280 nm based on the theoretical molar extinction coefficient. Analysis of the proteins’ thermal denaturation profiles were performed using a 96-well PCR plate (Agilent Technologies). The samples were heated from 25 °C to 95 °C at 1 °C/min and fluorescence measurements were taken. Thermal melting curves were processed as in the protocol described by Niesen and co-workers32 and the melting temperature was obtained using GraphPad Prism software ( For a comparison of the galectin-4, galectin-4N and galectin-4C denaturation profiles, we initiated a 20 μl reaction containing 10 μM protein in 25 mM HEPES pH 7.2, 75 mM NaCl, 7 mM β-mercaptoethanol and 5X SYPRO® orange. In the same conditions, the behaviour of galectin-4 and its domains was assessed using the Solubility and Stability Screen (Hampton Research). Evaluation of the proteins’ behaviour in the presence of lactose was performed using serial dilution from a parent solution of 409.6 mM lactose. The behaviour of galectin-4 at different pHs and ionic strengths was assessed using the Solubility and Stability Screen 2TM (Hampton Research). Here, we initiated a 20 μl reaction containing 2.8 μM protein in 2.5 mM HEPES pH 7.2, 7.5 mM NaCl, 0.7 mM β-mercaptoethanol and 5X SYPRO® orange.

Protein crystallisation, data collection and structural analysis

The galectin-4N and galectin-4C domains were crystallised as previously described23,24. Cryogenic X-ray diffraction data for galectin-4N and galectin-4C were collected at the Diamond Light Source (beamline I04-1) and the SRL/SLAC National Accelerator Laboratory (beamline BL12-2) respectively. The data were indexed with MOSFLM33 and reduction was performed with Scala34 and Aimless35 in the CCP4 suite36. The structure of galectin-4N was determined to 1.48 Å resolution using the previous solution23 as a search model in Phaser37, implemented in the PHENIX suite38. The galectin-4C structure was determined to 1.78 Å resolution as described24. Model building and refinement were performed with Coot39 and phenix.refine38. The quality of the final models was validated by MolProbity40, where Ramachandran statistics indicate that 98.1% of residues lie in the favoured regions with no outliers for both galectin-4N and galectin-4C final models. Figures were prepared with PyMOL41. Diffraction data and refinement statistics are shown in Table 1. Structures were analysed with Coot39, PyMol41 and PDBsum42.

Modelling of linker-peptide and full-length galectin-4 construction

A sequence of 33 amino acid residues (from 153 to 185, QPLRPQGPPMMPPYPGPGHCHQQLNSLP TMEGP in which the underlined region corresponds to the linker-peptide) from galectin-4 was submitted to the ROBETTA server43 for ab initio structure prediction. Geometry idealization was performed for all resulting models using the phenix.geometry_minimization program38 and results were evaluated based on model quality with the MolProbity server. Crystallographic structures of galectin-4N and galectin-4C together with the top two linker-peptide models were used to build six different structures for galectin-4 using MODELLER v9.1444. Two steps of optimization were implemented in the model generating script, Variable Target Function Method (VTFM) and molecular dynamics simulations (MD). Conjugated gradient and simulated annealing were implemented between VTFM and MD routines. The resultant full-length models were also submitted to geometry idealization and analysed with the MolProbity server. As with the linker-peptide, the structures were compared and the best model was used for preliminary molecular dynamics simulations.

Molecular dynamics simulations

Molecular dynamics simulations were carried out using the GROMACS package45 along with the AMBER99sb-ILDN force field parameters46. The temperature and pressure were set to 310 K and 1 atm and controlled by the Nosé-Hoover47 and Parrinello-Rahman48 algorithms, respectively. The electrostatic interactions of each atom were treated with the Particle Mesh Ewald scheme and, like the non-bonded interactions (described by the Lennard-Jones potential), were limited to a cut-off radius of 1.0 nm. All water-bonded interactions were constrained by the SETTLE algorithm49, whereas LINCS50 was used to constrain the bonded interactions of the protein. The time step integration of the leap-frog algorithm was set to 2 fs.

Galectin-4 starting MD model

The homology model was enclosed and centred in a dodecahedron box within a distance of 1.2 nm from the faces and the system was explicitly solvated with the TIP3P water model51. The pH of each system was set indirectly to neutral according to the correspondent ionization states of the amino acids side-chains of the protein52. Therefore, the addition of counter ions Na+ and Cl was controlled to neutralize the protein charges and reach an ionic strength of 150 mM. In order to remove spurious molecular contacts, a steepest descent energy minimization was carried out, levelling the total potential energy of the system to a value smaller than 2000 kJ.mol−1.nm−1. Then a restriction potential of 1000 kJ.mol−1nm2 was applied to the xyz coordinates of the backbone amino acids for 2 ns in order to adjust the solvation layer on the surface of the protein. Afterwards, we produced a 30 ns trajectory, which allowed us to thermalize the system as well as adapt the protein structure to an aqueous environment. From the resulting trajectory, we performed principal component analysis using a covariance matrix and obtained the set of eigenvectors in order to sample its conformational space. We then selected the first and second projections and fed the values to generate a trajectory on the average structure. The potential energy of the resulting model was minimized using the method of steepest descent.

Galectin-4 molecular dynamics: equilibrium and production

The final galectin-4 model from MD energy minimization was submitted to four 100 ns trajectories in the absence and presence of the lactose ligand (β-D-galactopyranosyl-D-glucose), using different seeds. The starting complex model was built by three-dimensional superimposition of each CRD from galectin-4 with the CRDs from galectin-8 (PDB ID 3VKL). The side chains of residues from the binding site of galectin-4 were positioned as in galectin-8, complexed with lactose. Next, lactose was transferred into the binding site of galectin-4. The ligand was built and parameterized with the Glycam53 server54. We performed the solvation, energy minimization and restriction steps in the same way as described above for the protein model. The resulting structure and topology files were converted to the GROMACS notation with acpype55 and the runs were analysed by GROMACS tools, Bio3D56, VMD57 and Pymol41. Secondary structure was assessed with PROMOTIF program58 implemented in PDBsum analysis59.

X-ray Scattering of full-length galectin-4

X-ray scattering measurements were performed at the G1 Station of the Cornell High Energy Synchrotron Source (CHESS) using 11.75 keV X-rays with a flux of 1011 photons per second at a beam size of 250 × 480 μm2. Small-angle and wide-angle X-ray scattering (SAXS/WAXS) images were collected simultaneously on two photon-counting detectors (Pilatus 100K) at sample-to-detector distances of 1.47 m and 0.42 m respectively. The SAXS detector covered a q-range of 0.014 to 0.336 Å−1 and the WAXS detector covered a q-range of 0.338 to 0.960 Å−1, where q is the momentum transfer, defined as q = (4π/λ)sin(2θ/2), where λ is the X-ray wavelength and 2θ is the scattering angle. Samples were passed continuously through an in vacuo X-ray sample cell60 via an in-line size exclusion column (GE Superdex 200 5/15GL) operated by a room-temperature GE Äkta Purifier using a flow rate of 0.075 ml min−1. The column was pre-equilibrated with the running buffer, consisting of 50 mM HEPES pH 7.2, 140 mM NaCl and 9 mM DTT (−lactose), or the same buffer with 30 mM lactose added (+lactose). Protein samples were injected into a 50 μL loop at a concentration of 22.6 mg ml−1 (+lactose) and 20 mg ml−1, (−lactose). Approximately 500 eight-second exposures were collected per sample. Images were integrated and normalized by the incident X-ray intensity as measured by an N2-filled ion chamber located after the beam-defining slits. Data were processed and analysed following established protocols61 using the ATSAS suite of programs62 and custom code written in MATLAB. Predicted SAXS profiles were calculated using CRYSOL63 with maximum order of harmonics equal to 35 and Fibonacci grid of order 18. The SAXS and WAXS regions were merged prior to pair distance distribution analysis in GNOM64. Ab initio shape reconstructions were performed in GASBOR65. 10 models were generated with 323 dummy residues and subsequently aligned and averaged in DAMAVER66. The final, most probable model had a normalized spatial discrepancy (NSD) of 1.07 with a standard deviation of 0.03.

Additional Information

Accession codes: Atomic coordinates and structure factors have been deposited in the Protein Data Bank under accession codes 4XZP (galectin-4N) and 5CBL (galectin-4C).

How to cite this article: Rustiguel, J. K. et al. Full-length model of the human galectin-4 and insights into dynamics of inter-domain communication. Sci. Rep. 6, 33633; doi: 10.1038/srep33633 (2016).