Structural insight for chain selection and stagger control in collagen

Collagen plays a fundamental role in all known metazoans. In collagens three polypeptides form a unique triple-helical structure with a one-residue stagger to fit every third glycine residue in the inner core without disturbing the poly-proline type II helical conformation of each chain. There are homo- and hetero-trimeric types of collagen consisting of one, two or three distinct chains. Thus there must be mechanisms that control composition and stagger during collagen folding. Here, we uncover the structural basis for both chain selection and stagger formation of a collagen molecule. Three distinct chains (α1, α2 and α3) of the non-collagenous domain 2 (NC2) of type IX collagen are assembled to guide triple-helical sequences in the leading, middle and trailing positions. This unique domain opens the door for generating any fragment of collagen in its native composition and stagger.

coiled coil domain observed in lung surfactant protein D 21 . In each class, specific regions within the domain are responsible for the formation of homo-or hetero-trimers 12,20 . These domains also serve as the nucleus for the zipper-like folding of the triple-helical domain from the C-to the N-terminus of a molecule. However, how the stagger of the triple helix is determined is not clear from the isolated structures of the collagen trimerization domains. Until now no structural information was available on how the triple helical domain is linked to the collagen trimerization domain and whether the triple-helix stagger is determined/influenced by it.
There are other collagens that use different types of trimerization domains. Such domains greatly vary in size, position and structure with the smallest known domain discovered in type IX collagen of only ~35 residues 22 . This domain, NC2 (the second non-collagenous domain, Fig. 1A) was demonstrated to be responsible for stagger control in the adjacent triple helix 23 . Here we report the structural basis of this control.

Results
Previously, to test whether the stagger control resides in the NC2 domain we used short native sequences of type I collagen, which is a hetero-trimer of two α 1 and one α 2 chains. Two host-guest collagen peptides (GPP) 4 -(GXY) 4 -(GPP) 3 , where (GXY) 4 sequences are from the α 1 and α 2 chains of human type I collagen, were recombinantly linked to chains A, B and C (corresponding to α 1, α 2 and α 3 in collagen nomenclature and omitted here to avoid confusion) of the NC2 domain in the following combinations designated as α 1 A α 1 B α 1 C (for short 111), α 1 A α 1 B α 2 C (112), α 1 A α 2 B α 1 C (121), α 2 A α 1 B α 1 C (211) and α 2 A α 2 B α 2 C (222) (for details see Fig. 1B). The collagenous portion of type I collagen in these complexes formed a stable triple helix, but demonstrated differences in thermal stability and binding affinity to the von Willebrand factor A3 domain 23 .
All five constructs were screened for crystallization, but only three of them yielded crystals of sufficient quality, crystal structures for 111, 211 and 121 were independently solved using the MAD phasing from selenomethionine derivatives (thereafter designated as 111sm, 211sm and 121sm) to 2.25 Å, 2.10 Å and 1.6 Å, respectively (Fig. 2, Table 1). In addition, the structure of 121 (with regular methionines, thereafter designated as 121nat) was solved using molecular replacement to 1.9 Å. Whereas the crystal packing for 111sm and 211sm are similar (two trimers per asymmetric unit), it differs from that for 121nat or 121sm (one trimer per asymmetric unit) ( Table 1). In total we obtained six crystal models for the trimer of the NC2 domain and the adjacent triple helix (Fig. 2).
The structure of the NC2 domain of the hetero-trimeric type IX collagen. In accord with the secondary structure prediction 22,24 , the NC2 domain assumes predominantly an α -helical conformation. Three unique chains form a parallel α -helical right-handed bundle. Whereas the α 2 chain contains a single α -helix, α 1 and α 3 have a short kink and a bend, respectively (Figs 3B and 4). As predicted 22,23 , a disulfide bond connects α 1 and α 3. An overall superimposition of 121nat and 121sm (r.m.s.d. of 0.35 Å) confirmed the identity of the two structures. Overall superimposition of all trimers showed some drastic deviations ( Fig. 2A). The most deviated pair is the first trimer (chains A, B, C in the asymmetric unit) of 121sm versus the second trimer (chains D, E, F) of 111sm    2B). Most of deviations observed for overall superimposition of trimers are attributed to distal flexibility of the triple helical fragments caused by crystal packing ( Fig. 2A vs B). It was suggested that many collagens contain α -helical coiled coil domains that might help in trimerization and stagger formation 25 . The NC2 domain of type IX collagen has been among such domains, but discontinuities in the heptad reapeat pattern (a characteristic feature of the coiled coil) in this domain were pointed out 25 . Although the crystal structure of the NC2 domain demonstrates a high content of α helices in somewhat parallel organization, overall it does not even resemble the coiled coil due to multiple violations of geometry (Fig. 2B). Overall, the NC2 domain structure demonstrates a right-handed bundle of helices as opposed to a left-handed superhelix in classical coiled coils, e.g. in lung surfactant protein D 21 . Moreover, the α -helical coiled-coils are normally blunt ended and do not embody a stagger needed to accommodate the triple helix. Nevertheless, the inter-chain interface is stabilized by numerous hydrophobic interactions similar, but not identical to those observed in the coiled coil. In addition a set of hydrogen bonding and ionic interactions contributes to specificity and structural integrity of the trimer (Fig. 4).

Structure of the triple-helical domain.
The overall structure of the triple helical sequences is typical for the structure of a triple helix. Despite the variations of composition (homo-trimeric for 111) and stagger (121 or 211) that might not represent a native stagger, the regions of sequences derived from type I collagen are well structured. A set of unique side chain interactions is observed for each composition (Fig. 3A). The most important observation is that the triple-helical chain stagger is entirely determined by the NC2 domain (Fig. 5A). Namely, a triple helical chain linked to chain B of the NC2 domain is always in the leading position, the one linked to A is in the middle, and the third one linked to C is in the trailing position. We suggest here to use a rule of BAC-translation: chain B is leading, A is middle, C is trailing. This way, construct 121 (or α 1 A α 2 B α 1 C ) translates into staggering order of α 2α 1α 1, whereas 211 translates into α 1α 2α 1.
Interface between the triple helix and trimerization domains. The right-handed bundle of α helices (not a very common structure) of the NC2 domain congruently continues into the right-handed superhelix of the collagenous part. Visual analysis of the interface region suggests a broadening of the triple helical end before the NC2 domain starts. To analyze and quantify the opening of the triple helix and transition into the NC2 domain we identified and plotted the ladder of recurrent N-H (G) … O = C (X) hydrogen bonds (characteristic collagenous bonds between glycine in one chain and an amino acid in X position of an adjacent chain) that form within the triple helix and the beginning of the NC2 domain (Fig. 5B). Remarkably, no opening was identified within the host-guest collagen peptide (GPP) 4 -(GXY) 4 -(GPP) 3 sequences linked to the NC2 domain. Moreover, first glycine residues that were originally assigned to the beginning of the NC2 domain are still part of the triple helix without  any sign of disturbance. Even an alanine residue (Ala39, + 3 position from the first glycine) in the leading chain (chain B) does still form a reliable hydrogen bond with lysine 37 in the trailing chain (chain C) (Fig. 5B).
The actual opening of the "triple helix" happens only at already the non-collagenous sequence of the NC2 domain (Fig. 6), where residues such as Ala39, Thr40, His43 of chain B, Pro39 of chain A and Ala39 of chain C, are the capping residues of the hydrophobic core of the NC2 domain (starting from Ile44). In other words these capping residues constitute a pyramid that connects a "zero" hydrophobic core (formed by glycines) of the triple helix to the real hydrophobic core of the NC2 domain. Interestingly, whereas His43 of chain B is involved in the intra-(Thr40, chain B) and inter-chain (Ala39, chain C) hydrogen bonding within the capping core (Fig. 6), solvent exposed His43 of chain A is interfacing with Asp41 of chain B (Fig. 4), further emphasizing the asymmetric nature of collagen.

Discussion
Collagen is the most plentiful protein in our body fulfilling structural and biologically active roles in multiple physiological processes as well as in pathology. Numerous heritable and acquired diseases are associated with collagen. Atherosclerosis, fibrosis, osteoarthritis, rheumatoid arthritis, diabetes, cancer are just few diseases where collagen function is adversely affected. 28 collagen types are formed from polypeptides encoded by 42 distinct genes, frequently in several isoforms. In addition, more than 20 additional proteins adopt collagen-like structures such as collectins, ficolins, and scavenger receptors 26 . Our knowledge of structural and functional organization of this universe is very fragmented and limited to just few homo-trimeric collagenous fragments and some non-collagenous domains. The only example where a triple-helix has been crystallized with an adjacent non-triple-helical domain is the structure of the (GPP) 10 -foldon construct 27 . Foldon, a trimeric nucleation domain for a classical coiled coil, leads to a severe kink and disturbance of the triple helix attached to it. Until now there was no robust method to produce fragments of hetero-trimeric collagenous regions; this has significantly limited the repertoire of reagents that are available to study the role of collagens in development, remodeling and cell signaling.
As revealed by the structural analysis of the host-guest system reported here, such a method is now available. A collagenous sequence connected to the α 2 chain of the NC2 domain will have the leading position, whereas collagenous sequences linked to α 1 and α 3 chains will be in the middle and trailing positions, respectively. To avoid confusion we suggest to label α 1, α 2 and α 3 chains of the NC2 domain as chains A, B and C. Connecting collagenous sequences to respectively B, A and C chains will place them in the leading, middle and trailing positions (the BAC translation rule). The small size of the NC2 domain and the ability to recombinantly express individual chains in bacteria and later assemble them in vitro make this system easily adoptable in any laboratory with basic molecular biology techniques. If needed the expression system can be transferred to eukaryotic cells to obtain certain post-translational modifications of proline and lysine residues. Moreover a peptide synthesis with specifically modified residues is still possible for these sizes.
Crystal structures of 111, 121 and 211 constructs demonstrated that a staggering order can be manipulated at least for short native collagenous sequences, meaning that such sequences are rather adaptive to various abnormal conformations. Nevertheless, these alternative conformations demonstrated differences in thermal stability and affinity to a ligand 23 . Namely, we have shown previously that the 112 construct (with the α 1α 1α 2 staggering order of the triple helical portion of type I collagen) had the highest binding affinity to von Willebrand factor A3 domain, and highest thermal stability of the different constructs. If the high binding affinity and high thermal stability is indicative, then the stagger of type I collagen is α 1 chain in the leading position, α 1 chain in the middle position and the α 2 chain in the trailing position. Further experiments with other fragments of type I collagen as well as other hetero-trimeric types of collagen are anticipated to clarify this general problem.
At least for type IX collagen we can conclude now that the stagger of the central collagenous domain (COL2) is α 2α 1α 3 and it is determined by the NC2 domain. Staggers and its staggering mechanisms of other collagenous domains in type IX and other collagens remain to be elucidated. The most intriguing structural studies would include a junctional region between a triple-helical and trimerization domain in such hetero-trimeric collagens as type I and IV, as well as in hetero-trimeric collagen-related complement protein C1q.
In summary, these data detail the structural organization of the triple-helix-to-trimerization domain interface of type IX collagen and the mechanism of staggering. The current constructs provide a straightforward tool to produce any collagen fragments of interest with a controlled composition and stagger.

Methods Summary
All constructs were expressed, accordingly assembled and purified as described 23 . Only polypeptides containing the α 1 chain of the NC2 domain (chain A) were labeled with selenomethionine for phasing using methionine-deficient E. coli strain B834 (DE3) and a medium composed of SelenoMet Medium Base and SelenoMet Nutrient Mix (Athena Enzyme Systems).
The complexes were crystallized by vapor diffusion with the following crystallization conditions: 111sm: 0.1 M BisTris pH 6.4, 14% PEG MME 5,000 + 20% glycerol (cryo) 121nat and 121sm: 0.1 M HEPES pH 7.5, 50 mM Na-Acetate, 17% PEG 3,350 + 20% glycerol (cryo) 211sm: 0.1 M BisTris pH 6.0, 16% PEG MME 5,000 + 20% glycerol (cryo). Diffraction data were collected at the Advanced Light Source beamline 4.2.2. The diffraction images were indexed, integrated, and scaled using HKL2000 28 . Selenomethionine crystals were used for a three-wavelength MAD data collection procedure. The program PHENIX 29 was used for the determination of Se atom positions, phasing, density modifications, and automatic building of partial models. Iterative cycles of model extension/correction and refinement were performed using the programs COOT 30 and PHENIX 29 , respectively. A model from the selenomethionine crystal of 121sm was directly used for the refinement of native structure 121nat.
Crystal diffraction data, phasing and refinement statistics are presented in Table 1.