The heptad repeat region is a major selection target in MERS-CoV and related coronaviruses

Middle East respiratory syndrome coronavirus (MERS-CoV) originated in bats and spread to humans via zoonotic transmission from camels. We analyzed the evolution of the spike (S) gene in betacoronaviruses (betaCoVs) isolated from different mammals, in bat coronavirus populations, as well as in MERS-CoV strains from the current outbreak. Results indicated several positively selected sites located in the region comprising the two heptad repeats (HR1 and HR2) and their linker. Two sites (R652 and V1060) were positively selected in the betaCoVs phylogeny and correspond to mutations associated with expanded host range in other coronaviruses. During the most recent evolution of MERS-CoV, adaptive mutations in the HR1 (Q/R/H1020) arose in camels or in a previous host and spread to humans. We determined that different residues at position 1020 establish distinct inter- and intra-helical interactions and affect the stability of the six-helix bundle formed by the HRs. A similar effect on stability was observed for a nearby mutation (T1015N) that increases MERS-CoV infection efficiency in vitro. Data herein indicate that the heptad repeat region was a major target of adaptive evolution in MERS-CoV-related viruses; these results are relevant for the design of fusion inhibitor peptides with antiviral function.

camel to human transmission rather than vice versa 10 . Therefore, the most likely scenario envisages that a bat-derived MERS-CoV spread to humans via the zoonotic transmission from dromedary camels.
Coronaviruses use their spike (S) protein to bind a host receptor and to promote membrane fusion. The spike protein assembles as a trimer on the viral surface and belongs to the class I fusion protein family 11 . Class I fusion proteins are found in many other virus genera including retroviruses, orthomyxoviruses, paramyxoviruses, and filoviruses and share similar domain organization, as well as common functional properties 12 .
Host proteases cleave the CoV spike protein into two functionally distinct domains: the N-terminal region (usually referred to as the S1 subunit) contains the receptor binding domain (RBD), whereas the C-terminal portion (S2 subunit) includes the fusion peptide, two heptad repeats (HR1 and HR2), and the transmembrane (TM) domain 12 (see Fig. 1A). Following receptor binding, membrane fusion is mediated by a major conformational rearrangement that exposes the fusion peptide and results in the formation of a six-helix bundle (6HB) 13,14 . The core of the 6HB is a triple-stranded coiled coil formed by the HR1s of the three spike subunits forming the trimer; the HR2 elements pack within the grooves of the coiled coil in an antiparallel direction 13,14 .
Because of its central role in membrane fusion, a number of antiviral peptides that interfere with the 6HB formation have been developed as potential therapeutic compounds against HIV 15 , Ebola virus 16 , SARS-CoV 17,18 , and MERS-CoV 14 .
Although the RBD of spike proteins is generally considered the major determinant of host range, several reports have suggested that variation in the C-terminal portion of spike proteins, particularly in the HR1 and HR2, determine host range expansion 12 . Moreover, recent works indicated that MERS-CoV and Ty-BatCoV HKU4 bind DPP4 both of human and of bat origin [7][8][9] . In particular, although MERS-CoV binds human DPP4 with higher affinity than Ty-BatCoV HKU4, which shows a preference for the bat receptor, the RBDs of the two viruses engage human DPP4 via a similar binding mode 9 . These observations suggest that MERS-CoV and related viruses have the potential to shift host range with little adaptation of the RBD.
Motivated by the notion that evolutionary analyses can provide information on the molecular events that underlie host shifts and, more generally, host-pathogen interactions 19 , we investigated the evolutionary history of S proteins in MERS-CoV and related betaCoVs. Specifically, we aimed to determine whether natural selection drove the evolution of specific regions and sites that may contribute to variation in host range or replication efficiency. Thus, using different strategies, we analyzed MERS-CoV strains isolated from human and camels, as well as MERS-CoV-related viruses from other mammals. Data indicate the HR1 to HR2 region as a major target of adaptive evolution in these viruses.

Results
Positive selection shaped the evolution of clade c betaCoV spike protein. We first investigated whether positive selection drove the evolution of MERS-CoV-related coronavirus spike proteins. Previous phylogenetic analyses of S genes of viruses isolated from humans/camels (MERS-CoV), bats, and hedgehogs (Supplementary Table. S1) indicated that the S1 and S2 regions display different tree topologies (Fig. 1B), possibly as a result of recombination 6 . Because recombination can inflate estimates of positive selection 20 , we separately analyzed the two regions.
We pruned the S1 and S2 multiple sequence alignments (MSAs) from unreliably aligned codons and we screened them for evidence of additional recombination events using GARD (Genetic Algorithm Recombination Detection) 21 , which detected no breakpoint.
The saturation of substitution rates represents a major problem in the detection of positive selection among distantly related sequences. Computation of the nonsynonymous (dN) and synonymous (dS) substitution rates over whole phylogenies allows breaking of long branches, resulting in improved rate estimation. Thus, evidence of episodic positive selection was searched for by using the codeml branch-site test 22 , which is relatively insensitive to the saturation of substitution rates 23 . In the S1 and S2 regions, 3 and 4 branches yielded statistically significant evidence of positive selection under different codon frequency models (Fig. 1B, Table 1 and Supplementary Table S3). Positively selected sites along these branches were detected using the BEB (Bayes empirical Bayes) procedure and validated using the Mixed Effects Model of Evolution (MEME) 24 .
One positively selected site was found in the S1 region, 7 sites in the S2 subunit (Fig. 1A, Table 1). The R652 selected site (in S1) almost corresponds to two mutations that independently arose in the SARS-CoV spike gene as a result of in vitro adaptation of zoonotic strains to primate cells 25 (Fig. 1A). In S2, most selected sites are located in the HR1, HR2, and in the intervening linker. Remarkably, position 1060 is the almost exact counterpart of aminoacid changes that expand the host range or cell-type tropism of infectious bronchiolitis virus (IBV, L857F) and murine hepatitis virus (MHV, E1035D) ( Fig. 1A) 26,27 .
No positively selected site was found to be located in the RBD (Fig. 1A). Nevertheless, the pruning of unreliably aligned codons operated on the MSA left a minority of RBD sites available for analysis. We thus repeated the branch-site test on a subset of more closely related sequences (Fig. 1B). This procedure decreased divergence and pruning in the RBD and allowed analysis of most codons; even with this procedure, no positively selected site was detected (not shown).

Minor effect of positive selection for the S genes of Ty-BatCoV HKU4 and Pi-BatCoV HKU5. Previous analysis of Ty-BatCoV HKU4 and Pi-BatCoV HKU5 viruses isolated in Hong Kong
indicated that positive selection targeted the spike protein and particularly its S1 region 28 . Nevertheless, recombination was not accounted for in that analysis. We thus analyzed the sequence alignments of the Hong Kong isolates for the presence of recombination breakpoints using GARD. The algorithm detected 9 recombination breakpoints for the Pi-BatCoV HKU5 alignment (Fig. 1A) and none for Ty-BatCoV HKU4. The Ty-BatCoV HKU4 spike gene was therefore analyzed using the codeml site models, which test the hypothesis that a subset of codons evolve with dN/dS > 1. No evidence of positive selection was found, even using the relatively non-conservative model M7/model M8 comparison (Supplementary  Table S4). As for the Pi-BatCoV HKU5 spike gene, rampant recombination prevents application of a similar approach. We thus resorted to the simultaneous estimation of selection and recombination using omegaMap 29 . This analysis confirmed high recombination along the whole gene ( Supplementary Fig. S1) and detected a single positively selected site in the S1 region (Fig. 1C). The same site (codon 198, position relative to the MERS-CoV spike sequence) was also detected by MEME, which was run by incorporating the alternative phylogenies detected by GARD. Most of the previously reported selected sites 28 were not detected using other methods that account for recombination (Supplementary Table S5). We thus consider that robust inference of positive selection in Pi-BatCoV HKU5 spike genes can only be made for position 198, which lies outside the RBD (Fig. 1A).
Positive selection in the MERS-CoV heptad repeat 1. We next wished to determine whether positive selection occurred at the spike gene of MERS-CoV viruses circulating in the recent outbreak. A previous study suggested that positive selection drove the evolution of two codons in the MERS-CoV spike gene (positions 509 and 1020) 30 . Nevertheless, in that study only one method was used to infer selection and sequences isolated from camels were not included.
We thus retrieved 54 fully or almost fully spike sequences of MERS-CoV isolated from camels or humans (Supplementary Table S1). Alignments for the S1 and S2 regions were separately analyzed and screened for the presence of recombination. No breakpoint was detected and the codeml site models were applied. For the S2 region, two models of gene evolution that allow a class of codons to evolve with dN/dS > 1 (NSsite models M2a and M8) showed better fit to the data than the null models (NSsite models M1a and M7), strongly supporting the action of positive selection ( Table 2, Supplementary Table  S6). No evidence of selection was detected for the S1 portion ( Table 2, Supplementary Table S6). In S2, both BEB and MEME detected one selected site: position 1020 in the HR1 (Fig. 1A). Interestingly, three different residues are observed at this site in both camel-and human-derived viruses (Fig. 1A). MERS-CoV is thought to have spread from camels to humans; the presence of the three alternative residues in viruses isolated from camels suggests that adaptive evolution at this site occurred prior to the infection of humans.
Interestingly, the 1020 variant is in proximity to a mutation (T1015N) (Figs 1A and 2A) that arose during tissue-culture adaptation of MERS-CoV (strain EMC2012) and increases replication efficiency 31 . through its side chain, the Q1020 residue forms hydrogen bonds with D1024 and interacts with M1266 in HR2 ( Fig. 2A,B). Replacement of the glutamine residue with histidine or arginine (observed in the camel-and human-derived viruses) results in loss of side chain interactions with M1266 and variably affects hydrogen bonds with D1024 (Fig. 2B).
To gain further insight into the effect of adaptive evolution at position 1020, we performed a stability computational analysis after in silico mutagenesis. Q1020 was replaced with all other possible aminoacids: even if changes of different magnitude in Δ G were obtained using three different methods 32-34 , trends were very consistent (Fig. 2C). In particular, replacement with histidine or arginine residues resulted in mild destabilization (Fig. 2C). As a comparison, the same analysis was performed for position 1015. Replacement of the threonine residue with asparagine, which was previously associated with increased replication efficiency in vitro 31 , resulted in a similar level of destabilization as observed for Q1020H/R (Fig. 2C).

Discussion
We analyzed the evolution of the S protein in betaCoVs, in bat Ty-BatCoV HKU4 and Pi-BatCoV HKU5 viral populations, as well as in MERS-CoV isolates from the current outbreak. Different strategies were applied, as appropriate depending on divergence and recombination.
Results indicated that several adaptive changes are located in the S2 region, with fewer in the S1 domain and none of these within the RBD. It should be noted, though, that in all analyses we applied quite conservative approaches and we intersected two different methods to declare a site as positively selected. Whereas this approach was meant to limit the false positive rate it may have yielded some false negative results. In particular, the branch-site test we used to analyze the S1 and S2 regions of betaCoVs is robust to saturation issues and has a minimal false positive rate, but lacks power 23 . Moreover, due to the high divergence and the consequent need of alignment pruning, analysis of the RBD was performed on a shallower phylogeny compared to the other regions. This procedure is expected to reduce power, but is nonetheless necessary. In fact, alignment errors, together with unrecognized recombination, inflate estimates of positive selection and represent major sources of false positive results in evolutionary analyses 20,35 . Consistently, when we accounted for recombination in Pi-BatCoV HKU5 sequences most previously described selection signals disappeared, including those in the RBD 28 . Thus, whereas we cannot exclude that adaptive variants in the RBD of betaCoVs were missed by our approach, we conclude that the more recent evolution of MERS-CoV, Ty-BatCoV HKU4, and Pi-BatCoV HKU5 was not driven by positive selection in this domain. Conversely, as previously shown for SARS-CoV 25 , our data support a role for the S1 region separating the RBD and fusion peptide as a determinant of betaCoV host range expansion.
Analysis of both betaCoVs and MERS-CoV strains revealed evidence of positive selection in the S2 region. Most positively selected sites were found to be located either in the heptad repeats or in the intervening linker. Among these sites, position 1060 is particularly interesting, as it almost corresponds to substitutions that modify the host range and/or cell tropism in MHV and IBV 26,27 . Both viruses belong to the coronavirus genus. The Beaudette strain of IBV has been adapted to embryonated chicken eggs; following passages in culture, the strain was further adapted to infect Vero cells (from African Green monkey) and primary chicken kidney cells 26 . The L857F mutation was shown to represent a major determinant of the fusogenic activity in these cell types 26 Table 2. Likelihood ratio test statistics for models of variable selective pressure among sites in MERS-CoV isolates. 1 M1a is a nearly neutral model that assumes one dN/dS (ω ) class between 0 and 1, and one class with ω = 1; M2a (positive selection model) is the same as M1a plus an extra class of ω > 1. 2 M7 is a null model that assumes that 0 < ω < 1 is beta distributed among sites; M8 (positive selection model) is the same as M7 but also includes an extra category of sites with ω > 1. 3 2Δ lnL: twice the difference of the natural logs of the maximum likelihood of the models being compared. 4 Positions are relative to the MERS-CoV sequence (EMC/2012).
mutation was recovered after passages in mouse liver and was shown to contribute significantly to the hepatotropism and hepatic virulence of a previously attenuated strain (MHV-A59) 27 . Additional variants in the HR1 and fusion peptide of MHV strains cooperate with changes in the S1 region, resulting in a broadening of receptor usage (to heparan sulfate) and, consequently, an extension of the host range 36 . Finally, in the MHV-A51 strain the ability to bind human CAECAM receptors is strongly influenced by mutant residues located in the fusion peptide, HR1 and HR1/HR2 linker 37 . Similar observations have been reported for viruses that do not belong to the coronavirus genus, but that use a class I fusion protein.
For instance, one single mutation in the HR1 of a simian-human immunodeficiency virus (SHIV) strain (KB9) increases by two-to three-fold infection efficiency in cells expressing the marmoset cellular receptors 38 , whereas a nearby HR1 change in SIV contributes to macrophage tropism 39 . Overall, these findings pinpoint the relevance of changes in the HR1 and HR2 as modifiers of host range and cell-type tropism. The molecular mechanisms underlying the altered phenotype of HR1 and HR2 mutants remain to be determined in all these instances, although changes in conformational structure have been suggested as a possible explanation. Unfortunately, no coronavirus S protein fusion intermediate or pre-fusion state has been solved to date, hampering investigation of molecular interactions. We therefore analyzed the effect of variation at the 1020 position of MERS-CoV on the stability of the 6HB in the post-fusion conformation 14 . The presence of a Q1020 had previously been suggested to confer higher stability to the MERS-CoV 6HB compared to SARS-CoV 14 . Indeed, replacement of this residue results in loss of intraand inter-helical interactions. In line with these observations, three different methods used for stability analysis were concordant in showing that the alternative arginine and histidine residues at position 1020 result in a moderate and similar level of destabilization. Although the observed Δ Δ G is relatively small, it was calculated on the single monomer, and is expected to be multiplied in the trimer. The observation whereby mildly destabilizing variants are favored by selection may seem counterintuitive. Nonetheless, we show that a similar level of destabilization is observed for a mutation in MERS-Cov HR1 (T1015N) that increases infection efficiency, at least in vitro 31 . Mutagenesis of HR1 in retroviral type I fusion proteins has indicated that, whereas strong destabilization of the 6HB (as measured by circular dichroism) almost inevitably results in reduced infectivity, a minor stability decrease is not necessarily associated with defects in cell fusion and infection efficiency [40][41][42] . For instance, different aminoacid replacements at the same HR1 position in HIV-1 gp41 result in decreased stability, but unaffected or even increased infectivity 40 . In the case of EIAV (equine infectious anemia virus), destabilized HR1 mutants were found to display different infection phenotypes depending on temperature 42 . This observation may be extremely interesting in the context of MERS-CoV, as both bats and dromedary camels display adaptive heterothermy (i.e. sensible daily or season variation in body temperature).
Coronavirus spike proteins are highly exposed on the virus surface and represent major targets for antibody response 43 , raising the possibility that adaptive evolution of the S protein is driven by the host immune system. This hypothesis is difficult to address due to the paucity of information concerning the specific MERS-CoV epitopes recognized by human antibodies. Recently, different studies identified human antibodies againts MERS-CoV from non-immune human antibody libraries: all of them were directed against the RBD, suggesting that this region represents a major target for the host immune system 43 . Nevertheless, data on the humoral immune response to MERS-CoV in infected subjects are presently lacking, whereas such information is richer for SARS-CoV. Indeed, analysis of antibodies derived from a patient who recovered from SARS indicated that some of them recognize epitopes in the HR2 region 44 . Their neutralizing effect was ascribed to interference of the interaction between HR2 and HR1. Whether antibodies against the S2 region also arise in human subjects (or other mammalian hosts) infected with MERS-CoV and related betaCoVs remains to be determined; if this were the case, some of the selected sites we identified may be under selective pressure to evade recognition.
Finally, it is worth noting that HRs have been studied in different viruses because synthetic peptides interfering with 6HB formation are promising antiviral molecules [15][16][17][18] . This is also the case for MERS-CoV, and HR2-like peptides were recently shown to be effective in vitro 14 . These peptides were tested against a MERS-CoV strain carrying Q1020 and all include the interacting M1266 residue 14 . These antivirals may display decreased activity depending on the MERS-CoV strain and its aminoacid status at the selected 1020 position.

Materials and Methods
Sequences and alignments. Virus sequences were retrieved from the NCBI database and a list of accession numbers is provided as Supplementary Table S1. Sequences of Ty-BatCoV HKU4 and Pi-BatCoV HKU5 isolated in Hong Kong were derived from a previous work 28 .
Errors in the inferred multiple sequence alignment (MSA), which may be common when highly divergent sequences are analyzed, can inflate estimates of positive selection. We therefore used PRANK 45 for building the MSA and GUIDANCE 46 for filtering unreliably aligned codons (i.e. we masked codons with a score < 0.90), as suggested 35 . Detection of recombination and positive selection. To detect positive selection at the S gene of clade c betaCoVs we applied the branch-site test from the PAML suite 22 . The test compares a model (MA) that allows positive selection on one or more lineages (foreground lineages) with a model (MA1) that does not allow such positive selection. Twice the difference of likelihood for the two models (Δ lnL) is then compared to a χ 2 distribution with one degree of freedom 22 . Specifically, the internal branches of previously reconstructed 6 Bayesian phylogenies of the S1 and S2 regions were set as the foreground lineages in independent tests. A false discovery rate (FDR) correction was applied to account for multiple hypothesis testing (i.e. we corrected for the number of tested branches), as suggested 47 .
Positively selected sites were identified through the BEB analysis (with a p value cutoff of 0.90), which calculates the posterior probability that each site belongs to the site class of positive selection on the foreground branch(es). Sites were validated using MEME (with the default cutoff of 0.1), which allows the distribution of dN/dS (also referred to as ω ) to vary from site to site and from branch to branch at a site, therefore allowing the detection of episodic positive selection 24 .
The site models implemented in PAML were applied -independently-for the analysis of HKU4 and MERS-CoV sequences, which display very limited divergence and do not suffer from saturation problems. To detect selection, site models that allow (M2a, M8) or disallow (M1a, M7) a class of sites to evolve with ω > 1 were fitted to the data 48 . Trees were generated by maximum-likelihood using the PhyML program 49 with a GTR model of nucleotide substitution and γ distributed rates. Positively selected sites were identified using the BEB analysis (from model M8) 50 . Again, sites were validated using MEME.
To assure consistency, all models were run using the F3 × 4 and the F61 codon frequency models. MSAs were screened for the presence of recombination using GARD. Recombination breakpoints were considered significant if the HK (Kishino-Hasegawa) p value was < 0.01.
Simultaneous inference of selection and recombination for analysis of positive selection was performed using omegaMap 29 , a program for detecting natural selection and recombination based on a model of population genetics and molecular evolution. The model uses a population genetic approximation to the coalescent with recombination. This latter is estimated from patterns of linkage disequilibrium assuming that recombination events occur only between codons and not within them. OmegaMap applies reversible-jump Markov Chain Monte Carlo (MCMC) to perform Bayesian inferences of both ω and the recombination parameter ρ , allowing both parameters to vary along the sequence. An average block length of 10 and 30 codons was used to estimate ω and ρ , respectively. To determine the influence of the choice of the priors on the posteriors, analyses were repeated with alternative sets of priors (Supplementary Table S2). Three independent omegaMap runs, each with 500,000 iterations and a 50,000 burn-in iteration, were compared to assess convergence and merged to obtain the posterior probability estimate.
The REL (random effects likelihood) analysis models variation in both dN and dS across sites according to a predefined distribution with different rate classes; positively selected sites are identified through an empirical Bayes method 51 . The default criterion of a Bayes Factor > 50 was used to identify positively selected sites.
For GARD, MEME, and REL the nucleotide substitution models were chosen using a Genetic Algorithm implemented in the dataMonkey suite 52 . All analyses were performed through the DataMonkey server 53 (http://www.datamonkey.org).
In silico analysis of HR1 variants. The crystal structure of MERS-CoV HR1 and HR2 region was obtained from PDB (PDB ID: 4MOD).
Histidine or arginine residues were introduced at positions 1020 and suitable rotamers were sampled through the rapid torsion scan utility in Maestro (Maestro. 9.1; Schrodinger). Intraprotein interactions were calculated with PIC (protein interaction calculator) 54 .
Because programs that calculate stability changes achieve only moderate accuracy 55 , we used three different methods to assure reliability. These approaches are based on different principles. Specifically, PoPMuSiC uses statistical potentials and takes into account amino acid volume variation upon mutation 33 ; FoldX uses an empirical force field and evaluates the energetic effect of point mutations and the interactions contributing to the stability of proteins 32 . Finally, I-Mutant 2.0 is based on a neural network approach to evaluate the free energy change after a single point mutation with incorporation of information on the three-dimensional structure of the protein 34 .
In FoldX and I-Mutant the Δ Δ G values are calculated as follows: ΔΔG = ΔG mutant − ΔG wild-type . In FoldX and I-Mutant Δ Δ G values > 0 kcal/mol indicate mutations that decrease protein stability, whereas in PoPMuSiC Δ Δ G values > 0 kcal/mol are mark of mutations increasing protein stability. Therefore, PoPMuSiC Δ Δ G values were multiplied by − 1 to obtain homogeneous results.
In the analysis carried out with FoldX 3D, the three-dimensional structure of the protein was repaired using the < RepairPDB> command. Mutations were introduced using the < BuildModel> command with < numberOfRuns> set to 5 and < VdWdesign> set to 0. Temperature (298K), ionic strength (0.05 M) and pH (7) were set to default values and the force-field was used to predict the water molecules on the protein surface.