Genes Affecting Vocal and Facial Anatomy Went Through Extensive Regulatory Divergence in Modern Humans

Regulatory changes are broadly accepted as key drivers of phenotypic divergence. However, identifying regulatory changes that underlie human-specific traits has proven very challenging. Here, we use 63 DNA methylation maps of ancient and present-day humans, as well as of six chimpanzees, to detect differentially methylated regions that emerged in modern humans after the split from Neanderthals and Denisovans. We show that genes affecting the face and vocal tract went through particularly extensive methylation changes. Specifically, we identify widespread hypermethylation in a network of face- and voice-affecting genes (SOX9, ACAN, COL2A1, NFIX and XYLT1). We propose that these repression patterns appeared after the split from Neanderthals and Denisovans, and that they might have played a key role in shaping the modern human face and vocal tract.


71
The advent of high-coverage ancient genomes of archaic humans (Neanderthal and Denisovan) 72 introduced the possibility to identify the genetic basis of some unique modern human traits 1 . A 73 common approach is to carry out sequence comparisons and detect non-neutral sequence 74 changes. However, out of ~30,000 substitutions and indels that reached fixation in modern 75 humans, less than 100 directly alter amino acid sequence 1 , and as of today, our ability to estimate 76 the biological effects of the remaining ~30,000 noncoding changes is very restricted. Whereas 77 many of them are probably nearly neutral, many others may affect gene function, especially 78 those in regulatory regions such as promoters and enhancers. Such regulatory changes may have 79 a sizeable impact on human evolution, as alterations in gene regulation are thought to underlie 80 much of the phenotypic variation between closely related groups 2 . Because of the limited ability 81 to interpret noncoding variants, direct examination of regulatory layers such as DNA methylation 82 has the potential to enhance our understanding of the evolutionary origin of human-specific traits 83 far beyond what can be achieved using sequence comparison alone 3 . 84 In order to gain insight into the regulatory changes that underlie human evolution, we previously 85 developed a method to reconstruct DNA methylation maps of ancient genomes based on analysis 86 of patterns of damage to ancient DNA 4 . We used this method to reconstruct the methylomes of a 87 Neanderthal and a Denisovan, which were then compared to a partial methylation map of a 88 present-day osteoblast cell line. However, the ability to identify differentially methylated regions 89 (DMRs) between the human groups was constrained by the incomplete reference map (providing 90 methylation information for ~10% of CpG sites), differences in outputs of sequencing platforms, 91 lack of an outgroup, and a restricted set of skeletal samples (see Methods). 92 8 To acquire a more precise understanding of the possible functional consequences of these 161 DMGs, we used Gene ORGANizer, which links human genes to the organs they phenotypically 162 affect 11 . Unlike tools that use GO terms or RNA expression data, Gene ORGANizer is based 163 entirely on curated gene-disease and gene-phenotype associations from monogenic diseases. It 164 relies on direct phenotypic observations in human patients whose conditions result from known 165 gene perturbations. Using Gene ORGANizer, we found 11 organs that are over-represented 166 within the 588 AMH-derived DMGs, eight of which are skeletal parts that can be divided into 167 three regions: the face, larynx (voice box), and pelvis ( Fig. 2c, Extended Data Table 4). The 168 strongest enrichment was observed in the laryngeal region (x2.11 and x1.68, FDR = 0.017 and 169 0.048, for the vocal folds (vocal cords) and larynx, respectively), followed by facial and pelvic 170 structures, including the teeth, forehead, jaws, and pelvis. Interestingly, the face and pelvis are 171 considered the most morphologically divergent regions between Neanderthals and AMHs 12 and 172 our results reflect this divergence through gene regulation changes. To gain orthogonal evidence 173 for the enrichment of the larynx and face within these AMH-derived DMGs, we carried out a 174 number of additional analyses: First, we analyzed gene expression patterns and found that the 175 supralaryngeal vocal tract (the pharyngeal, oral, and nasal cavities, where sound is filtered to 176 specific frequencies) is the most enriched body part (1.7x and 1.6x, FDR = 5.6 x 10 -6 and FDR = 177 7.3 x 10 -7 , for the pharynx and larynx, respectively, Extended Data Table 3). Second, 44 of the 178 AMH-derived DMRs overlap previously reported putative enhancers of human craniofacial 179 developmental genes (5.1x compared to expected, P < 10 -4 , permutation test) 13,14 . Third, Palate 180 development is the third most enriched GO term among AMH-derived DMGs (Extended Data 181 Table 3). Fourth, DMGs significantly overlap genes associated with craniofacial features in the 182 13 and covers almost the entire COL2A1 gene, from its 1 st intron to its 54 th exon and 3'UTR 276 region); and (d) an upstream lincRNA (LINC02097). Notably, regions (a), (b), and (d) overlap 277 the longest DMR on the AMH-derived DMR list, spanning 35,910 bp (DMR #11, Fig. 4). 278 Additionally, a more distant putative enhancer, located 345 kb upstream of SOX9, was shown to 279 bear strong active histone modification marks in chimpanzee craniofacial progenitor cells; 280 whereas, in humans these marks are almost absent (~10x lower than chimpanzee, suggesting 281 down-regulation, Fig. 4b) 13 . Importantly, human and chimpanzee non-skeletal tissues (i.e., brain 282 and blood) exhibit very similar methylation patterns in these genes, suggesting the DMRs are 283 skeleton-specific. Finally, the amino acid sequence coded by each of these genes is identical 284 across the hominin groups 1 , suggesting that the observed changes are purely regulatory. 285 Together, these observations support the idea that SOX9 became down-regulated in AMH 286 skeletal tissues, likely followed by down-regulation of its targets: ACAN and COL2A1. 287 XYLT1, the 4 th highest skeleton-related DMG, is an enzyme involved in the synthesis 288 of glycosaminoglycan. Loss-of-function mutations, hypermethylation of the gene and its 289 consequent reduced expression underlie the Desbuquois dysplasia skeletal syndrome, which was 290 shown to affect the cartilaginous structure of the larynx, and drive a retraction of the face 29,30 . 291 Very little is known about XYLT1 regulation, but interestingly, in zebrafish it was shown to be 292 bound by SOX9 [ 31 ]. 293 To quantitatively investigate the potential phenotypic consequences of these DMGs, we tested 294 what fraction of their known phenotypes are also known as traits that differ between modern and 295 archaic humans. We found that four of the top five most differentially methylated genes (XYLT1, 296 NFIX, ACAN, and COL2A1) are in the top 100 genes with the highest fraction of divergent traits 297 between Neanderthals and AMHs. Remarkably, COL2A1, the most divergent gene in its 298 14 methylation patterns, is also the most divergent in its phenotypes: no other gene in the genome is 299 associated with as many divergent traits between modern humans and Neanderthals (63 traits, 300 Extended Data Table 7, see Methods). This suggests that these extensive methylation changes 301 are possibly linked to phenotypic divergence between archaic and AMHs. 302

NFIX methylation patterns suggest downregulation in AMHs 303
In order to investigate how methylation changes affect expression levels, we scanned the DMRs 304 to identify those whose methylation levels are strongly correlated with expression across 22 305 human tissues 32 . We found 90 such AMH-derived DMRs (FDR < 0.05, Extended Data Table 2). 306 DMRs in voice-affecting genes are significantly more likely to be correlated with expression 307 compared to other DMRs (2.05x, P = 6.65 x 10 -4 , hypergeometric test). Particularly noteworthy 308 is NFIX, one of the most derived genes in AMHs (ranked 5 th among DMGs affecting the 309 skeleton, Fig. 3a,b). NFIX contains two DMRs (#24 and #167, Fig. 5a), and in both, methylation 310 levels are tightly linked with expression (correlation of 81.7% and 73.8%, FDR = 3.5 x 10 -6 and 311 8.6 x 10 -5 , respectively, Fig. 5b). In fact, NFIX is one of the top ten DMGs with the most 312 significant correlation between methylation and expression in human. The association between 313 NFIX methylation and expression was also shown previously across several mouse tissues 33,34 . 314 To further examine this, we investigated a dataset of DNMT3A-induced methylation of human 315 MCF-7 cells. Forced induction of methylation in this study was sufficient to repress NFIX 316 expression by over 50%, placing NFIX as one of the genes whose expression is most affected by 317 hypermethylation 35 (ranked in the 98 th percentile, FDR = 1.28 x 10 -6 ). We further validated the 318 hypermethylation of NFIX across the skeleton by comparing four human cranial samples to four 319 chimpanzee cranial samples through bisulfite-PCR (P = 0.01, Extended Data Fig. 3, Extended 320 Data Table 1, see Methods). Together, these findings suggest that the observed hypermethylation 321 of NFIX in AMHs reflects down-regulation that emerged along the AMH lineage. Indeed, we 322 found that NFIX, as well as SOX9, ACAN, COL2A1, and XYLT1 are hypermethylated in human 323 femora compared to baboon 36 (P = 1.4 x 10 -5 and P = 8.1 x 10 -9 , compared to baboon femora 324 bone and cartilage, respectively, t-test). Also, all five genes show significantly reduced 325 expression in humans compared to mice (Fig. 5c). Taken together, these observations suggest 326 that DNA methylation is a primary mechanism in the regulation of NFIX, and serves as a good 327 proxy for its expression. Interestingly, NFI proteins were shown to bind the upstream enhancers 328 of SOX9 [ 37 ], hence suggesting a possible mechanism to the simultaneous changes in the five top 329 genes we report. 330

331
We have shown here that genes affecting vocal and facial anatomy went through extensive 332 methylation changes in recent AMH evolution, after the split from Neanderthals and Denisovans. 333 The extensive methylation changes are manifested both in the number of divergent genes and in 334 the extent of changes within each gene. Notably, the DMRs we report capture substantial 335 methylation changes (over 50% between at least one pair of human groups), span thousands or 336 tens of thousands of bases, and cover promoters and enhancers. Many of these methylation 337 changes are tightly linked with changes in expression. We particularly focused on changes in the 338 regulation of the five most derived skeletal genes on the AMH lineage: SOX9, ACAN, COL2A1, 339 XYLT1, and NFIX, whose downregulation was shown to underlie a retracted face, as well as 340 changes to the structure of the larynx 20,29,38-41 . The results we report, which are based on ancient 341 DNA methylation patterns, provide novel means to analyze the genetic mechanisms that underlie 342 the evolution of the human face and vocal tract.       To overcome these obstacles, a major goal of the current study was to significantly extend the 735 span of our skeletal methylome collection, covering as many individuals, sexes, and bone types 736 as we could. This included the generation of many new samples, including the high-coverage 737 sequencing of additional ancient genomes, as listed below. 738

Present-day human bone DNA methylation maps 739
We generated full DNA methylation maps from two femur head bones from present-day humans 740 using whole-genome bisulfite sequencing (WGBS). Femora were chosen because of their 741 43 abundance in present-day human samples, as well as in ancient DNA samples 5,6,59 . In addition, 742 we collected 53 publicly available partial skeletal methylation maps. 743

Sample collection 745
Trabecular bone tissue from femur heads were taken from two patients with osteoarthritis during 746 a total hip replacement surgery, and after filling in a consent form as per Helsinki approval 747 #0178-13-HMO. Importantly, the effects of osteoarthritis processes on trabecular bone are much 748 less substantial than those on the synovium, cartilage, and subchondral bone. sequenced on an Applied Biosystems 3730xl Genetic Analyzer (Extended Data Fig. 3a,b). Analysis was performed similarly to Bone1 and Bone2, with the exception that the BSreads were 891 mapped to bisulfite converted chimpanzee (panTro4) reference genome, and we ignored the first 892 5bp of read1 and the first 44 bp of read2 in the chimpanzee sample (Extended Data Fig. 6). 893 Methylation data were deposited in NCBI's Gene Expression Omnibus and are accessible 894 through GEO accession number GSE96833. 895

Sample collection 897
We used two unidentified long bone fragments that belonged to a newborn wild chimpanzee 898 infant who died during a documented infanticide event at Gombe National Park on 9 March 899 2012. The infant was known to be the offspring of a chimpanzee called Eliza and was partially 900 50 eaten by an adult female and her family. The sample was collected from the ground about 48 901 hours after the infant's death and stored in RNAlater solution until arrival at Arizona State 902 University (ASU). At ASU the sample was stored at 4°C until extraction. 903

DNA Extraction 904
Sampling and DNA extractions were conducted at the ASU Ancient DNA Laboratory, a Class 905 10,000 clean-room facility in a separate building from the Molecular Anthropology Laboratory. 906 Precautions taken to avoid contamination included bleach decontamination and UV irradiation of 907 tools and work area before and between uses, and use of full body coverings for all researchers. 908 The bone samples were pulverized together in December 2012 using a SPEX CertiPrep Freezer 909 Mill. Three DNA extractions were conducted using 50-100 mg of bone powder (Extended Data 910 Table 9) and following the extraction protocol by Rohland and Hofreiter 64 . Two extraction blank 911 controls were included to monitor contamination of the extraction process. The probes on the arrays were designed to specifically hybridize with human DNA, so our use of 975 chimpanzee DNA required that probes non-specific to the chimpanzee genome, which could 976 produce biased methylation measurements, be computationally filtered out and excluded from 977 downstream analyses. This was accomplished using methods modified from 68 . Briefly, we used 978 blastn to map the 866,837 50bp probes onto the chimpanzee genome (Assembly: Pan_tro_3.0, 979 Accession: GCF_000001515.7) using an e-value threshold of e -10 . We only retained probes that 980 successfully mapped to the genome, had only 1 unique BLAST hit, targeted CpG sites, had 0 981 mismatches in 5bp closest to and including the CpG site, and had 0-2 mismatches in 45bp not 982 including the CpG site. This filtering retained 622,819 probes. 983 Additionally, β values associated with cross-reactive probes, probes containing SNPs at the CpG 984 site (either human or chimp), probes detecting SNP information, probes detecting methylation at 985 non-CpG sites, and probes targeting sites within the sex chromosomes were removed using the 986 minfi package in R 67 . This filtering retained a final set of 576,505 probes. 987

Sample collection 989
Postmortem frontal skull bones from two different chimpanzees (chimpanzee 1 and chimpanzee 990 2) were provided by the Biomedical Primate Research Centre (BPRC, The Netherlands). Bones 991 56 of one of the same UDG-treated libraries from this individual on a NextSeq500 instrument using 1038 2 x 76bp paired end sequences 71 . Following the mapping protocol described previously 9 , we 1039 trimmed adapter sequences, only processed read pairs whose ends overlapped by at least 15 bp 1040 (allowing for one mismatch) so that we could confidently merge them, and then mapped to the 1041 human reference sequence hg19 using the command samse in BWA (v0.6.1). We Neolithic individual came from a community that practiced farming, and was anthropologically 1053 determined to be a male aged 6-10 years at the time of death (the sex was confirmed genetically date; however, these were sequenced to a relatively low coverage (<5x), and thus, only crude 1074 methylation maps could be reconstructed from them. CàT ratio was computed for every CpG 1075 position along the hg19 (GRCh37) human genome assembly, for each of the samples, as 1076 previously described 4 . 1077 In order to exclude from the analyses positions that potentially represent pre-mortem CàT 1078 mutations rather than post-mortem deamination, the following filters were applied: (i) Positions  1079 where the sum of A and G reads was greater than the sum of C and T reads were excluded. (ii) 1080 For genomes that were produced using single-stranded libraries (i.e., Ust'-Ishim, Altai 1081 Neanderthal, Denisovan, Vindija Neanderthal and ~1/3 of the Loschbour library), positions 1082 where the GàA ratio on the opposite strand was greater than 1/(average single strand coverage) 1083 58 were excluded. This fraction represents a threshold of one sequencing error allowed per position. 1084 For Loschbour, this was performed only on the fraction of reads that came from the single 1085 stranded library. (iii) For all genomes, positions with a CàT ratio > 0.25 were discarded. For the 1086 Vindija Neanderthal, this threshold was raised to 0.5, due to its relatively low coverage (~7x). 1087 (iv) Finally, a maximum coverage threshold of 100 reads was used to filter out regions that are 1088 suspected to be PCR duplicates. 1089 In all genomes, excluding Vindija, a fixed sliding window of 25 CpGs was used for smoothing of 1090 the CàT ratio. This allowed for an unbiased scanning of differentially methylated regions 1091 (DMRs) that is not affected by the size of the window. Due to its relatively low coverage, we 1092 extended the sliding window used on the Vindija genome to 50 CpGs. This extended window is 1093 not expected to introduce a bias, as this genome was not used for DMR detection, but only for 1094 subsequent filtering that was applied equally to all genomes (see later). 1095 As previously described, CàT ratio was translated to methylation percentage using linear 1096 transformation determined from two points: zero CàT ratio was set to the value 0% 1097 methylation, and mean CàT ratio in completely methylated (100% methylation) CpG positions 1098 in modern human bone reference (hereinafter μ100) was set to the value 100% methylation. 1099 Positions where CàT ratio > μ100 were set to 100% methylation. For genomes that were 1100 extracted from bones, the modern Bone 2 WGBS map, which is the one with the higher coverage 1101 between the two WGBS modern bone maps, was used to determine μ100. For genomes that were 1102 extracted from teeth, there was no available modern reference methylation map, and therefore, 1103 we transformed the CàT ratio into methylation percentage based on the assumption that the 1104 genome-wide mean methylation is similar to bone tissue. Thus, the genome-wide mean CàT 1105 ratio represents 75% methylation, which is the genome-wide mean of measured methylation in 1106 59 the Bone 2 reference map. This was accomplished by setting μ100 to 1.33 x mean genome-wide 1107 CàT ratio. 1108

1109
The DMR detection algorithm is comprised of five main steps. We hereby provide an overview 1110 of the algorithm followed by a detailed description of each step. The overall goal of this pipeline 1111 is to detect differential methylation, assign it to the lineage on which it arose and filter out 1112 within-lineage variation. 1113 Overview 1114 Step 1: Two-way comparisons. To avoid artifacts that could potentially be introduced by 1115 comparing DNA methylation maps that were produced using different technologies, our core 1116 analysis relied on the comparison of the three reconstructed maps of the Altai Neanderthal, 1117 Denisovan, and Ust'-Ishim. Each of the samples was compared to the other two in a pair-wise 1118 manner, as a raw CàT ratio map against a reconstructed methylation map, and vice versa. This 1119 reciprocal comparison insured that the reconstruction process does not introduce biases to one of 1120 the groups. The minimum methylation difference threshold was set to 50%, spanning >50 CpGs. 1121 Step 2: Three-way comparisons. This step classifies to which of the three hominins the DMR 1122 should be attributed. This step is done by overlapping the three lists of DMRs found in Step 1. 1123 For example, a DMR that is detected between the Neanderthal and Ust'-Ishim and also between 1124 the Denisovan and Ust'-Ishim is considered specific to Ust'-Ishim. 1125 Step 3: FDR filtering. Various factors could introduce noise to the reconstruction process, 1126 including the stochasticity of the deamination process, the use of a sliding window, and 1127 variations in read depth within a sample. We ran simulations that mimic the post-mortem 1128 61 that is originally C, in the context of a CpG dinucleotide. We similarly use : to count the total 1152 number of C's that appear in a position that is originally C, in the context of a CpG dinucleotide. 1153 The CàT ratio is defined as : / : , where : = : + : . Let : and : (both between zero and 1154 one) be the methylation of this position in the reference genome and in the reconstructed one, 1155 respectively. If we denote by the deamination rate, assumed to be constant throughout the 1156 genome, and if we assume that deamination of C into T is a binomial process with probability of 1157 success : , we get 1158 : ∼ ( : , : ). (1) Our null hypothesis is that the th CpG is not part of a DMR, namely that : = : . The 1159 alternative hypothesis states that this CpG is part of a DMR. The definition of this statement is 1160 that | : − : | ≥ Δ, where Δ is some pre-specified threshold. In other words, under the 1161 alternative hypothesis we get that : ≥ : + Δ if the site has low methylation in the reference 1162 genome, and : ≤ : − Δ if it has high methylation in the reference genome. 1163

Per-site statistic 1164
Let us start with the first option, testing whether : ≥ : + Δ when : is low. A log-likelihood-1165 ratio statistic would be 1166