The structure, function and evolution of a complete human chromosome 8

The complete assembly of each human chromosome is essential for understanding human biology and evolution1,2. Here we use complementary long-read sequencing technologies to complete the linear assembly of human chromosome 8. Our assembly resolves the sequence of five previously long-standing gaps, including a 2.08-Mb centromeric α-satellite array, a 644-kb copy number polymorphism in the β-defensin gene cluster that is important for disease risk, and an 863-kb variable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere. We show that the centromeric α-satellite array is generally methylated except for a 73-kb hypomethylated region of diverse higher-order α-satellites enriched with CENP-A nucleosomes, consistent with the location of the kinetochore. In addition, we confirm the overall organization and methylation pattern of the centromere in a diploid human genome. Using a dual long-read sequencing approach, we complete high-quality draft assemblies of the orthologous centromere from chromosome 8 in chimpanzee, orangutan and macaque to reconstruct its evolutionary history. Comparative and phylogenetic analyses show that the higher-order α-satellite structure evolved in the great ape ancestor with a layered symmetry, in which more ancient higher-order repeats locate peripherally to monomeric α-satellites. We estimate that the mutation rate of centromeric satellite DNA is accelerated by more than 2.2-fold compared to the unique portions of the genome, and this acceleration extends into the flanking sequence.


Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Data analysis
Custom code for the SUNK-based assembly method is available at https://github.com/glogsdon1/sunk-based_assembly. Other software used in this study are publicly available and include Pacific Biosciences CCS algorithm (v3 For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability The complete CHM13 chromosome 8 sequence and all data generated and/or used in this study are publicly available and listed in Supplementary Table 9 with their BioProject, accession #, and/or URL. For convenience, we list their BioProjects and/or URLs here: complete CHM13 chromosome 8 sequence (PRJNA559484);

Sample size
We generated a whole-chromosome assembly of human chromosome 8 and assembled the chromosome 8 centromere in a diploid human cell line and three diploid nonhuman primates in order to perform phylogenetic and comparative analyses. For phylogenetic tree reconstruction of the centromeric satellite, we used 150 data points from each genome, which resulted in a bootstrap value of 100 for all major branches of the tree (meaning, 100 out of 100 times, the same branch was observed in that clade when repeating the phylogenetic reconstruction on resampled data). For the centromeric mutation rate computation, we compared 1,002 10 kbp regions from across the chimpanzee, orangutan, and macaque genomes to the corresponding human region, which spans approximately 1.65 Mbp of sequence. This number of data points is the maximum number of points that can possibly be analyzed within this region (assuming 10 kbp windows) and is strengthened by the comparison across three different species (rather than just one). For gene copy number estimation, we analyzed 1,105 published high-coverage datasets spanning nine human superpopulations, which were all that were available for this analysis and provides a sufficiently high number of genomes to determine a median and standard deviation of gene copy number for each superpopulation with confidence. For droplet digital PCR (ddPCR), we performed seven technical replicates, which is four more than the standard three technical replicates used in such experiments. For the chromatin fiber-FISH, we generated three slides, which served as technical replicates, and identified multiple fibers showing the indicated CENP-A and methylation patterns. For the pulsed-field gel Southern blots, each experiment was performed twice with different restriction enzymes, and each result confirmed the expected banding pattern. For FISH on metaphase chromosome spreads, experiments were performed >3 times and generated several spreads with chromosome 8 FISH probes hybridized in the expected order. This number of FISH replicates meets or exceeds the standard number of experimental replication commonly accepted by the field.
Data exclusions No data were excluded.

Replication
Computational experiments are deterministic and are, therefore, reproducible. Despite this expected reproducibility, computational experiments were run multiple times with different parameters to improve the experimental analysis. All attempts at replication were successful for both computation and wet-lab experiments.
Randomization Randomization is not applicable to this study because we did not perform any experiments where there are treatment and control groups that would necessitate randomization between the subjects.

Blinding
Blinding is not applicable to this study because we did not perform any experiments where there are treatment and control groups that would necessitate blinding.

Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. Antibodies used Rabbit monoclonal anti-5-methylcytosine antibody (RevMAb, RM231) Alexa Fluor 488 goat anti-rabbit (Thermo Fisher Scientific, A-11034) Alexa Fluor 594 conjugated to goat anti-mouse (Thermo Fisher Scientific, A-11005)

Validation
The anti-CENP-A antibody was generated against a synthetic peptide consisting of aa3-19 of CENP-A, and mutation of this epitope in human cells prevents antibody binding (Logsdon et. al., JCB, 2015).

Mycoplasma contamination
The CHM13hTERT cell line is negative for mycoplasma contamination (Miga et al., Nature, 2020). The other cell lines used in this study have not been assessed for mycoplasma contamination to our knowledge.
Commonly misidentified lines (See ICLAC register) No commonly misidentified cell lines were used in this study.

ChIP-seq Data deposition
Confirm that both raw and final processed data have been deposited in a public database such as GEO.
Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks.