Chromosomal-scale de novo genome assemblies of Cynomolgus Macaque and Common Marmoset

Cynomolgus macaque (Macaca fascicularis) and common marmoset (Callithrix jacchus) have been widely used in human biomedical research. Long-standing primate genome assemblies used the human genome as a reference for ordering and orienting the assembled fragments into chromosomes. Here we performed de novo genome assembly of these two species without any human genome-based bias observed in the genome assemblies released earlier. We assembled PacBio long reads, and the resultant contigs were scaffolded with Hi-C data, which were further refined based on Hi-C contact maps and alternate de novo assemblies. The assemblies achieved scaffold N50 lengths of 149 Mb and 137 Mb for cynomolgus macaque and common marmoset, respectively. The high fidelity of our assembly is also ascertained by BAC-end concordance in common marmoset. Our assembly of cynomolgus macaque outperformed all the available assemblies of this species in terms of contiguity. The chromosome-scale genome assemblies produced in this study are valuable resources for non-human primate models and provide an important baseline in human biomedical research. Measurement(s) sequence_assembly • genome • DNA Technology Type(s) sequence assembly process • DNA sequencing Sample Characteristic - Organism Macaca fascicularis • Callithrix jacchus Measurement(s) sequence_assembly • genome • DNA Technology Type(s) sequence assembly process • DNA sequencing Sample Characteristic - Organism Macaca fascicularis • Callithrix jacchus Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.14665314

www.nature.com/scientificdata www.nature.com/scientificdata/ mate-pair sequencing were commonly used to join the contigs into longer scaffolds, albeit with sequence gaps in between. Long-standing non-human primate (NHP) genome assemblies released earlier used the human genome for ordering and orienting the assemblies into chromosomes, which prevents the observation of intrinsic structural differences between the primate genomes 5 . For example, a large inversion of around 20 Mb was observed in chromosome 16 of the earlier marmoset genome assembly, which should have been the result of the 'humanization' bias 6 .
Recent technological advancements allow us to obtain chromosome-scale assemblies without relying on existing genome assemblies, where such errors or bias can be avoided. Single-molecule long-read sequencing (Pacific Biosciences [PacBio] and Oxford Nanopore Technologies) have drastically increased the contiguity of assemblies, and chromatin contact profiling with Hi-C and other techniques such as optical mapping have paved the way to reconstructing chromosome-scale sequences. Taking advantage of these recent advancements, the genome sequences of some non-human primates (NHP) including gorilla, orangutan, and chimpanzee were largely improved 5,7 , followed by the ones for bonobo (Bioproject accession: PRJNA526933), and Northern white-cheeked gibbon (PRJNA369439). High-quality genome assemblies of old world monkeys were also recently reported, such as the ones for Rhesus macaque by three different research groups (PRJNA476474 8 , PRJNA509445 9 , and PRJNA514196 10 ), the ones for olive baboon (PRJNA527874 11 ), golden snub-nosed monkey (PRJNA524949 12 ), and Francois's langur (PRJNA488530 13 ). We had also previously produced pseudo-chromosome assembly by using PacBio long-reads for common marmoset 6 , where 'humanized' sequences were still used as a reference.
In this study, we focus on cynomolgus macaque and common marmoset to establish a solid baseline for human biomedical research. We performed genome sequencing and de novo assembly for both species by using PacBio long-reads, along with Hi-C for chromosome-scale scaffolding through optimized Hi-C data acquisition 14 . We corrected misjoins of the scaffolds through examination of the Hi-C contact maps. We further investigated misjoins through cross-checking alternate contigs made from PacBio data and corrected them. Lastly, we assessed the quality of the resultant assemblies based on the contiguity of assembled sequences (N50), completeness of conserved protein-coding genes, and consistency to BAC clone sequences. The results indicate that our genome assemblies are of chromosomal-scale contiguity, with nearly complete coverage of gene space, and highly concordant to the conventional genomic resource obtained independently to our data set. Above all, for cynomolgus macaque, our genome assembly achieved the optimal quality in comparison with other resources available for this species.

Methods
Sample preparation, sequencing, and de novo assembly. A  For common marmoset, the 43x PacBio RSII read dataset from our earlier sequencing effort (PRJDB8242), was used to reassemble the genome ( Fig. 1a; Table 1).
To assemble both the NHP genomes, we employed a series of assembly programs (supplementary fig. S1). Based on the performances from the earlier marmoset genome assembly study, we chose the assemblers Flye 15 , Redbean (wtdbg2) 16 , SMARTdenovo 17 , and miniasm 18,19 . After de novo assembly, all the assemblies were processed by PacBio's polishing tool, arrow, to eliminate possible sequencing errors. After considering the results from multiple de novo assembly tools, Redbean was chosen as the base assembler, as it produced the most contiguous assemblies for both the common marmoset (contig N50 of 8.04 Mb; Table 2) and the cynomolgus macaque (contig N50 of 5.84 Mb) genomes (Table 3).
Hi-C data acquisition and scaffolding. Hi-C libraries were constructed with the restriction enzyme DpnII following the iconHi-C protocol 14 , using the skeletal muscle tissues of an adult female marmoset and a cynomolgus macaque that were kept frozen in −80 °C after dissection and snap-freezing in liquid nitrogen. Fixed tissue materials containing 2 μg of DNA were used for the preparation of Hi-C DNA by in situ restriction digestion and ligation. Library preparation was performed using 1 μg of Hi-C DNA with 5 cycles of PCR for the marmoset and 6 cycles of PCR for the cynomolgus macaque. Quality control of the Hi-C DNA and the Hi-C library was performed as described in the iconHi-C protocol 14 . Quality control of the Hi-C DNA showed an expected pattern of a shift in size -shortening after digestion and elongation after ligation, indicating successful preparation of the Hi-C DNA (Fig. 1b). Quality control of the Hi-C library by restriction digestion confirmed the existence of expected ligation junction sequences inside the library molecules at a high proportion and successful generation of the library (Fig. 1c).
Sequencing to obtain Hi-C reads was performed on an Illumina HiSeq X in paired-ends with 151 cycles. The obtained reads were processed using Trim Galore to remove low-quality regions and adapter sequences. Post-sequencing quality control of the Hi-C libraries was performed as described previously 14 . Hi-C read statistics obtained by HiC-Pro 20 , using one million subsampled read pairs from the large scale sequencing data mapped to the PacBio contigs, confirmed high quality of the libraries while showing a high proportion of valid interaction read pairs, a low proportion of invalid ligation products (dangling end pairs), and a low proportion of contiguous restriction fragments (re-ligation pairs) (Supplementary table S1). Approximately 408 and 377 million Hi-C reads for the marmoset and the cynomolgus macaque were mapped to the Redbean contig sequences of the common marmoset and the cynomolgus macaque respectively using Juicer 21 . With the Juicer output files, Hi-C scaffolding was performed using 3d-dna 22 . Inversions and misjoins in the assemblies that occurred during the Hi-C scaffolding process were corrected by using Juicebox based on the frequency of Hi-C contacts 23 .
As a result of the genome assembly workflow involving de novo assembly followed by Hi-C scaffolding (Fig. 2a), the N50 lengths increased from approximately 8 to 135 Mb for the marmoset ( Table 2) and from 5.8 to 149 Mb for the cynomolgus macaque (Table 3), and the number of scaffolds longer than 10 Mb presumed to be at the chromosome level became close to the actual number of chromosomes (Fig. 3). For the cynomolgus macaque, two pairs of the scaffold sequences longer than 10 Mb were merged into two pseudo-chromosomes, referred to as chromosomes 2 and 8 later, based on alignment concordance with the 3d-dna scaffold sequences of alternate assemblies, as well as the previous reference sequence.
Misjoin detection and gap-filling. All the alternate assemblies were also independently scaffolded using 3d-dna. These were later aligned against the Hi-C scaffolds from Redbean assembly to observe whether any misjoins were introduced by Hi-C scaffolding. Because the contiguity profiles of the assemblies are different across the employed assemblers, contigs which are broken into two or more sequences in one assembly could have been assembled into a single contiguous piece in another assembler (Supplementary fig. S1). This contiguity information from alternate assemblies was used to identify misjoins in the scaffold sequences from the Redbean assembly (Fig. 2b). When the alignments of the alternate assemblies against Redbean scaffolds were visualized using D-genies dot plots 24 , occasionally the scaffolds did not exhibit consistency in chromosomal structure among the 3d-dna scaffolds. These indicated the presence of misjoins introduced by Hi-C scaffolding (Supplementary fig. S2).
In case that such discrepancies in contig orders were observed in the alignments, a two-step check was performed, A) whether gaps are present in the Redbean scaffolds, and B) whether at least two of the alternate assemblies are consistent in those regions and discrepant against the Redbean scaffolds. If the above case is true, instead  www.nature.com/scientificdata www.nature.com/scientificdata/ of re-ordering the misjoined scaffolds, we replaced the alignment block with one of the alternate assemblies making it a hybrid assembly. A similar procedure was also performed for gap-filling in regions without misjoins also. When a gap-containing region in Redbean scaffolds aligned to a gap-free region of contigs from alternate assemblies, those gap-containing regions were replaced with the corresponding blocks from alternate assemblies (Fig. 2b). When two or more alternate assemblies can be used to fill the gaps, one of them was randomly chosen to fill the gaps, followed by additional two rounds of polishing using arrow. This gap-filling procedure filled 1,147 gaps of length 573.94 kb in cynomolgus macaque, and 1,061 gaps of length 522.16 kb in common marmoset respectively.

Data records
The N50 length of the contigs increased from 8.04 Mb to 24.82 Mb, and from 5.84 Mb to 26.27 Mb for the final assemblies of common marmoset and cynomolgus macaque genomes, respectively. The final genome assemblies displayed chromosome-sized sequences, designated here as 'pseudo-chromosomes' , which exhibited scaffold N50 lengths of 132.27 Mb and 149.88 Mb, for the cynomolgus macaque and the common marmoset genomes, respectively ( Table 2; Table 3).
The raw genome sequence data of the cynomolgus macaque are deposited at the DDBJ under the accession, DRA009584 25 . The Hi-C raw sequence data are deposited under the accessions, DRA009641 26 , and DRA009987 27 , for the cynomolgus macaque and the common marmoset genomes, respectively. The assembled genome sequences are deposited under the accessions, BLPH01000001-BLPH01000400 28 , and BLSI010000 01-BLSI01001872 29 , for the cynomolgus macaque and the common marmoset genomes, respectively.

technical Validation
Gene space completeness assessment. The web server gVolante 30 , in which two established pipelines CEGMA 31 and BUSCO 32 are implemented, was used to assess the sequence length distributions and gene space completeness in a uniform environment. The latter was based on the coverage of one-to-one reference orthologues with the ortholog search pipeline CEGMA and the gene set CVG that is specifically optimized to assess vertebrate genome sequences 33 . CEGMA identified pre-selected conserved protein-coding genes by using a  Table 2. Properties of the obtained and existing assemblies of common marmoset. *The current reference assemblies and the other ones deposited recently. + The completeness assessment was performed by the computational pipeline CEGMA and the reference gene set CVG.
www.nature.com/scientificdata www.nature.com/scientificdata/ combination of tblastn, and GeneWise for sequence alignment 34,35 , and GeneID for gene prediction 36 , in addition to HMMER3 for further homology search 37 , which are used to assess the completeness of the genome assemblies. The analysis of gene space completeness revealed a smaller number of missing CVG genes for both species, in comparison with the scores for the genome assemblies released earlier ( Table 2; Table 3).
BaC-end alignment. BAC-end read pairs of the common marmoset from our earlier sequencing effort 4 , were aligned against the constructed assembly using bowtie2 38 . After the misjoin detection and gap-filling  Table 3. Properties of the obtained and existing assemblies of cynomolgus macaque. *The current reference assemblies and the other ones deposited recently. + The completeness assessment was performed by the computational pipeline CEGMA and the reference gene set CVG. Misjoin correction and gap-filling using contigs from genome assemblies using alternate assembly tools.
www.nature.com/scientificdata www.nature.com/scientificdata/ procedure, the number of concordantly aligned read pairs increased by a count of 684, while the number of discordantly aligned read pairs decreased by a count of 401 ( Table 2).
Comparison of common marmoset assembly against previous assemblies. For the common marmoset, NCBI hosts nine genome assemblies (https://www.ncbi.nlm.nih.gov/assembly/organism/9483/latest/), with cj1700 being the representative reference genome. The assembled genome in this study has a total scaffold length of 2.81 Gb with a contig N50 length of 24.82 Mb, in comparison to the total scaffold length of 2.79 Gb with the contig N50 length of 6.38 Mb from our earlier effort (CJ2019). Our newly assembled contigs have slightly lesser contiguity than the recently submitted assembly cj1700 which has a total scaffold length of 2.9 Gb with a contig N50 length of 26.62 Mb (Fig. 4a). Notably, our assembly is 23 Mb longer than cj1700 after excluding gap regions, and more than half of the sequence gaps in the pseudo-chromosomal sequences of the cj1700 assembly  www.nature.com/scientificdata www.nature.com/scientificdata/ were identified as contiguous regions in our assembly. These results imply that these assemblies are similarly useful and, in some difficult regions, complementary to each other (Supplementary table S2).
Comparison of cynomolgus macaque assembly against previous assemblies. NCBI hosts five cynomolgus macaque genome assemblies (https://www.ncbi.nlm.nih.gov/assembly/organism/9541/latest/), with the macFas5 genome assembly being the representative reference genome ( Table 3). Three of the assemblies including the representative genome have contig N50 lengths shorter than 100 kbp. In contrast, the recently submitted Macaca_Fascicularis_6.0 assembly had produced a much larger contig N50 length of 21.34 Mb. It has a total scaffold length 2.91 Gb, which is smaller than that of the representative reference genome's 2.95 Gb. The assembly from our present study has a contig N50 length of 26.27 Mb and a total scaffold length 2.91 Gb. It is 65 Mb shorter than Macaca_Fascularis_6.0 even after excluding gap regions, and the ideal total length of the assembly is not yet conclusive. Notably, our assembly outperformed all the assemblies in terms of contiguity (Fig. 4b) and its coverage of orthologous genes in our assembly achieved the highest ratio, 93%, whereas the one in Macaca_Fascicularis_6.0 was 90% (Table 3). It indicates that the assembly produced here outperformed the rest not only in terms of contiguity, but also in base accuracy.
Comparison against other non-human primate genome assemblies. The contiguity of the two genomes assembled in this study was compared against other non-human primate genomes such as those of Rhesus macaque, Francois langur, chimpanzee, gorilla, orangutan, white-cheeked gibbon, golden snub-nosed monkey, bonobo, and olive baboon. Although there were three different sub-species genome assemblies for Rhesus macaque, we herein considered only the best assembly out of them in terms of contiguity. In comparison to the other non-primate genome assemblies, it was evident that the common marmoset and cynomolgus macaque assemblies produced the second and the third-best contiguities, with the Rhesus macaque genome assembly 8 , being the best among all in terms of contiguity (Supplementary fig. S3).