To the Editor:

In recent years, the number of published genome sequences has increased substantially owing to major developments in next-generation sequencing (NGS) technologies, concomitant reduction of sequencing costs and improvements in assembly strategies. In 2011, your journal published the genome of Chinese hamster ovary (CHO)-K1 cells, the most frequently used mammalian production cell line for biopharmaceutical products1. In this issue, the genomes of several related CHO cell lines as well as of the genome of the Chinese hamster are also presented2. Although this information provides long-awaited and necessary insights for scientists working with these important production hosts, it also highlights a major drawback of short-read NGS technology, namely, the difficulty of assembling short-read data and scaffolding these sequences into a fully structured genome. This is especially critical for CHO cells, which are known to be genomically unstable, with frequent chromosome rearrangements and loss3,4. In the following correspondence, we describe how a chromosome sorting approach can facilitate genome assembly from short-read sequences.

The effects of chromosome rearrangements on behavior relevant to individual bioprocesses of different CHO cell lines is not clear and will require more detailed analysis in the future. Although it seems less likely that large segments of genomic DNA are lost completely, which would entail the loss of necessary cellular functions, presumably leading to cell death, rearrangements probably lead to subtle changes in transcription patterns. These may affect cellular properties relevant to bioprocessing, such as growth, robustness and productivity of CHO cell lines and clones. For future studies on these changes and their impact on cell behavior in industrial cell lines, it is thus of prime importance to have, on the one hand, a reference genome that includes the allocation of scaffolds and contigs to chromosomes and, on the other hand, a method that enables characterization of chromosomal translocations present in CHO cell lines being sequenced.

Current NGS technology yields short-read sequences typically in the range of 100–500 bp, so that common repeats cannot be assembled and the precise location of duplicated sequences is likely to be missed5. De novo assembly generates, on average, scaffolds of 1–2 Mb if genome coverage is sufficiently high (50- to 100-fold). As chromosomes are several fold larger (typically 90-200 Mb), chromosomal rearrangements and translocations can be captured only in part.

Here, we address this dilemma by isolating individual chromosomes by flow cytometric cell sorting, followed by NGS of the obtained material in separate sequencing reactions. After curation and assembly, the resulting scaffolds can be assigned to specific chromosomes. We applied our approach to cells from the Chinese hamster strain 17A/GY and came across several challenges, such as cross-contamination by chromosomes that were too close in the flow histogram and which required a bioinformatic procedure for curation (Fig. 1). The most severely affected chromosomes in this respect were chromosomes 5 and 6. Chromosomes 9 and 10 could only be separated as a pool and chromosome Y was not sorted at all. For library construction, we obtained 80–620 ng of DNA for each sorted chromosome and prepared, in addition, a 5,000-bp mate-pair sequencing library from whole genome DNA. We sequenced the libraries on an Illumina (San Diego) Genome Analyzer IIx, using TrueSeq PE Cluster Kit v5-CS-GA and TrueSeq SBS Kit v5-GA and generated 70-fold genome coverage, assuming a genome size of 2.8 Gb for the Chinese hamster6. Subsequently, 1.4 billion reads were assembled into a draft sequence for the separated chromosomes using ALLPATHS-LG7. As mentioned above, sequencing libraries from separated chromosomes might be contaminated with sequences from other hamster chromosomes. The separated chromosome assemblies were therefore analyzed to identify and eliminate contaminating scaffolds from the data. This filtering led to high-quality assemblies of separated Chinese hamster chromosomes with the total number of scaffolds ranging from 517 for chromosome 8 to 5,348 for chromosomes 9+10, and a total genome size of 2.33 Gb (Table 1).

Figure 1: Bivariate flow cytometric analysis of Chinese hamster chromosomes.
figure 1

Fibroblast cultures were established from strain 17A/GY. Staining was performed with Hoechst 33258 and chromomycin A3. Fluorescence intensity is plotted for 30,000 events. Numbers and letters refer to the respective chromosomes. The X chromosome and all autosomes except chromosomes 9 and 10 (sorted as a pool) show individual peaks. The Y chromosome peak was very close to chromosome 5 and therefore not sorted.

Table 1 Assembly statistics of separated Chinese hamster chromosomes

We mapped scaffolds of the separated hamster chromosome libraries to the mouse genome together with the published CHO-K1 genomic sequence1 (Supplementary Fig. 1). This revealed that, in principle, the entire genome of the mouse can be covered by Chinese hamster sequences, even though complex chromosomal rearrangements have occurred. The only exceptions are mouse chromosomes 7, 14, 17 and X, which are incompletely covered by both Chinese hamster and CHO-K1 sequences. Gaps detected between the Chinese hamster scaffolds and mouse chromosomes occur primarily in regions with a high frequency of interspersed repeats and low complexity regions, which cannot be assembled properly from short sequence reads. As the missing regions on mouse chromosomes 7 and 12 are in part covered by short scaffolds and as the corresponding CHO-K1 genome has even more sequences mapping to these locations, it seems likely that these sequences are not missing in the Chinese hamster, but might have been difficult to assemble owing to sequence repeats. Also notable is that despite the severe chromosomal rearrangements that have occurred in CHO-K1 (refs. 3,4), no major parts of the genome are completely missing: gaps relative to the mouse chromosomes occur at the same positions of high repeat density as for the Chinese hamster reference genome, and only very small regions are missing in CHO-K1 that are present in the Chinese hamster genome. Homologies between the Chinese hamster chromosome sequences and mouse chromosomes identified by sequence mapping compare well to reciprocal chromosome painting results of hamster and mouse chromosomes8.

The sequence of the Chinese hamster provides a reference for future research of sufficient quality and precision to enable characterization and study of chromosomal rearrangements and stability in CHO cell lines. In addition, the results of this study suggest that the approach of using sorted chromosomes for library generation may prove beneficial for sequencing of complex reference genomes of other eukaryotes.

Accession code. GenBank: APMK00000000. The version described in this paper is the first version, APMK01000000.

Author contributions

N.B., J.G., J.E.M. and A.P. originated the concept of the study. H.L. contributed the chromosome sorting strategy. The project was further developed by W.E.B., D.M., T.J., M.L. and B.H. K.B., T.N., A.T. and A.G. carried out the sequencing project design. W.E. and F.H. contributed to study planning and generated samples of cells and genomic DNA of the Chinese hamster. R.K. and J.W. sorted Chinese hamster chromosomes. H.L. and S.R. prepared DNA from sorted chromosomes. O.R., F.K. and B.L. performed data analysis. All authors contributed to drafting and reviewing the manuscript.