Thousands of high-quality sequencing samples fail to show meaningful correlation between 5S and 45S ribosomal DNA arrays in humans

The ribosomal RNA genes (rDNA) are tandemly arrayed in most eukaryotes and exhibit vast copy number variation. There is growing interest in integrating this variation into genotype–phenotype associations. Here, we explored a possible association of rDNA copy number variation with autism spectrum disorder and found no difference between probands and unaffected siblings. Because short-read sequencing estimates of rDNA copy number are error prone, we sought to validate our 45S estimates. Previous studies reported tightly correlated, concerted copy number variation between the 45S and 5S arrays, which should enable the validation of 45S copy number estimates with pulsed-field gel-verified 5S copy numbers. Here, we show that the previously reported strong concerted copy number variation may be an artifact of variable data quality in the earlier published 1000 Genomes Project sequences. We failed to detect a meaningful correlation between 45S and 5S copy numbers in thousands of samples from the high-coverage Simons Simplex Collection dataset as well as in the recent high-coverage 1000 Genomes Project sequences. Our findings illustrate the challenge of genotyping repetitive DNA regions accurately and call into question the accuracy of recently published studies of rDNA copy number variation in cancer that relied on diverse publicly available resources for sequence data.

The genes encoding the ribosomal RNAs are present in long tandem arrays in most eukaryotes, often referred to as the ribosomal DNA (rDNA). Most eukaryotic genomes contain two types of rDNA arrays: the 45S, encoding the 18S, 5.8S, and 28S rRNAs, and the 5S, encoding the 5S rRNA. Because of their repetitive nature, both rDNA arrays are susceptible to expansion and contraction, which can lead to vast copy number differences among individuals 1,2 . The phenotypic consequences of natural variation in rDNA copy number remain largely unexplored 3 , although a number of human disease phenotypes have been linked to rDNA copy number.
Changes in rDNA copy number have been identified in some cancer and aging studies. The 45S and 5S copy numbers change in some human cancers, in which cancerous tissue has a higher or lower rDNA copy number than a matched control [4][5][6][7][8] . In some tissues such as the brain, heart, and adipose tissue, 45S rDNA copy loss has been observed with age [9][10][11] . However, one study argues that rDNA instability is only observed in brains of individuals affected by dementia, and is absent in aging brains of healthy individuals 12 . In contrast, a recent study on aging mice found that 45S rDNA copy number increases in the blood with age in mice 13 . Additional studies in mice and rat cell lines report a lack of change in rDNA copy number with age [14][15][16] . In both cancer and aging studies, not all cancers and not all aging tissues or organisms appear to display changes in rDNA copy number 1,4,17,18 . In short, there is no universal agreement on whether rDNA copy number changes with age or cancer.
In addition, an individual's inherited rDNA copy number may be of interest for GWAS studies. In humans, rDNA copy number is associated with differences in global gene expression and mitochondrial abundance 1 . These differences may modify the effects of trait-associated variation. There are a few studies that report potential effects of rDNA copy number on disease: Higher 45S rDNA copy numbers are found in people with schizophrenia

Results
Meaningful concerted copy number variation is not present in the Simons simplex collection. Our initial goal was to estimate rDNA copy number in the Simons simplex collection to determine if rDNA copy number is associated with autism spectrum disorder. Autism spectrum disorders associate with variants in hundreds of genes, including single nucleotide, tandem repeat, and copy number variants [32][33][34] . rDNA copy number variation has not been assessed in autism spectrum disorder; however, it has been hypothesized that higher rDNA copy number associates with a more severe intellectual disability due to the increased potential for rRNA translation 35 . The Simons Simplex Collection has sequenced hundreds of families with a child affected by an autism spectrum disorder [36][37][38][39][40] . We estimated rDNA copy number in 7,268 individuals from families in which both of the parents, the proband, and an unaffected sibling were sequenced. We detected no difference in rDNA copy number based on autism status (Fig. 1a), using read coverage estimates (Student's paired t-test, p = 0.9365) 23 . We asked whether within probands, rDNA copy number associates with degree of intellectual disability by comparing individuals with an IQ of < = 50 to those with an IQ of > = 100. We detected a statistically significant difference (nominal p = 0.03, Welch two-sample t-test) of individuals with an IQ < = 50 having on average eleven more rDNA copies than those with IQ > = 100, consistent with the published hypothesis (Fig. 1b). However, eleven additional rDNA copies seem unlikely to be biologically relevant, given an average copy number of ~ 250 and epigenetic silencing of many copies. Testing additional cutoffs for IQ-including assessing individuals with IQ < 70 versus those with IQ > = 70, the cutoff for severe versus not severe intellectual  Fig. 1). With more stringent IQ cutoffs, nominal significance was abolished. Because we previously reported that rDNA copy number estimates based on short-read whole genome sequencing data can be error prone, with eleven copies certainly being within the range of error 30 , we sought to validate rDNA copy number in a subset of samples by alternate methods. Our preferred alternate method to estimate rDNA copy number is the CHEF gel. CHEF gels are the gold standard to estimate rDNA copy number, but they have limitations. In humans, the 45S rDNA array makes up the short arm of each of the 5 acrocentric chromosomes, so when assessing rDNA copy number by CHEF gel ten distinct bands should be observed. No previous study has ever observed ten distinct bands in a 45S CHEF gel, possibly because some rDNA loci are too large to be resolved 31 . In contrast, the 5S rDNA, residing in a single locus, can be readily measured by CHEF gels. Given the previously reported concerted copy number variation between the 5S and 45S rDNA arrays 23 , we planned to validate the 5S rDNA copy number estimates by CHEF gels to infer the accuracy of the sequencing-based 45S rDNA copy number estimates.
To this end, we first estimated 5S rDNA copy number in the Simons Simplex Collection to determine the strength of concerted copy number variation. We found weaker correlation than what was reported previously: The 18S and 5S copy numbers correlate with a Spearman coefficient of 0.24, while in the initial study using 1000 Genomes Project data the Spearman coefficient was 0.61 (Fig. 1c, Table 1). As we were primarily interested in  www.nature.com/scientificreports/ predicting 45S rDNA copy numbers from 5S rDNA copy numbers, we tested a linear model relating the copy numbers. This linear model is not predictive; the R 2 value is 0.061. As expected, we observe similar trends when analyzing the 28S and 5.8S copy numbers in relation to the 5S number (Fig. 1d,e, Table 1).
Concerted copy number variation in high-coverage 1000 Genomes Project data is weak. We next tested if weak concerted copy number variation was specific to the Simons Simplex Collection dataset. The 1000 Genomes Project recently released a new dataset of higher coverage sequencing data for approximately 2500 samples (Michael Zody, personal communications). The sequencing data from the high-coverage dataset were generated by a single sequencing center with a single library preparation and sequencing method and a target genome coverage of ~ 30×. This sequencing center also generated the Simons Simplex Collection dataset. We estimated rDNA copy number in 2,419 of the high-coverage 1000 Genomes Project samples which displayed normal karyotypes. We observe a weak but significant correlation between each the 18S, 5.8S, and 28S copy numbers with the 5S copy number: The Spearman coefficients are 0.084, 0.111, and 0.118, respectively (Fig. 2a, Table 2, and Supplemental Fig. 2). Despite the significance of the correlation, there is no predictive power to the  www.nature.com/scientificreports/ relationship between the 45S and the 5S copy numbers: For example, a linear model relating the 18S and 5S copy numbers has an R 2 value of 0.005 ( Table 2). The weak concerted copy number variation signal in the high-coverage 1000 Genomes Project dataset is not simply due to an increased number of samples. If we exclusively analyze the high-coverage samples that were a part of the low-coverage study (n = 163), we still observe a much weaker correlation than previously reported: the 18S and 5S correlate with a Spearman coefficient of 0.194 (Fig. 2b, Table 3, and Supplemental Fig. 2). Our results raise the question whether differences in analysis methods or differences in data quality are responsible for the observed discrepancy with the previously reported findings 23 .
Our pipeline reproduces concerted copy number variation observed in low-coverage 1000 Genomes Project data. A key difference between our analysis and the original study in which concerted copy number variation was reported lies in the alignment pipelines and post-alignment corrections. We used BWA for alignment 41 , while bowtie2 was used in the original study 23,42 . Additionally, the original study used a correction for pseudogene content, which we did not perform because no significant differences in concerted copy number variation were observed with or without corrections 23 . To ensure that our analysis pipeline identifies the previously reported concerted copy number variation, we applied it to 163 of the 168 low-coverage 1000 Genomes Project samples previously studied.
Consistent with the original study, our analysis pipeline detected strong concerted copy number variation between the 45S and 5S loci in the low-coverage 1000 Genomes Project data. The 18S, 5.8S, and 28S rRNA gene copy numbers correlate to the 5S copy number with Spearman coefficients of 0.56, 0.79, and 0.69 (Table 4, Fig. 2c, Supplemental Fig. 2). These coefficients are similar to those previously published, which are 0.61, 0.80, and 0.73, respectively 23 . We conclude that the failure to detect strong concerted copy number variation does not arise from differences in copy number estimation methods but is likely due to differences in the datasets used for analysis.
Indeed, we find that the low-coverage and high-coverage data yield different rDNA copy number estimates for the same cell lines. In comparing 18S estimates of the same cell lines in the two different datasets, the rDNA copy numbers only correlate with Spearman coefficients of 0.49. The 5S locus shows a Spearman correlation of 0.52 between the two datasets. These values are far lower than would be expected when analyzing the same cell lines (Fig. 2d,e, Supplemental Fig. 3). The scenario that rDNA copy numbers have changed between samplings for low-and high-coverage data generation is unlikely as there are several reports documenting that rDNA copy number is largely stable in cell lines 14,43 . Taken together, our results are consistent with previous reports that different library preparations of the same samples often yield different rDNA copy number estimates 30 .
Low-coverage 1000 Genomes Project sequencing data come from multiple sources. We wanted to further explore which differences in the low-and high-coverage datasets lead to different rDNA copy number estimates and the lack of meaningful concerted copy number variation. One difference between the datasets is that the low-coverage 1000 Genomes sequencing data were produced by any of seven sequencing centers, while the high-coverage data were produced by a single center. The seven sequencing centers used various library preparation methods 44 , and library preparation methods and batch effects can influence rDNA copy number estimates 30 . Moreover, some samples included reads from multiple library preparations and/or sequencing centers, turning the low-coverage copy number estimates into composite estimates from different libraries. Table 3. Correlations between 45S and 5S rDNA copy numbers in the subset of high-coverage 1000 Genomes Project data also analyzed in the low-coverage dataset. The Spearman correlation coefficient and linear models that describe the relationships between the 45S and 5S rDNA copy numbers are shown. www.nature.com/scientificreports/ For each of the 163 low-coverage 1000 Genomes Project samples, we split the sequencing files by library preparation ID. Library depths varied by nearly two magnitudes: chromosome 1 coverage for individual libraries ranged from 0.24× to 14.4×, with an average of 5.4×. When we analyze rDNA copy number estimates from individual libraries by sequencing center, we find that some centers produced sequence data that biased rDNA copy estimates toward higher or lower values (Supplemental Fig. 4). As it is unlikely that certain centers were assigned samples with abnormally high or low rDNA copy number, the observed bias likely arose through differences in library preparation methods. This bias was not observed in the high-coverage data for the same samples, which supports the notion that sample assignment was not biased.
Of the 163 low-coverage samples, rDNA copy number estimates for 54 samples were based on sequencing data generated from multiple libraries (Fig. 3). Of these 54 samples, some samples, such as NA12154 and NA0700, contained multiple libraries that yielded 18S estimates very similar to both each other and our high-coverage data estimate. Others, such as NA11892 or NA12778, contained one library with an estimate very similar to our high-coverage data estimate but also other libraries with estimates near zero copies. The libraries with severe underestimates of rDNA copy number tended to have lower coverage, suggesting that read depth may influence rDNA copy number estimates.
Sequencing coverage does not account for the magnitude of rDNA copy number differences observed between the high and low-coverage 1000 Genomes Project datasets. We next investigated if depth of read coverage drives the differences in rDNA copy number estimation between the highand low-coverage 1000 Genomes Project datasets. We performed downsampling experiments on four samples spanning the range of rDNA copy numbers estimates from the high-coverage 1000 Genomes Project dataset. Randomly downsampling of the high-coverage data to 400, 300, 200, 100, 50, and 10 million reads ten times each reveals that reducing coverage does affect rDNA copy number estimates to a small degree. Reassuringly, the average of ten independent downsamplings for a sample was close to the copy number estimate of the full dataset (Fig. 4). For example, NA07357 had an estimated copy number of 252 in the high-coverage dataset. Its average copy number from downsampling ten times to 10 million reads was 249. Nevertheless, the range of copy number estimates increased with decreased coverage. At 10 million reads, the range of 18S estimates for NA07357 varied from 222 to 272, while at 100 million reads it varied from 244 to 260.
For comparison, the individual libraries for the low-coverage 1000 Genomes Project samples have a mean coverage of 5.4× for chromosome 1. The comparable samples in the downsampling experiment would be those with 100 million reads, which have a chromosome 1 coverage of approximately 4.5×. For the four samples analyzed, at ~ 4.5× coverage, the copy number estimates are at most ± 13 copies of the full dataset estimate. Even at 10 million reads (~ 0.45× coverage), copy number estimates are at most ± 43 copies of the full dataset estimate. Meanwhile, comparing the high-and low-coverage estimates, the low-coverage libraries underestimated the 18S by an average of 57 copies. The different libraries ranged from underestimating by 371 copies to overestimating by 110 copies as compared to the high-coverage data. These differences are much larger than what can be explained by differences in sequencing coverage alone. We conclude that the vast discrepancies in copy number estimates between datasets are not due to sequencing coverage, and hypothesize that the discrepancies may have arisen through batch effects or technical differences in library preparation.
Simons Simplex Collection and high-coverage 1000 Genomes Project data are likely higher quality than the low-coverage 1000 Genomes Project data. To test the above hypothesis, we analyzed the correlation of copy number estimates of the three 45S components to each other, an approach previously used as a quality metric 1 . Because these components are part of the same array, they should correlate www.nature.com/scientificreports/ highly. As expected, the high-coverage data showed somewhat higher intra-45S correlations than the low-coverage data. For example, the 18S and 28S copy numbers in the high-coverage dataset have a Spearman correlation of 0.97 (Fig. 5a, Table 5). In the 163 samples analyzed in both the high-and low-coverage data, the 18S and 28S copy numbers in the high-coverage dataset showed a Spearman correlation of 0.97 but has a Spearman correlation of 0.92 in the low-coverage dataset (Fig. 5b,c, Tables 6 and 7). This trend is reinforced by analysis of a linear model relating the two copy numbers. For the 163 samples analyzed in both 1000 Genomes Project datasets, the R 2 value for the high-coverage dataset was 0.95 while it was 0.85 for the low-coverage dataset (Tables 6 and  7). Analysis of the correlation of the 18S to the 5.8S and the 28S to the 5.8S show the same trends between the three datasets (Supplemental Fig. 5) The improved intra-45S correlations suggest that the rDNA copy number estimates in the high-coverage 1000 Genomes Project dataset are of higher quality. There are three different data quality metrics that can be analyzed with the Simons Simplex Collection data. As with the 1000 Genomes Project data, we first assessed intra-45S correlations, finding similarly high Spearman correlations: The 18S and 28S copy numbers showed a Spearman correlation of 0.943, and a linear model relating the two had an R 2 value of 0.90 (Fig. 5d). The family structure of the Simons Simplex Collection permits analysis of rDNA copy number heritability. We found high heritability of both 45S and 5S rDNA array components (h 2 = 0.94 for the 18S, h 2 = 0.92 for the 5S) (Fig. 5e,f, Supplemental Fig. 6).
Unique to this study, we can also assess reproducibility of rDNA copy number estimates through use of monozygotic twins and duplicate samples. The Simons Simplex Collection includes four pairs of monozygotic twins that showed perfect correlation of 18S copy number. Additionally, some individuals in the Simons Simplex Collection enrolled in other studies and were therefore sequenced twice, albeit by the same sequencing facility. These duplicate samples were identified from shared SNVs, and would be expected to share rDNA copy numbers. Although duplicate samples can show considerable deviation from one another due to technical issues 30 , we found reasonably high correlation between the 18S copy number estimates arising from the duplicated samples (Spearman correlation 0.81, Fig. 5g, Table 8). The 5S copy number estimates also showed high correlation between the twin and duplicated samples (Spearman correlation 0.902, Fig. 5h). Similar to the comparisons between the high-and low-coverage 1000 Genomes Project data, the correlation of 28S and 5.8S estimates in the twin and duplicate samples of the SSC was lower than that of the 18S (Supplemental Fig. 6). These values are still substantially higher than the correlation values between the same regions in the high-and low-coverage 1000 Genomes Project, indicating that the replicate sequencing events in the Simons Simplex Collection produce more similar rDNA copy number estimates. Together, the heritability data, intra-45S correlations, and duplicated samples give confidence in the Simons Simplex Collection rDNA copy number estimates.
We conclude that co-variation between the 45S and 5S rDNA arrays in both high-coverage datasets is weak and does not allow prediction of the copy number at one array based on the copy number at the other. This lack of meaningful concerted copy number variation holds true regardless of whether data were generated using   Table 6. Correlations between the three rRNA genes encoded in the 45S repeat unit in the subset of highcoverage 1000 Genomes Project data also analyzed in the low-coverage dataset. The Spearman correlation coefficient and linear models that describe the relationships between the 18S, 5.8S, and 28S copy numbers are shown. www.nature.com/scientificreports/ lymphoblastoid cell lines or whole blood samples. The previously observed concerted copy number variation in the low-coverage dataset appears to be an artifact of lower data quality.

Discussion
In this study, we sought to further explore the relationship between the copy numbers of the 45S and 5S rDNA arrays. We have previously reported that short-read sequencing estimates of rDNA copy number genotypes are error-prone 30 . Given the previously published concerted copy number variation of the 5S and 45S rDNA arrays in humans, we thought this co-variation may provide a useful metric to predict 45S rDNA copy number from 5S copy number, the latter of which can be readily obtained by pulsed-field gel electrophoresis. Working with two new, high-coverage sequencing datasets, we found weak concerted copy number variation that was not predictive. Sequencing coverage alone did not explain the discrepancy in copy number estimates or concerted copy number variation between samples sequenced in both the original low-coverage 1000 Genomes dataset and the newer high-coverage 1000 Genomes dataset. Available sequence data for many samples of the low-coverage dataset were derived from multiple library preparations from multiple sequencing centers. We suspect that differences in library preparation methods lead to these considerable discrepancies in rDNA copy number estimates. Because we were able to recapitulate the previously observed concerted copy number variation from the low coverage 1000 Genomes Project samples, the differences in results between the datasets stem from sample preparation methods, not analysis methods.
In addition to the promise of a predictive model, the previously reported concerted copy number variation had fascinating implications for biology, in particular for genome maintenance and evolution. Selection for strongly concerted copy number variation suggests the existence of mechanisms that "count" and adjust copy numbers accordingly, operating across several genomic loci on separate chromosomes. It is not fully understood how quantities of the rRNA products of the 45S and 5S arrays are balanced to facilitate ribosome biogenesis. Concerted copy number variation could have provided a possible mechanism of maintaining proper rRNA dosage by means of balancing their genomic templates. However, vast stretches of 45S rDNA repeats are typically silenced, resulting in the total number of 45S rDNA copies not necessarily reflecting the number of transcribed copies. Less is known about 5S regulation in mammals, though 5S silencing is prevalent in species such as A. thaliana and Xenopus species 27,[45][46][47][48] . The abundance of 45S silencing suggests that maintaining concerted copy number variation of the rRNA genes is not crucial for balancing rRNA levels.
In support of our findings, this lack of a meaningful correlation between 45 and 5S arrays was previously observed in model organism studies. Co-variation between 45 and 5S rDNA copy numbers was weak in a large mutant collection of over 2000 strains and 40 natural isolates of Caenorhabditis elegans, which estimated rDNA copy number from high-coverage sequence data 49 . Meaningful covariation between these arrays was also found to be lacking in a study of various ecotypes of A. thaliana, and separately in several species of fish 50,51 . In short, our finding that copy numbers at the 45S and 5S arrays show little correlation is biologically plausible and supported by studies of model organism rDNA.
Some may argue that the observed differences in rDNA copy number between the low-and high-coverage 1000 Genomes datasets stem from biological differences; i.e. that the studied samples have acquired changes in rDNA copy number. We consider this an unlikely scenario. The DNA used for each sequencing effort was extracted from lymphoblastoid cell lines, which were propagated for an unknown number of generations from a common stock between each study. However, rDNA copy number has been reported as stable both in cell lines 14,43 Table 7. Correlations between the three rRNA genes encoded in the 45S repeat unit in the low-coverage 1000 Genomes Project dataset. The Spearman correlation coefficient and linear models that describe the relationships between the 18S, 5.8S, and 28S copy numbers are shown.  30,49 , except for when rDNA copy number is reduced to a level that causes fitness defects such as in Saccharomyces cerevisiae or Drosophila melanogaster 52,53 . We present our results as a cautionary tale about the challenges of genotyping repetitive DNA. While our results suggest that the high-coverage uniform sequencing performed by the New York Genome Center for the updated 1000 Genomes Project and the Simons Simplex Collection likely yielded more accurate rDNA copy number estimates, there is no certainty without validation through alternate methods. We are reassured by the fact that the observed lack of meaningful covariation between 45 and 5S rDNA copy numbers is consistent with current biological knowledge. We therefore trust our finding that there is no association of rDNA copy number with autism spectrum disorder. However, in light of our findings, we posit that care should be taken when drawing biological conclusions based on rDNA copy number estimates from short-read whole genome sequencing data. Changes in rDNA copy number have been reported for some cancers and in aging, prompting speculation about their role as drivers or essential players in both processes. A subset of these findings rely on rDNA copy number estimates from short read sequencing data 1,8,23 . However, even studies that develop their own rDNA copy number estimation methods still compare to short read sequencing data to comment on their accuracy 4,13 . Some of these findings may need to be re-evaluated by applying multiple data quality metrics to the analyzed sequence data, by conducting uniform, high-coverage re-sequencing, or by validating rDNA copy number through alternative approaches. Even still, better sequencing and bioinformatics methods must be developed for the accurate assessment of repetitive DNA regions so that repeat copy numbers can be calculated unambiguously. Going forward, we hope that these results will caution researchers about drawing firm conclusions from rDNA copy number estimates based on short-read sequencing data, as public data repositories often contain samples generated with various methods and by different sources.

Methods
Alignment and copy number estimation. Sequence analysis. Samtools version 1.9 and bwa version 0.7.15 were used for all analyses.
Reference sequences for the 45S ribosomal DNA (U13369.1), 5S ribosomal DNA (X12811.1), mtDNA, and chromosome 1 of the human genome (GRCh38 reference) were downloaded from the NCBI nucleotide database with GenBank IDs as indicated in parenthesis. CRAM files were converted to fastq by Samtools fastq and aligned to the appropriate reference sequence by bwa mem with default parameters and converted to CRAM files with samtools view. Per-base read depth was calculated with Samtools depth outputting all positions, with the-d 0 flag to eliminate a maximum read depth cutoff.
To estimate ribosomal DNA copy number, average read depth across the whole 5S or 5.8S coding sequence was divided by the average chromosome 1 read depth. These regions correspond to positions 271-391 of X12811.1 for the 5S and positions 6623-6779 of U13369.1 for the 5.8S For the 18S and 28S subunits, segments of these genes previously used for concerted copy number variation analysis were used, and the average read depth at these regions was divided by the chromosome 1 depth 23