Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

SkewDB, a comprehensive database of GC and 10 other skews for over 30,000 chromosomes and plasmids

Abstract

GC skew denotes the relative excess of G nucleotides over C nucleotides on the leading versus the lagging replication strand of eubacteria. While the effect is small, typically around 2.5%, it is robust and pervasive. GC skew and the analogous TA skew are a localized deviation from Chargaff’s second parity rule, which states that G and C, and T and A occur with (mostly) equal frequency even within a strand. Different bacterial phyla show different kinds of skew, and differing relations between TA and GC skew. This article introduces an open access database (https://skewdb.org) of GC and 10 other skews for over 30,000 chromosomes and plasmids. Further details like codon bias, strand bias, strand lengths and taxonomic data are also included. The SkewDB can be used to generate or verify hypotheses. Since the origins of both the second parity rule and GC skew itself are not yet satisfactorily explained, such a database may enhance our understanding of prokaryotic DNA.

Measurement(s) Imbalances in the use of DNA nucleotides
Technology Type(s) Next Generation Sequencing
Factor Type(s) Position within DNA sequence • Organism type
Sample Characteristic - Organism bacterium • archaea
Sample Characteristic - Environment Varying
Sample Characteristic - Location World

Background & Summary

The phenomenon of GC skew1,2,3 is tantalizing because it enables a simple numerical analysis that accurately predicts the loci of both the origin and terminus of replication in most bacteria and some archaea4,5.

Bacterial DNA is typically replicated simultaneously on both strands, starting at the origin of replication6. Both replication forks travel in the 5’ to 3’ direction, but given that the replichores are on opposite strands, topologically they are traveling in opposite directions. This continues until the forks meet again at the terminus. This means that roughly one half of every strand is replicated in the opposite direction of the other half. The forward direction is called the leading strand. Many plasmids also replicate in this way7.

The excess of G over C on the leading strand is in itself only remarkable because of Chargaff’s somewhat mysterious second parity rule8, which states that within a DNA strand, there are nearly equal numbers of G’s and C’s, and similarly, T’s and A’s. This rule does not directly follow from the first parity rule, which is a simple statement of base pairing rules.

Depending on who is asked, Chargaff’s second parity rule is so trivial that it needs no explanation, or it requires complex mathematics and entropy considerations to explain its existence9.

The origins of GC skew are still being debated. The leading and lagging strands of circular bacterial chromosomes are replicated very differently; it is at least plausible that this leads to different mutational biases. In addition, there are selection biases that are theorized to be involved10. No single mechanism may be exclusively responsible.

This article does not attempt to explain or further mystify11 the second parity rule or GC skew, but it may be that the contents of the SkewDB can contribute to our further understanding.

The SkewDB also contains some hard to explain data on many chromosomes. These include highly asymmetric skew, but also very disparate strand lengths. Conversely, the SkewDB confirms earlier work on skews in the Firmicute phylum12, and also expands on these earlier findings.

GC skew has often been investigated by looking at windows of DNA of a certain size. GC skew is computed as (G−C)/(G + C) in a window of N bases, where G is the number of guanines and C the number of cytosines in that window. It has been found that the choice of window size impacts the results of the analysis. The SkewDB is therefore based exclusively on cumulative skew13, which sidesteps window size issues. For example, the sequence GGGCCC has a cumulative GC skew of zero, and as we progress through the sequence, this skew takes on values 1, 2, 3, 2, 1, 0. If the window size N is 6, the non-cumulative skew is also 0.

The result of a cumulative GC skew analysis is shown in Fig. 1. The analysis software fits a linear model on the skews, where it also compensates for chromosome files sequenced in the non-canonical direction, or where the origin of replication is not at the start of the file.

Fig. 1
figure 1

Sample graph showing SkewDB data for Lactiplantibacillus plantarum strain LZ95 chromosome.

GC skew analysis is not new. As noted below, the DoriC database for example contains related data that is more precise for its stated purpose (finding the Origin of replication). The SkewIT database4 similarly provides a metric of skew, and also comes with an online analysis tool.

Other work, like the Comparative Genometrics Database14 and the Z Curve Database15 has been foundational, but by dint of their age lack an analysis of the tens of thousands of DNA sequences that have become available since the initial availability of these databases.

SkewDB is funded to be updated monthly with the latest sequences from NCBI until 2026.

Other software that calculates GC skew is available, like for example GraphDNA16, GC Skewing17 and GenSkew. The SkewDB delivers far more metrics however, also because it involves annotation data in its calculations. For ease of use, SkewDB is made available as a ready to use database, as well as in software form that reproduces this database exactly.

Methods

The SkewDB analysis relies exclusively on the tens of thousands of FASTA and GFF3 files available through the NCBI download service, which covers both GenBank and RefSeq. The database includes bacteria, archaea and their plasmids.

Furthermore, to ease analysis, the NCBI Taxonomy database is sourced and merged so output data can quickly be related to (super)phyla or specific species.

No other data is used, which greatly simplifies processing. Data is read directly in the compressed format provided by NCBI. All results are emitted as standard CSV files.

In the first step of the analysis, for each organism the FASTA sequence and the GFF3 annotation file are parsed. Every chromosome in the FASTA file is traversed from beginning to end, while a running total is kept for cumulative GC and TA skew. In addition, within protein coding genes, such totals are also kept separately for these skews on the first, second and third codon position. Furthermore, separate totals are kept for regions which do not code for proteins.

In addition, to enable strand bias measurements, a cumulative count is maintained of nucleotides that are part of a positive or negative sense gene. The counter is increased for positive sense nucleotides, decreased for negative sense nucleotides, and left alone for non-genic regions. A separate counter is kept for non-genic nucleotides.

Finally, G and C nucleotides are counted, regardless of if they are part of a gene or not.

These running totals are emitted at 4096 nucleotide intervals, a resolution suitable for determining skews and shifts.

In addition, one line summaries are stored for each chromosome. These lines include the RefSeq identifier of the chromosome, the full name mentioned in the FASTA file, plus counts of A, C, G and T nucleotides. Finally five levels of taxonomic data are stored.

Chromosomes and plasmids of fewer than 100 thousand nucleotides are ignored, as these are too noisy to model faithfully. Plasmids are clearly marked in the database, enabling researchers to focus on chromosomes if so desired.

Fitting

Once the genomes have been summarised at 4096-nucleotide resolution, the skews are fitted to a simple model.

The fits are based on four parameters, as shown in Fig. 1. Alpha1 and alpha2 denote the relative excess of G over C on the leading and lagging strands. If alpha1 is 0.046, this means that for every 1000 nucleotides on the leading strand, the cumulative count of G excess increases by 46.

The third parameter is div and it describes how the chromosome is divided over leading and lagging strands. If this number is 0.557, the leading replication strand is modeled to make up 55.7% of the chromosome.

The final parameter is shift (the dotted vertical line), and denotes the offset of the origin of replication compared to the DNA FASTA file. This parameter has no biological meaning of itself, and is an artifact of the DNA assembly process.

The goodness-of-fit number consists of the root mean squared error of the fit, divided by the absolute mean skew. This latter correction is made to not penalize good fits for bacteria showing significant skew.

GC skew tends to be defined very strongly, and it is therefore used to pick the div and shift parameters of the DNA sequence, which are then kept as a fixed constraint for all the other skews, which might not be present as clearly.

The fitting process itself is a downhill simplex method optimization18 over the four dimensions, seeded with the average observed skew over the whole genome, and assuming there is no shift, and that the leading and lagging strands are evenly distributed. To ensure that the globally optimum fit is (very likely) achieved, ten optimization attempts are made from different starting points. This fitting process is remarkably robust in the sense that even significant changes in parameters or fitting strategies cause no appreciable change in the results.

For every chromosome and plasmid the model parameters are stored, plus the adjusted root mean squared error.

Both for quality assurance and ease of plotting, individual CSV files are generated for each chromosome, again at 4096 nucleotide resolution, but this time containing both the actual counts of skews as well as the fitted result.

Some sample findings

To popularize the database, an online viewer is available on https://skewdb.org/view. While this article makes no independent claims to new biological discoveries, the following sections show some results gathered from a brief study of the database. Some of these observations may be of interest for other researchers.

GC and TA skews

Most bacteria show concordant GC and TA skew, with an excess of G correlating with an excess of T. This does not need to be the case however. Figure 2 is a scatterplot that shows the correlation between the skews for various major superphyla. Firmicutes (part of the Terrabacteria group) show clearly discordant skews.

Fig. 2
figure 2

Scatter graph of 25,000 chromosomes by superphylum, GC skew versus TA skew.

Firmicute prediction

In many bacteria, genes tend to concentrate on the leading replication strand. If the codon bias of genes is such that they are relatively rich in one nucleotide, the “strand bias” may itself give rise to GC or TA bias. Or in other words, if genes contain a lot of G’s and they huddle on the leading strand, that strand will show GC skew. As an hypothesis, we can plot the observed GC and TA skews for all Firmicutes for which we have data.

Mathematically the relation between the codon bias, strand bias and predicted GC skew turns out to be a simple multiplication. In Fig. 3, the x-axis represents this multiplication. The y-axis represents the GC and TA skew ratio.

Fig. 3
figure 3

Predicted versus actual GC/TA skew for 4093 Firmicutes.

It can clearly be seen that both GC and TA skew correlate strongly with the codon/strand bias product. TA skew goes to zero with the two biases, but GC skew appears to persist even in the absence of such biases.

Figure 4 shows the situation within an individual chromosome (C. difficile), based on overlapping 40960-nucleotide segments. On the x-axis we find the strand bias for these segments, running from entirely negative sense genes to entirely positive sense genes. The skew is meanwhile plotted on the y-axis, and here too we see that TA skew goes to zero in the absence of strand bias, while GC skew persists and clearly has an independent strand-based component.

Fig. 4
figure 4

Scatter graph of codon/strand bias versus GC/TA skew for C. difficile.

Asymmetric skew

The vast majority of chromosomes show similar skews on the leading and the lagging replication strands, something that makes sense given the pairing rules. There are however many chromosomes that have very asymmetric skews, with one strand sometimes showing no skew at all. In Fig. 5 four chromosomes are shown that exhibit such behavior. The SkewDB lists around 250 chromosomes where one strand has a GC skew at least 3 times bigger/smaller than the other one.

Fig. 5
figure 5

Chromosomes with asymmetric skews.

Asymmetric strands

Bacteria tend to have very equally sized replication strands, which is also an optimum for the duration of replication. It is therefore interesting to observe that GC skew analysis finds many chromosomes where one strand is four times larger than the other strand. In Fig. 6 four such chromosomes are shown. The SkewDB lists around 100 chromosomes where one strand is at least three times as large as the other strand.

Fig. 6
figure 6

Chromosomes with differing strand lengths.

Anomalies

In many ways, GC skew is like a forensic record of the historical developments in a chromosome. Horizontal gene transfer, inversions, integration of plasmids, excisions can all leave traces. In addition, DNA sequencing or assembly artifacts will also reliably show up in GC graphs, as elucidated with examples in4.

Figure 7 shows GC and TA skews for Salmonella enterica subsp. enterica serovar Concord strain AR-0407 (NZ_CP044177.1), and many things could be going on here. The peaks might correspond to multiple origins of replication, but might also indicate inversions or DNA assembly problems.

Fig. 7
figure 7

GC and TA skew for Salmonella enterica subsp. enterica serovar Concord strain AR-0407.

Data Records

The SkewDB is available through https://skewdb.org, where it is frequently (& automatically) refreshed. A snapshot of the data has also been deposited on Dryad19.

The SkewDB consists of several CSV files: skplot.csv, results.csv, genomes.csv and codongc.csv. In addition, for each chromosome or plasmid, a separate _fit.csv file is generated, which contains data at 4096-nucleotide resolution.

skplot.csv contains all the 4096-nucleotide resolution data as one big file for all processed chromosomes and plasmids. The parameters are described in Table 1.

Table 1 Fields of skplot.csv.

results.csv meanwhile contains the details of the fits. In this Table 2, all marked out squares exist. The actual fields are called alpha1gc, alpha2gc, gcRMS, alpha1ta, alpha2ta etc. DNA sequence shift and div are also specified, and they come from the GC skew. gc0-2, ta0-2 refers to codon position. gcng and tang refer to the non-coding region skews. Finally sb denotes the strand bias.

Table 2 Skew metrics.

Table 3 documents the data on codon bias, also split out by leading or lagging strand found in codongc.csv.

Table 3 Fields in codongc.csv.

Table 4 documents the fields found in genomes.csv:

Table 4 Fields in genomes.csv.

Finally, the individual _fit.csv files contain fields called “Xskew” and “predXskew” to denote the observed X = gc, ta etc skew, plus the prediction based on the parameters found in results.csv.

Technical Validation

This database models the skews of many chromosomes and plasmids. Validation consists of evaluating the goodness-of-fit compared to the directly available numbers.

The SkewDB fits skews to a relatively simple model of only four parameters. This prevents overfitting, and this model has proven to be robust in practice. Yet, when doing automated analysis of tens of thousands of chromosomes, mistakes will be made. Also, not all organisms show coherent GC skew.

Fig. 8 shows 16 equal sized quality categories, where it is visually clear that the 88% best fits are excellent. It is therefore reasonable to filter the database on RMSgc < 0.16. Or conversely, it could be said that above this limit interesting anomalous chromosomes can be found.

Fig. 8
figure 8

SkewDB fits for 16 equal sized quality categories of bacterial chromosomes.

The DoriC database5 contains precise details of the location of the origin of replication. 2267 sequences appear both in DoriC and in the SkewDB. The DoriC origin of replication should roughly be matched by the “shift” metric in the SkewDB (but see Usage notes). For 90% of sequences appearing in both databases, there is less than 5% relative chromosome distance between these two independent metrics. This is encouraging since these two numbers do not quite measure the same thing.

On a similar note, the DnaA gene is typically (but not necessarily) located near the origin of replication. For over 80% of chromosomes, DnaA is found within 5% of the SkewDB “shift” metric. This too is an encouraging independent confirmation of the accuracy of the data.

Finally, during processing numbers are kept of the start and stop codons encountered on all protein coding genes on all chromosomes and plasmids. These numbers are interesting in themselves (because they correlate with GC content, for example), but they also match published frequencies, and show limited numbers of non-canonical start codons, and around 0.005% anomalous stop codons. This too confirms that the analyses are based on correct (annotation) assumptions.

Usage Notes

The existential limitation of any database like the SkewDB is that it does not represent the distribution of organisms found in nature. The sequence and annotation databases are dominated by easily culturable microbes. And even within that selection, specific (model) organisms are heavily oversampled because of their scientific, economic or medical relevance.

Because of this, care should be taken to interpret numbers in a way that takes such over- and undersampling into account. This leaves enough room however for finding correlations. Some metrics are sampled so heavily that it would be a miracle if the unculturable organisms were collectively conspiring to skew the statistics away from the average. In addition, the database is a very suitable way to test or generate hypotheses, or to find anomalous organisms.

Finally it should be noted that the SkewDB tries to precisely measure the skew parameters, but it makes no effort to pin down the Origin of replication exactly. For such uses, please refer to the DoriC database5. In future work, the SkewDB will attempt to use OriC motifs to improve fitting of this metric.

On https://skewdb.org an explanatory Jupyter20 notebook can be found that uses Matplotlib21 and Pandas22 to create all the graphs from this article, and many more. In addition, this notebook reproduces all numerical claims made in this work. The SkewDB website also provides links to informal articles that further explain GC skew, and how it could be used for research.

Code availability

The SkewDB is produced using the Antonie DNA processing software (https://github.com/berthubert/antonie2), which is open source. In addition the pipeline is fully automated and reproducible, including the retrieval of sequences, annotations and taxonomic data from the NCBI website. The software has also been deposited with Zenodo23,24.

A GitHub repository is available for this article on https://github.com/berthubert/skewdb-articles, which includes this reproducible pipeline, plus a script that regenerates all the graphs and numerical claims from this paper.

References

  1. Frank, A. C. & Lobry, J. R. Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms. Gene 238, 65–77 (1999).

    CAS  Article  Google Scholar 

  2. Marín, A. & Xia, X. GC skew in protein-coding genes between the leading and lagging strands in bacterial genomes: New substitution models incorporating strand bias. Journal of Theoretical Biology 253, 508–513, https://doi.org/10.1016/j.jtbi.2008.04.004 (2008).

    ADS  CAS  Article  PubMed  MATH  Google Scholar 

  3. Quan, C.-L. & Gao, F. Quantitative analysis and assessment of base composition asymmetry and gene orientation bias in bacterial genomes. FEBS Letters 593, 918–925, https://doi.org/10.1002/1873-3468.13374 (2019).

    CAS  Article  PubMed  Google Scholar 

  4. Lu, J. & Salzberg, S. L. SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes. PLOS Computational Biology 16, e1008439, https://doi.org/10.1371/journal.pcbi.1008439 (2020).

    ADS  CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. Luo, H. & Gao, F. DoriC 10.0: an updated database of replication origins in prokaryotic genomes including chromosomes and plasmids. Nucleic Acids Research 47, D74–D77, https://doi.org/10.1093/nar/gky1014 (2019).

    CAS  Article  PubMed  Google Scholar 

  6. ODonnell, M., Langston, L. & Stillman, B. Principles and concepts of DNA replication in bacteria, archaea, and eukarya. Cold Spring Harbor Perspectives in Biology 5, a010108–a010108, https://doi.org/10.1101/cshperspect.a010108 (2013).

    CAS  Article  Google Scholar 

  7. Lilly, J. & Camps, M. Mechanisms of theta plasmid replication. Microbiology Spectrum 3, https://doi.org/10.1128/microbiolspec.plas-0029-2014 (2015).

  8. Rudner, R., Karkas, J. D. & Chargaff, E. Separation of B. subtilis DNA into complementary strands. 3. Direct analysis. Proceedings of the National Academy of Sciences 60, 921–922, https://doi.org/10.1073/pnas.60.3.921 (1968).

    ADS  CAS  Article  Google Scholar 

  9. Fariselli, P., Taccioli, C., Pagani, L. & Maritan, A. DNA sequence symmetries from randomness: the origin of the Chargaff’s second parity rule. Briefings in Bioinformatics bbaa041, https://doi.org/10.1093/bib/bbaa041 (2020).

  10. Tillier, E. R. & Collins, R. A. The Contributions of Replication Orientation, Gene Direction, and Signal Sequences to Base-Composition Asymmetries in Bacterial Genomes. Journal of Molecular Evolution 50, 249–257, https://doi.org/10.1007/s002399910029 (2000).

    ADS  CAS  Article  PubMed  Google Scholar 

  11. Zhang, R. & Zhang, C.-T. A Brief Review: The Z-curve Theory and its Application in Genome Analysis. Current genomics 15, 78–94, https://doi.org/10.2174/1389202915999140328162433 Publisher: Bentham Science Publishers. (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  12. Charneski, C. A., Honti, F., Bryant, J. M., Hurst, L. D. & Feil, E. J. Atypical AT Skew in Firmicute Genomes Results from Selection and Not from Mutation. PLOS Genetics 7, e1002283, https://doi.org/10.1371/journal.pgen.1002283 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  13. Grigoriev, A. Analyzing genomes with cumulative skew diagrams. Nucleic Acids Research 26, 2286–2290, https://doi.org/10.1093/nar/26.10.2286 (1998).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. Roten, C.-A. H. Comparative genometrics (CG): a database dedicated to biometric comparisons of whole genomes. Nucleic Acids Research 30, 142–144, https://doi.org/10.1093/nar/30.1.142 (2002).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. Zhang, C.-T., Zhang, R. & Ou, H.-Y. The z curve database: a graphic representation of genome sequences. Bioinformatics 19, 593–599, https://doi.org/10.1093/bioinformatics/btg041 (2003).

    CAS  Article  PubMed  Google Scholar 

  16. Thomas, J. M., Horspool, D., Brown, G., Tcherepanov, V. & Upton, C. GraphDNA: a java program for graphical display of DNA composition analyses. BMC Bioinformatics 8, https://doi.org/10.1186/1471-2105-8-21 (2007).

  17. Grigoriev, A. Analyzing genomes with cumulative skew diagrams. Nucleic Acids Research 26, 2286–2290, https://doi.org/10.1093/nar/26.10.2286 (1998).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  18. Nelder, J. A. & Mead, R. A simplex method for function minimization. The Computer Journal 7, 308–313, https://doi.org/10.1093/comjnl/7.4.308 (1965).

    MathSciNet  Article  MATH  Google Scholar 

  19. Hubert, B. Skewdb: A comprehensive database of gc and 10 other skews for over 28, 000 chromosomes and plasmids. Dryad https://doi.org/10.5061/DRYAD.G4F4QRFR6 (2021).

  20. Kluyver, T. et al. Jupyter notebooks – a publishing format for reproducible computational workflows. In Loizides, F. & Schmidt, B. (eds.) Positioning and Power in Academic Publishing: Players, Agents and Agendas, 87 – 90 (IOS Press, 2016).

  21. Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering 9, 90–95, https://doi.org/10.1109/MCSE.2007.55 (2007).

    ADS  Article  Google Scholar 

  22. Reback, J. et al. pandas-dev/pandas: Pandas 1.3.2. zenodo https://doi.org/10.5281/zenodo.5203279 (2021).

  23. Hubert, B., Beaumont Lab. berthubert/antonie2: Skewversion 1.0. zenodo https://doi.org/10.5281/ZENODO.5516524 (2021).

  24. Hol, F. J. H., Hubert, B., Dekker, C. & Keymer, J. E. Density-dependent adaptive resistance allows swimming bacteria to colonize an antibiotic gradient. The ISME Journal 10, 30–38, https://doi.org/10.1038/ismej.2015.107 (2016).

    CAS  Article  PubMed  Google Scholar 

Download references

Acknowledgements

I would like to thank Bertus Beaumont for helping me to think like a biologist, and Jason Piper for regularly pointing me to the relevant literature. In addition, I am grateful that Felix Hol kindly allowed me to field test my software on his DNA sequences. Twitter users @halvorz and @Suddenly_a_goat also provided valuable feedback.

Author information

Authors and Affiliations

Authors

Contributions

B.H. did all the work.

Corresponding author

Correspondence to Bert Hubert.

Ethics declarations

Competing interests

The author declares no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hubert, B. SkewDB, a comprehensive database of GC and 10 other skews for over 30,000 chromosomes and plasmids. Sci Data 9, 92 (2022). https://doi.org/10.1038/s41597-022-01179-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41597-022-01179-8

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing