Omics databases are widely used in life sciences research. Scientific investigators, some with limited bioinformatics experience, perform analyses with omics databases under the assumption that they are reliable, although that may not always be the case. For example, two COVID-19 research articles were retracted because analyses were based on an unreliable data registry1,2. Concerningly, omics resources rarely provide sex annotation or allow for sex-specific analysis. This diminishes the value of these resources as we increasingly strive to incorporate sex as a biological variable in research. Here we aim to bring attention to the innate bias of omics resources and provide recommendations for addressing this limitation.

The problem

Sex differences in molecular, cellular and organismal biology accrue from the time of fertilization and broadly influence normal development3. Studying merged male and female datasets can mask differences that are only revealed when each sex is considered individually4. Historically, male subjects have been over-represented in animal and human research owing to concerns that the hormonal variability of females confounds results5, and the chromosomal sex of cell lines has largely been ignored6. Without justification, results from these male-dominant or sex-agnostic studies are assumed to apply equally to both sexes. When comparing female or mixed-sex data to a male standard, false negatives can arise or results may be misinterpreted (Fig. 1). Conversely, there are instances when female subjects are over-represented (for example, breast cancer and autoimmune diseases), which results in bias against males. This inattention to sex in basic science studies has, in some cases, harmed patients7,8 and may slow scientific progress.

Fig. 1: Analyzing disaggregated male and female data through the perspective of databases that were built upon sex-biased studies (prism) could give rise to misleading results.
figure 1

a, If the database is male-biased but there are truly no sex differences in the system, the output will be accurate for both male and female. bd, If sex differences exist and the database is male-biased, results could be accurate for males but have lower significance in females (b), be incomplete in females (c), or be uninformative in females (d). e, If sex differences exist and the database is female-biased, results may be uninformative for male data. f, If the database annotates for sex, thereby allowing for truly sex-specific analyses, male and female outputs can be both different and accurate.

Some organizations have raised awareness of the importance of considering sex in research. The US National Institutes of Health (NIH) now requires the incorporation of sex as a biological variable in the design of all funded studies9, and the Horizon Europe program intends to do the same10. Some journals follow ARRIVE guidelines11 and mandate disclosure of the sex of subjects used in the study. While these initiatives are important steps toward ensuring sex equity in research, they are not universally adopted and do not rectify the decades of biased work on which current omics resources are built.

The state of sex annotation in omics resources

Omics resources compile the results of thousands of studies to summarize biological relationships. While some investigators regularly consider sex as a biological variable, the NIH has determined that basic and preclinical research continues to suffer from the over-representation of males9. This in turn gives rise to bias in primary data repositories (for example, GEO (Gene Expression Omnibus)12) unless the resource requires sex annotation upon submission (for example, TCGA (The Cancer Genome Atlas)13, GTEx (Genotype–Tissue Expression)14).

There are currently 702 cataloged resources that collectively document all known biological pathways and molecular interactions across 24 organisms15. Of these, 370 (53%) provide references to the primary publications that originally described the knowledge. Among five of the most-cited resources, from which several third-party analysis tools are built, all provide citations but none annotate the sex of the subjects that generated the results (Table 1).

Table 1 Five of the most highly cited public omics resources do not annotate terms by sex

While some resources with niche interests (for example, DICE16 (the Database of Immune Cell Expression, Expression Quantitative Trait Loci (eQTLs) and Epigenomics)) acknowledge the biological importance of sex and have incorporated it into their querying tools, most have yet to adopt this practice. These resources are often used for functional genomic analyses, so research that employs them—even if sex is considered in the experimental design—discounts the many molecular mechanisms by which male and female fundamentally differ. It is important to recognize that using these databases as a standard to evaluate both sexes may give rise to misleading results.

Mechanisms by which sex differences arise

At the most fundamental level, X inactivation and the presence or absence of a Y chromosome drive sex determination. However, sex chromosomes alone cannot explain the innumerable differences between males and females. A striking example of this is androgen insensitivity syndrome, a condition in which individuals have an XY karyotype but female characteristics as a result of a nonfunctional androgen receptor.

Across the genome, there are no sex differences in the frequency of single nucleotide polymorphisms17, and only a few sex differences in rare copy number variations have been described18. There are conflicting reports of sex differences in telomere length, telomere attrition rate, and the relationships between telomeres and aging. Males and females accumulate nuclear and mitochondrial DNA mutations at different rates and loci, which may contribute to differences in aging and oncogenesis19. While there is some sex-based variance in DNA, differences are largely thought to arise at the level of gene expression20,21.

When males and females have different fitness optima for the same trait, divergent evolutionary selection can cause sexual dimorphism in a characteristic that was once shared. These selective pressures may act on regulatory factors that can profoundly influence phenotype. Divergent evolution of regulatory factors is increasingly recognized as a contributor to sex differences22, but their variability and poor characterization make them challenging to identify. Still, sex differences in both coding and regulatory regions have been identified across 29 normal human tissues21.

Similar gene expression does not prove the absence of sex differences since the same gene can give rise to two distinct phenotypes in males and females. For example, the male and female glioblastoma transcriptomes are similar, yet cell-cycle- and integrin-related genes are associated with survival in a sex-specific manner4. Similarly, modeling approaches have revealed that chronic obstructive pulmonary disease in males and females is driven by distinct metabolism and mitochondrial networks in the absence of differential expression23. Conversely, the same phenotype can be driven by distinct genetic pathways. In a study of over 100,000 humans, 13 complex phenotypes showed genetic heterogeneity between males and females, and genomic prediction using sex-specific models outperformed a sex-agnostic model24.

Further complexity arises from the effects of environmental exposures and hormonal interactions on molecular phenotypes3,17,25. In response to endogenous and exogenous factors, epigenetic modifications regulate the accessibility of DNA to transcriptional machinery26,27. This sex-influenced chromatin remodeling can cause differential gene expression in response to the same stimulus28,29. Sex hormones can directly modulate the function of transcription factors and other proteins, thereby giving rise to sex-specific regulatory networks23,30,31. In this way, identical phenotypes could be generated by two distinct networks in males and females, and diverse transcriptional responses could be generated by the same signal. Network modeling and systems-based approaches have an elevated sensitivity to sex differences21,23,31, so the consequences of neglecting sex in these analyses can be more profound than when considering genes individually.

The importance of incorporating demographic information into primary databases is illustrated by considering immunology research. Women exhibit greater immune responsiveness to acute infection and vaccines than men, even when matched for pathogen load32. This heightened antigen-specific immune response contributes to the female bias in autoimmune diseases32 and may protect young women from cancer33. Sex differences in the immune response are not evident in infants and children, suggesting that immunity is modified over the lifespan as a function of age, gonadal and adrenal steroid hormones, and environmental exposures32. Thus, analytical tools that are based upon pooled gene-expression data, without regard to the sex or age of the donor, are not necessarily sensitive nor specific when applied to smaller datasets like those queried by most investigators. Furthermore, they undermine our ability to understand complex biological processes and regulatory mechanisms in their totality32.

Conclusions, recommendations, and challenges

Sex differences are a cumulative effect of genetics, epigenetics, transcriptomics, proteomics, environment, social factors, hormonal influences and network-level modulation. Our understanding of the underlying bases of biological systems requires us to acknowledge and disentangle these complex interactions. Several foundational questions will remain unanswered until omics resources with sex annotation are developed. While sex-unique pathways and networks likely exist across nearly all tissues and species, it is impossible to quantify the error associated with current, sex-agnostic methods. We suspect that databases rooted in gene and protein interactions may suffer disproportionately from this inattention compared to DNA-centric resources as sex differences seem to be most profound at the network level21. Despite the uncertainty regarding the degree to which current practices have affected the quality of past results, it is clear that sex is a critical factor to be considered in omics analyses moving forward. As starting points, we recommend the following:

For scientists

  • Perform omics analyses in combined-sex and separated male and female cohorts. Simply adding sex as a covariate to combined-sex investigations is insufficient, but these analyses remain valuable from the perspective of contextualizing sex-specific results in light of previous literature (for example, if results of previous studies were driven by an over-representation of one sex).

  • Design studies to represent males and females equally and in sufficient numbers to detect sex differences, or provide a justification as to why this is not possible. Although the sex of cell lines is often not available, efforts should be made to conduct experiments on those derived from both sexes. When cell lines are passaged within animals, attention should be given to the evolution of those cells in the sex-matched vs. sex-mismatched settings.

  • Follow the ARRIVE11 and MIAME34 guidelines when describing omics studies or depositing data in a public database. When comparing self-generated and public data, report the sex composition of both.

  • When using a database that references primary studies, evaluate the work that gave rise to any statistically significant pathways or terms for sex, compare it to the composition of the experimental cohort, and report it as a part of the results.

  • If sex is missing from a tool or database, suggest that curators require subject sex reporting from contributors going forward to facilitate prospective annotation.

  • If the terms in an omics database were generated by studies that are sex-incongruent with the experimental design, evaluate the literature for alternative signatures that are sex-specific and may not have been incorporated into the database yet.

For databases

  • Provide references to the primary literature from which the information was originally derived.

  • Note entries with the sex that the data originated from, and allow users to filter results by the sex that matches their experimental design.

  • Actively caution users about the risks of applying female or mixed-sex data to historically male-biased standards.

  • Prospectively curate new databases to bring attention to known sex differences and explicitly reference the data that support these conclusions.

For funding agencies

  • Provide opportunities for individuals to determine the problem’s scope, annotate resources, and use illustrative cases to quantify the impact of sex annotation (or lack thereof) on results.

  • Support the generation of data and tools to directly characterize sex differences or novel statistical or computational approaches to retrospectively address sex differences in data that are not now amenable to such comparisons.

Challenges

We recognize the hurdles to implementing these recommendations, including:

  • Financial burden of running both male and female experiments with the statistical power to detect differences.

  • Time to explore the primary publications that contributed to databases and tools to determine the sex composition of these sources.

  • Effort to annotate existing and future databases with sample donor sex, race and age.

  • Flexibility to continually expand the numbers of features accounted for in our primary datasets as we learn more about systems-level influences on molecular phenotypes.

Cognizance of sex bias in omics resources and the bioinformatics tools built on these databases will enhance scientific rigor and improve the quality of work across all biological disciplines. Embracing these recommendations will finally bring attention to a fundamental variable that has been long overlooked.