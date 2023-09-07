Overview of the Quartet Project

The Quartet Project provides the community with multi-omics reference materials and reference datasets for QC and data integration in increasingly large-scale multi-omics studies (Fig. 1a). Suites of large quantities of multi-omics reference materials (DNA, RNA, protein and metabolites) were simultaneously established from the same immortalized LCLs of a Chinese Quartet family from the Fudan Taizhou Cohort65 (Extended Data Fig. 1), including the father (F7), mother (M8) and monozygotic twin daughters (D5 and D6). As summarized in Extended Data Table 1, each reference material was stocked in more than 1,000 vials. These reference materials are suitable for a wide range of multi-omics technologies, including DNA sequencing, DNA methylation analysis, RNA sequencing (RNA-seq), microRNA sequencing (miRNA-seq), and liquid chromatography and tandem mass spectrometry (LC–MS/MS)-based proteomics and metabolomics. Notably, the DNA and RNA reference material suites have been approved by China’s State Administration for Market Regulation as the First Class of National Reference Materials (GBW 099000–GBW 099007) and are being used extensively for proficiency testing and method validation.

Fig. 1: Overview of the Quartet Project. a, Design and production of Quartet family-based multi-omics reference material suites. b, Data generation across multiple platforms, labs, batches and omics types. DDA, data-dependent acquisition; DIA, data-independent acquisition; WGS, whole-genome sequencing. c, QC metrics for horizontal (within-omics) integration include the Mendelian concordance rate and SNR, which are also applicable to wet-lab proficiency testing. Two types of QC metrics for vertical (cross-omics) integration were developed that assess the ability to detect cross-omics feature relationships that follow the central dogma and the ability to classify samples into either four phenotypically different groups (D5–D6–F7–M8) or three genetically driven clusters (daughters–father–mother). d, Ratio-based scaling using common reference materials empowers horizontal and vertical integration. Full size image

For comprehensive performance evaluation, the Quartet multi-omics reference material suites were profiled across commonly used multi-omics platforms, including seven DNA sequencing platforms, one DNA methylation platform, two RNA-seq platforms, two miRNA-seq platforms, nine LC–MS/MS-based proteomics platforms and five LC–MS/MS-based metabolomics platforms (Fig. 1b). For performance evaluation, three technical replicates for each reference material were measured in each lab, except for the long-read DNA sequencing platforms, for which only one replicate was sequenced for each platform. Supplementary Table 1 summarizes the Quartet multi-omics datasets for the real-world assessment of commonly used multi-omics technologies. All the data can be accessed from the Quartet Data Portal (https://chinese-quartet.org/), which provides a landscape of data quality for each type of omics profiling.

The Quartet Project provides a set of metrics for QC and data integration in multi-omics profiling. For generation and horizontal integration of data from each omics type, the Quartet built-in QC metrics, that is, the Mendelian concordance rate for genomic variant calls and signal-to-noise ratio (SNR) for quantitative omics profiling, enable proficiency testing on a whole-genome scale using the Quartet reference materials. In addition, the Quartet multi-omics design provides two types of QC metrics to evaluate the reliability of vertical integration. One assesses the ability to correctly classify the Quartet samples into both four different individuals (daughter1–daughter2–father–mother) and three genetically driven clusters (daughters–father–mother), which is related to the multi-omics research purpose of sample clustering. Another QC metric assesses the ability to correctly identify cross-omics feature relationships that follow the central dogma (information flow from DNA to RNA to protein) and can be used to evaluate the reliability of correlation-based multi-omics integration (Fig. 1c). We propose ratio-based profiling using common reference materials to empower horizontal and vertical omics data integration. Ratio-based data were derived by scaling the absolute feature values of study samples (such as D5, F7 and M8) relative to those of a concurrently measured reference sample (such as D6) on a feature-by-feature basis (Fig. 1d).

In this article, we provide an overview of the Quartet Project and propose a ratio-based quantitative profiling approach for multi-omics data integration using Quartet reference datasets across multiple omics types, platforms, batches and labs (Extended Data Fig. 2). Four accompanying papers detail the establishment of the DNA66, RNA67, protein68 and metabolite69 reference materials, reference datasets and QC methods for each type of omics profiling. Haplotype-resolved assemblies and a variant benchmark have also been provided70. Another paper71 is dedicated to benchmarking batch effect correction algorithms (BECAs) using the Quartet multi-omics data. We have also developed the Quartet Data Portal (https://chinese-quartet.org/)72 for the community to conveniently access and share the Quartet multi-omics resources according to the regulations of the Human Genetic Resources Administration of China.

Wet-lab proficiency in omics data generation varies

Before data integration, the proficiency in data generation for each type of omics data was assessed. Except for the long-read sequencing platforms, the reference materials were profiled within a batch in a lab in three replicates for each of the four samples (donors). For long-read sequencing, one replicate for each reference material was sequenced and the resulting data were analyzed using 11 pipelines; therefore, the performance evaluation was conducted only at the level of analytical procedures. Details on data generation and analysis are provided in the Methods.

QC metrics for evaluation of objective performance are critically important. The number of measured features, coefficient of variation (CV) and technical reproducibility are widely used QC metrics across different omics platforms and were used in our study for cross-omics performance comparisons. As shown in Fig. 2, the number of features measured for each omics type varied by several orders of magnitude, from 60 metabolites to 4.8 million small DNA variants (single-nucleotide variants (SNVs) and indels) per sample (Fig. 2a). Within each omics type, the number of features detected varied among batches and labs. The reproducibility of detected features in each omics profiling type was evaluated using the number of replicates supporting a variant call for genomics and the CV among technical replicates within a batch in quantitative omics profiling (Fig. 2b). Most SNVs were supported by all three library replicates within the batch, whereas the number of analytical repeats supporting a structural variant (SV) call greatly varied. For quantitative omics profiling, the CVs of most quantified features were below 30%. In addition, the reproducibility of technical replicates was also evaluated at the individual sample level. Reproducibility was calculated as the Jaccard index from three library repeats within a batch. For the short-read sequencing platforms, all Jaccard index values were above 93%. Moreover, the reproducibility of SVs from 11 call sets using different analytical pipelines was between 80% and 90%. Nanopore was found to be more reproducible than PacBio among the long-read sequencing platforms. The reproducibility of quantitative omics profiling was calculated as the Pearson correlation coefficient (Pearson’s r) of technical replicates within a batch. The r values from all labs and metabolomic platforms were above 95%, indicating high reproducibility in metabolomic profiling for the same sample. However, the r values for repeated measurements of the same sample were between 88.42% and 97.62% for transcriptomics and between 82.37% and 99.34% for proteomics (Fig. 2c).

Fig. 2: Wet-lab proficiency in omics data generation varies. a, The number of features detected from each dataset generated in different labs using different platforms. b, Distribution of the number of experiments supporting genomic variant calling or CV in quantitative omics profiling from technical replicates (analytical repeats in SV calling and library repeats for the others) within a batch. c, Technical reproducibility from three replicates within a batch, calculated as the Jaccard index for small variant calling and Pearson correlation coefficient (r) for quantitative omics profiling (n = 12). For SV call sets, technical reproducibility was defined as the Jaccard index between different analytical repeats (Oxford Nanopore, n = 28; PacBio Sequal, n = 55; PacBio Sequal2, n = 55). The box plots display the distribution of data, with the median represented by the line inside the box and the interquartile range represented by the box. Whiskers extend to 1.5× the interquartile range. d, SNR based on the Quartet multi-sample design (4 samples × 3 replicates per batch). e, RMSE of high-confidence DEFs. Dots represent RMSE values for the D5–F7, D5–M8 and F7–M8 pairs in each batch (n = 3), while the bar plots present the corresponding mean values. Full size image

On the basis of the Quartet multi-sample design, we defined two QC metrics to measure the ability to identify intrinsic biological differences among various groups of samples, a key objective of omics profiling. The Quartet-based SNR metric is the ratio of inter-sample differences (that is, ‘signal’) to intra-sample differences among technical replicates (that is, ‘noise’). It is calculated as the ratio of the average distance between the Quartet samples to the average distance between technical replicates of the same sample (see Methods for details). For a measurement method with high resolution in differentiating biologically different groups of samples, the inter-sample differences of the Quartet samples should be much larger than the variation among technical replicates for the same sample. Principal-component analysis (PCA) showed clear separation among the Quartet samples (D5, D6, F7 and M8) for high-quality profiling experiments (Supplementary Fig. 1a) but not for low-quality profiling experiments (Supplementary Fig. 1b). Strikingly, high variabilities in intra-batch data quality were observed in each omics platform (Fig. 2d), especially for the quantitative omics platforms, including for methylomics (SNR range of 15.5–27.1, s.d. = 4.5), transcriptomics (SNR range of 8.7–31.0, s.d. = 7.1), miRNA profiling (SNR range of 1.9–20.5, s.d. = 6.8), proteomics (SNR range of 0.9–30.5, s.d. = 7.5) and metabolomics (SNR range of 4.6–27.1, s.d. = 5.1). Moreover, high variabilities of proficiency in data generation were evident for each technology platform. For example, both high and low SNRs were observed in RNA-seq for the Illumina and BGI platforms, but the average SNRs across multiple batches were very close for the two sequencing platforms (20.39 versus 19.54, P = 0.84). Similarly, high variabilities in SNR were observed within each MS platform for proteomics or metabolomics profiling. These results implied that the inherent proficiency of an individual wet lab, instead of a specific platform itself, was a more important factor affecting the reliability of data generation for each omics type.

In addition, we constructed high-confidence reference datasets (Supplementary Table 2) of differentially expressed features (DEFs) in terms of the level of differential expression between pairs of samples (D5–F7, D5–M8 and F7–M8 pairs) for each quantitative omics profiling type using a consensus-based integration strategy (Extended Data Fig. 3a). Root mean square error (RMSE) was used to quantitatively evaluate the consistency of a test dataset with the high-confidence reference dataset (Fig. 2e).

We further explored the relationships between SNR and the number of detected features, the reproducibility of features, the reproducibility of technical replicates and the RMSE of DEFs to evaluate data quality in quantitative omics profiling (Supplementary Fig. 2). These data suggested that none of the widely used QC metrics (number of measured features, CV of measured features and correlation of technical replicates) based on a single reference sample guarantee high resolution (SNR) in identifying inherent differences (that is, biological signals) among various biological sample groups. Therefore, multi-sample-based QC metrics are needed to identify labs with low proficiency in detecting intrinsic biological differences among sample groups.

Ratio-based scaling enables horizontal integration

In large-scale omics studies, the reliability of horizontal integration of omics datasets across different platforms, labs or batches for the same omics type is essential. We propose a ratio-based scaling approach (for example, D5, F7 and M8 as study samples) using common reference material(s) (for example, D6) to enable horizontal integration of diverse datasets from the same omics type.

Technical variations are dominant during horizontal integration of the Quartet data at the absolute level. For methylation array data represented as M value, miRNA-seq data represented as log 2 (counts per million mapped reads (CPM)), RNA-seq data represented as log 2 (fragments per kilobase of transcript per million mapped reads (FPKM)), proteomics data represented as log 2 (fraction of total (FOT)) and metabolomics data represented as log 2 (intensity), systematic deviations were observed between two technical replicates (D5) from different batches after horizontal integration (Fig. 3a). The intercepts for the fitted lines for each scatterplot ranged from –0.084 to –11, with integrability being the worst for absolute metabolomics profiling. However, after scaling the absolute feature values of D5 relative to those for a concurrently measured D6 sample on a feature-by-feature basis, the systematic deviations for each omics profiling type were reduced. The intercepts for all the fitted lines decreased; in particular, the intercept decreased markedly from –11 (absolute) to –0.069 (ratio) for metabolomics profiling (Fig. 3b). In addition, the CV values of six technical replicates of D5 samples from an exhaustive combination of two batches of datasets were mostly decreased at the ratio level except for some combinations of metabolomics profiling from the same lab (Fig. 3c and Supplementary Fig. 3).

Fig. 3: Ratio-based scaling enables horizontal integration. a,b, Scatterplots of the feature abundance of inter-batch D5 samples in methylation, miRNA-seq, RNA-seq, proteomics and metabolomics datasets at the absolute level (raw data; a) and ratio level (ratio scaling to the D6 sample; b). The x and y axes show the average expression of the three D5 technical replicates from the two best quality batches from different labs (ranked by SNR). At the absolute level, features with a CV less than 0.2 for the technical replicates of D5 in both batches were retained; at the ratio level, features with a CV less than 0.2 for the technical replicates of D5 and D6 in both batches were retained. r denotes the Pearson correlation coefficient, and m denotes the number of features. Linear fits were performed on the basis of the feature abundance. c, Lollipop plots of CV in feature abundance for six D5 samples across two batches. The x axis represents the exhaustive two-by-two combination of all batches for each omics type. d,e, PCA plots of horizontal integration of all batches of methylation, miRNA-seq, RNA-seq, proteomics and metabolomics datasets at the absolute level (d) and ratio level (e). n denotes the number of samples, and m denotes the number of features. f, Scatterplots between SNR and degree of sample class-batch balance. Blue, absolute level; red, ratio level. Full size image

We further compared the sources of variability in the Quartet data at the absolute and ratio levels. Technical factors dominated the variability in the absolute data, and the proportional contribution of each factor to the total variability is dependent on omics type (Extended Data Fig. 4). On the contrary, principal-variance component analysis (PVCA) results for the ratio data showed that the biological factor (‘donor/sample’) dominated the data variability in most omics types and its relative contribution over technical factors was markedly higher when compared to the absolute data. The PVCA results clearly demonstrated the effectiveness of feature-by-feature ratio data in removing technical noise present in absolute multi-omics data, enabling the identification of true biological signals (that is, true differences among donors).

The reliability of horizontal integration can also be assessed using the Quartet-based SNR metric. The aforementioned five types of quantitative omics data all showed obvious batch-dominant clustering at absolute expression levels in horizontal integration (Fig. 3d). However, after converting the absolute omics data to a ratio scale relative to the same reference material (D6) within a batch on a feature-by-feature basis, PCA plots showed clear separation of the four types of reference samples (D5, D6, F7 and M8) and the strong batch effects seen at the absolute scale were largely absent (Fig. 3e). We further quantitatively measured the quality of horizontal data integration using the Quartet multi-sample-based SNR as the metric. A method of good quality for horizontal data integration at each omics level would clearly separate the four Quartet sample groups; that is, the inter-sample differences of the Quartet samples should be much larger than the variation among technical replicates of the same sample. As shown in Fig. 3d,e, the SNR values after horizontal integration of datasets for each omics type at the absolute level were all close to zero except for methylation data (Fig. 3d), whereas the SNR values of the integrated datasets were markedly higher at the ratio level for each omics type (Fig. 3e). Notably, these conclusions remain the same if one chooses D5, F7 or M8 instead of D6 as the reference sample (Extended Data Fig. 5), indicating the universal applicability of the ratio-based scaling approach.

In addition, we characterized the impact of the level of batch effects on the SNR for horizontal integration by randomly selecting samples from different batches and using the average of the Jaccard index for batches from the four sample groups as a measure of group-batch balance (see Methods for details). As shown in Fig. 3f, regardless of the level of balance of sample classes across batches, horizontal integration at the ratio level resulted in much better discrimination between sample classes, that is, much higher SNR. However, the corresponding SNR values at the absolute level were all close to zero except for methylation data, whether there was group-batch balance or not. These results clearly demonstrate that quantitative omics profiling data at the ratio level are much more comparable and suitable for horizontal integration than those at the absolute level.

Ratio-based profiling allows for more accurate determination of the subtle differences between any two Quartet samples on a feature-by-feature basis. For all three comparisons (D5–F7, D5–M8 and F7–M8), compared to the log 2 -transformed fold differences in the absolute-based integration data, those for the ratio-based integration data showed a much higher level of agreement (and lower RMSE) with the corresponding reference dataset for each omics type (Extended Data Fig. 3b,c). Furthermore, the level of balance of sample groups across batches was helpful for the accurate detection of DEFs. This was reflected in the negative correlation between RMSE and the level of group-batch balance. It was also clear that a lack of group-batch balance affected absolute data integration much more severely than it did ratio-based data integration, with the former showing a much larger slope than the latter (Extended Data Fig. 3c).

The pervasiveness of batch effects in quantitative analysis techniques at the absolute expression level presents a real challenge for horizontal integration. Our results demonstrate that the conversion of quantitative omics data to a ratio scale relative to a common reference sample (for example, the Quartet D6 sample) can effectively mitigate the detrimental impact of batch effects on sample classification, differential feature identification, etc.

Improved reliability of cross-omics feature correlations

One advantage of multi-omics studies is the ability to systematically discover cross-omics relationships from multiple interconnected biological layers. The correlation coefficient is one of the simplest ways to estimate the pairwise relevance for two types of omics features, which is the foundation of multi-omics integration for network analysis. In large multi-omics studies, the multi-omics datasets are usually generated in multiple batches, platforms and labs. Vertical integration of multi-omics datasets from various omics types is typically performed after horizontal integration of the same omics type73; thus, performance of the final integration is influenced by both horizontal and vertical dimensions. Therefore, we evaluated the reliability of vertical integration using horizontally integrated ratio-based data under different scenarios.

Cross-omics feature relationships calculated on the basis of multiple batches of data integrated at the ratio level (inter-batch) showed much stronger correlations with cross-omics single batches (intra-batch) than those at the absolute level (Fig. 4a). These cross-feature correlations of methylation–miRNA, methylation–RNA, miRNA–RNA, RNA–protein and protein–metabolite types were derived from features of both omics types associated with the same genes, which may more closely follow the principle of the central dogma. In particular, for the relationships between proteins and metabolites, direct integration of multi-batch data at the absolute level could not easily identify true correlations for cross-omics feature pairs.

Fig. 4: Improved reliability of cross-omics feature correlations. a, Scatterplots of the cross-omics feature relationships of intra- and inter-batch (horizontally integrated) data at the absolute level (blue) and ratio level (red). The solid lines represent fitted curves from linear regression along with the Pearson correlation coefficient (r). b, Workflow for the construction of reference datasets of cross-omics feature relationships according to the following steps: (1) identification of detectable multi-omics features and per-sample normalization; (2) intra-batch QC by filtering out features that are not detectable or have low technical reproducibility; (3) identification of cross-omics feature pairs associated with the same genes or pathways; (4) cross-batch QC by retaining reliable feature pairs identified in a sufficient number of batches; (5) calculating Pearson correlation coefficients for each feature pair in each batch combination and classifying the relationships into positive (r ≥ 0.5, P < 0.05) and negative (r ≤ –0.5, P < 0.05) categories; and (6) voting based on the direction of the correlations (negative or positive) to screen the high-confidence cross-omics feature relationships. c, Chord plot of the reference dataset of cross-omics feature relationships. Each chord represents a positive (red) or negative (blue) correlation of any two cross-omics features. d, Scatterplots of the expression abundance of 224 positively correlated RNA–protein pairs at the absolute level (blue) and ratio level (red). Data were selected from the best quality batch in the RNA-seq and proteomics datasets. r denotes the Pearson correlation coefficient, and m denotes the number of features. e,f, Bar plots of RMSE of cross-omics feature relationships identified from different quality datasets (e; bad versus good) and different scenarios (f; confounded versus balanced) at the absolute level (blue) and ratio level (red) based on the reference datasets. The number of data sampling instances (n) used to derive statistics was as follows: bad, n = 10; good, n = 10; confounded, n = 200; balanced, n = 100. Data are presented as mean values ± s.d. The P values were calculated using unpaired two-tailed Wilcoxon rank-sum tests with false discovery rate (FDR) correction. ****P < 0.0001, ***P < 0.001, **P < 0.01, *P < 0.05; not significant, P ≥ 0.05. Specific P values are listed in Supplementary Data 1 and 2. Full size image

To evaluate the performance of vertical integration at the feature relationship level, we constructed Quartet cross-omics reference datasets (Supplementary Table 3) using a consensus voting approach, as depicted in Fig. 4b. This reference dataset consisted of the Pearson correlation coefficients between the expression levels of two different types of omics features. By exhaustively enumerating all batch combinations of the above five cross-omics types, feature pairs that exceeded a predetermined threshold of batch combinations were selected for further analysis. The cross-omics relationships were classified as positive (r ≥ 0.5, P < 0.05) or negative (r ≤ –0.5, P < 0.05) on the basis of the outcomes of Pearson correlation analysis conducted for each feature pair. Feature pairs demonstrating positive or negative correlations above 70% of all batch combinations were included in the high-confidence dataset, and the mean value for this category (that is, positive or negative) was used as the reference Pearson correlation coefficient.

The reference dataset comprises a comprehensive selection of high-confidence correlation feature pairs, consisting of 1,054 methylation–miRNA pairs, 1,134 methylation–RNA pairs, 637 miRNA–RNA pairs, 224 RNA–protein pairs and 29 protein–metabolite pairs (Fig. 4b). Within this dataset, a subset of 59 genes showed regulation influenced by both methylation and miRNA, alongside a strong positive association with protein (P < 0.05, r > 0.5), as depicted in Fig. 4c. This finding highlights the intricate interplay between different omics types and offers valuable insights into the coordinated regulation of gene expression.

The principle of the central dogma was well reflected in the Quartet multi-omics data, as it could be seen that the abundance of RNAs was almost exclusively positively correlated with that of proteins in the reference dataset (224 RNA–protein pairs were positively correlated while no RNA–protein pair was negatively correlated). The positive RNA–protein correlations were better identified at the ratio level (r = 0.8) than at the absolute level (r = 0.39; Fig. 4d). The same phenomenon was demonstrated for other inter-omics associations; that is, ratio-based scaling improved the confidence of the identification of cross-omics feature relationships in the reference datasets (Extended Data Fig. 6).

In large-scale cohort studies involving multi-omics quantitative analyses, issues related to uneven data quality and unbalanced sample groupings across batches often arise36,37. Confounded scenarios, characterized by substantial confounding between biological factors and batch effects, are frequently encountered in longitudinal and multicenter cohort studies, presenting challenges in disentangling the influences of biological factors from batch effects. Although balanced scenarios, where samples from the biological group of interest are evenly distributed across batches, represent an ideal situation, they are rarely achievable in practical settings. In this context, we further investigated the performance of the ratio-based approach under both scenarios.

In agreement with Fig. 4a, the concordance of correlation coefficients of cross-omics features with the reference Pearson r was higher (as indicated by lower RMSE values) in the horizontally integrated data based on the ratio level than those based on the absolute level (Fig. 4e,f). The ratio-based profiles exhibited lower RMSE values when detecting cross-omics feature relationships from datasets of different quality (as indicated by the SNR values). The performance of identifying cross-omics feature relationships on the basis of ratio-based data is improved when the single-batch dataset is of higher quality (Fig. 4e). Furthermore, in different experimental scenarios, that is, balanced or confounded batch groups, the ratio-based data showed essentially the same good performance, whereas the absolute level was more sensitive to batch effects (Fig. 4f).

Facilitating vertical integration for sample classification

Another advantage of vertical integration of multi-omics data is the ability to distinguish subtypes of clinical samples with subtle differences that cannot be identified on the basis of a single type of omics data. Therefore, the ability to discover the true biological differences between sample groups is a key metric to measure the performance of multi-omics integration tools and procedures. The multi-sample and multi-omics design of the Quartet Project provides unique resources for assessing the reliability of vertical integration. Here we included six horizontal integration methods for evaluation, that is, ratio-based scaling (Ratio), ComBat74, Harmony75, RUVg76, z score and direct integration of the normalized values (Absolute). Five widely accepted vertical integration tools were subsequently used, that is, SNF5, iClusterBayes77, MOFA+78, MCIA79 and intNMF80, generating 30 combinations of horizontal and vertical integration for performance assessment.

The adjusted Rand index (ARI)81 is a widely used QC metric to compare clustering results against external criteria. To quantitatively evaluate the reliability of vertical data integration at the multi-omics level, we used Quartet-based ARI (daughter1–daughter2–father–mother (that is, D5–D6–F7–M8) as four independent sample groups or clusters) as the metric.

Ratio-based scaling data largely outperformed the absolute-level data with a much higher ARI when the same vertical integration algorithm was used (Fig. 5a). Furthermore, there was no significant difference between high- and low-quality groups, regardless of the methods used (Extended Data Fig. 7a). This may be due to the relatively simple classification task and the integration of multi-omics data that effectively improves the discrimination between different samples.

Fig. 5: Facilitating vertical integration for sample classification. a,b, Bar plots of the ARI of vertically integrated multi-omics datasets of different quality (a; bad versus good) and different scenarios (b; confounded versus balanced) at the absolute level (blue) and ratio level (red) using SNF, iClusterBayes, MOFA+, MCIA and intNMF. The number of data sampling and integration instances (n) used to derive statistics was as follows: bad, n = 10; good, n = 10; confounded, n = 200; balanced, n = 100. Data are presented as mean values ± s.d. The P values were calculated using unpaired two-tailed Wilcoxon rank-sum tests with FDR correction. ****P < 0.0001, **P < 0.01, *P < 0.05; not significant, P ≥ 0.05. Specific P values are listed in Supplementary Data 3 and 4. c, Scatterplots of the degree of sample class-batch balance versus ARI with different data preprocessing methods. d, Scatterplots of the degree of sample class-batch balance versus SNR with different data preprocessing methods. SNR was calculated on the basis of a sample-to-sample similarity matrix. e, Curves of ARI and SNR with the degree of balance between sample classes and batches at the absolute level (blue, solid line), ratio level (red, solid line), absolute level combined with BECAs (blue, dotted line) and ratio level combined with BECAs (red, dotted line). Each point represents an instance of data sampling and integration. The solid lines correspond to fitted curves obtained from local regression, and the shading indicates the 95% confidence interval around the smoothing. Full size image

In particular, ratio-based data showed an obvious advantage over the absolute level in confounded scenarios (Fig. 5b and Extended Data Fig. 7b). The vertical integration based on the ratio-level profiles exhibited an ARI close to 1 at different levels of batch-group balance. Most of the popular batch correction methods (ComBat, Harmony and z score) showed lower ARI in the confounded scenario, and their performance with all five vertical integration algorithms steadily improved as the degree of batch-group balance increased. RUVg, which is theoretically suitable for processing confounded datasets, had excellent performance in extremely confounded and balanced scenarios. However, it was progressively less effective when the batch and sample groups were confounded. Interestingly, the absolute-level data exhibited a similar pattern of variation as RUVg, with the highest ARI under extreme confounding scenarios. However, it was batch information that was actually distinguished (Fig. 5c and Extended Data Fig. 7c). These results suggest that vertical integration for sample classification based on ratio-based scaling profiles is essentially unaffected by the degree of batch-group balance in the experimental design.

It is worth noting that ARI only qualitatively measures whether clustering results and external criteria have a similar clustering structure and does not indicate the degree of difference between clusters. When the ARI is the same, the biological features of sample groups after vertical integration may still differ. In this context, we extended the idea of SNR to quantitatively evaluate the vertically integrated results to improve the resolution of the assessment of the accuracy of sample classification (see Methods for details). In line with the previous findings, ratio-based scaling resulted in higher SNR values in different scenarios ranging from confounded to balanced, regardless of the vertical integration method used (Fig. 5d).

The ultimate performance of integration was influenced by both the horizontal integration methods and the vertical integration algorithms. For example, regardless of the chosen horizontal integration method, MOFA+ performed better than MCIA in subsequent vertical integration. These results indicate that ratio-based scaling improves the vertical integration of sample clusters through reliable cross-sectional integration.

By conducting a comprehensive comparison with BECAs, we aim to provide a more robust depiction of the effectiveness of direct quantification at the ratio level during data generation. When using ratio quantification directly, the ratio approach consistently produces high ARI values, indicating accurate sample classification, as well as high SNR values, indicating discriminatory power to correctly classify samples, regardless of whether the sample classes are balanced across batches. Furthermore, the additional use of BECAs in conjunction with ratio quantification produces superior outcomes compared to batch correction based on absolute quantification (Fig. 5e). Therefore, it is imperative to incorporate ratio-based profiling at the experimental measurement stage instead of data massage alone (for example, normalization and/or batch effect correction) after data generation.

Quartet design for genetics-driven ground truth

Multi-omics integration of molecular-level information and phenotypic characteristics holds great promise in advancing understanding of intricate genotype–phenotype relationships. Beyond straightforward differentiation of the four different individuals (daughter1–daughter2–father–mother, or D5–D6–F7–M8), the Quartet monozygotic twin family design offers a unique opportunity as well as a more challenging task of classification into the Quartet family-based groups and three genetically distinct groups (daughters–father–mother, or D–F–M). Here we integrated the multi-omics data of moderate quality (SNR in the range of the top 20% to 80%) including DNA variants, methylation, miRNA, RNA, protein and metabolites. For each vertical integration method, only one batch of data was selected for each omics type to prevent the influence of batch effects during horizontal integration. In addition, we conducted partitioning around medoids (PAM) clustering82 for each type of single omics data and calculated ARI as a control to assist in assessing the performance of the vertical integration.

The inter-sample similarity networks built using data from a single omics type (top) and multi-omics data integrated using SNF, iClusterBayes, MOFA+, MCIA and intNMF (bottom) are visualized in Fig. 6a. At the DNA level, the samples for the identical twins (D5 and D6) were tightly clustered together owing to their near-identical DNA sequences. On the other hand, these samples showed no clear tendency to cluster together for the five types of quantitative omics data (methylation, miRNA, RNA, protein and metabolites) and could even appear relatively far apart (for example, D6 and F7 appeared closer in miRNA, RNA and protein data). This distinction in clustering tendency between DNA variants and quantitative omics data implies that the classification task (D–F–M) can be used to assess whether a vertical integration approach can reveal the intrinsic, built-in genetic truth in the Quartet family with identical twins.

Fig. 6: Quartet design for genetics-driven ground truth. a, Networks of six types of omics profiling based on the similarity between 12 samples within one batch (top) and sample similarity networks obtained with SNF, iClusterBayes, MOFA+, MCIA and intNMF (bottom), which integrated the six types of multi-omics data. b, Bar plots of the ARI when clustering samples into three (D–F–M) or four (D5–D6–F7–M8) groups by single-omics clustering (yellow) versus multi-omics integration (orange). c, Bar plots of the ARI for multi-omics data integration using SNF, iClusterBayes, MOFA+, MCIA and intNMF. Light green represents data when the true labels of the samples were set to three clusters (D–F–M), while dark green represents four clusters (D5–D6–F7–M8). In b,c, data are presented as mean values ± s.d. A total of 60 batches of multi-omics datasets were used for single-omics PAM clustering, on the basis of which 100 cross-omics combinations were used for multi-omics integration with five algorithms. d, The number of multi-omics features associated with DNMs, DEFs identified from profiles and their intersections. e, Enrichment pathway maps for differential multi-omics features between D5 and D6, that is, the intersection of DNMs and DEFs. Darker colors indicate pathways and lighter colors indicate genes. The percentage of each circle of a specific color corresponds to the proportion of features associated with each omics type. f, Box plots of the similarity between D5 and D6 for integration of different types of omics data with 50 iterations. The multi-omics data were integrated starting with DNA (red) and ending with metabolites (gray) by using SNF. The box plots display the distribution of data with the median represented by the line inside the box and the interquartile range represented by the box. Whiskers extend to 1.5× the interquartile range. Full size image

Vertical integration reduced technical noise and improved sample clustering, as indicated by the fact that the ARIs for both the three clusters (D–F–M) and four clusters (D5–D6–F7–M8) from multi-omics integration were higher than those with direct clustering of single-omics data (Fig. 6b). Nevertheless, there were still differences in performance between the vertical integration algorithms when distinguishing the three sample categories (D–F–M). SNF, iClusterBayes, MOFA+ and intNMF correctly classified the samples into the three Quartet family-based groups (D–F–M), whereas MCIA did not perform well (Fig. 6c). This demonstrates that the integration algorithms could be prioritized by whether they find potential genetic truth (identical twins) behind the four individuals with distinct differences in molecular phenotypic data.

To better decipher what influences D–F–M clustering, we annotated the genomic coordinates of de novo and somatic small variants (abbreviated as DNMs) in addition to directly calculating DEFs for each omics type. The intersection of these indicates highly plausible multi-omics features affected by genomic-level differences between the Quartet identical twins (Fig. 6d). Further enrichment analysis yielded pathways and features with specific molecular insights into the impact of genomic variants on D–F–M clustering (Fig. 6e). Identification of the primary immunodeficiency signaling pathway (IGHM, IGHD, IGLL1 and IGLL5) indicated potential differences in immune system functions between the cell lines derived from the twins that could affect immunoglobulin synthesis and secretion, likely resulting from the process of immortalization of B cells with Epstein–Barr virus (EBV). The Hippo signaling pathway (DLG2, PPP2R2C and SMAD1) is associated with cell proliferation, polarity and tissue morphology, suggesting that there could be structural and morphological differences between the two cell lines from the twins. The p70-S6K signaling pathway (IGHM, IGHD, IGHV1-2, IGLV3-21, PLCL2 and PPP2R2C) is associated with protein synthesis, cell proliferation and metabolic regulation and could potentially account for variations in the culturing status of the two cell lines. The PI3K signaling pathway in B lymphocytes (IGHM, IGHD, IGHV1-2, IGLV3-21 and PLCL2) is specific to these cells and is also associated with protein synthesis, cell proliferation and metabolic regulation. Finally, identification of DNA methylation and transcriptional repression signaling (CDK14 and TET1) suggested that there may be differences in these processes between the twins. Taking these findings together, it is possible that some of the multi-omics differences between the Quartet identical twins at the immune, cellular and metabolic levels are due to genetic variation. Additional differences may be caused by environmental or random factors.

The similarity between the identical twins (D5 and D6) during vertical integration can be quantified to illustrate the impact of adding different layers of omics information on the clustering of the Quartet samples (see Methods for details). As shown in Fig. 6f, the similarity between D5 and D6 decreased both when gradually adding downstream omics data starting with genomics data (left, red) and when integrating upstream omics data starting with metabolomics data (right, blue; except for the eventual addition of DNA). This phenomenon again demonstrates that the genetic relationships between the Quartet identical twins are only reflected at the DNA level, and it also specifies the need to incorporate genomic data when using the three clusters (D–F–M) as a QC metric for vertical integration.

Best practices for QC using Quartet reference materials

QC comprises procedures to ensure the reliability of multi-omics profiling using defined QC metrics and thresholds to meet the requirements of different research purposes. Large-scale multi-omics studies involve multicenter and long-term measurements for which unified QC metrics and universal integration strategies are needed to ensure quality during data generation and integration. We recommend including the Quartet reference materials (for example, four samples × three replicates) or a similar strategy when profiling each batch of study samples and propose best-practice guidelines for QC and data integration in three aspects, including intra-batch data generation, horizontal integration and vertical integration (Extended Data Table 2).

We have provided both reference dataset-free and reference dataset-based QC metrics to assess the wet-lab proficiency of data generation for the same omics type in terms of the capacity to identify subtle differences between sample groups. Without relying on the reference datasets, the Quartet-based SNR (D5–D6–F7–M8) can be calculated for quality assessment for all types of omics data. SNR calculated on the basis of the four Quartet sample groups was more sensitive when assessing wet-lab proficiency than generic QC metrics based on multiple technical replicates of a single sample (Fig. 2). We also recommend use of the Mendelian concordance rate based on the pedigree of the Quartet family as a QC metric for assessing the quality of genomic data66. With the reference datasets, the wet-lab proficiency was assessed by the concordance between the evaluated batch of data and the reference datasets. Precision, recall and F1-score are recommended for qualitative omics (small variants and SVs), and RMSE at the ratio level (scaling to D6) for feature expression and the differential expression between groups (D5–F7, F7–M8 and M8–D5) is recommended for quantitative omics (DNA methylation, transcriptomics, proteomics and metabolomics). In addition, more comprehensive proficiency tests or inter-lab comparisons can be performed by obtaining the relative quality ranking among the cumulative datasets within the Quartet Data Portal72.

For horizontal integration of multi-batch data, a paradigm shift from absolute to ratio-based profiling by incorporating common reference materials is essential and improves the reproducibility and resistance to batch effects. QC metrics used in intra-batch data generation can still be used in the quality assessment of horizontal integration. The reliability of further exploratory studies can be ensured as long as the horizontally integrated dataset can still distinguish the different Quartet samples with subtle built-in differences.