Identification of novel genes whose expression in adipose tissue affects body fat mass and distribution: an RNA-Seq and Mendelian Randomization study

Many studies have shown that abdominal adiposity is more strongly related to health risks than peripheral adiposity. However, the underlying pathways are still poorly understood. In this cross-sectional study using data from RNA-sequencing experiments and whole-body MRI scans of 200 participants in the EPIC-Potsdam cohort, our aim was to identify novel genes whose gene expression in subcutaneous adipose tissue has an effect on body fat mass (BFM) and body fat distribution (BFD). The analysis identified 625 genes associated with adiposity, of which 531 encode a known protein and 487 are novel candidate genes for obesity. Enrichment analyses indicated that BFM-associated genes were characterized by their higher than expected involvement in cellular, regulatory and immune system processes, and BFD-associated genes by their involvement in cellular, metabolic, and regulatory processes. Mendelian Randomization analyses suggested that the gene expression of 69 genes was causally related to BFM and BFD. Six genes were replicated in UK Biobank. In this study, we identified novel genes for BFM and BFD that are BFM- and BFD-specific, involved in different molecular processes, and whose up-/downregulated gene expression may causally contribute to obesity.


MRI scans
MRI scans were obtained to assess body compartments from 594 participants.The measurements were performed with a 1.5T MRT scanner ("Magnetom Avanto", Siemens, Erlangen, Germany) and the Vibe Dixon sequence.The Vibe-Dixon sequence is a special MRI protocol for body fat analysis which separates the fat from tissue water.Automated segmentation algorithms of the MRI scans were used to quantify the fat mass in different body compartments with high repeatability and reproducibility.

RNA extraction from SAT biopsies
Subcutaneous adipose tissue biopsies were taken from 278 participants using a needle aspiration method with sufficient material extracted from 200 participants.For details regarding the subcutaneous AT biopsies, see Konigorski et al. (2019)  1 .From the SAT biopsies, the total RNA, genomic DNA, and total protein from the fat tissue samples were purified and separated using the Qiagen All Prep DNA/RNA Mini Kit.The purified genomic DNA has an average length of 15-30 kb.Regarding the RNA, only RNA molecules longer than 200 nucleotides were purified and with the employed standard protocol, all short RNA molecules with length less than 200 nucleotides were removed.These removed small molecules include most of the ncRNA and short mRNA.The quantity and integrity of the purified DNA and RNA was verified using the NanoDrop 1000 Spectrophotometer V3.7 (PeqLab) and the 2100 Bioanalyzer (Agilent), which uses an on-chip electrophoresis.For the assessment of gene expression of candidate adipokines with PCR, 2µg RNA were reverse transcribed to cDNA using the High Capacity cDNA Reverse Transcription Kit (Applied Biosystems).

Library preparation & multiplexing
The extracted RNA was prepared for sequencing using the TruSeq RNA and DNA Sample Preparation Kit v2 (Illumina).First, it was polyA-selected to purify and enrich for mRNA, fragmented into small pieces and primed (with random hexamers) for cDNA synthesis.The cDNA products were then enriched with PCR and ligated to adapters to multiplex samples.In the ligation, single indexes were used.Finally, the cDNA libraries were created, validated, normalized (so that they had equal volume) and pooled.The resulting pooled single-indexed paired-end libraries of different samples were then applied to the flow cells (containing 8 lanes) on cBot (Illumina), so that multiple samples could be sequenced together.

Sequencing and demultiplexing
The multiplexed probes were sequenced on the Illumina HiSeq 2000 platform in 201 sequencing cycles.Of the 200 samples with total RNA extracted from the SAT biopsies, 198 samples were sequenced with 6 samples per lane, yielding on average 64,095,856 raw reads (SD=7,518,970), with minimum 43,373,110 and maximum 85,591,020 raw reads per probe.Two samples were sequenced each on one lane as a sensitivity check and to assess how many more genes can be detected with a greater sequencing depth.After the alignment and QC filtering (see description below), approx.50 million single reads (i.e., about 25 million paired reads) were available at high quality per sample, and about 330 million single reads of the two deeply sequenced samples.
The raw data containing the sequences of quality-scored base calls was saved in .bclfiles.Next, the CASAVA (Illumina) software was used for converting the .bclfiles to .fastqfiles.In the same step, the multiplexed samples were demultiplexed and the raw reads of each sample are extracted and saved.

Quality control of the raw sequencing reads
For an assessment of the sequencing quality, the Q-score was used as quality scoring method, which is a prediction of the probability of an incorrect base call.All probes had a high percentage of bases with high quality (Q>30), on average 88.0%(SD=1.9%).The mean quality score of bases in a probe was 32.5 (SD=0.5).
More quality checks were performed by investigating the sequence quality, GC content, sequence base content, the presence of adapters and duplicated reads using the FASTQC tool 2 .The sequence quality was high for all probes, across all base pair positions of all reads.The GC distribution across all reads was close to the expected theoretical distribution for all probes, with only minor deviations from some probes which didn't indicate systematic biases or deviations.There was no indication for any problems with adapters.Due to the PCR-steps involved in the cluster generation in RNA-Seq, duplicate sequences were naturally observed, around 50% for all probes.There were no specific sequences that were reported to be systematically duplicated, hence there didn't seem to be any systemic problems.
As a summary, the quality control checks of the raw reads showed a consistently high read quality without indication for systematic sequence biases, presence of adapter sequences, or duplication levels beyond what can be expected from RNA-Seq.Hence, the fastq files were parsed to the alignment stage with TopHat2 without trimming or deletion of reads.

Read alignment
The reads were aligned to the human reference genome GRCh38 (Homo_sapiens.GRCh38.78)using TopHat2 (version 2.0.12) with Bowtie2 (version 2.0.6.0) and samtools (version 0.1.18.0), which maps the RNA reads in the presence of insertions, deletions and gene fusions.In the alignment, all reads were used without trimming or discarding reads with low-quality calls.

Quality control of the aligned reads
Low-quality reads, reads with multiple alignments, and not properly aligned pairs were filtered.In more detail, on average, 90.3% (SD=1.2%) of all reads could be aligned.Furthermore, on average, 85.7% (SD=1.9%)or all raw reads could be aligned in pairs and on average 91.8% (SD=1.3%) of the mapped reads had a high mapping quality.Hence, in addition to the high sequencing quality, the alignments had high quality.

Read counts as raw measures of gene expression levels
The aligned reads were first sorted using samtools and then htseq-count was used to obtain counts as a raw measure of the gene expression levels.Counts were obtained for genes with respect to Ensembl gene identifiers.Default settings were used to discard aligned reads with mapping quality smaller than 10.Since all mapped reads had either mapping quality higher than 30 or lower than 10, only high-quality-mapped reads were used to obtain counts.Reads overlapping multiple genes were not counted for any gene (based on the default mode "union").The median absolute number of counted reads was 23,850,000.The percentage of mapped and counted reads relative to the raw reads was on average 56.7% (SD=3.0%).

Quality control and normalization of read counts
The Ensembl database 3,4 lists in total 64,253 genes for the reference genome GRCh38.78(obtained from http://dec2014.archive.ensembl.org/index.html).Based on the obtained raw gene expression levels, 48,126 genes were expressed in at least one probe.Interestingly, only 107 of the 48,126 genes (≈0.2%) were uniquely observed in the 2 deeply sequenced probes which yielded 6 times as many reads.On average, the expression of 27,690 genes was observed per sample (SD=1,274, min=25,430, max=33,280), 19,460 genes had at least 5 counts per sample (SD=1,042, min=17,600, max=25,910), 17,500 genes had at least 10 counts per sample (SD=940, min=15,890, max=23,580), and 13,630 genes had at least 50 counts per sample (SD=696, min=12,420, max=18,470).When investigating the changes when probes were sequenced at the depth of 1 sample per lane, 3 samples per lane, and 6 samples per lane, as can be expected, the number of genes without observed counts decreases, and the average count per gene as well the maximum number of counts per gene increases, but the shape of the average observed counts per gene and the overall pattern don't change with sequencing depth.For further details regarding the comparison of different sequencing depths, the relative frequency of gene classes that are detected with the different sequencing depths was investigated.As could be expected, this percentage is always higher for the deeper sequenced probe, but again, the differences are not substantial especially for protein coding genes.
In the analysis of gene expression measures from RNA-Seq, the observed counts are often dependent on the gene length (the longer the gene, the higher the counts), which can affect downstream analyses. 5To normalize gene expression measures we computed TPM 6 (transcripts per million: counts per gene length adjusted for total counts per gene lengths in million) as withinsample normalizations to correct for gene length and library size (i.e., total number of reads).TPMnormalized counts were computed for the 48,019 genes which had non-zero counts in at least 1 sample.For this, the gene lengths were obtained from the Ensembl database through biomaRt 7 at dec2014.archive.ensembl.org(accessing GRCh38.78).In addition to the within-sample normalization, the TMM 8 method was used for a between sample normalization to account for potential sequencing biases.TMM computes the trimmed mean of M-values between each pair of samples and thereby normalizes for RNA composition using scaling factors for the total number of reads (i.e., library sizes), which minimizes the log-fold changes between samples.The effective library size is computed and used instead of the observed library size.
We checked the GC content of a gene, which can bias downstream analyses. 5,9The results indicated that the mean gene expression (i.e., TMM-normalized TPM value over all samples) was not associated with GC content of all genes.As a result, the gene expression measures were not further normalized for GC content in order not to remove true biological variance and the TMMnormalized TPM values were used as primary measure of gene expression.All genes with observed counts for at least one individual were passed on to the following stages of the data processing.
As further validity check of the obtained gene expression measures that are analyzed in the following, correlations between RNA-Seq and qPCR gene expression estimates were computed for 6 candidate genes that were investigated in Konigorski et al. ( 2019) 1 .The (Pearson) correlations were r=0.36 for FABP4, r=0.56 for leptin receptor, r=0.58 for adiponectin, and r=0.85-0.87 for leptin, interleukin-6, and resistin, in line with previous reports in the literature. 10,11,12

Transformations of normalized read counts
Next, the distribution of the gene expression measures was investigated.Transformations (such as a log-transformation) lead to problems for very lowly expressed genes, where almost everyone has an observed count of 0. Hence, the following analyses and transformations were restricted to those 30,917 genes, where at least 25% of the people (i.e., at least 50 probes) had non-zero observed counts.Most lowly expressed genes had a very skewed distribution: of the 30,917 genes, 15,569 had skewness greater than 1, and 5498 had skewness greater than 2.
The skewed genes (i.e., with skewness greater than 1) were transformed using the Yeo-Johnson transformation (which is the Box-Cox transformation of U + 1 for nonnegative values, and of |U| + 1 with parameter 2-λ for U negative) using the yjPower() function in the car R package 13 , where the parameter λ for the transformation is determined in a first step using the powerTransform() function.This approach seemed to be successful in removing the skewness for all genes.Still, 7,953 genes didn't seem to be normally distributed according to the Kolmogorov-Smirnoff test with a p-value smaller than 0.001, for example, due to bimodal or other forms of distributions.However, for those genes, there wasn't a clear indication which transformation could yield normally distributed measures.Hence, the Yeo-Johnson transformed gene expression measures were used as final measure for the analysis.To incorporate that the normality of some genes is questionable, robust analyses incorporating multiple genes (e.g. the GO term enrichment analysis) were performed, and top associated genes with highest significance in the main analyses were inspected manually.

Assessment of genetic variation
In order to investigate genetic variants, SNVs (in coding regions) were called from the RNA-Seq reads using the mpileup tool of bcftools version 1.9 14,15 and further quality-controlled and trimmed.

GO-term enrichment analysis
For the GO-term enrichment analysis, the topGO R package 18 was used with the biological processes ("BP") sub-ontology, pruning GO terms with less than 10 annotated genes before the enrichment analysis, and computing gene-GO term mappings based on the Ensembl gene identifiers.As enrichment tests, classic Fisher's exact test and the classic as well as adapted "elim"

Biological regulation 158
Metabolic process 49 Immune system process 84

Response to stimulus 84
Localization 27

Biological adhesion 16
Cell proliferation 14 Cellular component organization or biogenesis 30

Multicellular organismal process 27
Signaling 33 Table S1.Results of the GO term enrichment analysis in the obesity study.
Shown are the number of (highest-level parent GO terms of those) GO terms that were under-/overrepresented in the 441 genes associated with SAT (left panel) and 225 genes associated with SAT/TAT (right panel), compared to the full pool of 30,917 genes, using the classical KS test.Table S3.Results of the GO term enrichment analysis for SAT-associated genes in the obesity study.

GO ID GO term
Shown are the 15 GO terms that were (statistically significantly after multiple testing correction for all 6287 analyzed GO terms) overrepresented in the 441 genes associated with SAT compared to the full pool of 30,917 genes, using Fisher's exact test.Shown are the GO terms with their description, the number of genes that were annotated with the respective term in all 30,917 genes ("Annotated"), the number of genes that were annotated with the respective term in the 441 SAT-associated genes ("Significant"), which is contrasted with the expected number of annotated genes in this pool of 441 genes ("Expected"), and the p-value from Fisher's exact test.Table S4.Results of the GO term enrichment analysis for SAT/TAT-associated genes in the obesity study.

GO.ID
Shown are the 36 GO terms that were (statistically significantly after multiple testing correction for all 6287 analyzed GO terms) overrepresented in the 225 genes associated with SAT/TAT compared to the full pool of 30,917 genes, using Fisher's exact test.Shown are the GO terms with their description, the number of genes that were annotated with the respective term in all 30,917 genes ("Annotated"), the number of genes that were annotated with the respective term in the SAT/TAT-associated genes ("Significant"), which is contrasted with the expected number of annotated genes in this pool of 225 genes ("Expected"), and the p-value from Fisher's exact test.
see file "TableS6.xlsx"Table S6.Results and annotations for the 215 autosomal genes associated with SAT/TAT.
Shown are the Ensembl ID of the gene ("Ensembl"), the gene symbol ("Gene"), the genomic position in form of the chromosome number ("Chr") as well as the start ("Gene_start") and end ("Gene_end") position in base pairs of the gene; whether the gene is known to be associated with obesity (based on the NCBI gene and GWAS Catalog databases), the corresponding protein encoded by the gene if it is known (from UniProt); the estimates of the effect size ("C-JAMP_beta"; estimate of beta_j in the marginal model of C-JAMP in equation ( 2) in the main text), its standard error estimates ("C-JAMP_SE") and p-value ("C-JAMP_pvalue") from the C-JAMP association analysis of SAT and SAT/TAT conditional on the gene expression and covariates; the results from the Mendelian randomization (MR) analysis in form of the number of SNVs in the gene that were incorporated in the MR analysis ("MR_no_var"), the explained variance of the gene expression based on a multiple linear regression model containing these SNVs ("MR_R2"), and the p-value from the inverse-variance weighted MR method ("MR_pvalue"); as well as the results of the replication analysis in form of the number of SNVs in the gene that were incorporated in the MR analysis ("no_var"), the explained variance of the gene expression based on a multiple linear regression model containing these SNVs ("R2"), and the p-value from the linear regression analysis ("pvalue").

Table S7 .
Sex-stratified characteristics of the study population from the UK Biobank (n=4904).Values are relative frequencies, mean and SD, or *median and median absolute deviation.aSAT, abdominal subcutaneous adipose tissue; BMI, body mass index; CSE, Certificate of Secondary Education; GCSE, General Certificate of Secondary Education; HNC, Higher National Certificate; HND, Higher National Diploma; NVQ, National Vocational Qualifications; VAT, visceral adipose tissue.