Geographical genomics of human leukocyte gene expression variation in southern Morocco

Idaghdour, Youssef; Czika, Wendy; Shianna, Kevin V; Lee, Sang H; Visscher, Peter M; Martin, Hilary C; Miclaus, Kelci; Jadallah, Sami J; Goldstein, David B; Wolfinger, Russell D; Gibson, Greg

doi:10.1038/ng.495

Article
Published: 06 December 2009

Geographical genomics of human leukocyte gene expression variation in southern Morocco

Youssef Idaghdour¹,
Wendy Czika²,
Kevin V Shianna³,
Sang H Lee⁴,
Peter M Visscher⁴,
Hilary C Martin⁵,
Kelci Miclaus²,
Sami J Jadallah⁶,
David B Goldstein³,
Russell D Wolfinger² &
…
Greg Gibson^1,5

Nature Genetics volume 42, pages 62–67 (2010)Cite this article

4153 Accesses
110 Citations
10 Altmetric
Metrics details

Abstract

Studies of the genetics of gene expression can identify expression SNPs (eSNPs) that explain variation in transcript abundance. Here we address the robustness of eSNP associations to environmental geography and population structure in a comparison of 194 Arab and Amazigh individuals from a city and two villages in southern Morocco. Gene expression differed between pairs of locations for up to a third of all transcripts, with notable enrichment of transcripts involved in ribosomal biosynthesis and oxidative phosphorylation. Robust associations were observed in the leukocyte samples: cis eSNPs (P < 10⁻⁰⁸) were identified for 346 genes, and trans eSNPs (P < 10⁻¹¹) for 10 genes. All of these associations were consistent both across the three sample locations and after controlling for ancestry and relatedness. No evidence of large-effect trans-acting mediators of the pervasive environmental influence was found; instead, genetic and environmental factors acted in a largely additive manner.

You have full access to this article via your institution.

Download PDF

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Article Open access 09 April 2024

Anoushka Joglekar, Wen Hu, … Hagen U. Tilgner

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Wenpin Hou & Zhicheng Ji

Main

The human transition from pastoral and rural to urban lifestyles has been accompanied by an increase in the incidence of numerous chronic diseases such as asthma, diabetes and cancer¹. Environmental contributors, which are likely to include dietary shifts, pollution and psychological factors, are the subject of continuing epidemiological research. It is equally interesting to determine whether genetic influences on disease susceptibility change across environments.

Because disease risk is commonly thought to involve differential gene expression², we have assessed the robustness of transcript abundance to environmental variation by performing a genome-wide association study (GWAS) on leukocyte gene expression profiles across two ancestries in three locations. Previously, we demonstrated that environmental geography³ has a substantial effect on gene expression in Moroccan Amazigh individuals; here, we add the contrast with people of Arab descent, enabling us to test whether geography and/or ancestry affects each of several hundred robust associations between genotype and transcript abundance.

Results

Population structure of southern Morocco

The Souss region in southern Morocco is home to several million people of two dominant ancestries who live in either cities or rural villages (Fig. 1). The Amazigh Berbers are descendants of the first modern humans who populated north Africa 35,000 years or more ago⁴, and many still live in traditional villages in the low Atlas Mountains. The Arabs, by contrast, moved into southern Morocco between the seventh and eleventh centuries and tend to occupy lowland villages. The cities are inhabited by both groups, often retaining their linguistic and cultural identities.

**Figure 1: Map of the Souss region of southern Morocco showing the sampling locations.**

In June and July of 2008, we collected peripheral blood samples from 284 healthy adults from four locations, including approximately equal numbers of men and women, and of Amazigh and Arabs. Half of the sample was from two high-density, low- to middle-income, urban communities, Anza and Dchiera, located on either side of the city of Agadir. The other half was from two rural villages near Tiznit, 120 km to the south. Boutroch is predominantly Amazigh and remains relatively isolated, whereas Ighrem is predominantly Arab and (on the basis of self-reported information and our observations at the collection site) many of the men, in particular, commute into the cities.

Leukocytes were isolated from serum, platelets and erythrocytes at the time of blood sampling by depletion filter technology⁵ and fixed in RNALater solution within minutes of blood collection. Gene expression profiles were obtained from 208 high-quality RNA samples by using Illumina HumanHT12 bead arrays that include 48,804 probes, of which 22,300 RefSeq probes for 16,738 genes were deemed to have signal above background. To minimize batch effects, all samples were processed in the same week, and the extraction, labeling and hybridization steps were performed in accordance with a randomized block design. Whole-genome genotypes were obtained from whole-blood samples by using Illumina Human 610-Quad arrays. After quality control filters were applied, 516,972 SNPs were available for 194 of the individuals who also had gene expression profiles.

Population structure was assessed by examining the principal components of the variance of the genotype profiles using Eigenstrat software⁶. Initial examination revealed several clusters of siblings and other close relatives (cousins or similar), whose similarity skewed the axes; where data were available, these identities were in agreement with participant records. After removal of these relatives, analysis of 163 unrelated individuals revealed seven significant eigenvectors (or genotypic principal component axes, gPCs). None of these explained more than 5% of the variance, and gPC3–gPC7 were heavily weighted by large clusters of SNPs on one or a few chromosomes. Such axes are commonly observed and do not provide reliable genome-wide estimates of population structure^7,8, but notably gPC3 distinguishes Ighrem from the other locations (Supplementary Fig. 1a).

A plot of the first two eigenvectors highlights the main historical influences on population structure in southern Morocco (Fig. 2a). gPC1 separates only a dozen individuals, and we inferred that this axis represents a sub-Saharan African contribution, consistent with expected levels of admixture in Morocco, by performing an analysis including 21 Yoruban individuals (Supplementary Fig. 1b). gPC2 is highly correlated with both location and self-reported ancestry; thus, we inferred that it captures the main component of Arab-Amazigh ancestry.

**Figure 2: Population structure in southern Morocco.**

An unexpected aspect of this analysis is the positioning of Ighrem Arabs between Boutroch Amazigh and half of the Agadir Arabs along gPC2. Structure analysis⁹ of 16,000 randomly chosen autosomal SNPs assuming admixture of two ancestral populations (Fig. 2b) confirmed that Ighrem residents tend to be a mixture, whereas most Amazigh are derived from one population, and only a few Agadir Arabs represent the other. Thus, there has probably been considerable admixture between these two groups over an extended period of time, possibly with recent movement of Arabs from other locations into Agadir. A slight shift of Ighrem Arabs toward the Amazigh pole of gPC2, relative to Agadir Arabs, would also be consistent with genetic exchange between the villages over 50 generations. Further sampling of villages in the region may reveal subtle population structure across southern Morocco^10,11,12,13.

Regional differentiation in gene expression

Next, we tested whether region, location and ancestry affect gene expression profiles, and if they do so in a gender-specific manner. Because location and ancestry are confounded in the villages, several parallel analyses were undertaken to tease apart these influences. Transcript abundance data were transformed by median centering on the log₂ scale (Supplementary Fig. 2), which results in maximal overlap of profiles without altering their variance.

Gene-specific analysis of variance (ANOVA)¹⁴ with expression as a function of region, gender and their interaction identified 1,521 probes that were significant at a false discovery rate (FDR) of 1% (P < 0.0007). Region, namely the rural (Boutroch plus Ighrem) versus city (Anza plus Dchiera) comparison, is the main effect in this joint analysis. Almost 7% of all expressed genes differentiate these individuals by this conservative criterion, whereas considerably less than 1% of the probes show gender differences (see Supplementary Table 1 for a full list of genes). Among several classes of genes overrepresented in this lifestyle comparison, small nucleolar RNA genes stand out: 5 of the top 8 overall and 15 of 29 members of the SNORD family are in the highly significant list, as compared with only 1 of 10 SNORA genes. There is little in the literature to indicate why this is the case or what the physiological consequences may be, but epigenetic modification has been observed for many small nucleolar RNA genes¹⁵.

Even more differentiation was observed when we fitted ANOVA models including location, gender and their interaction. Because exploratory analyses indicated that the Anza and Dchiera samples are indistinguishable for either gene expression or genotype, these samples were combined into a single location, Agadir, in all subsequent analyses. In the three-way comparison, 8,459 probes (38%) were significant at the 1% FDR threshold for location (Supplementary Table 2). Boutroch differs from Ighrem and also Agadir at over 7,000 probes in each contrast, with a high degree of overlap (Fig. 3a and Table 1). Ighrem and Agadir are much more similar to one another, in part because there is considerably more diversity in the Ighrem sample that reduces the significance of the location contrast. Women are much more differentiated among locations than men (Table 1). These results confirm our previous report of substantial differentiation between Bedouin nomads, urban Anza and another remote Amazigh village, Sebt Nabor³.

**Figure 3: Location affects gene expression across the transcriptome.**

Table 1 Number of transcripts significant at 1% FDR

Full size table

To evaluate the possible independent contribution of ancestry more carefully, we carried out variance component analysis of the expression variation. In Agadir alone, neither ancestry (modeled as the second eigenvector of the genotype data, gPC2) nor gender has a noteworthy impact on the principal components of the expression variation (Fig. 3b). In the total data set, however, there is evidence of a contribution: when fitted jointly with location, the ancestry and ancestry- and gender-by-location interaction terms make a substantial contribution to the expression profiles (Fig. 3c).

Although gender and ancestry affect the expression of fewer genes as compared with location, the plot of expression PC1 by PC2 for the most differentially expressed 1,500 genes indicates that for many genes the interaction between these three factors is complex (Fig. 4). This complexity is also seen in the expression profiles of characteristic individual genes (Supplementary Fig. 3). In general, Boutroch and Ighrem villagers separate along PC1, whereas high values of PC2 are obtained for all Boutroch residents (cluster 1) and for Arab women in Ighrem (cluster 2). Amazigh women from Ighrem (cluster 3) and the Ighrem men (cluster 4) have lower values of PC2, similar to those observed for all Agadir residents. The simplest interpretation is that cultural or behavioral differences, probably including time spent outside the village, contribute strongly to the observed gender and ancestry effects. Deeper sampling would be required to establish firmly whether intrinsic biological differences between the sexes and/or populations also make significant contributions to expression divergence in lymphocytes, as they appear to do for lymphoblast cell lines grown in culture^16,17,18,19.

**Figure 4: Principal component plot for the most differentially expressed genes.**

Two classes of genes stand out as significantly differentially expressed among locations: namely, those encoding ribosomal proteins of the small and large subunits, as well as the cytoplasmic and mitochondrial compartments; and those encoding proteins involved in oxidative phosphorylation, which are highly upregulated in half of the Agadir residents (Supplementary Fig. 4a). All of the transcripts encoding these proteins form a module of co-regulated genes, but notably this module is not coexpressed with the SNORD family, which tends to be relatively downregulated in Agadir individuals but particularly highly expressed in the Arab women from Ighrem (Supplementary Fig. 4b). These differences may reflect differential abundance of leukocyte cell types, but ribosomal biosynthesis is also related to response to viral infection, and seems to be involved in tumorigenesis in conjunction with mitochondrial activity^20,21. Oxidative phosphorylation is correlated with renal health and the production or disposal of free radicals²²; thus, our data suggest that deeper evaluation of health risks associated with lifestyle transitions may be revealing.

Genome-wide association with gene expression variation

The genetic contribution to expression variation was evaluated by genome-wide association with expression of all 22,300 probes. Starting with a simple test of the correlation between each transcript abundance and each genotype, and filtering to retain only eSNPs with a minor allele frequency of >0.05, we observed 3,430 associations at P < 10⁻⁸. Further filtering of eSNPs to retain only autosomal associations with annotated genes, and imposing the additional stringency of P < 10⁻¹¹ for putative trans associations between an eSNP on one chromosome and a probe on another chromosome, reduced this number to 1,636 associations: 1,569 (96%) of these associations are intra-chromosomal linkages and most are within 50 kb and hence cis-acting (Supplementary Fig. 5); only three are clearly in different chromosomal intervals. Facsimile associations were observed for 39 of the target genes represented by a second probe (37 cis, 2 trans). Reducing the data set further to exclude linked associations within haplotype blocks left 346 unique cis and ten unique trans associations at the stringent genome-wide significance level of 5%. These proportions are in good agreement with most other GWAS expression studies on blood or lymphocyte cell lines^{16,17,23,24,25,26}, and a 30-fold or greater excess of cis over trans associations is also supported by 1% FDR estimates of 600 and 20 genes, respectively (see Supplementary Table 3 for a complete list of peak cis and trans associations).

Given the high degree of population structure for gene expression, we addressed the possibility that differentiation of eSNP allele frequencies may contribute to the associations observed by estimating the fraction of variation within subpopulations (F_ST) for each pair-wise comparison of location for the 516,972 SNPs and 16,500 of the genes. No fixed differences were observed, and plots of the F_ST comparisons (Supplementary Fig. 6a) indicate only moderate overall genetic differentiation with a few SNPs having F_ST values between 0.12 and 0.3. There is no tendency for these outliers to have increased differentiation in expression, and in fact almost all of the top 10% most differentially expressed genes are among the least genetically differentiated. Nor is there any correlation between F_ST and significance of gene expression divergence (Supplementary Fig. 6b), confirming that the expression differences observed between locations are for the most part not attributable to gene-specific allelic frequency differences between locations.

The robustness of the 3,430 associations to environmental sources of variance and population structure was further evaluated by fitting two additional linear trend models to the data. The first included location, gender and the interaction between them. The second included two measures of ancestry (the first three genotype eigenvectors and a four-way categorical ancestry cluster, see Online Methods), a matrix of relatedness based on an identity-by-descent measure²⁷, and gender interactions with ancestry cluster and genotype. Figure 5a,b shows the Manhattan plot of associations by chromosomal location for the second of these models, and the cis-trans plot of target against eSNP location, respectively. The logarithm of the genotype significance term is highly correlated (r > 0.95) between both of these models and the original correlation test (Fig. 5c and Supplementary Fig. 7). In addition, there is no evidence for significant genotype-by-location interactions in any of the association trend tests (Fig. 5d). Neither the ancestry nor the relatedness variance components explain an appreciable amount of the expression variation for any of the transcripts (Supplementary Fig. 8).

**Figure 5: Genome-wide association with transcript abundance.**

The absence of interaction effects can be visualized by plotting expression as a function of genotype, with color coding of each location, for each association. An example of a trans association in Supplementary Figure 9 shows the clear trend of increased expression of AMY1A (chromosome 1) in homozygotes for the A allele of ACTG1 gamma actin (chromosome 17) consistently across the three locations despite slight overall location effects. Expression of AMY1A is highly correlated with that of AMY1B (r > 0.8) and many other genes in a coexpression module, but the eSNP regulates only AMY1A, because it increases expression of the gene twofold in an additive manner. A similar plot for another representative gene (C21ORF57) shows highly significant location and genotype effects in cis (Fig. 6a) (see Supplementary Fig. 9c for further examples).

**Figure 6: Relationship among genotype, expression and phenotype.**

Novel associations with potential disease alleles

Expression associations detected in one tissue can identify regulatory variants that may be active in other tissues that are directly engaged in the etiology of disease^23,25,26. For example cis linkages in peripheral blood are associated with the type 1 diabetes (T1D) susceptibility locus at chromosome 12q13. The strongest expression association is with transcription of the ribosomal protein gene RPS26, and network analyses have been used to argue that this gene is a more likely candidate for diabetes than is the initially reported²⁸ gene ERBB3. However, the strongest T1D association involves a SNP that differs from that associated with expression and/or splicing²⁴ of RPS26. We further found that the same linkage group of eSNPs, centered on rs10876864 in the SUOX gene 35 kb from RPS26, is also associated in trans with other RP26 paralogs (probably owing to cross-hybridization), and with CCDC4 on chromosome 4, albeit at the suggestive significance level of P = 3.5 × 10⁻¹⁰. Intriguingly, expression of RPS26 is only weakly correlated with that of the module of ribosomal proteins that differentiate locations (Supplementary Fig. 4b); therefore, this association does not contribute to the environmental effect on transcription of ribosomal protein genes.

Another trans association involves rs11987927 in MYOM2 at 8p23, which interacts with the zinc finger transcription factor gene ZNF71 at 19q13 and also with its own MYOM2 transcript. Logic would suggest that the cis association probably affects the abundance of the MYOM2 myomesin protein, which in turn regulates ZNF71; however, the trans association is significantly stronger, and conditional dependence analysis^29,30 points in the opposite direction — that is, the MYOM2 regulatory site influences ZNF71, which then feeds back on the MYOM2 transcript (Supplementary Fig. 10). This example may be a cautionary tale concerning the interpretation of conditional dependence results. Notably, four of the seven strongest trans associations involve regulation by loci that include genes encoding structural proteins; the others are the laminin gene LAMA5 (20q13) with the oxysterol binding protein gene OSBPL2, and the plekstrin homology domain gene PLEKHM1 (17q21) with the kinase gene MAPK8IP1.

One further trans association deserves attention. Prolongation of fetal gamma hemoglobin expression in adults is often observed in individuals with thalassemia. We found association of two probes that detect transcripts of the hemoglobin genes HBG1 and HBG2 at 11p15 with rs766432 in the second intron of the zinc-finger proto-oncogene BCL11A at 2p16. This same SNP has been associated with the fraction of erythrocytes that contain measurable fetal hemoglobin³¹, and alteration of BCL11A activity has been shown to drive differences in globin switching between mice and humans³². Another SNP in BCL11A, rs4671393, has been associated with abundance of two BCL11A transcript isoforms in the CEU (CEPH Utah residents with ancestry from northern and western Europe sample) and YRI (Yoruba in Ibadan, Nigeria) HapMap lymphoblast cell lines³³, but is not associated with BCL11A transcript abundance in our leukocyte data, suggesting that regulation of BCL11A translation or protein activity is more likely to be affecting HBG1 and HBG2 expression in our sample.

Numerous cis associations are likely to be of interest. We scanned the GWAS association database for overlap between our study and established disease associations at P < 10⁻⁵. Of 1,628 entries, ten involve cis associations observed in our data set that explain between 15 and 55% of the transcript variance (Supplementary Table 4). Five of the associations are with disease conditions (rheumatoid arthritis, celiac disease, T1D, ulcerative colitis and systemic lupus erythematosus) and five are with endophenotypes (levels of the proteins PAFAH1B2 and ICAM-1, triglycerides, low-density lipid cholesterol and hip bone mineral density). The two serum protein associations^34,35 are with the same SNPs that we detected and hence suggest that protein abundance is largely regulated at the transcriptional level.

Discussion

Genetic and environmental contributions to transcript variation

Our geographical genomic survey of gene expression variation in southern Morocco has highlighted two parallel and for the most part non-overlapping insights. On the one hand, it is evident that as much as half of the transcriptome is influenced by the environment in a highly coordinated manner such that where a person lives explains up to a quarter of the variation for a substantial fraction of the transcripts. The environmental influences are probably a combination of biotic and abiotic factors, in addition to cultural and behavioral ones, whereas genetic differences between the two north African ancestries are relatively minor. On the other hand, the genome is littered with strong genetic associations, mainly in cis, that explain between 15 and 60% of the variance of 5% of the transcripts. Impressive as these associations are, particularly because they are apparent in a sample of just under 200 individuals, they have essentially no bearing on most of the transcriptional variation and are not informative of the genetic basis of the environmental response.

The robustness of the associations observed to the environmental effect raises the issue of whether genotype-by-environment interactions influence the peripheral blood transcriptome at all. Genome-wide significant interaction effects are generally unlikely to occur in the absence of significant main genotype effects³⁶. The only circumstances in which they will occur are when the genotype effect is in the opposite direction in two locations, and if the genetic effect in these locations is at least the same magnitude as the main effects detected in this GWAS — in other words, if the effect can explain >30% of the variance of a particular transcript. Although a few such interactions may exist, it would take a study comparing several thousand individuals from each location to reveal weaker genotype-by-environment interactions. If the genetic architecture of transcription is similar to that of visible phenotypes such as height and body mass^37,38, then even such a study will be underpowered to explain most transcriptional variance.

A related issue is whether or not genotype-by-environment interactions at the level of transcription are necessary to explain genotype-by-environment interactions for disease. It is possible the small interactions beneath the level of detection of GWAS are prevalent, or alternatively that disease arises primarily as a result of rare alleles of major effect, whose penetrance may be modulated in an environment-specific manner. However, transcriptional interactions are not required to explain the increased incidence of chronic disease. It is not difficult to imagine that individuals that fall into the chief categories of transcriptome profiles (such as those implicated in Fig. 4 and Supplementary Fig. 4) have different distributions of disease susceptibility that alter the genotype-disease association matrix across the genome, thereby inducing environment-by-genotype interactions for disease. Transcription of genes that contribute to this expression component may also correlate directly with disease, effectively uncovering cryptic variation and resulting in environment-specific eSNP disease associations without any interaction effect at the level of transcription³⁹ (Fig. 6). A corollary of this is that gene expression profiling might be used to stratify individuals at higher risk for disease, thereby increasing the resolution of GWASs by focusing attention on the subset of individuals in whom genetic effects on disease are most pronounced.

Methods

Study population.

Sampling was designed so that four localities representing two main lifestyles and including both genders were sampled, and both Arab and Amazigh ancestries were represented in each locality. Sampling of the two ancestries relied originally on self-reported information. The urban group consisted of residents sampled from two low-income districts, Anza and Dchiera, located seven miles apart on the north and south sides of Agadir, respectively. All of these individuals live a typical urban lifestyle characterized by a relatively dense human population, frequent traffic and the presence of industrial activities. The rural group consisted of villagers sampled from two sites, Ighrem and Boutroch, located 26 miles apart and 80 miles south of Agadir. Both villages are characterized by a traditional lifestyle based on agriculture and herding, but the villagers in Boutroch are more isolated and have very limited exposure to urban activities relative to the villagers in Ighrem. Obtaining samples from males from either village was challenging, and most of the males make occasional, or in some cases frequent, trips to neighboring cities. Boutroch is known to be a predominantly Amazigh village and is in the low Atlas mountains (latitude, 29.346; longitude, −9.368; altitude, 1335 m), whereas Ighrem is located in the foothills of the low Atlas mountains (latitude, 29.459; longitude, −9.672; altitude, 720 m) and is historically Arab with a small fraction of Amazigh residents; self-report confirmed these ancestry differences.

All study participants were between the ages of 18 and 50 yr, and the mean age of the three locations was similar (31–34 yr). The effect of age on gene expression was minimal; only 30 probes were significant at 1% FDR by ANCOVA with location and gender as fixed effects.

Collection protocol.

The study was approved by the ethical review committees of the Moroccan Ministry of Health, North Carolina State University and the University of Queensland. Under informed consent, 284 peripheral blood samples were collected in the field; 215 and 209 of these samples were profiled for gene expression and genotype, respectively, but several were later discarded for quality control purposes (see below). The subjects reported that they were in good health at the time of sampling. Peripheral blood samples (∼8 ml) were collected over the course of 6 d during the months of June and July 2008. The same collection protocol was followed for all samples to minimize heterogeneity due to technical reasons. All samples were collected within 4 h between 8:00 and 12:00. The total leukocyte population was isolated from ∼6 ml, and within minutes its total RNA was stabilized by using a Leukolock Total RNA Isolation System⁵ (Ambion). This system incorporates depletion filter technology to isolate leukocytes and to eliminate plasma, platelets and red blood cells and uses RNALater^® to stabilize the RNA in the cells captured in the filter. The remaining blood was stored in EDTA tubes for DNA extraction. The filters and blood samples were kept on ice and then frozen at −45 °C within hours of collection at all study sites.

RNA and DNA preparation.

Total RNA extraction, and cDNA and cRNA synthesis were performed with an Illumina TotalPrep RNA Amplification kit (Ambion) in accordance with the manufacturer's instructions. Total RNA samples were checked for quality with an RNA 6000 Nano LabChip kit and 2100 Bioanalyzer (Agilent). We retained 215 samples with high RNA quality (RNA integrity number > 8) for expression profiling. We extracted 209 DNA samples with a QIAamp DNA kit (Qiagen) and quantified them by using an ND-1000 instrument (NanoDrop Technologies). All DNA samples had 260/280 and 260/230 ratios of optical density within the range 1.70–2.05.

Gene expression profiling.

HumanHT-12 beadchips (Illumina) were used to generate expression profiles of >48,000 transcripts by using 500 ng of labeled cRNA for each of the 208 samples in accordance with the manufacturer's recommended protocols. The order in which the samples were processed was randomized to minimize chip effects. The beadchips were hybridized and scanned with an Illumina BeadArray reader by K.S.'s laboratory at the Duke University Institute for Genomics and Science Policy (IGSP). The raw intensities were extracted with the Gene Expression Module in BeadStudio software (Illumina). Expression intensities were log₂-transformed and median-centered by subtracting the median value of each array from each intensity value. This procedure preserves the variance of each sample, and inspection of the residuals indicated that they were reasonably distributed for ANOVA; in addition, an outlier filtering procedure provided further quality control. The top 22,300 transcripts with expression above background levels averaged across all of the arrays were retained for further analyses as described³. All array data have been submitted to GEO according to MIAME compliance guidelines and are available under accession number GSE17065.

Genome-wide genotyping.

We assayed 209 samples with Infinium Human 610-Quad beadchips (Illumina) by following standard procedures, also at the Duke University IGSP. The Human 610-Quad SNP Chip contains over 610,000 markers based on HapMap release 23. The beadchips were imaged by using a BeadArray Reader (Illumina), and genotype calls were extracted with the Genotyping Module in BeadStudio software. Six samples with low intensity or a low call rate as assessed by the Illumina cluster measure (<95%) were removed, and all SNPs that had a call frequency of <99% were deleted. SNPs with a cluster separation value of <0.3 were checked manually, and those that could not be fixed manually were removed. Next, to screen for departure from Hardy-Weinberg equilibrium, we checked the quality of the raw and normalized data of autosomal SNPs with heterozygosity excess values between −1.0 to −0.1 and between 0.1 to 1.0, and any SNP cluster that was not clean was removed. The process of quality control checks resulted in retention of 579,144 SNPs in 203 individuals for the population structure analysis; this value was reduced to 516,972 for the association studies after removing SNPs with a minor allele frequency of <0.05.

Population structure, ancestry inference and F_ST.

Principal component analysis (PCA) and a Bayesian approach were implemented in Eigenstrat⁶ and Structure⁹, respectively, to explore genetic structure among the samples. Relatedness between all pairs of individuals was estimated indirectly from identity by state measures using PLINK⁴¹, and 65 of the individuals appeared to be related by virtue of having pi-hat scores of >0.125. We observed 15 pairs or triplets of full siblings (0.451 < pi-hat < 0.595, a range similar to that described for full siblings⁴²), six clusters of lesser relatives (0.125 < pi-hat < 0.3) and four mixed clusters of 4–5 relatives of both types. By these criteria, 138 individuals did not appear to be related to any other individuals in the sample, and were combined with one randomly chosen member from each of the 25 clusters to result in 163 unrelated individuals for the population structure analysis. PCA was used to infer the extent of global genotypic variation in this set, retaining the first seven eigenvectors according to the Tracey-Widom test statistic. Close inspection of axes 3–7 indicated that they were dominated by a few SNPs that mapped to the same region of the genome (data available from the authors on request). The sub-Saharan contribution to PC1 was established by including matching genotypes for 21 Yoruban HapMap individuals (provided by J. Akey and S. Biswas, University of Washington) in an expanded analysis. Structure⁹ was used to infer population structure with a subset of 16,000 autosomal SNPs (randomly selected and approximately uniformly distributed on the 22 autosomes) at k = 2–5 using the admixture model with correlated allele frequencies and 20,000 iterations after a burn-in length of 20,000.

Subsequently, relatedness was recalculated more formally²⁷ for all individual pairs by using Â_ij averaged over l = 1 to n loci:

where x_il = 0, 1 or 2 according to whether individual i has genotype aa, Aa or AA at locus l, p (q) is the allele frequency of A (a), and 2p is the mean of x_l.

F_ST estimates between locations were calculated for each of the 516,972 SNPs included in the association study by using PROC ALLELE in SAS version 9.2 (SAS Institute). This implementation uses the method of moments approach in an ANOVA framework and expected mean squares to estimate F_ST. The method assumes 'random' (in contrast to 'fixed') populations and accounts for common evolutionary history. Gene-specific F_ST estimates were calculated by averaging F_ST measures of all SNPs in each gene and in flanking 5′ and 3′ UTR regions. Plots of F_ST by SNP and gene show typical upper values of 0.08, 0.10 and 0.12 for comparisons of Agadir with Ighrem, Boutroch with Ighrem, and Agadir with Boutroch, respectively (Supplementary Fig. 6a). A few SNPs exceed these values, the maximum being 0.3: no fixed differences between the locations were observed. To test for a possible influence of divergence in allele and genotype frequencies on gene expression divergence between locations, we examined the correlation between F_ST and fold change in expression, or significance of differential expression for each pair-wise comparison. There was no relationship between these measures (P values for all correlations > 0.047, percentage variance explained < 0.1%), nor was there an excess of outliers with high F_ST and high expression divergence (Supplementary Fig. 6b). Genetic differentiation thus does not significantly contribute to the location effects.

Principal variance component analyses, ANOVA and ANCOVA.

Principal variance component analyses were performed on gene expression data by using JMP Genomics v3.2 (SAS Institute). Expression principal components (ePCs) were modeled as a function of various effects, assuming that each is a random term. A series of models was used to partition variance components into different combinations of the following factors and their pair-wise combinations: location (or lifestyle), gender and gPC2 (the second principal component of the genotypic variance, corresponding to the Arab-Amazigh axis of diversity). The magnitude and significance of differential expression of individual transcripts were evaluated by ANOVA and analysis of covariance (ANCOVA) through JMP Genomics using PROC MIXED as implemented in SAS and incorporating an outlier removal algorithm with a 5% false positive rate criterion. The following ANOVA models were used for differential expression analysis:

and gPC2 was added as a covariate for ANCOVA. Location (Agadir, Ighrem or Boutroch), lifestyle (urban or rural) and gender (male or lemale) were considered fixed effects. The error ε was assumed to be normally distributed with mean zero.

A marked feature of the PCA of the total data set is the presence of such a strong correlation structure in the data that ePC1 explains 21% and ePC1–ePC5 combined explain 50% of the transcriptional variance. In addition, almost half (47.6%) of the variation captured by ePC1–ePC5 can be decomposed into effects of the Arab-Amazigh axis of variation (gPC2), location, gender, and pair-wise interactions among these factors (Fig. 3c). This analysis is described in detail in ref. 43. It is substantially in agreement with the gene-specific ANOVA, which revealed similar magnitudes of contribution of the various effects. Taken together, the two modes of analysis imply that genetic and non-genetic effects both contribute significantly to transcriptional variation in our human data set. In addition, to evaluate possible environmental effects on alternative splicing, we fitted a mixed model for each gene targeted by more than one probe in the array and found evidence for 245 transcriptome-wide significant (P < 1.2 × 10⁻⁵) location-specific differences in transcript isoform abundance (Supplementary Note).

The absence of a relationship between transcript size (and GC content) and significance of differential expression (Supplementary Fig. 12) shows that there is no tendency for shorter transcripts to be differentially expressed between locations or lifestyles, indicating that enrichment for short transcripts such as the SNORD gene family is not due to degradation or technical artifacts.

Clustering and functional enrichment annotation.

Clustering was generated with Ward's method in JMP Genomics v3.2. The gene ontology and pathway analyses were generated through the use of Panther⁴⁴ and KEGG⁴⁵. Genes whose expression was significantly differentially regulated were included by using stringent cutoffs as described in the Results. Enrichment analysis was used to calculate the probability that the number of genes in each biological function, pathway and/or disease assigned to that data set was greater or less than expected by chance given the numbers of genes expressed in the samples. Corrections for multiple testing were achieved using Bonferroni or Benjamini-Hochberg methods depending on the analysis.

Genome-wide association tests.

Tests for association of gene expression levels with each genotype were performed by both ANOVA (to test for genotype effects irrespective of allelic trends) and regression (to test for a linear trend, where heterozygotes are intermediate in phenotype owing to additive allelic effects) as implemented in PROC MIXED with SNP as a class variable or continuous variable, respectively, using SAS 9.2 and JMP Genomics v3.2. First, the whole allelic data set was coded as 0, 1 or 2, where each number represents the number of copies of the minor allele. Each of 516,792 SNPs was tested for association with each of the 22,300 expressed transcripts. This analysis gave rise to a genome-wide Bonferroni threshold of 4 × 10⁻¹² for trans associations (NLP > 11.4, which is likely to be conservative given the linkage disequilibrium (LD) structure across the genome) and, assuming that 200 common SNPs are in 100 kb of each transcript probe, a threshold of 0.05/(22300 × 200) = 1 × 10⁻⁸ for cis associations (this value is also likely to be conservative because the median number of linked SNPs is <100). Note that a small fraction of putative cis eSNPs are more distant from the transcription start site than 50 kb on either side. We pragmatically distinguished cis from trans effects by plotting the eSNP and probe coordinates for each chromosome. Only three associations on the same chromosome were clearly off the diagonal; the remainder were within 1% of the chromosome arm length of the target probe and operationally likely to be cis-acting. The 1% FDR threshold was estimated by using the relationship FDR = m × alpha/(number of positives at alpha), where m is the total number of comparisons. Assuming 10⁶ independent cis tests and 2 × 10⁹ independent trans tests allowing for LD, approximate 1% FDR thresholds were found with 600 and 20 associations, respectively, at P < 6 × 10⁻⁶ and P < 10⁻¹⁰. Although the complex dependency structure of the genotype and expression data caution against too literal interpretation of these numbers, similar relative numbers of the two types of association are obtained with different assumptions about non-independence of the tests.

Tests of association were carried out with three models. First, we used the following basic correlation model, where μ is the mean measure of transcript abundance and the error ε is assumed to be normally distributed with a mean of zero:

The 10,000 most significant associations from this model were brought forward for two further analyses. Model 2 assessed the effects of location (Agadir, Ighrem or Boutroch) and gender (male or female):

We also accounted for location, ancestry, relatedness and gender in a third model:

where gPC1-3 correspond to genotypic principal component eigenvectors of axis 1, 2 and 3 computed with Eigenstrat; and gCluster represents clustered ancestry, where the 194 samples were clustered into four groups corresponding largely to Agadir Arabs, Ighrem Arabs, Boutroch Amazighs and admixed individuals from Agadir and Ighrem, which accounts for location in an unbiased manner relative to ancestry. Relatedness was fitted as a random effect. Considerable overlap was observed between our set of GWAS-significant hits and highly significant eSNP associations reported in four other expression GWASs on peripheral blood or its derivatives, depending on the stringency adopted (Supplementary Note).

Accession codes.

NCBI GEO: Gene expression data from this study have been deposited under series GSE17065.

Accession codes

Accessions

Gene Expression Omnibus

GSE17065

References

Abegunde, D.O., Mathers, C.D., Adam, T., Ortegon, M. & Strong, K. The burden and costs of chronic diseases in low-income and middle-income countries. Lancet 370, 1929–1938 (2007).
Article Google Scholar
Cookson, W., Liang, L., Abecasis, G., Moffatt, M. & Lathrop, M. Mapping complex disease traits with global gene expression. Nat. Rev. Genet. 10, 184–194 (2009).
Article CAS Google Scholar
Idaghdour, Y., Storey, J.D., Jadallah, S.J. & Gibson, G. A genome-wide gene expression signature of environmental geography in leukocytes of Moroccan Amazighs. PLoS Genet. 4, e52 (2008).
Article Google Scholar
Arredi, B. et al. A predominantly Neolithic origin for Y-chromosomal DNA variation in North Africa. Am. J. Hum. Genet. 75, 338–345 (2004).
Article CAS Google Scholar
Feezor, R.J. et al. Whole blood and leukocyte RNA isolation for gene expression analyses. Physiol. Genomics 19, 247–254 (2004).
Article CAS Google Scholar
Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS Google Scholar
Fellay, J. et al. A whole-genome association study of major determinants for host control of HIV-1. Science 317, 944–947 (2007).
Article CAS Google Scholar
Biswas, S., Scheinfeldt, L.B. & Akey, J.M. Genome-wide insights into the patterns and determinants of fine-scale population structure in humans. Am. J. Hum. Genet. 84, 641–650 (2009).
Article CAS Google Scholar
Pritchard, J., Stephens, M. & Donnelly, P. P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
CAS PubMed PubMed Central Google Scholar
Kéfir, R., Stevanovitch, A., Bouzaid, E. & Béraud-Colomb, E. Diversité mitochondriale de la population de Taforalt (12.000 ans bp - Maroc): Une approche génétique à l′étude du peuplement de l'Afrique du nord. Anthropologie 43, 1–11 (2005).
Google Scholar
Coudray, C. et al. Population genetic data of 15 tetrameric short tandem repeats (STRs) in Berbers from Morocco. Forensic Sci. Int. 167, 81–86 (2007).
Article CAS Google Scholar
Ennafaa, H. et al. Mitochondrial DNA haplogroup H structure in North Africa. BMC Genet. 10, 8 (2009).
Article Google Scholar
Bosch, E. et al. Population history of North Africa: evidence from classical genetic markers. Hum. Biol. 69, 295–311 (1997).
CAS PubMed Google Scholar
Wolfinger, R.D. et al. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8, 625–637 (2001).
Article CAS Google Scholar
Royo, H. & Cavaillé, J. Non-coding RNAs in imprinted gene clusters. Biol. Cell 100, 149–166 (2008).
Article CAS Google Scholar
Dixon, A.L. et al. A genome-wide association study of global gene expression. Nat. Genet. 39, 1202–1207 (2007).
Article CAS Google Scholar
Stranger, B.E. et al. Population genomics of human gene expression. Nat. Genet. 39, 1217–1224 (2007).
Article CAS Google Scholar
Cheung, V.G. et al. Mapping determinants of human gene expression by regional and genome-wide association. Nature 437, 1365–1369 (2005).
Article CAS Google Scholar
Storey, J.D. et al. Gene-expression variation within and among human populations. Am. J. Hum. Genet. 80, 502–509 (2007).
Article CAS Google Scholar
Kao, C.F., Chen, S.Y. & Lee, Y.H. Activation of RNA polymerase I transcription by hepatitis C virus core protein. J. Biomed. Sci. 11, 72–94 (2004).
Article CAS Google Scholar
Ruggero, D. & Pandolfi, P.P. Does the ribosome translate cancer? Nat. Rev. Cancer 3, 179–192 (2003).
Article CAS Google Scholar
Shah, S.V., Baliga, R., Rajapurkar, M. & Fonseca, V.A. Oxidants in chronic kidney disease. J. Am. Soc. Nephrol. 18, 16–28 (2007).
Article CAS Google Scholar
Göring, H.H. et al. Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nat. Genet. 39, 1208–1216 (2007).
Article Google Scholar
Heinzen, E.L. et al. Tissue-specific genetic control of splicing: implications for the study of complex traits. PLoS Biol. 6, e1000001 (2008).
Article Google Scholar
Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008).
Article CAS Google Scholar
Heap, G.A. et al. Complex nature of SNP genotype effects on gene expression in primary human leucocytes. BMC Med. Genomics 2, 1 (2009).
Article Google Scholar
Hayes, B.J., Visscher, P.M. & Goddard, M.E. Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. 91, 47–60 (2009).
Article CAS Google Scholar
Schadt, E.E. et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol. 6, e107 (2009).
Article Google Scholar
Chen, L.S., Emmert-Streib, F. & Storey, J.D. Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol. 8, R219 (2007).
Article Google Scholar
Rockman, M.V. Reverse engineering the genotype-phenotype map with natural genetic variation. Nature 456, 738–744 (2008).
Article CAS Google Scholar
Menzel, S. et al. A QTL influencing F cell production maps to a gene encoding a zinc-finger protein on chromosome 2p15. Nat. Genet. 39, 1197–1199 (2007).
Article CAS Google Scholar
Sankaran, V.G. et al. Developmental and species-divergent globin switching are driven by BCL11A. Nature 460, 1093–1097 (2009).
Article CAS Google Scholar
Sankaran, V.G. et al. Human fetal hemoglobin expression is regulated by the developmental stage-specific repressor BCL11A. Science 322, 1839–1842 (2008).
Article CAS Google Scholar
Melzer, D. et al. A genome-wide association study identifies protein quantitative trait loci (pQTLs). PLoS Genet. 4, e1000072 (2008).
Article Google Scholar
Paré, G. et al. Novel association of ABO histo-blood group antigen with soluble ICAM-1: results of a genome-wide association study of 6,578 women. PLoS Genet. 4, e1000118 (2008).
Article Google Scholar
Culverhouse, R., Suarez, B., Lin, J. & Reich, T. A perspective on epistasis: limits of models displaying no main effect. Am. J. Hum. Genet. 70, 461–471 (2002).
Article Google Scholar
Visscher, P.M. Sizing up human height variation. Nat. Genet. 40, 489–490 (2008).
Article CAS Google Scholar
Soranzo, N. et al. Meta-analysis of genome-wide scans for human adult stature identifies novel Loci and associations with measures of skeletal frame size. PLoS Genet. 5, e1000445 (2009).
Article Google Scholar
Gibson, G. Decanalization and the origin of complex disease. Nat. Rev. Genet. 10, 134–140 (2009).
Article CAS Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B. 57, 289–300 (1995).
Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS Google Scholar
Visscher, P.M. et al. Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2, e41 (2006).
Article Google Scholar
Idaghdour, Y. Genetic and Environmental Components of Human Leukocyte Gene Expression Variation in Morocco. PhD thesis, North Carolina State Univ. (2009).
Thomas, P.D. et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 13, 2129–2141 (2003).
Article CAS Google Scholar
Okuda, S. et al. KEGG Atlas mapping for global analysis of metabolic pathways. Nucleic Acids Res. 36, W423–W426 (2008).
Article CAS Google Scholar

Download references

Acknowledgements

We thank all of the study participants in Agadir, Ighrem and Boutroch, and numerous individuals who facilitated sample collection, in particular the Idaghdour family. D. Ge and A. Motsinger-Reif provided timely computational support, and we also thank S. Biswas and J. Akey for providing HapMap genotypes. Funding for the study was provided by the University of Queensland. Y.I. was supported by a Fulbright Fellowship and G.G. by an ARC Australian Professorial Fellowship.

Author information

Authors and Affiliations

Department of Genetics, North Carolina State University, Raleigh, North Carolina, USA
Youssef Idaghdour & Greg Gibson
SAS Institute Inc., Cary, North Carolina, USA
Wendy Czika, Kelci Miclaus & Russell D Wolfinger
Institute for Genome Science and Policy, Duke University, Durham, North Carolina, USA
Kevin V Shianna & David B Goldstein
Queensland Institute of Medical Research, Brisbane, Queensland, Australia
Sang H Lee & Peter M Visscher
School of Biological Sciences, University of Queensland, Queensland, Australia
Hilary C Martin & Greg Gibson
HRH Prince Sultan International Foundation for Conservation and Development of Wildlife, Agadir, Morocco
Sami J Jadallah

Authors

Youssef Idaghdour
View author publications
You can also search for this author in PubMed Google Scholar
Wendy Czika
View author publications
You can also search for this author in PubMed Google Scholar
Kevin V Shianna
View author publications
You can also search for this author in PubMed Google Scholar
Sang H Lee
View author publications
You can also search for this author in PubMed Google Scholar
Peter M Visscher
View author publications
You can also search for this author in PubMed Google Scholar
Hilary C Martin
View author publications
You can also search for this author in PubMed Google Scholar
Kelci Miclaus
View author publications
You can also search for this author in PubMed Google Scholar
Sami J Jadallah
View author publications
You can also search for this author in PubMed Google Scholar
David B Goldstein
View author publications
You can also search for this author in PubMed Google Scholar
Russell D Wolfinger
View author publications
You can also search for this author in PubMed Google Scholar
Greg Gibson
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.I. collected the samples with the assistance of S.J.J. and processed them with K.V.S. and D.B.G.; K.M., S.H.L., D.B.G., P.M.V. and R.D.W. provided statistical and conceptual support for analysis of the data by Y.I., W.C., H.C.M. and G.G.; and Y.I. and G.G. conceived the study and wrote the paper. All authors read and contributed to the manuscript.

Corresponding author

Correspondence to Greg Gibson.

Ethics declarations

Competing interests

Coauthors Russell D Wolfinger, Wendy Czika and Kelci Miclaus are all employees of SAS, Inc, which are the producers of commercial JMP Genomics software used in the analysis of the data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Idaghdour, Y., Czika, W., Shianna, K. et al. Geographical genomics of human leukocyte gene expression variation in southern Morocco. Nat Genet 42, 62–67 (2010). https://doi.org/10.1038/ng.495

Download citation

Received: 06 July 2009
Accepted: 13 October 2009
Published: 06 December 2009
Issue Date: January 2010
DOI: https://doi.org/10.1038/ng.495

This article is cited by

Human immune diversity: from evolution to modernity
- Adrian Liston
- Stephanie Humblet-Baron
- An Goris
Nature Immunology (2021)
The origins of diversity in human immunity
- Adrian Liston
- An Goris
Nature Immunology (2018)
Urbanization in Sub‐Saharan Africa: Declining Rates of Chronic and Recurrent Infection and Their Possible Role in the Origins of Non‐communicable Diseases
- Stephen W. Bickler
- Andrew Wang
- Antonio De Maio
World Journal of Surgery (2018)
Comprehensive assessment showed no associations of variants at the SLC10A1 locus with susceptibility to persistent HBV infection among Southern Chinese
- Ying Zhang
- Yuanfeng Li
- Gangqiao Zhou
Scientific Reports (2017)
Identification of context-dependent expression quantitative trait loci in whole blood
- Daria V Zhernakova
- Patrick Deelen
- Lude Franke
Nature Genetics (2017)