Introduction

Biological evolution has been defined as “biological change through or over time” [1]. It is a continuous process, which has given rise to the contemporary populations. These have evolved from ancestral populations through adaptation and selection mechanisms [1], in which organisms with traits that confer a competitive advantage will survive and their offspring will perpetuate. Therefore, the variations observed within contemporary human populations and between different ethnic subpopulations are the consequences of this process of evolution. Each ethnic subpopulation or group can be characterized by specific morphological, anatomical, and physiological characteristics (i.e., the phenotype) that are encoded within the genome (i.e., the genotype) of the subpopulation. These genetic polymorphisms, if advantageous, are transmitted from one generation to the next [2]. As many diseases in humans involve apparently aberrant physiological processes, there is an underlying genetic basis for most diseases. Although many inherited disorders are regarded as defects in contemporary populations, the selection for these phenotypes in ancestral populations was to confer specific selective advantages. For example, the iron overload disease, hereditary hemochromatosis, is believed to have conferred a selective advantage during times of famine. As the relative content of dietary iron has increased over history, this excess now causes cirrhosis in patients who are homozygote for a series of variants in the HFE gene on chromosome 6, in particular the C282Y variant [3].

The first human genome project was completed by a collaborative international effort [4]. The subsequent 1000 genome project was an extension of the International HapMap project that aimed to sequence the whole genome of 1000 individuals from various populations [5]. The principle objective was to provide a comprehensive description of genetic variations in humans and their distribution throughout the genome and to make this data publically available. In 2016, Popejoy and Fullerton reported on the status of genome data available in the public domain, after completing an extensive audit of databases and discussed how the Arab Genome was sadly underrepresented [6]. At 0.08% of the total amount of data, it is only better than native populations [6]. This has to change in order for individuals of Arabian ancestry to benefit from the new paradigm that is precision medicine and associated improvements in standards in healthcare.

Arab effort in sequencing the human genome

During the fossil fuel era, a number of countries in the Gulf Cooperative Council (GCC) have thrived and have built competencies and proficiencies around the financial sector, telecommunication, and aviation. These countries have in recent years looked to innovate in other technology sectors, in anticipation that oil reserves will run out and as the world energy needs shift to renewable sources. As these nations seek to diversify their economics and its base of revenues, health projects are becoming more important.

Despite the genome effort in this part of the world being largely ignored until recently, it is slowly and most certainly gaining momentum. A number of governments in the region have initiated projects. National Genome efforts have commenced in Saudi Arabia [7], Qatar, and Kuwait [8]. These projects have mainly focused on understanding the unique genetic makeup of the citizens of these respective nations. Furthermore, the studies were designed to primarily identify or verify specific genetic variants of disease that are common in their populations. For example, the study of native population of Kuwait was developed to identify deleterious polymorphisms associated with obesity and asthma [9].

Two conflicting points of view necessitate genome studies of the indigenous populations and ethnic subpopulations from Arab nations. Firstly, it has been reported that these populations are relatively homogenous groups with a highly conserved gene pool [8]. This homogeneity has presumably arisen from a number of social factors, including the high rates of consanguineous marriages (especially between first cousins) [8], endogamous unions [8], a tribal structure of the society, and the large family sizes. Contrasting this view is the suggestion that the Arab people are more varied than thought [10]. The present population residing in the Middle East region has arisen around intercontinental migration between Africa, Asia and Europe, being at the crossroads of these three continents. Migration of Semitic tribes throughout the region and interaction with traders from Far East along silk road and immigrants shunting in and out of the African continent [11] have contributed to the diversity in the region. Other factors contributing to the increase in the contemporary diversity of the Arab population include the expansion and spread of Islam between the 7th and 14th centuries, the gradual transfer of power from the Ottoman Dynasty to European rule, the Crusade wars, and more recently, migration patterns facilitated by mass air travel [12]. The history of the region suggests close social interactions between different ethnic groups.

These conflicting views provide a compelling reason to study the genomic structures of Arab populations. It is important to resolve this matter and through gathering data on the Arab genome, it will be possible to establish the relationship between Arabs and other populations.

The lack of Arab genome representation

Although the various ethnic groups of the Arab world (e.g., Arabs, Persians, Armenians, Assyrians, Bakhtiyarians, the Baluch, Beja, Berber, Copts, Gilaki, Jews, Kurds, Lurs, Mazandaranis Nubians, Talyshs, and Turks) share some characteristics, they have arisen from distinctive demographical groups and geographical locations. Therefore, the genetic makeup of each population is potentially very different. An example of this difference is evident from the diversity found in conclusions drawn from a series of Genome Wide Association Studies (GWAS). For example, rs5219 SNP of the KCNJ11 gene is known to be associated with the incidence of Type 2 Diabetes. Studies conducted on the populations from Oman [13] and Saudi Arabia [14] showed a positive association with this variant, while no or weak associations were found in the Arab populations of Tunisia [15] and Morocco [16]. To understand the significance of these differences, genome studies for each specific ethnic group in the region are needed.

Unfortunately, despite the significant efforts put into compiling genome data of ethnic groups from around the globe, information on the genome of populations from the nations of the Middle East, remains underrepresented. There is an increasing evidence suggesting that the genome structure of the individuals of Arabian descent is different to those studied to date [12]. For instance, in its December 9, 2016 edition, the Science Journal reported on the effort of Saudi Human Genome Project, which in the preceding 5 years reported nearly 200 genes that contribute to human pathologies within their population [17]. They further pointed out that the rate of inherited genetic diseases is nearly double the rate in Europe and the United States and was 10 times higher for certain disorders [17]. In addition, despite the relatively similar genetic structure between the different Arab groups, differences can be found. Another study, by Garcia-Bertrand et al. (2014), focused on admixture analysis to establish the phylogenetic relationships and ancestral populations of different Arab groups, showing the uniqueness of different Middle Eastern populations, especially those in the Gulf region [18].

The National Arab Genome project in the UAE

In spite of these genome efforts in the Arabian regions, there remains a poor level of data. As previously mentioned, only 0.08% of the information in the public domain is of Arabian origin [6]. Consequently, the National Arab genome project planned for the United Arab Emirates (UAE) will specifically focus on the ethnic groups of the gulf region by sampling Emirati citizens. The project is intended to create reference sequences to establish databases for developing customized molecular diagnostic assays and personalized medicine strategies, for individuals of Arabian descent. From a pharmacogenomics standpoint, this should in turn maximize drug efficiency or provide an optimal dose to the patient.

The contemporary population structure of the UAE is diverse; a result of a high percentage of expatriates (80–90%, source: CIA World Fact Book) residing in the country, juxtaposed next to a relatively small but diverse indigenous population admixed with immigrants from Yemen, Oman, North Africa, Iran, Baluchistan, India, and other neighboring regions [19]. Of the approximately 11 million residents in the UAE, around 10% are Emirati citizens. The proposed study aims to sequence the genome of Emirati citizens using next-generation sequencing technology and analyze the data using advanced bioinformatics tools. The project is a collaborative effort between academia (Khalifa University of Science and Technology) and clinicians in local hospitals. The medical and health records of the participants will be stored on a secure database. The project was intended to start as a population based study, shifting to a healthcare or disease association study to identify predisposing pathogenic variants.

The expected outcomes of this study include:

  1. 1.

    The creation of an Emirati genome reference that identifies point and segmental (indel) polymorphisms. These polymorphisms will be examined to assess possible associations with disease that are common in Arabian populations (e.g., diabetes, cardiovascular disease, etc.). Variations in the gene copy number will also be examined.

  2. 2.

    The identification of novel and rare or low-frequency genetic variations (defined as those with minor allele frequencies of <1% and 1–5%, respectively) in the indigenous Emirati population. These variations will be compared with published findings. Comparative analyses may provide clues to the underlying factors behind the molecular associations and cellular mechanisms of the genetic disorders within the particular ethnic groups of Arabian descent in the UAE.

  3. 3.

    The identification of unique associations is expected to give rise to customized DNA-based assays to screen for the genetic polymorphisms that may assist in the early detection of disease for intervention and for improved diagnostic applications such as histocompatibility matching for transplantation.

  4. 4.

    The data collected will also provide opportunities to study the UAE population in detail. The data could potentially be used as a genome panel that will serve as a reference for future imputation studies.

The project was conceived to provide genome data for future healthcare applications in the UAE. The deliverables of this project will influence the future medical practice in the UAE, which is expected to include a focus on correlating phenotype and genotype for many complex diseases that afflict the local population. Optimistically, future collaborations and training opportunities will take place between the UAE and other efforts in the Arabia, as well as with the international research community. Improvement in the quality and quantity of genome data of Arab origin is also expected to improve our understanding of relationships between the genotype and phenotype in other ethnic groups around the world, since diversity will provide opportunities to contrast.

The data generated from the UAE National Genome Project is expected to present clinicians, specifically in the UAE, and broadly across the Middle East region with detailed information that is relevant to the health status of their people. Consequently, improved diagnosis can be expected, and more informed decisions could be made. This project will provide the opportunity to improve our understanding of a genetics including more in-depth appreciation of pharmacogenomic factors that impact on UAE and Arab patients. A patient’s genetic makeup will be used to determine the optimal doses of treatments for use and the therapeutic response to certain drugs in the presence of the appropriate candidate loci [20]. By assuring efficient drug intake, it is possible to reduce or completely eliminate adverse side effects [20] and eliminate wastage, thus reducing the cost of healthcare in the United Arab Emirates.

Despite being largely ignored to date, the rise of Next-Generation Sequencing technologies has provided an opportunity for smaller nations to join in the genome effort. Cost and time are not prohibitive factors as they were in the past, and the National UAE Genome project is poised to deliver information not only for the local effort but to the international community.