A scientometric review of genome-wide association studies

This scientometric review of genome-wide association studies (GWAS) from 2005 to 2018 (3639 studies; 3508 traits) reveals extraordinary increases in sample sizes, rates of discovery and traits studied. A longitudinal examination shows fluctuating ancestral diversity, still predominantly European Ancestry (88% in 2017) with 72% of discoveries from participants recruited from three countries (US, UK, Iceland). US agencies, primarily NIH, fund 85% and women are less often senior authors. We generate a unique GWAS H-Index and reveal a tight social network of prominent authors and frequently used data sets. We conclude with 10 evidence-based policy recommendations for scientists, research bodies, funders, and editors. Melinda Mills and Charles Rahal discuss genome-wide association studies published in the last 13 years, finding increases in sample sizes, rates of discovery, and traits studied over time. They discuss limitations, including sample diversity, and make recommendations for scientists and funding bodies.

funding bodies and Universities have likewise noted lower levels of women and ethnic minorities in senior biomedical positions and implemented policies to counteract these trends, but there are limited metrics for evaluation across the genomic landscape [7][8][9] .
We first study participant demographics, sample sizes, ancestry, the geographic distribution of participant recruitment, the number and p values of genetic associations, journal diversity, and disease focus. We draw on over 13 years of GWAS discoveries (March 2005 to October 2018) from the NHGRI-EBI GWAS Catalog (hereafter, the Catalog) produced by the US National Human Genome Research Institute (NHGRI) in conjunction with the European Bioinformatics Institute (EBI) 10,11 . We then link the Catalog to external PubMed and United Nations (UN) population data and manually curate the most frequently used data sets, which cover over 85% of all GWAS by cumulative sample size across approximately a third of all papers. We rank and map top funders by ancestry and disease, isolate key consortiums, engage in an analysis of gender and authorship, create a unique GWAS H-Index and undertake a social network analysis of author centrality. This unique overview allows us to formulate 10 concrete evidence-based policy recommendations. Our accompanying, Supplementary Methods and Supplementary Note 1 describe the methods and data used to produce the results and dynamically pull in new data, which will regularly update our analyses, creating an open, live database over time.
Sample sizes, associations found, diseases studied, and journal diversity Figure 1 shows the explosion in GWAS research since 2007. Although the first entry within the GWAS Catalog is dated 10 March 2005, only 10 entries were made in 2005 and 2006. A major breakthrough occurred in 2007, with a widely heralded paper published by the Wellcome Trust Case Control Consortium 12 , later termed a masterwork of diplomacy owing to the aggregation of the data involved 6 . As of 29 October 2018, the Catalog records 3639 individual research papers, which span 5849 unique Study Accessions (unique identifiers ascribed to studies of specific traits within a paper) across 3508 unique diseases/traits, which map to 2532 unique Mapped Experimental Factor Ontology traits. The average number of associations or hits per study is 15.3, with an average p value of 1.3729 × 10 -6 . Only 49,451 out of 89,588 (55.20%) reported associations meet the heralded p ≤ 5×10 -8 threshold, with most remaining within or below the borderline level, with recent work suggesting a possible relaxation in the current threshold 13 . Nature Genetics has been the most frequent publisher over time, although in 2017, GWAS were most frequently published by Nature Communications. At the time of writing, the largest study in the Catalog presently contains 1,030,836 subjects.
Ancestral diversity, geographical concentration, and data sets used Considerable attention has been paid to the disparities underlying the ancestral diversity of study participants for technical reasons  2007Q2  2007Q3  2007Q4  2008Q1  2008Q2  2008Q3  2008Q4  2009Q1  2009Q2  2009Q3  2009Q4  2010Q1  2010Q2  2010Q3  2010Q4  2011Q1  2011Q2  2011Q3  2011Q4  2012Q1  2012Q2  2012Q3  2012Q4  2013Q1  2013Q2  2013Q3  2013Q4  2014Q1  2014Q2  2014Q3  2014Q4  2015Q1  2015Q2  2015Q3  2015Q4  2016Q1  2016Q2  2016Q3  2016Q4  2017Q1  2017Q2  2017Q3  such as population stratification 14 , reduced linkage disequilibria 15 , genetic diversity and admixture 16 , cultural distrust and social misuses, and interpretations 17,18 . Including diverse participants is crucial for understanding genetic heterogeneity in disease phenotypes and the creation of an equitable distribution of personalized medicine 19 . There is also a limited portability of polygenic scores across populations, which we return to in our final discussion 20 . Figure 2 visualizes a customized Broader Ancestral Category 21 field, which subsumes hundreds of combinations of seventeen different broad ancestral categories mapped to seven unique broader categories. Our results (when dropping rows of the Catalog that contain any unrecorded ancestries) concur with existing estimates 21,22 , showing that on aggregate, ancestry in genetic discovery has been highly unequal and dominated by participants of European ancestry (86.03% discovery, 76.69% replication, 83.19% combined). Other prominently studied ancestries are Asian (9.92% discovery, 17.97% replication, 12.37% combined), African American or Afro-Caribbean (1.96% discovery, 1.96% replication, 1.96% combined), Hispanic or Latin American (1.30% discovery, 1.33% replication, 1.30% combined), Other or Mixed (0.48% discovery, 1.77% replication, 0.87% combined) and African (0.31% discovery, 0.28% replication, 0.30% combined) ancestry. Table 1 shows that the percent per annum of European ancestry samples fluctuates considerably and has been as high as 90.76% in 2016 and as low as 71.98% in 2012. In 2008, not a single study utilized participants of African ancestry. By partitioning the data into discovery and replication samples, we show that the percent of European ancestry samples used for initial discovery is substantially higher than for replication, and that samples of Asian ancestry make up a considerably higher share of replications than for initial discovery.
A regular expression-based exercise to extract information from the free text related to discovery and replication sample descriptions identifies 212 and 150 unique terms, respectively for classifying participants in terms of their race, region, country, ethnicity, or ancestry. This ranges from the most common term of European, to hybrid terms such as Caucasian Eastern Mediterranean along with multiple other examples of polyvocality. Our accompanying replication material provides a more empirically transparent and rigorous evidence base compared with previous research that reported that around a fifth of papers use classification schemes in logically ambiguous ways 23 and estimates that there were up to 26 terms to describe participants of African ancestry 22 .
This decomposition of the free text field also allows us to examine categorizations of Native or Indigenous populations. These groups have had a particularly complex relationship with genomics research, but have also revealed some key genetic  25 , this number increases to 0.022%. Uniquely, we also provide the first systematic breakdown of recruitment of GWAS subjects by examining the Country of Recruitment field 21 provided by the Catalog for studies where only a single country was recruited from (Fig. 3). We show that 71.80% of participants are recruited from only three countries; the US, UK, and Iceland. Although participants from the United States are most frequently the basis for the largest number of studies (41.01% of all studies), the United Kingdom dominates in terms of the number of participants (40.50% of all participants) analyzed. Conversely, although 1.13% of recorded studies involve Icelandic participants, the small Icelandic population (around 334,000) represents 11.52% of all participants contributed to GWAS research. In terms of the ratio of the number of observations contributed by a country relative to the population of the country 26 , Iceland is by far the largest (19.13), followed by the United Kingdom (0.32). Note that owing to the way in which data on recruitment from multiple countries is curated, these numbers can only be used to compare between countries, rather than in absolute terms. This result is predominantly driven by data from deCODE genetics, a major biotech company founded in 1996 in Reykjavík, Iceland. Aggregating to the continental level, Table 2 illustrates a similar but distinct global picture of genomic research: European countries contribute 58.54% of recruited participants and North America a further 19.99% (29.09% and 42.57% of all studies, respectively).
We manually extracted a list of the most frequently used datasets (sometimes referred to cohorts) across the majority of the largest 1250 GWAS as of 29 August 2018, with the objective of providing the first systematic estimate of the frequency and identification of data sources used in GWAS ( Table 3). The most frequently used data sets have several key distinguishing features 27 . First, echoing our geographic analysis, frequently used data are from industrialized countries (Netherlands, US, UK, Ireland, Germany, Iceland), which share similar rates of disease prevalence and population profiles. Second, most engaged in random probability or population sampling to gain as representative a sample as possible, something that is not characteristic of emerging large data sets such as the healthy, older and higher socioeconomic status participants in the UK Biobank 28 or directto-consumer genetic data. Third, they are cohorts that are deeply and richly phenotyped across many diseases, future-proofing them for multiple needs. Fourth, many are older populations with disease diagnosis aimed at unraveling the pathways to disease and  disability in old age. In this respect, they miss the longer-term development of disease and intervention possibilities that an asymptomatic younger population might afford (except for the 1958 British Birth Cohort or additional data collection in cohorts such as the FHS). Fifth, they are all prospective longitudinal data sets, following individuals or birth cohorts over a longer period, thus facilitating a life-course approach to understanding the pathways to certain diseases, disability, and mortality. Sixth, all but one of these cohorts is comprised of predominantly female participants (ranging from 48 to 100%). This sex ratio imbalance is rarely addressed, yet sexual dimorphism or sex differences in disease are highly relevant 29,30 . Finally, although many started as focused hypothesis-driven clinical samples to study one type of disease, most have expanded to contain a breadth of phenotypes and document a trend of adding new samples or generations over time.
GWAS researchers: impact, networks, and gender bias In total, we estimate that there have been 122,141 authorship contributions made by 39,893 unique authors. GWAS metaanalysis has traditionally involved a collaboration of many authors contributing a data set or expertize, with 33.71 authors on average per paper returned from the PubMed database. The highest number of authors on one paper is 559, who collaborated on a study of type 2 diabetes and metabolic traits 31 . Table 4 shows the 10 authors with the highest score in our newly derived GWAS H-Index (Supplementary Methods), which goes beyond a standard H-Index to estimate the importance, significance and impact of a scientist's cumulative GWAS-related research contributions (the replication material outlined in Supplementary Note 1 provides a full ranking of all authors who have been involved in more than one GWAS and have more than 10 citations). These key authors share several striking traits. Many (Stefánsson, Thorsteinsdóttir, and Thorleisfsson) are from deCODE Genetics; pioneers in terms of large sample size, detailed genetic and medical information and the development of new statistical tools. The upper realms of the table also feature key academics at the center of prominent data sets such as Uitterlinden, Hofman, van Duijin, and Rivadeneira, who are key investigators of The Rotterdam Study and Generation R Study. In a recent Nature article describing hyperprolific authors, Uitterlinden provides a candid explanation of his authorship. In addition to making long hours he attributes his success to the richness of the phenotypes and diseases available in the data at his disposal. Regarding his high number of co-authorships, he argues that it is not problematic, but rather reflects the sheer magnitude of the network and effort required to achieve these types of scientific discoveries (Supp Mat) 32 . A third group of authors are individuals who have led multiple key consortiums (e.g., CHARGE) focused on prominent traits such as obesity, type 2 diabetes and cardiovascular disease. Their high GWAS H-Index comes in part from their ability to contribute the same data sets to examinations of multiple traits and renewed rounds of study on the same trait which incorporate larger and larger sample sizes. Nine of the top 10 researchers are based at European institutions (and Albert Hofman was at the Erasmus Medical Center, Netherlands until 2016).
We also examined the most frequently returned Consortiums (termed Collectives in the PubMed database). Of all unique PubMed IDs queried, 844 refer to at least one consortium, with an estimated total of 1654 contributions from 681 unique consortia. The top five consortiums ordered by the number of (cleaned and harmonized) returns are: Wellcome Trust Case Control Consortium (49 returns), CHARGE (46), Wellcome Trust Case Control Consortium 2 (36), the LifeLines Cohort Study (30), and DIAGRAM (29).
In Table 4, only two of the 10 senior authors are female, leading us to explore different aspects of gender imbalance. A growing number of studies have flagged gender imbalance in scientific publications and funding 8,33,34 . We estimate that men contribute 63.03% of all authorships and represent 59.62% of all unique authors. This allows us to naively infer that men contribute more papers on average (per author) than women. These results are best examined in the context of recent work 35,36 based on the entire JSTOR corpus, which estimates that 27.27% of academic authorships between 1990 and 2011 are on aggregate female. This figure increases to 29.3% when filtering for authorships in the field of Molecular and Cell Biology (and to 32.4% for the specific subdiscipline of Human Genomics). Our estimate of 36.97% is higher than these figures, and even more so when compared with the historical average of women undertaking research in Molecular and Cell Biology (20.7% between 1665 and 1989).
We build on work showing the historical under-representation of women in the first and last authorship positions 36-40 . Our The top 10 most frequently utilized cohorts across the majority of the largest third of all GWAS studies as of 29 August 2018 (with studies ranked by the number of times they are involved in a GWAS), manually extracted and harmonized. Additional fields (country of recruitment, age range, and study design) manually curated from web searches. * denotes originally 30-62 years, ** denotes variation by country, *** denotes full sample, including non-genotyped participants analysis shows that 44.04% of the authors in the first author or junior position are female: substantially higher than the all positions estimate. This decreases to just 29.66% for authorships in the senior last author position: substantially lower than the all positions estimate (albeit still higher than other estimates spanning 1990-2011 in the Human Genome subdiscipline 36 ) or first authors of commentaries in Nature (20.0% in 2016) 33 . This is potentially owig to a historic gender imbalance in educational attainment in scientific fields, with fewer women obtaining doctorates in the past than today. We found similar average GWAS-Indexes for female (4.85) and male (5.34) authors and compared the average number of papers published by females (6.15) and males (7.17)  Most of the funding acknowledgments are to US agencies (85.11%) and primarily relate to programs funded by the NIH (apart from the Public Health Service). This is followed by the UK (14.37% of total), with a high number of acknowledgments not just to the MRC, but also to the Wellcome Trust (3.73%), and Cancer Research UK (1.23% of total). This contrasts with other returned countries including: Canada (0.36%), 'International' (0.14%), Austria (0.01%), and Italy (0.01%).
We also summarize the broad ancestral patterns and the distribution across broad disease categories studied when tabulated across various funding agencies in Fig. 4. The NIH Revitalization Act of 1993 (Subtitle B, Part I) 41 implemented a policy regarding the inclusion of minorities as subjects in clinical research (where a minority is defined as a readily identifiable subset of the US population that is distinguishable by racial or cultural heritage) 42 . The Medical Research Council (the largest UK funder) has no similar restriction, although one fund that forms part of the UK's Wellcome Trust solicits proposals, that promote diversity and inclusion, and engages people and communities who are affected by social and economic disadvantage 43 . An important feature of the figure is the comparatively lower ratio of European to non-European ancestries in NIH-funded research in comparison with UK-funded research, which is not legislated to diversify participants. In terms of traits, we see the expected clustering around terms corresponding to the missions of each respective funder. For example, the most frequently funded term from the National Cancer Institute (NCI) is Cancer.

Future directions
Recommendation One: prioritize the inclusion of multiple types of diversity. These findings lead us to 10 evidence-based policy recommendations. Recommendation One is that researchers, editors, funders, and commercial companies prioritize the inclusion of multiple types of diversity in data, namely: ancestral, geographical, environmental, temporal and demographic, and recognize the impact that this lack of diversity has on research findings. First, ancestral diversity needs to increase beyond the replication phase to include more non-European ancestry populations. Significantly extending previous comparisons 22 , we show that diversity levels fluctuated markedly. Following the full release of the UK Biobank and increased reliance on large direct-to-consumer data, we predict that diversity in GWAS ancestry may decrease even further, given that 94.23% of the 488,377-genotyped UK Biobank participants are in the white ethnic group 44 and 23andMe has a sample with 77% European ancestry 45 .  The benefits of increased ancestral diversity are multiple; GWAS that utilize data from diverse populations will provide more accurately targeted therapeutic treatments to more of the world's population, extend insights into the architecture of traits and uncover rare variants with significant effect sizes, which replicate across ancestries. Isolated populations-owing to bottleneck events, genetic drift, adaptation, and selection-are of importance owing to higher frequencies of rare variants, which increase the power to detect associations with clinically important phenotypes 46 . Discovery is often boosted in populations with high rates of homozygosity such as those with a tradition of consanguineous marriage. A recent study of exomes of British Pakistani adults with high parental relatedness, for instance, discovered rare-variant homozygous genotypes that predicted "knockouts" (loss of gene function) in hundreds of genes 47 .
Although the focus has primarily been on increasing ancestral diversity, we also call for an expansion of both geographical and environmental diversity. Although~76.2% of the current world population reside in Asia or Africa 48 , we estimate that 72% of genetic discoveries emanate from participants recruited from only three countries (US, UK, Iceland). By examining only genotype-phenotype associations, GWAS have largely ignored the fact that complex traits have a strong geographical component involving genetic predisposition and environmental exposure. There is little reflection on how environmental variation or Gene-Environment (G×E) interaction impacts results or even shapes the traits that are prioritized for research 49 . The US, UK, and Iceland have distinct histories and social systems that have fundamentally shaped exposure to certain disease factors or traits. Those predisposed to obesity for instance, face radically different environmental stimuli in the US than in other nations. Or, those with a higher genetic predisposition to skin cancer would have their risk exacerbated if they resided in areas with higher sunlight exposure. GWAS regularly combine data sets from vastly different countries and historical periods with little recognition of the consequences, implicitly assuming the impact of genetic loci on traits is universal across time and place. A recent study shows that for complex traits, a large proportion of genetic effects are hidden or watered-down when disparate data across different countries and historical periods are combined 50 .
We also advocate an increased temporal diversity of individuals across different birth cohorts, historical periods and life-course stages. We estimate that the most frequently used data sets are disproportionately populated by older individuals, yet the prevalence and measurement of disease varies considerably with age. There is only a moderate positive correlation between midlife and old-age measures for body mass index, glucose, and systolic blood pressure, for instance, which all increase with age 51 . Samples of older individuals also suffer from mortality selection and exclude a non-random subset of the population 52 . This issue   is compounded by healthy volunteer selection and participants with a high socioeconomic status, both of which occur disproportionately in prominent large data sets such as the UK Biobank 28 . Finally, we call for more discussion related to the gender diversity of GWAS participants, particularly regarding specific diseases as there is growing evidence of sexual dimorphism in traits linked to obesity 29 , reproduction 30,53 , and others.
Recommendation Two: monitoring with funding consequences. Beyond policy formation regarding diversity or gaps in research to intensive monitoring with consequences for funding. Our scientometric approach that links funders, researchers, and grant IDs to ancestral and geographical coverage provides a cost-effective first step toward transparent monitoring in this direction with the potential to expand and locate knowledge gaps in research into certain clinical traits.
Recommendation Three: careful interpretation of genetic differences. European ancestry-based polygenic scores derived from GWAS explain only half as much of the variability in the phenotype for non-Hispanic Black samples as compared with non-Hispanic Whites 20,54 and many cancer associations fail to replicate in other populations 55 . There is a danger that the inability to apply polygenic scores from European ancestry studies to other groups is misinterpreted to reflect biological differences between different ethnic or racial groups. This misnomer was carefully discussed, for instance, in a recent GWAS of educational attainment 56 . Genetic variation needs to be distinguished from the social, cultural, and political meanings ascribed to different human groups 57,58 . Race is not a biological category, as genetic variation is traced to geographical locations and does not map into our perpetually evolving and socially defined racial or ethnic groups. Dictionary-based exercises herein have revealed categorizations that often combined geographical, migration, and ancestral background. Populations are the product of repeated mixtures over tens of thousands of years 20 . Although we use the dominant broad ancestral categories common in the field, by noting these issues we recognize that a more sophisticated categorization scheme is required.
Recommendation Four: local participant and researcher involvement. Previous research has noted lack of local participant and researcher involvement when collecting genetic material in underrepresented communities 57,59 . There are encouraging endeavors to increase genotyping outside of North America and Europe such as the African Genome Variation Project 60 . Many projects that collect non-European samples have funding from large research bodies such as the NIH or Wellcome Trust, granted primarily to researchers working in those countries. The danger, however, is that helicopter science-collecting and then exporting genetic data-may compound existing inequalities, with participants and researchers from those countries not being the main benefactors. African researchers have recently noted that many have accepted restrictive terms offered by foreign partners owing to a lack of resources to handle large genomic data sets 61 . We recommend the inclusion of meaningful local intellectual contributions and, if required (in addition to data collection), the supply of training, computational resources, and infrastructure development to enable local scientists to build the capacity to work independently.
Recommendation Five: action to reduce inequalities in authorship and investigators. We estimate that women author on average fewer GWAS papers, have fewer citations than men, are more frequently junior first authors and less frequently senior authors. The latter observation is remarkably similar to NIH figures, where women constitute only 30% of principal investigators on grants 62 . This suggests a relationship between acting as a senior author and functioning as a PI on grants and may contribute to women's lower peer review scores on funding panels 8 . The NIH has established initiatives such as the Women in Biomedical Careers Working Group and the 2017 Next Generation Research Initiative. Policies such as these which target early career researchers are more likely to reach this goal since these groups are more often more ethnically diverse and populated by a higher percent of women 9 . Female researchers themselves need to be cognizant of these disparities, as should those who conduct research appraisals and funding reviews.
We were unable to control for maternity or care leaves, which may have a role in productivity and serving as a PI, particularly in some European countries where women may take up to 1 year leave 63 . This echoes recent findings that women had a lower longevity in funding, witnessed by a lower likelihood to renew projects, lower submission rates, and lower funding per year 8 . Women face distinct work-life reconciliation issues and may require additional mentoring and support to encourage them to submit and renew applications or serve as a PI. Increased gender diversity in science may also lead to fundamentally new discoveries. That can have real clinical consequences: consider for instance that symptoms of cardiac arrest in women were ignored and misdiagnosed for decades. This has been attributed to the notion that coronary disease was considered a male only health concern, largely studied in male subjects by male scientists.
Recommendation Six: reform incentive structures that intertwine the role of authorship, data ownership, and dating sharing. GWAS demand collaboration through the formation of large consortiums, resulting in multiple authorships. As illustrated ( Fig. 1. and Fig. 2), large samples are required owing to the relatively small effect sizes, with the number of detected associations typically increasing with sample size. Central authors within the GWAS network are the holders of large longitudinal data sets or those who lead large consortiums, with many top GWAS scientists classified as hyperprolific 32 . We reinforce the necessity of conventions related to author transparency in contributions, such as via the Vancouver Regulations which describe the contributions of individual authors 32 . With hundreds of authors, full transparency and reporting remains a challenge. A related suggestion could be to distinguish between authors and contributors who provide data. Another could be to provide data producers with a 6-12-month grace period before making data publicly available to similarly interested researchers. This, however, has the potential to generate its own incentive-based anomalies and pressures.
These solutions, however, do not align with current incentive and reward structures. When the PI and participating researchers are evaluated, it occurs at the individual level. In the UK's national Research Excellence Framework (which ranks departments and institutions according to research excellence), for instance, authorship is a key return. To remove individuals from GWAS authorship demands a broader discussion of incentive systems applicable to data generators. Some observers argue that the authorships of scientists who obtained the funding, designed the study, supervised staff and students, and often supervise data collection and analyses should be removed. Yet, without such labor-intensive endeavors, GWAS would not exist. We also call for the careful application of research metrics such as the H-Index, particularly when comparing scientists and academics across scientific disciplines. As a leading GWAS author and holder of one of the most used GWAS data sets carefully warns: "…for comparing these authorships across different scientific disciplines (biomedical and beyond) I think we should revisit this issue with a critical appraisal to create a better understanding among fellow scientists". (p. 104 Supp Mat) 32 .
Recommendation Seven: create digital object identifiers (DOIs) for data sets and enforce ORCID iDs for authors. An implicit part of this, related to Recommendation Six, is the invitation to publish Data Resource style articles, which generate DOIs for each data source to reward data collection. Surprisingly, our manual curation of data sets revealed a striking lack of transparency and inconsistency in describing the basic data source or additional sample restrictions utilized in many papers. Even in the most eminent journals, descriptions of data were cryptic and sources unclear or untraceable, raising issues of transparency and reproducibility of research. The opening of publicly funded databases has enabled this review to take place, and newly emerging Application Programming Interfaces represent just one small part of the sweeping advancements. However, the implementation of DOIs for common data sets, and the encouraged use of ORCID iDs for authors-in the same way that PubMed IDs identify papers and EFO terms represent experimental variables-would enable better scientometrics and a more accurate reflection of genomic science.
Recommendation Eight: coordinated governance from multiple stakeholders. There have been repeated calls to remove barriers and increase trans-border cooperation, such as UNESCO's reiteration that it is a human right to benefit from shared scientific advancements 64 . There are striking differences in national regulations for data sharing and a patchwork of Institutional Review Board (IRB) positions. International models of genomic data sharing do exist, such as those pioneered by the International Cancer Genome Consortium. A recent evaluation of genomics data sharing across multiple countries reveals complexity, contradiction, and confusion 64 . Data transfer to third countries outside of China, for instance, is prohibitive owing to overlapping and complex data regulations. The US has a fragmented data protection regime with oversight across IRBs and data access committees 65 . Europe's recent General Data Protection Regulation (GDPR) brought new restrictions related to the transfer of data across borders, complicated by additional unique country-and institutional-specific interpretations 66 . An international genomics group could create a more transparent code of conduct and shape the interpretation of GDPR's rules. Closely related to this is the further development of the regulatory protection and data sharing across borders in relation to cloud based storage providers. Those who store the data are dependent on cloud providers who often shift data across geographical locations with limited notification or oversight 67 .
Recommendation Nine: enforce the sharing of GWAS summary results. Just as data can serve as a valuable commodity, so can summary results. Although such sharing is a requirement of many major journals, it remains a policy gray area and they are regularly not released, even after directly contacting authors. Others share only when co-authorship is granted. An effective deterrent could be the threat of retraction of the article unless summary results are shared or prohibiting applications or granting future funding until past discoveries are made publicly available.
Recommendation Ten: utilize influence for the good of more people. Our last recommendation highlights the fact that data sharing, ethics, and transparency is frequently discussed with the implicit assumption that funders, ethics boards, and universities are the only bodies with the power to govern this ecosystem. But what if researchers do not need funding or operate outside of universities and their incentive systems? The growth of direct-toconsumer companies such as 23andMe and biomedical companies, many of whom hold the largest genomic data sets, often fall outside of regulations of funders or universities. By virtue of their position, data sharing, and release of results often follow different rules than publicly funded data sets. Some impose the restricted release of GWAS summary statistics (i.e., the information that is used by other researchers to create polygenic scores and additional analyses). Considering the recent sales of blocks of directto-consumer data to pharmaceutical companies 68 , scientific collaboration also has the potential to be restricted. Although commercial genomics companies generally operate with different demands and incentive structures, most still require external validation of their results published in top scientific journals, placing editors, and journals in a key position of power. We conclude thus by calling upon all parties in the genomics ecosystem to utilize their influence for the good of more people as part of the ongoing genomic revolution.
Conclusions. Our systematic scientometric review of genomic discovery quantifies multiple known and unknown assumptions about this domain. We observe considerable fluctuation in the ancestral diversity of participants over time. By ranking the most frequently used data sets, we also went beyond ancestral diversity to show other types of selectivity. We mapped the geographical recruitment of GWAS participants and core funders by ancestry and disease coverage, explored gender disparities in authorship and provided evidence of a tightly knit social network of researchers and consortiums. A central finding was that our results once again emphasized the potential for a cycle of disadvantage for underrepresented communities and despite continued efforts, infusing diversity into genomics remains challenging.