Reporting of race in genome and exome sequencing studies of cancer: a scoping review of the literature

Purpose Minorities are often underrepresented in clinical cancer research yet the frequency of reporting of race in genomic sequencing studies of cancer is unknown. This scoping review determines the rate at which race is reported as a demographic variable, the factors associated with reporting of race, and the participation rates of minority populations. Methods PubMed was systematically searched from 1 January 2010 through 15 November 2018 and 11,014 studies were assessed for eligibility. Publications reporting genome or exome sequencing data for patients with one of the ten most common cancers in the United States were included. Results A total of 231 publications containing sequencing data from 15,721 unique patients met inclusion criteria. Race was reported in 37% of studies compared with 84% of studies reporting age and 85% reporting gender. Reporting of race was associated with cohort size, sequencing method, familial cancer, cancers with disparities, and reporting of age and gender. Minority populations were significantly underpowered to detect recurrent pathogenic variants in most cancers. Conclusion Race is underreported as a demographic variable in genomic sequencing studies of cancer. Substantially increased efforts are needed to sequence patients from underrepresented populations to reduce health disparities in patients of non-European ancestry.


INTRODUCTION
Genome sequencing (GS) and exome sequencing (ES) have transformed the clinical ability to identify pathogenic variants in cancer. As these next-generation sequencing (NGS) technologies have become more cost-effective and ubiquitous, the rate of genetic sequencing data in the literature has increased exponentially.
One aspect frequently overlooked in cancer NGS studies is the racial composition of the patient cohort. 1,2 The cancer burden in the United States disproportionately affects minorities due to numerous factors including access to care, socioeconomic status, and genetics. The impact of ancestryrelated genetic variation on cancer incidence and mortality disparities in minority populations has been documented for breast, lung, prostate, colorectal, melanoma, and kidney cancer. 3,4 The National Institutes of Health (NIH), the Cancer Moonshot Initiative, and major cancer research organizations advocate for diversity in clinical cancer research, 5 yet minority populations are frequently underrepresented in clinical trials 6,7 and genome-wide association studies. 8,9 Despite the abundance of clinical sequencing studies performed in the past decade, the accrual of minorities and reporting of race in NGS studies of cancer remain unknown.
We have performed a scoping review of the literature to comprehensively describe the representation of minority populations in publications containing GS/ES sequencing data for the ten most common cancers in the United States. We chose to perform a scoping review to identify the available evidence, clarify concepts, and identify gaps in knowledge concerning race reporting and minority inclusion in NGS cancer studies that are contributing to disparities in clinical cancer research. Scoping reviews aim to identify characteristics of studies to examine how research is conducted at an overview level but do not assess risk of bias as in a systematic review. 10 The objectives of this scoping review were to measure the frequency of race reporting as a demographic variable in NGS cancer studies, quantify the rate of minority participation in these study cohorts, and explore factors associated with race reporting.

MATERIALS AND METHODS
This review was conducted using the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR). 11

Search strategy
PubMed was systematically searched for studies published between 1 January 2010 and 15 November 2018 using search strategies designed by an experienced medical science librarian (E.M.B.S.) and stated in the Supplemental Appendix. Additional sources were identified through screening of The Cancer Genome Atlas (TCGA) publications and related studies.

Study selection
The studies included in this review performed GS or ES on tissue from patients diagnosed with one of the ten most common cancers in the United States at sufficient depth to identify rare somatic or germline variants. Low-pass GS studies were excluded. All studies had senior authors from a US institution and a clinical cohort that included at least one patient from the US for which sequencing data had not previously been reported. Studies were required to perform GS or ES on patient tissue. A total of 399 publications performing only targeted gene panel sequencing were excluded. Sequencing data from cell lines, xenografts, single cell, circulating tumor DNA, and nonhuman subjects were excluded. Case reports, review articles, meeting abstracts, pancancer, and multicancer studies were also excluded. A.N. reviewed studies for inclusion in November 2018 and consulted with J.T.N. and L.J.R.S. to resolve uncertainties. The study selection process was performed according to the PRISMA-ScR guidelines and additional details are provided in the Supplemental Appendix.

Data extraction
Data abstraction was performed independently by two of four investigators (A.N., K.R.C., L.L.T., J.T.N.) for each publication. Discrepancies were resolved through communication between the reviewers. The extracted data included relevant bibliographic details (study title, first author, year of publication, journal, and journal impact factor), demographic characteristics of the patient population (race, ethnicity, age, and gender), and study details (cancer type, cohort size, sequencing technique, NIH funding, data availability, and associated clinical study). Race and ethnicity were defined by NIH notice NOT-OD-15-089 and encompassed five racial categories: White, Black or African American, Asian, American Indian or Alaska Native (AI/AN), or Native Hawaiian or Other Pacific Islander (PI) and two ethnic categories: (1) Hispanic or Latino or (2) non-Hispanic or non-Latino. TCGA and Therapeutically Applicable Research to Generate Effective Treatments (TARGET) studies were considered to include race, ethnicity, age, and gender even when not explicitly stated, as these data are available in online databases referenced in the publications. For the quantification of minority populations, patient data were retrieved from included publications and the Genomic Data Commons (GDC) Data Portal accessed on 18 January 2019. Although not all patients in the GDC Data Portal were from publications included in this review, these patients were included in the quantitative analysis because their data are publicly available to researchers.

Statistical analyses
Statistical analyses were performed using StataSE 12.0 software (StataCorp LP, College Station, TX). Variables not specifically stated in the study were coded as missing. Individual study characteristics were compared between studies reporting or not reporting race using the Chi-square test for proportions and two-sided Mann-Whitney U test for medians. Logistic regression was performed to estimate adjusted odds ratios and 95% confidence intervals for the association between race reporting and study-level characteristics. Power calculations were performed using the www.tumorportal.org power calculator and previously defined cancer variant frequencies. 12,13

Search results
A total of 11,014 studies were assessed for eligibility and 10,615 articles were excluded based on review of the title and abstract. An additional 168 full text articles were excluded for reasons shown in the Supplemental Appendix. In all, 198 studies passed screening and were included in this review, in addition to 33 studies identified from other sources, totaling 231 publications.

Study characteristics
The 231 included publications contained sequencing data from 15,721 unique patients. Study characteristics are summarized in Table 1 and detailed descriptions are provided in Table S1. A mean of 23 studies were retrieved for each cancer type (range 11-32), 17 publications were from TCGA, 3 were from TARGET, and 41 reported data from patients enrolled in clinical studies. A total of 52 studies reported both GS/ES and targeted gene panel sequencing data, of which only GS/ES data were included in this analysis.

Factors affecting reporting of race
Publications reporting race were more likely to be familial studies (P = 0.001), have larger patient cohorts (P = 0.007), report gender (P < 0.001), report age (P = 0.002), and include GS (P = 0.02) ( Table 1). Race was more likely to be reported in studies of cancers with a known ancestry-related genetic disparity (breast, lung, prostate, colorectal, kidney cancer, and melanoma) compared with those without (bladder cancer, NHL, thyroid cancer, and leukemia) (P = 0.005). NIH funding, journal impact factor, publication year, depositing of sequencing data in publicly available databases, and enrollment of patients in a clinical study were not significantly associated with reporting of race. In a multivariate logistic regression model, studies of familial cancers, studies of cancers with known genetic disparities, cohort size, inclusion of GS, and reporting of gender remained significantly associated with reporting of race ( Table 1).  Adjusted odds ratios and P values were estimated with use of a multivariate logistic regression model with reporting of race as the dependent variable. Values were adjusted for gender reporting, age reporting, familial versus sporadic disease, cohort size, cancers with known ancestral genetic disparities, inclusion of GS, patient enrollment in clinical studies, journal impact factor, NIH funding, publication year, and data availability.

Analysis of race in publications
Race and ethnicity were rarely discussed in the publications. Of 85 studies that reported race, 36 (42%) analyzed or commented on the role of race in the context of their findings. Only 18/85 (21%) publications included a description of how race was determined, with 9 studies using self-reported race, 2 using physician-reported race, and 7 performing ancestry analysis by single-nucleotide polymorphism array or ancestryinformative markers (AIMs).

Inclusion of minorities in sequencing studies
Race was provided for patients in study publications or the GDC Data Portal for 7790 of 15,721 (50%) patients (Fig. 1b). A total of 5042 (65%) patients for whom race was reported were from the 20 TCGA and TARGET studies included in this analysis. Of 85 publications that reported race, 24 (28%) reported patients from only one race (18 White, 6 Black). Black and Asian/PI patients comprised a greater percentage of sequencing study participants compared with the proportion of Black and Asian/PI incident cancer patients for 6/10 and 7/10 cancers, respectively (Supplemental Appendix). AI/AN populations were sequenced at lesser rates than incident cancer patients for all cancer types.

Power to detect pathogenic variants by race
Of patients with race reported, 6373 (82%) were White, 1064 (14%) were Black, 316 (4%) were Asian/PI, 15 (0.2%) were AI/AN, and 22 (0.3%) were Other, similar to previous findings. 2,8 The number of patients needed to sequence to achieve 90% power to detect a recurrent pathogenic variant present in at least 10% of patients was determined based on previously identified somatic pathogenic variant frequencies of individual cancers. 12,13 The total number of published genomes and exomes from Whites exceeded this minimum threshold for 9 of 10 cancer types (Fig. 1c). However, in Blacks, only breast and prostate cancer had a sufficient number of cases to achieve this power (Fig. 1d). Asian/PI and AI/AN populations did not achieve this power for any cancer type.

DISCUSSION
This scoping review and systematic analysis of genome and exome sequencing studies of the ten most common cancers in the United States found that race was significantly underreported as a demographic variable compared with age and gender. Previous analyses quantifying the inclusion of minorities in cancer NGS have been limited to TCGA 2 and 23 single-race studies in the Database of Genotypes and Phenotypes. 8 Here, we found that Black and Asian/PI, but not AI/AN, patients were included in sequencing studies at higher rates than incident patients for the majority of cancer types. However, the total number of minority patients with sequencing data remains significantly underpowered to detect pathogenic variants in all minority populations.
As the patient populations represented in research studies directly inform clinical decision-making and outcomes, substantially increased efforts are needed to sequence patients from minority populations to reduce health information disparities in patients of non-European ancestry. A more complete understanding of ancestral genetics has already yielded positive outcomes, such as beginning to explain why African American women have more aggressive triple negative breast cancer than Caucasian women, 14 why African Americans with renal cell carcinoma are less likely to respond to treatment, 15 and why children with acute lymphoblastic leukemia and >10% Native American ancestry are more likely to relapse. 16 On the other hand, ancestry bias in clinical databases has resulted in genomic testing that is less informative and more costly in non-European patients 17 and the failure to properly control for variants in minority populations has led to false positives and inaccurate conclusions of the genetic causes of cancer. 18,19 One limitation of this scoping review is that the included publications are restricted to GS/ES and therefore other types of massively parallel sequencing are not considered. In addition, some publications include international cohorts with patients from other countries in addition to US patients. Finally, the ten cancers included in this study do not capture the full variation of cancer types and disparities in the United States.
A major benefit from systematically identifying these studies is that the opportunity exists to retroactively determine the ancestry of individual patients using AIMs. AIMs are more accurate than self-reported race and enable fine-scale resolution of admixture. 20 Reanalysis of patients in these studies and inclusion of AIMs in future studies will enable deeper understanding of the contribution of ancestral genetics to identify population-specific subgroups, prognoses, drug responses, and treatment. Full characterization of these molecular subgroups will inform clinical decision-making and reduce racial disparities in cancer.
The role of ancestry-related genetic variation is an important yet understudied component of cancer genomic sequencing studies. Increasing minority participation and reporting in sequencing studies will help to define ancestryrelated differences in the cancer genetic landscape to reduce the biological basis of racial disparities in cancer and improve clinical precision oncology in all patients.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.