With a few exceptions, health systems worldwide may collect data on people’s age and sex, but few ask patients about ethnicity or race. There are many reasons, from historical abuses of data on race or religion in the early twentieth century, to simple inertia. Yet countries that collect these data comprehensively have shown that COVID-19 has been characterized by wide inequalities, with those from some ethnic groups disproportionately affected by the direct and indirect effects of the pandemic.

Credit: Viaframe / DigitalVision Vectors / Getty

Data collection

To researchers who study ethnicity and health, these inequalities came as no surprise1. It has long been recognized that people from certain groups experience substantial barriers to accessing health care, as well as disproportionate rates of disease and its consequences, compounding their many other disadvantages in sectors such as employment, education, and housing, themselves risk factors for poor health2,3. Therefore, from a health equity perspective, the case for measuring and understanding variation in health and its determinants in different ethnic groups is obvious. Effective planning of health, social care and public health services requires data on ethnicity to ensure they are culturally appropriate, allocate resources equitably, and evaluate the impact of policy, with the ultimate aim to reduce variations in outcomes4.

A lack of such data is a particular problem in research, where participants are often unrepresentative of those who will ultimately receive interventions being evaluated. This has been apparent throughout the COVID-19 pandemic, with many clinical trials failing to record or report results by ethnicity5, in part because some countries prohibit the collection of data on ethnicity, citing concerns about privacy6.

Ethnicity is a complex concept, incorporating notions of race (the classification of people based on physical appearance), religion, and culture, especially for individuals born into families with multiple heritages. When asked their ethnicity, a patient’s first instinct may be to question why they are being asked. Their answer may differ according to their previous experience, especially if they have several possible identities, as is increasingly the case, and where they perceive that answering the question may place them at risk of discrimination or even violence, as described, for example, among Roma in central Europe7. There are also particular challenges in international comparisons, as context influences the response. For example, it may be quite different to be a Gujurati in India, Kenya, or the UK. Even where such data are collected, the quality may be variable. It has been suggested recently that inconsistencies in data on race (which is related to, but distinct from, ethnicity) in the USA delayed identification of groups at greatest need during the COVID-19 pandemic8.

The experience in countries that collect ethnicity data shows its importance, with many examples from the UK, which is unique in that data on ethnicity are collected routinely in most interactions with public authorities. There is a legal requirement on the National Health Service (NHS) to do so, rather than, as in some other countries, limiting this to ad hoc research studies or using proxy measures, such as country of birth (which will miss second- and subsequent-generation immigrants), as in France.

Defining ethnic groups

The recording of ethnicity principally involves collecting data on an individual’s membership of an ethnic group, but related information on primary language, religion and country of birth may also be collected, depending on the country and healthcare sector. In the UK, information on ethnicity is collected across a wide range of routine electronic health record data sources, primarily captured from patients within NHS systems at the point of care. These data are collected predominantly via self-reporting by patients and with or by health care practitioners.

Ethnic groups are socially constructed and distinct from genetic ancestry. Ethnicity is defined on the basis of a society’s norms, attitudes and expectations, rather than being a readily measurable biological variable (like blood pressure) where there is widespread agreement on measurement9,10. For this reason, the categorization of ethnic groups can differ over time and place. In the UK, the broad ethnic group Asian largely comprises people from the Indian subcontinent, whereas in the USA the term often implies people from East Asia. Similarly, classifications often evolve over time, with the ‘one-drop’ rule defining a person as Black if they had any Black ancestry, which was used for the purposes of segregation in twentieth century USA11. Terminology reflects social factors such as experiences of migration and broader historical processes.

Establishing meaningful ethnic groups to analyse health disparities is not a straightforward task. On the one hand, it is often preferable to study narrow ethnic groups lest important heterogeneity be masked10,12. On the other hand, some minority ethnic communities may be relatively small, which can prevent robust statistical analysis and raise concerns around maintaining confidentiality. It is increasingly appreciated that ethnicity intersects with other characteristics, such as gender, sex, or socioeconomic position13. There can be considerable value in adopting an intersectionality perspective, but this may again require a trade-off against studying more disaggregated ethnic groups. In the UK, a pragmatic classification of 18 ethnic categories has been chosen14, which, where available, provides standardized categories across government and healthcare settings, allowing for the monitoring of inequalities across health, policy and social care spheres.

Patient involvement

Although information on ethnicity is typically collected through patient self-report, minority ethnic communities have rarely been involved in the design, implementation, collection, and use of routine healthcare data on ethnicity15,16. The active involvement of minority groups in the various processes of collecting and utilizing ethnicity data should be prioritized as a way to build trust in communities that are often hesitant to provide data due to wrongful use or past abuse of gathered information. The quality and accuracy of ethnicity data and the evolving relevance of ethnic categories will likely be improved by community members validating or sense-checking ethnic group coding standards15.

Community members have been mobilized in some inclusive ethnicity data collection or design processes. For example, in the 2018 census in Colombia, the national statistical office undertook a wide-ranging consultative exercise with various indigenous and minority ethnic groups. This consultation resulted in revised question wording and ethnic group response options, to align with the consulted communities’ needs. Such public involvement should be undertaken by all organizations collecting routine ethnicity information in health and social care systems, who should also work with ethnic minorities on governance of data repositories and ownership of data17.

Boosting collection

One barrier to ethnicity data collection is healthcare professionals’ lack of knowledge about the importance and use of the data, including a reluctance to ask for ethnicity data and a lack of demonstrated need for the collection of data (Box 1; ref. 18). Experiences of providing ethnicity information within healthcare have been reported as acceptable, although in some studies, participants have expressed dissatisfaction about being asked to provide their ethnicity on repeat visits19, and there needs to be a clear explanation from the healthcare provider as to why the data is being collected and how and what it would be used for19.

At the organizational level, a comparison of self-reported ethnicity and Hospital Episode Statistics (HES)-coded ethnicity in England found that misclassification varied only by a small amount between ethnic groups, but varied by a greater degree between hospitals (the accuracy of coding of ethnicity across hospitals ranged from 67% to 100%)20. This suggests that processes within hospitals may influence coding accuracy, although whether this is driven by staff or organizational issues is unclear.

Incentives to record ethnicity can be effective. Ethnicity recording was introduced into UK primary care in 1991 and into Hospital Episode Statistics in 1995, but this was only financially incentivized under the Quality and Outcomes Framework (QOF) between 2006 and 2011. The completeness of ethnicity recording rose from 27% for individuals registered 1990–2012 to 78% for individuals registered 2006–2012, after it was incentivized, and the ethnic breakdown of Clinical Practice Research Datalink participants was comparable to the UK population21. Challenges remain, however, as while there is high accuracy for people who self-identify as white British (97% accurate), there is poorer accuracy for minority ethnic groups (59% accurate)20.

Training and standardization

Two key factors that affect collection of ethnicity data are staff training and knowledge, and variation in data collection procedures18,22. Standardized data collection protocols and the use of standardized ethnicity categories that can be harmonized across sectors would reduce this variation. Harmonization across countries or continents may not be appropriate, as ethnicity is a social construct and varies significantly between countries.

Self-identification of ethnicity by the patient will avoid errors and emphasizes the value of an individual’s lived experiences, as will self-completion of developed data collection forms. This will require comprehensive staff training to address barriers such as lack of time, or patient capacity for self-report, which can present in pressurized clinical situations4. Training should be developed with patients and public members to ensure that developed protocols are acceptable to patients, the reasons for collecting ethnicity information are clear and justifiable for both patients and staff members, and conveyed in appropriate languages and spoken or written formats19.

Public involvement will avoid pitfalls. During the COVID-19 pandemic, the initial practice in the UK was to use the ethnic grouping Black, Asian and Minority Ethnic (BAME) for early data analyses on outcomes, until this was discontinued following public input and feedback that this was not an appropriate description, as BAME groups together disparate ethnicities23. Community members should also be involved when disseminating information to the public on the importance of data linkage and collection more generally, as well as communications on how health and ethnicity data in particular are used and interpreted.

As well as collecting ethnicity information, data on the wider determinants of health will help to understand inequalities. Most of the differences between ethnic groups in COVID-19 outcome analyses are due to wider structural factors, such as housing and intergenerational living, poor-quality employment and occupational exposure, and environmental support for health behaviours, that are imperfectly collected in electronic health records, if collected at all24.

Policy and professionalism

Newly developed protocols, guidance and ethnic grouping standards will require support across sectors. Policy changes will be needed to enshrine regular reviews of guidance, as well as routine monitoring and publication on the quality of ethnicity coding data4 (Box 2). There is an opportunity for ethnicity to be an exemplar for improving data quality overall; the power of ethnicity data will be increased if all data recording is also improved. Minimum standards for electronic health records should be introduced and health informatics, including the work of the Professional Records Standards Body and the Faculty of Clinical Informatics in these areas should be professionalized, and research using electronic health records increased.

Data quality should be regularly reported at regional and national levels, with cycles of improvement for both completeness and quality. In addition, ethnicity collection reporting could include cross-disease and cross-country comparison; there was little or no mention of ethnicity in the Global Burden of Disease studies and other similar inter-country or global data collection exercises25. Efforts to collate and improve comparability of ethnicity data globally are needed, as comparison of health inequalities data between countries is limited due to inconsistency in ethnicity data collection methods and variance in ethnic group categorizations. Addressing these issues will be complex as self-defined ethnicity will differ depending on the social and cultural context in which the individual is responding26. In addition, any international comparison analyses must be clear on the different experiences and cultural context of comparable ethnic groups between countries, such as how they are or are not minoritized.

Ultimately, funding is required for health care systems and researchers, with supporting policy to ensure continued implementation and monitoring. These are not new arguments, but they are receiving renewed attention following the impact of the COVID-19 pandemic. Action is required to ensure that existing health inequalities based on ethnicity are not maintained or exacerbated. Improving the collection and reporting of ethnicity information in routine health data should be one part of a wider process to tackle health inequalities.