How ethics combine with big data: a bibliometric analysis

The term Big Data is becoming increasingly widespread throughout the world, and its use is no longer limited to the IT industry, quantitative scientific research, and entrepreneurship, but entered as well everyday media and conversations. The prevalence of Big Data is simply a result of its usefulness in searching, downloading, collecting and processing massive datasets. It is therefore not surprising that the number of scientific articles devoted to this issue is increasing. However, the vast majority of research papers deal with purely technical matters. Yet, large datasets coupled with complex analytical algorithms pose the risk of non-transparency, unfairness, e.g., racial or class bias, cherry-picking of data, or even intentional misleading of public opinion, including policymakers, for example by tampering with the electoral process in the context of ‘cyberwars’. Thus, this work implements a bibliometric analysis to investigate the development of ethical concerns in the field of Big Data. The investigation covers articles obtained from the Web of Science Core Collection Database (WoS) published between 1900 and July 2020. A sample size of 892 research papers was evaluated using HistCite and VOSviewer software. The results of this investigation shed light on the evolution of the junction of two concepts: ethics and Big Data. In particular, the study revealed the following array of findings: the topic is relatively poorly represented in the scientific literature with the relatively slow growth of interest. In addition, ethical issues in Big Data are discussed mainly in the field of health and technology.


Introduction
T he concept of Big Data has emerged in recent years and has become an active field of research with great interest from academics and practitioners. An extensive body of literature exists concerning the technical potential and challenges of Big Data, as with their increased volume, the velocity, variety, and veracity of data analysis become more sophisticated (Díaz et al., 2012;Michael and Miller, 2013;Hashem et al., 2015). Despite the apparent interest in the use of Big Data tools in the scientific literature and the well-established field of the ethics of technology, the two themes are not often combined in scientific research. Figure 1 presents the number of newly published scientific articles containing, respectively, "Big Data" (red bar), "ethic*" (green bar), and "Big data and ethics" (purple bar) in the topic search in the WoS database search engine. Figure 1 shows data from 1993 as publications in the field of Big Data began to appear in this year. It is unequivocal that the number of papers dealing simultaneously with big data and ethics is a minor fraction of the overall discussion around Big Data. Are there indeed only a few ethical doubts appearing in the context of Big Data systems?
In our paper, we decided to narrow down the analysis to Big Data as a result of the specificity of this phenomenon. The clarification of 'Big Data' meaning must be related to the notion of Data Science. Traditionally, Data Science is a broad notion, which encompasses mathematics, computer science and relevant expertise in the application domain (health, policing, insurance, etc.). Data Science applies scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured (Ley and Bordas, 2018). Three main fields in Data Science can be distinguished (Song and Zhu, 2016): • Data analytics-data is extracted and categorised to obtain some useful patterns and behavioural data.
• Machine-learning-focuses on the development of computer programs that can access data and use it to learn for themselves.
• Big Data-concentrates on mining of useful information from large volumes of datasets.
Thus, 'Big Data' is a term that describes the large volume of data-both structured and unstructured-that inundates a business on a day-to-day basis. But it is not the amount of data that matters-what is truly important is what organisations are doing with the data. Big Data can be analysed for insights that lead to improved decisions and strategic business moves. By employing Big Data, companies and organisations have ample information about the products, services, buyers, suppliers, consumer preferences, etc. that can be captured and analysed. Therefore, the central question is what ethical issues are associated with the use and analysis of Big Data.
In 2012, Boyd and Crawford (Boyd and Crawdord, 2012) claimed that "very little is understood about the ethical implications underpinning the Big Data phenomenon." The literature maps out several ethical dilemmas that evolve in the Big Data context, which are well summarised in the book of O'Neil (2016), from discriminating behaviour toward minorities and people living in a poor neighbourhood to abusive labour practices to the exploitation of consumers. A discussion of these dimensions is offered in other works of the present special issue, e.g., (Sareen, Rommetveit and Saltelli, 2020). Perhaps the most visible form of alarm against algorithms is their use in various types of cyberwarfare, which is often militarily directed and deployed against industrial and military infrastructures (Halpern, 2019). Also, drawing alarm is the use of social media to disrupt elections (McNamee, 2019), and to operate forms of ethical sabotage such as amplifying discord in social conflicts. No aspect of public life is spared, from political figures to the use of vaccines (Broniatowski et al., 2018), from gun controls and mass shootings to migration, and-at the time of writing this work, even the COVID-19 is caught in the crossfire (Rankin, 2020). The dangers to liberty in the form of digital dictatorship are among the challenges identified by historian Yuval N. Harari (Harari, 2018), while many fear the Big Data contribution to the deployment of autonomous lethal weapons. A race seems to be taking place between the scripts of dystopian science fiction-e.g., the series Black Mirror, and what happens in reality. In the age of Jules Vernes, fiction limited itself to anticipating technology in the coming few decades; now it becomes a reality during the period of crafting and producing the script, in a process which has been called "rapidification" (Pope Francis, 2015).
Due to the complex ethical concerns and high relevance of Big Data, it becomes increasingly difficult or even impossible to understand the overall structure and development of this field Fig. 1 Number of research papers on the topic of "Big Data", "Ethic*" and "Big Data and ethic*". Search on www.webofknowledge.com using the search string: TOPIC ("Big Data"); TOPIC (ethic*); TOPIC ("big data" AND ethic*) (July 6th, 2020). ARTICLE HUMANITIES AND SOCIAL SCIENCES COMMUNICATIONS | https://doi.org/10.1057/s41599-020-00638-0 without more in-depth analytical approaches. Not many papers have addressed the issue of the evolution of the concept of combining ethics with Big Data. That is the gap in the nascent literature that we aim to fill by providing extensive insights into publication patterns. According to the best knowledge of the authors, a study of ethics in Big Data using bibliometric methods has not yet been carried out. However, it was successfully used to analyse the relationship between ethics and entrepreneurship (Vallaster et al., 2019).

Methodology
Bibliometric analysis involving the application of mathematical and statistical methods to scholarly publications (Pritchard, 1969) is the cornerstone of modern literature research (Bornmann, 2017). It allows investigating knowledge structure, developing research fields, and capturing the interdisciplinarity of research topics (Reuters, 2008;Pauna et al., 2018;Zou et al., 2018). The goal of traditional citation analysis is to investigate two issues: (i) whether the two articles are connected through citations, (ii) and how many quotes an article has accrued. It is assumed that scientific impact is defined as the extent to which given research papers have been used by other researchers (Bornmann et al., 2008), so citation is taken as the main channel of communication between scientists. The number of quotations and average citations is often, though not always, (Osterloh and Frey, 2020), correlated with the quality and influence of scholars (Tang et al., 2018). However, it is also said that a high number of citations is a necessary, but not sufficient condition of 'being influential' (Small, 1978). Nevertheless, it should be noted that different scientific fields have different citation rates (Radicchi et al., 2008). Therefore measurements of performance based on citation count cannot be directly compared across various research fields.
In our research, we are focusing on a numerical feature of citation, i.e., we are assuming that research impact is not intangible, but measurable in a quantitative way (Zhang et al., 2013). We have decided to apply three different bibliometric methods to investigate the development of the relationship between ethics and Big Data: (i) descriptive analysis, (ii) network-citation analysis, and (iii) co-occurrence analysis.
The first approach concerns a descriptive analysis of fundamental indicators, such as the number of research papers over time, the number of global and local citations. The difference between local and global citations are expressed in the set from which quotes are counted. Local cited reference (LCR), shows the number of citations in a paper's reference list to other manuscripts within the created collection. In comparison, global citation score (GCL) presents the total number of citations to an article in the Web of Science Core Collection. Hence, in our study, we will focus on local citations, which should be understood as a contribution to the development of the field being analysed. To make an example, we are not interested in how many geographers mentioned in their study, research referring to the ethics of Big Data. In fact, we are interested in how many scientists writing about ethics in Big Data used a given article on this subject. Thus, a paper with a large number of global citations (GCL) that has reached many researchers from other fields, but has a low LCR indicates a small contribution to the development of the field related to the topic of the article.
In the second step, we employ network-citation analysis to disclose the relationship between the most-cited publications (Small, 1973). At this stage of our investigation, we are using HistCite software (2005) to generate a historiograph-which is a graphical representation of the network between the most-cited works (based on LCR indicator). In a historiograph, the vertical axis represents time, and the horizontal axis shows citation network nodes. Each node refers to a single research paper having its unique number, while the size of the node reflects the number of citations in the local database. The arrows express the relationship between cited publication-from the analysed manuscript to the previously published one. Moreover, this visualisation allows us to present the timeline of publications under consideration. According to Griffith (Griffith et al., 1974), the top forty research items with the highest number of citations are the optimal number to create the historiograph.
In the third step, we use VOSviewer software (Van Eck and Waltman, 2009) to conduct co-occurrence term analysis to ascertain trends and to identify "hotspots" domains (Cho and Khang, 2006;Williams and Plouffe, 2007). The co-occurrence method measures the distance between two terms. The more often two phrases co-occur in the same line of text, the smaller the distance between them. VOSviewer applies a natural language processing algorithm (NLP) (Van Eck and Waltman, 2011) to identify the strength of association among noun phrases 1 . The software creates a distribution function for each second-ordered phrase and compares it with the overall distribution function of co-occurrences over noun phrases (Van Eck and Waltman, 2010). The lower the distance between phrases in a semantic context, the higher association strength is expected. Based on the word count and association strength, VOSviewer creates a co-occurrence map, allowing us to distinguish main clusters characterised by strong association. To construct the map, VOSviewer uses the SMACOF algorithm (Borg and Groenen, 2005), which minimises the function: under the constraints: n-the number of nodes in a network, X i -the locations of node i in a two-dimensional space, ||X i −X j ||-the Euclidean distance between nodes i and j. VOSviewer builds clusters of nodes by maximising the following function: c i -the cluster to which node i is assigned, δ(c 1 , c j )-a function that equals one if c i = c j ; and zero otherwise, γ-a resolution parameter that determines the level of detail of the clustering (the higher γ is, the higher the number of clusters).
Although there is a significant overlap between the content of Scopus and WOS databases (Norris and Oppenheim, 2007), we have decided to use the Web of Science Core Collection Database (WoS) because it does not have the following disadvantages of Scopus databases. First, in Scopus, the citation matching algorithm seems to need improvement (Valderrama-Zurián et al., 2015). Second, duplicate publications in Scopus represent a vital data quality problem that requires serious attention (Van Eck and Waltman, 2017).
As fas as WoS is concerned, a general limitation is a fact that its coverage in the social sciences and humanities is still limited (Mingers and Leydesdorff, 2015). It is connected with the relatively small coverage of book publications, despite the fact that during the last five years, the number of indexed books has been increasing. Also, non-English language journals are underrepresented in the WoS database. Despite this, the Web of Science Core database is regarded by scholars-e.g., by (Byl et al., 2016), as a suitable tool for bibliometric evaluation.

Empirical results
In the initial stage of our research, we surveyed the literature on ethics in Big Data. Documents were collected on July 6 th , 2020, by research on the web search engine Web of Science Core Collection. We searched for the topic: "Big Data" and "Ethic*" in all categories in the period, 1900-2020. In the WoS database, the fields mined to return results in a common 'topic search' are: • The title of the article, review, proceedings, book, etc.

•
The abstract-which is the work's summary containing the key points discussed, such as research question, methodology, discussion, and conclusion. This field is supplied by the author(s) of the paper.

•
The keywords and keywords plus fields: The keywords field is the one supplied by the author(s) and "tags" the main and sub-topics of the paper's content. The keywords plus field is an algorithm that provides expanded terms stemming from the record's cited references or bibliography.
The total number of obtained documents was 892 (Table 1). Evaluation of data was conducted with the use of bibliometric software, HistCite, and VOSviewer. Based on the collected information, we aim to show where the topic of ethics in Big Data began and identify primordial papers and authors. Basic statistics referring to the created local bibliometric database are presented in Table 1.
The analysis of 892 records showed a substantial dispersion of publications measured as the ratio of the number of articles per one journal (i.e., on average each journal in the database was represented by 1.5 scientific papers about the topic under consideration). A moderate concentration was observed in the relationship of the average number of authors per journal (on average, 4.5 authors per journal). There is a substantial difference in the global and local number of citations, as selected publications were cited 8621 times in the whole WoS database, while only 798 times among the database created for the study. It can be assumed that researchers from fields other than data science were more willing to use work-related to ethics in Big Data. Thus the field itself was developing quite slowly (a relatively small number of connections in the local database).
The distribution over time of analysed publications is presented in Fig. 2. The pioneering work in the context of ethics in Big Data was an article published in 2011 by Helbring and Balietti in which one of the goals was to "elaborate ethical standards regarding the storage, processing, evaluation, and publication of social and economic data" (Helbring and Balietii, 2011). However, Danah Boyd and Kate Crawford are considered the mothers of the field. In their paper from 2012, they raised the issue of data privacy in social media and the issue of ignoring research ethics because "data is seemingly public" (Boyd and Crawdord, 2012). In the same year, three other publications that met the criteria of our search were published, but they did not achieve much success as measured by the number of citations. Within a year, three more articles were published that drew attention to the usage of online data for social research (Loader and Dutton, 2012;Wright, 2012;Nunan and Di Domenico, 2013). In subsequent years, a slow increase in interest in research on ethical issues in Big Data can be seen. So far, the peak of interest is in 2018 and 2019, during which years 173 and 196 scientific articles were published, respectively. As of July 6th 2020, 107 research papers on this subject have been published, which indicates that research on ethics in Big Data is still slowly entering the field of scientific research. Taking into account the current epidemic and voices about the questionable reliability of some publications on COVID-19, a sharp increase in research on ethics in Big Data can be expected.
The importance of individual authors for the development of research on ethics in Big Data can be assessed based on the number of citations of their publications in the created database   Table 2 presents a list of the ten most frequently cited authors. Crawford, with 181 citations of her works in the local database, is the leader in the created ranking. The second place went to Boyd, whose work was cited 160 times in the analysed database. However, it is worth mentioning here that Boyd is coauthor of Crawford's two works. Crowford, with her seven publications, is the second most productive author within the created database. This qualification of most prolific author belongs to Vayena, with her eleven manuscripts published in the field of ethics in Big Data. Analysing the data contained in Table 2, one can see, that the ratio of the number of local citations (LCS) to the number of publications (Q) for each of the authors is not particularly high, which may indicate that researches on this subject are so far at the initial stages. Interestingly, if one adds up the citation per-gender, one finds that female scholars total 429 citations against male scholars' 211, a rare partition in a usually male-dominated academy. We also compiled a rank of the most frequently cited publications in the field of ethics in Big Data (Table 3). It is worth reminding that the generated database only contains scientific articles and conference publications, with only some books, book chapters and reports.
The most significant publication in the evolution of ethics in Big Data is the previously mentioned work of Boyd and Crawford (2012) entitled "Critical questions for big data. Provocations for a cultural, technological, and scholarly phenomenon". This research paper is focusing on ethical problems concerning data privacy in social media and the problem of a lack of understanding of ethical boards with respect to "the processes of mining and anonymising Big Data" (Boyd and Crawdord, 2012). The second most cited work is the paper prepared by Mittelstadt and Floridi, "The Ethics of Big Data: Current and Foreseeable Issues in Biomedical Contexts" (Mittelstadt and Floridi, 2015). In this paper, as the title suggests, the authors investigate a biomedical context, not ignoring such issues as privacy, ownership or epistemology, and objectivity. They are also noticing that "as is often the case with the cutting edge of scientific and technological progress, understanding of the ethical implications of Big Data lags behind" (Mittelstadt and Floridi, 2015). The third article with the highest LCS is "The ethics of algorithms: Mapping the debate" which clarifies the ethical importance of algorithmic mediation (Mittelstadt et al., 2016). A thought-provoking element of this article is a conceptual map of the ethics of algorithms, allowing more rigorous diagnosis of algorithms' ethical challenges. Among normative concerns authors distinguish unfair outcomes and transformative effects, while epistemic considerations consist of: inconclusive, inscrutable and misguided evidence. In turn, the Table 3 The main papers related to ethics and Big Data.  fourth most cited publication, "Big Data, Big Problems: Emerging Issues in the Ethics of Data Science and Journalism," pointed out the uncertain status of data collected through telemetry or public submission (Fairfield and Shtein, 2014). Authors indicate the growing ethical problems of media and research using big data techniques, clearly observable now in the era of the COVID-19 pandemic and the spreading of dubiously ethical studies. The main conclusion from their paper focuses on the need to use the framework combining stability with flexibility, as the best way to achieve the original purpose of fundamental ethical principles. The content of Tables 2 and 3 is, of course, correlated, i.e., the most cited researchers are the authors of the most important publications in the field of the discussed issue. It may be surprising that none of the works by Vayena, the most prolific author, is included in Table 3. However, it should be noted that despite publishing 11 research papers on the analysed topic, they were cited only 25 times, which is an average of 2.7 per manuscript (LCS/Q score). Therefore, none of them managed to impact significantly the development of the field being analysed.
The crucial part of this analysis is not to identify the most frequently cited publications but to establish a network of connections between them. Thus, using the HistCite software, we have prepared a historiograph (Fig. 3) involving, typically, around 5% of the publications that are the most-cited in the local database (Garfield et al., 2003). As can be seen in Fig. 3, the biggest node (no. 27) represents the very first publication of Boyd and Crawford (Boyd and Crawdord, 2012). Their research turned out to be innovative and groundbreaking enough to contribute to further ethical considerations in the context of Big Data. In principle, all subsequent publications relate directly or indirectly to this particular study. For example, the work of Vayena (node no. 145) refers to the work of Mittelstadt and Floridi (node no. 201) that was inspired by Boyd and Crawford (node no. 27).
It is not surprising that the second-largest node (no. 201) refers to the second work in terms of citability, which is Mittelstadt and Floridi (2015). They refer to the first publication and become an essential source of inspiration for papers issued after 2016.
Interestingly, none of the research published between 2013 and 2015 gained as much popularity as the work of Mittelstadt and Floridi (2015). The success of this work probably results from the authors' explicit embedding of ethics in the context of biomedical research. The vast majority of later publications contain references to the two articles mentioned above. It is also worth noting that none of the papers published in 2018 or later is included in the top most-cited list. The hypothesis arises that ethical issues in Big Data relate more to the biological and medical sciences than other disciplines.
At the last stage of our analysis, we use the co-occurrence map, which helps to identify the various areas of research and understand the direction in which the ethics combine with Big Data. We used information included in the title, abstract, and keywords as term sources obtaining 156555 unique terms extracted from the local database. We applied the text mining functionality of the VOSviewer to identify the noun phrases in the text, and then to convert all plural noun phrases into singular ones. A minimum number of occurrences was assumed as 20, so 172 terms met the threshold. For those 172 words, a relevance score was calculated by VOSviewer, and then we selected the 60% most important phrases. Finally, we ended up with 103 terms, from which we excluded terms not germane to analysis goals such as specific place names, general statistical terms or measures reflecting such things as time, quantity, and rate. The same VOSviewer software was also used to construct a bibliometric diagram visualising the co-occurrence of the extracted texts. Figure 4 presents the cooccurrence term map. Each term is represented by a blurred circle, where the size of the label represents the term's frequency; the colour characterises the cluster to which it conceptually belongs, and proximity to another phrase indicates the degree of relatedness between them. The analysis of Fig. 4 showed science (the biggest font size) as the most frequently mentioned phrase followed by the words: health, medicine, governance, artificial intelligence, and knowledge.
There are three clusters in Fig. 4. The red cluster can be called the 'legal cluster.' This cluster groups terms associated with governance, regulation, law, and rights concerning gathering health and biomedical data, but also the ethical issues of obtaining private data on the Internet. The green cluster, which can be called the 'scientific cluster,' shows the ethical concerns and implications in data sharing and access to knowledge and research results. The blue cluster, named the 'medical cluster,' points to the importance of ethics in medicine, healthcare, and artificial intelligence. As was shown in the previous part of the analysis, ethics in the biomedical context are one of the biggest worries in the implementation of Big Data analysis. All clusters are located close to each other, proving a strong relationship between the topics covered within each group. One may even be tempted to say that the main phrases in clusters are located on their borderlands, demonstrating the interpenetration of the discussed phenomena. In fact, the subject of obtaining and processing medical data is the most pressing ethical issue related to Big Data, and references to this topic are undoubtedly visible in each of the clusters.

Conclusion
Big Data is a rapidly developing research area that attracts a lot of interdisciplinary attention, including on the ethical issues which arise in the course of the implementation of this new technology. The results of this study reveal that the current studies about ethics and Big Data are dominated by Boyd, Crawford, Mittelstadt, and Floridi, and that the thematic scope itself mainly relates to health and medical issues. It seems that these trends will also be maintained in the time of the COVID-19 outbreak as many ethical questions related to tracking the spread of the virus are raised (Jamrozik and Selgelid, 2020;Robert et al., 2020;WHO, 2020). In this particularly difficult period, attention is being paid to the issue of individual freedom, both in terms of traceability of movement and social networks, but also in terms of voluntariness of vaccination. Though not covered by the present investigation, growing ethical attention is focusing on so-called "challenge study," in which healthy subjects are given a prospective vaccine and then infected with the coronavirus (Elliot, 2020), and on the fact that participants in medical research studies such as these are often minorities or ex-detainees. Closer to the topic of this work, the issue of the ethics of Big Data also comes into play in the issue of contact-tracing applications for fighting the pandemic. While these were apparently a success in some countries (Holmes, 2020), they were less so in others, while the concerns about the privacy and security risks of the technologies let to an intense ethical debate (Singer, 2020). Ethical assessments of the potential benefits and risks of each action should be made in light of the best available empirical data and models. The expected harms and benefits of different proposed research programmes concerning not only COVID-19 but also all other areas, should be taken into consideration, and a nascent debate has sprung out about what numbers are being used to decide what policies to fight the pandemic (Caduff, 2020;Didier, 2020), a topic which relates to the ethics of quantification (Saltelli, 2020) and to the present special issue (Sareen et al., 2020;Saltelli and Di Fiore, 2020).
We figured out that there is a lack of well-recognised literature on ethical issues in Big Data related to micro and macroeconomic, political and sociological analyses. To make an example, in 892 papers reviewed, only 12 are from Economics, 42 from Management, 38 form Business, 27 from Sociology and 20 from Political Science. The small share of papers related to ethics in Big Data in the total number of published scientific research is striking. The individual works appearing in those topics are still somewhat limited, fringe research area. However, we realise that the contextual and multi-level phenomenon of ethic and Big Data is a demanding research area, requiring extensive knowledge, both philosophical and purely technical. These factors may contribute to the relatively low popularity of the issue raised in the scientific literature. In the case of medical research, the subject is also industry-specific. The enormous emphasis on ethical issues in medical sciences results mostly from working on sensitive data, but also the perilous consequences of unreliable studies, e.g., linking autism with vaccinations.
Despite the restrictions arising from the very nature of bibliometric research and the database used (including only some books and book chapters), the analysis allowed us to reconstruct HUMANITIES AND SOCIAL SCIENCES COMMUNICATIONS | https://doi.org/10.1057/s41599-020-00638-0 ARTICLE HUMANITIES AND SOCIAL SCIENCES COMMUNICATIONS | (2020) 7:137 | https://doi.org/10.1057/s41599-020-00638-0 the effects of scientific productivity in terms of concreteness in historical terms. Our main contributions in this work are the analysis of statistical patterns and the provision of an informative overview of the different contexts and intersections between ethics and Big Data-at a moment where the field is likely to experience transformation and accelerations.

Data availability
The datasets analysed during this study are available in the Harvard Dataverse repository: https://doi.org/10.7910/DVN/ RU8KTN Ethics and Big Data-bibliometric analysis.
Received: 7 April 2020; Accepted: 20 October 2020; Note 1 Noun phrase consists of a head, which is typically a noun, and of elements which (either obligatorily or optionally) determine the head and (optionally) modify the head, or complement another element in the phrase. Noun phrase 'consists of a noun and all the words and word groups that belong with the noun and cluster around it' Stagaberg (1979).