Introduction

Given advancements in the digital economy worldwide and the rapid development of related technologies, such as 5G and artificial intelligence, data have become an important resource globally. However, numerous potential risks of data breaches accompany such developments in information technology (IT). It has been reported that more than 100 million Android users’ sensitive personal data were exposed in May 2021 because of several misconfigurations. In the same year, a database containing the personal information of 533 million Facebook account users across 106 countries was exposed, potentially leading to further social engineering attacks or hacking attempts (Henriquez, 2021). The frequent incidents reported in the media reflect the severity of these data breaches and merely represent the “tip of the iceberg.” Despite the related laws and data breach notification requirements enacted by governments worldwide, such as the General Data Protection Regulation (GDPR) of the European Union, National Security and Personal Data Protection Act of 2019 (NSPDPA) of the United States, and Data Security Law and Personal Information Protection Law of China, data breach incidents continue to occur. Statista reported that the annual number of data compromises has increased from 2005 to 2022Footnote 1. According to statistics from the Privacy Rights Clearinghouse, the occurrence of data breaches has been high since 2010Footnote 2. As shown in Fig. 1(a), from 2005 to 2018, the number of reported data breach incidents increased by 4.2 times, presenting a significant upward trend. Figure 1(b) indicates that the number of data breaches achieved a record high in 2021 (Verizon, 2022). The cost of a data breach has also increased significantly. The average total cost of a global data breach was $4.35 million in 2022, which was the highest in the history of the report, increasing by 2.6% from 2021 and 12.7% from 2020. Given the COVID-19 outbreak enforcing remote work and digital transformations in recent years, data breach costs increased by $1.07 million in 2021 and $0.97 million in 2022 (IBM, 2021, 2022). Moreover, Meng et al. (2022) suggested that the spreading online of public opinions can have severe consequences. Information breaches once disclosed may damage the image of the related organization, industry, or even the supervisors.

Fig. 1: Yearly number of data breaches.
figure 1

A description of annual number of data breach incidents. Panel a presents statistics using data from PRC and panel b describes data breaches in recent years using data from Verizon.

Data breach risks remain prevalent in universities. The number of reported data breach incidents in higher education institutions is increasing (Bongiovanni, 2019). From the perspective of a university concerning multiple stakeholders (Borgman, 2018), individuals exhibit diverse activities in both physical and cyber spaces (Li T, Li Y, Hoque MA (2022) and interact through the internet, thereby leaving digital footprints (Qin et al., 2022). Students, faculty, staff, and visitors frequently access a university’s information technology infrastructure and generate data in various ways, such as via personal mobile devices, laboratory sensors, and swipe card access systems. These large-scale data interactions and flows among organizations and users inadvertently and continuously expand the digital footprints of universities, potentially leading to information security concerns by increasing the risk of data breaches. Moreover, insufficient security awareness and a lack of attention to data security place universities in a dangerous position. According to a survey by the Joint Information Systems Committee (JISC), only 39% of students indicated that they were informed of how universities store and use their personal data. Only 15% of the staff scored their organizations as eight or more out of ten in terms of data protection (JISC, 2018). Notably, JISC had a 100% track record of gaining access to the most valuable data in universities and research centers using spear phishing (Chapman, 2019). Data breaches may also be caused by human errors, such as sloppy data handling and negligent security procedures, due to insufficient awareness of data security (Ulven and Wangen, 2021). For example, almost 44,000 student records were obtained from the storage of secure information at Arden University in 2022 because of human errorsFootnote 3. Moreover, according to Verizon (2022), the education sector has been facing additional challenges because the pandemic made it mandatory to hold classes online, providing opportunities for malicious hackers and increasing the risk of data breaches.

Universities with plentiful personal and research data, intellectual property, and insufficient awareness of data security are enticing from a hacker’s perspective, making higher educational institutions primary targets (Hina and Dominic, 2020). It has been observed that the number of information security breach incidents reported by higher education institutions worldwide is increasing rapidly (Borgman, 2018). For example, the University of California announced a malicious cyberattack in 2021, and the stolen personal information (e.g., social security numbers, email addresses, phone numbers, and home addresses) was found on the dark web (Ying, 2021).

The same holds true for data breach risks in universities in China. Figure 2 presents the monthly statistics regarding data breach incidents in universities in China as reported by the Education Industry Vulnerability Reporting Platform, a resource-sharing platform for collection and notification of system vulnerabilities in the country’s education industryFootnote 4. The number of reported data breaches was relatively high, with a significant upward trend on a monthly basis, reflecting that Chinese universities are also at severe risk of data breaches, which should not be underestimated.

Fig. 2: Monthly statistics of data breaches in universities.
figure 2

A description of data breach incidents in universities in China using data from the Education Industry Vulnerability Reporting Platform.

Despite the increasing trend in information breach incidents, previous studies have rarely focused on such incidents in universities (Okibo and Ochiche, 2014). As Hina and Dominic (2020) have reported, only a few studies have focused on the security risks of sensitive information from higher educational institutions. Information security management in universities is a poorly investigated topic (Bongiovanni, 2019).

Hence, in this study, which focuses on universities’ data breach incidents, we aim to investigate the determinants of data breach risks to better understand the underlying impact mechanisms. The research framework is at the university level, and the samples used in empirical analyses were obtained from China. The aim of this study is to answer the following research questions: (i) What factors impact data breaches in universities and how do these factors interact? (ii) By what mechanism does academic strength impact data breaches? (iii) What is the influence of emerging information technologies, such as cloud storage, on the impact mechanism?

Based on crime opportunity theory and routine activity theory, we examine how public vulnerability disclosures, cross-border data flow, academic strength, and the adoption of cloud storage affect the possibility of data breaches, thereby analyzing the interactions between these variables. It is observed that an increase in the number of public disclosures of vulnerabilities increases the frequency of data breaches. In addition, cross-border data flow decreases the number of data breaches. Subsequently, using two mediators, the mechanisms through which academic strength affects data breaches are identified. Universities with higher academic achievements have relatively higher cross-border data flow and vulnerabilities. Furthermore, cloud storage is better than local storage when considering data breaches, and a public cloud has better performance than a private cloud in data security protection. Furthermore, our study shows that cloud adoption negatively moderates the impacts of vulnerabilities and positively moderates cross-border data flows.

This study contributes to the literature in several ways. First, the factors influencing data breach incidents related to universities are empirically examined. Prior studies focusing on data breach risks have primarily considered the medical industry and enterprises. The higher education industry—particularly universities—is presently subject to severe data breach risks, but has received relatively limited attention. Second, as risk management has become a research focus in the context of cross-border data flow, we investigate the effects of cross-border data flow on data breaches and provide a new perspective for understanding the value of such data transfers. Third, the impact of the cloud on data breaches is identified, distinguishing between the effects of different types of cloud adoption. Finally, we contribute to the literature on data breaches and theories on data security, indicating several managerial implications for the control of data security risks in universities to further optimize data protection strategies.

The remainder of this paper is organized as follows. First, the relevant literature is reviewed. Second, related theories are outlined prior to proposing a research framework with hypotheses. Third, the data and variable measurements are described, followed by empirical analyses and main results. Subsequently, several robustness checks are performed. Finally, the results are discussed, and conclusions are drawn.

Literature review

Prior research has analyzed the motives behind cybersecurity and the influencing factors of data breaches. Factors such as organizational attributes, economic indicators, and information technologies have been empirically explored. In this section, we first review the literature related to information breaches in universities and then summarize the literature according to types of influencing factors.

Information breaches in universities

As demonstrated by Bongiovanni (2019), regarding security management, information in universities is the least secure. Data breaches in higher education are becoming increasingly common (Chapman, 2019). One of the most urgent threats faced by higher education is from cybercriminals or hackers seeking to profit from the theft of the sensitive personal and financial information of the students, faculty, and staff (FireEye, 2016). Verizon (2022) noted that monetary gain was the primary motive for approximately 95% of data breach incidents observed in higher education in 2021. In general, the intention of cybercriminals is to steal data that can be quickly monetized.

The open and collaborative environment in a university and the typical access to many portable devices make it easier to gain access to unauthorized sensitive information (Coleman and Purcell, 2015). Web users are highly mobile and accustomed to accessing the web from any device, at any time, and from anywhere. This open-design architecture commonly used by universities undoubtedly facilitates the exchange of information (Okibo and Ochiche, 2014); however, the existence of numerous connected devices across organizations, the coexistence of different security cultures, and the tendency to outsource security controls make universities more vulnerable to information security risks (Borgman, 2018). Additionally, the academic culture of openness and the unencumbered access make it particularly difficult for universities to maintain security. The lax security that facilitates open access and the sharing of cutting-edge academic research and content on the network makes higher education an attractive target for attackers (Roman, 2014). In conclusion, universities that hold sensitive personal data and intellectual property of many researchers are ideal targets (Chabrow, 2015).

One of the factors affecting information security in universities is the increasing difficulty of security management. Noghondar et al. (2012) pointed out that high turnover rates and general complacency toward information security also increase the exposure of university information. Magura et al. (2021) highlighted issues affecting database security that could lead to data breaches and data theft, including human factors, work environments, and the technologies used. Liu et al. (2020) studied how centralized IT decision-making affects the likelihood of cybersecurity breaches in higher education, especially in institutions with a more heterogeneous IT infrastructure. Iriqat et al. (2019) explored the compliance of staff with information security policies at the Palestine University. Other studies have concluded that a lack of security awareness is directly related to how the faculty value the information system assets of their universities (Nyblom et al., 2020). To address these concerns, artificial neural network techniques have been utilized to improve cybersecurity in higher education (Saad AL-Malaise AL-Ghamdi et al., 2022).

Data breach influencing factors

There are three typical types of research on data breaches: (1) analysis of the consequences of data breaches, such as that of Foerderer and Schuetz (2022), who studied the influence on stock market reactions, Ali et al. (2022), who focused on the long-term effects on equity risk, and Bachura et al. (2022), who investigated the emotional response after a data breach and identified breach concepts most relevant to each emotion; (2) research on response strategies, such as user compensation (Goode et al., 2017; Hoehle et al., 2022) and corrective action (Nikkhah and Grover, 2022); and (3) analysis of the causes of data breaches, which we focus on primarily in this paper. The most relevant existing studies on the influencing factors of data breaches from different sectors listed in Table 1 provide a comparative analysis primarily from an industry perspective. From the listed studies, it can be concluded that when organizations at risk of data breaches have more commercial attributes, the interests involved can be more complex; thus, social perception can significantly affect information security, especially the likelihood of cyberattacks. However, when an organization has fewer commercial attributes, the defining attributes of the organization and IT management are dominant factors that influence data security. In the case of companies, due to their special nature as business organizations, researchers are more concerned about the impact of a company’s performance and image, which is likely to cause dissatisfaction among stakeholders (D’Arcy et al., 2020). In addition, the management practices of employees and the personal characteristics of top managers are important factors related to information security (Ifinedo, 2016; Haislip et al., 2021; Burns et al., 2022). Studies related to the health care industry have largely focused on organizational features. Scholars have paid more attention to the impact of IT management systems and organizational characteristics on data security (Angst et al., 2017; Dolezel and McLeod, 2019; Kim and Kwon, 2019). The same holds true for higher education, especially for Chinese universities as they are generally public universities with fewer commercial features. Therefore, following the spirit of prior research, this study focused on organizational features and IT measures.

Table 1 Competitive analysis.

The risks of data breaches can differ based on the main industry, geographic location, and types of breaches occurring in the past (Sen and Borle, 2015). Lee and Hess (2022) found that demographic variables (gender, age, race, ethnicity, income, and location) and political ideology are associated with data security. Schlackl et al. (2022) summarized the antecedents of data breaches identified in prior research, including technology measures, information disclosure, organization attributes, etc. In an enterprise, corporate social performance (measured by participation in socially responsible or irresponsible activities) has been proven to affect the likelihood of computer attacks leading to data breaches (D’Arcy et al., 2020). Corporate reputations were found to be important assets in protecting corporate value after a data breach (Gwebu et al., 2020). Wang and Ngai (2022) explored the negative association between firm diversity and data breach risks, delineating the boundary conditions. Ifinedo (2016) discussed how top management support, the severity of sanctions, and cost‒benefit analyses have significantly impacted employee compliance with information systems security policies. Burns et al. (2022) studied personal motives and controls for insider computer abuse, which could lead to costly and severe data breaches. Regarding the medical industry, Wasserman and Wasserman (2022) focused on cybersecurity risks in hospitals. Dolezel and McLeod (2019) studied employee behavior, safety culture, training, supplier selection and handling of personal health information, and strong risk management procedures as data breach factors. Another study found that data breach risks differ according to type and scale of a hospital (Gabriel et al., 2018). Regarding the banking industry, Ali et al. (2020) investigated the effects of socio-factors on the banking sector’s systematic risks.

Given the emerging developments of new information technologies, such as artificial intelligence and intelligent robots (Ban et al., 2022; Lu et al., 2023), IT factors are attracting more attention in related research streams. IT investments have been found to be effective in reducing the risk of data breaches (Sen and Borle, 2015); however, this does not necessarily translate into fewer data breaches. Institutional factors create conditions under which IT security investments can perform more effectively. When considering the impact of information security investments on data breaches, companies must consider the impact of institutional factors and balance them. Li et al. (2021) found that IT security investments have different effects on security breaches in organizations with different approaches to making digitalized progress. Li W, Leung ACM, Yue WT (2022) stated that there is a dynamic interrelationship between IT investments and data breaches. Haislip et al. (2021) found that executives’ IT expertize could be an effective factor influencing reported data security breaches. Additionally, the increase in vulnerabilities adds to the risk of data breaches but is mitigated by an increase in expired vulnerabilities (Sen and Borle, 2015). Regarding new ITs, Fried (1994) discussed both new threats and potential new defenses for information systems security brought about by new products and information technologies. For example, Kim and Kwon (2019) found that electronic medical records and medical management department plans increase the risk of accidental and malicious data breaches, especially in larger hospitals. For emerging cloud services, although people generally believe that cloud services are more vulnerable to security breaches, cloud services in fact reduce the average expected losses of consumers relative to internal software in a high-security loss environment during an attack (Zhang et al., 2020). Moreover, cloud storage is a type of centralized storage (Bandara et al., 2021; Ouf and Nasr, 2015; Wu et al., 2014) and may be safer when considering the emergence of end-user computing. The task of ensuring information security becomes more complex as information systems become increasingly distributed (Fried, 1993), and the integration of security and IT-related processes can reduce data breaches (Angst et al., 2017). Pang and Tanriverdi (2022) found that cloud migration of legacy IT systems significantly reduces cybersecurity risks for public clouds through the internal and external guardianship provided by the cloud service, which has more resources for establishing effective information protection.

As the digital economy develops, additional discussions on the security and development of cross-border data flow have emerged. The benefits of cross-border data have both economic and social repercussions. Ten percent of the average profit growth of various industries is attributed to cross-border data (China Academy of Information and Communications Technology, 2021). Bauer et al. (2013) found that limiting the free flow of data leads to a reduction in gross domestic product (GDP). In terms of social benefits and public welfare value, the Organization for Economic Co-operation and Development (OECD) (2019) insists that it is necessary for data to flow domestically and internationally, as this can provide significant developmental benefits. The “public good” nature of data beyond national borders has been emphasized and calls for international data sharing. For example, the COVID-19 pandemic clearly demonstrated the importance of the global sharing of health data for research purposes (United Nations Conference on Trade and Development, 2021). However, cross-border data flow and international storage are associated with perceived risks, such as those concerning surveillance and unwarranted data mining (Meltzer, 2015). To assess risk, Li et al. (2022) developed a risk index system for cross-border data flow and applied it to biomedical organizations. There is evidence that localized data are unlikely to provide better results in terms of data breaches, and the domestic storage of data poses risks to many poorly managed and costly data centers (Chander and Lê, 2014). Indeed, data localization does not contribute to data security but makes it more vulnerable to destruction, especially by hackers (Chander and Lê, 2015).

In summary, the current literature has shown that great progress has been made in research on the factors that influence data breaches, thereby drawing a basic outline of the problem and providing a thorough comprehension of data breaches. Based on this, we focus on identifying the influencing factors related to universities.

Theories and hypotheses

Relevant theories

The routine activity theory proposes three factors leading to crimes (in this case, cybersecurity crimes): (i) potential attackers or malicious insiders with crime motives; (ii) suitable, accessible, and valuable targets; and (iii) a lack of competent guardianship (Cohen and Felson, 1979). In this context, offenders can be predominantly potential attackers, malicious insiders, or insiders who disclose sensitive information unintentionally (Pang and Tanriverdi, 2022). The motive is mainly financial (Verizon, 2022). The target could be accessible IT systems that manage universities’ critical information. Universities can strengthen their guardianship by investing in security protection technology (Liao et al., 2017; Luo et al., 2020; Wang et al., 2015) or by seeking external governance from vendors (Pang and Tanriverdi, 2022).

The central assumption of crime opportunity theory is that criminal behavior is driven by human rationality and that the conditions for committing a crime require a vulnerable victim in addition to motive and the lack of restraint (Hannon, 2002). Thus, criminals are more likely to take opportunistic actions and choose victims who are more vulnerable. In criminal cases that lead to data breaches, vulnerabilities in information systems, software, and firmware present opportunities for potential intruders, that is, the more system vulnerabilities there are, the greater the chances of attracting intruders will be, resulting in a higher risk of data breaches.

Hypotheses development

Based on relevant theories and the related literature, we propose the research framework shown in Fig. 3 and the following research hypotheses.

Fig. 3: Research framework.
figure 3

It shows the relationships among variables and demonstrates relevant theories applied.

Coordinated vulnerability disclosure (CVD) is an efficient approach to finding and fixing flaws in IT systems. Through this approach, after finding a vulnerability in an IT system, a white-hat hacker (an ethical hacker who uses his or her ability to discover security vulnerabilities and helps protect organizations) reports it to the platform to warn the system manager. Details such as the titles of the vulnerabilities and their brief descriptions, ratings, and comments are visible to all registered white hats. However, the vulnerability details are only visible to relevant organizational administrators and vulnerability submitters. According to crime opportunity theory, criminals are more likely to engage in speculative behavior and choose victims who are more vulnerable (Hannon, 2002). In data breach incidents, “vulnerable” represents the public disclosure of computer security vulnerabilities in information systems, software, and firmware (Sen and Borle, 2015), which enhances the accessibility of sensitive information, thereby increasing the data breach risk according to routine activity theory. It has also been noted in the literature that public disclosures of relevant vulnerabilities increase the frequency of attacks (Browne et al., 2001). The more vulnerabilities there are, the more vulnerable the information system is to malicious attackers. Therefore, we propose the following hypothesis.

H1: Public disclosure of vulnerabilities increases data breaches.

Given the development of globalization, cross-border data flow has become an essential part of the global digital economy. The necessity for cross-border data flow has been emphasized considering its significant economic and social benefits (Bauer et al., 2013; OECD, 2019), especially in the context of academic research on international collaborations and data exchanges. It is also evident that localizing data storage is unlikely to provide better results in terms of data breaches (Chander and Lê, 2014) and does not contribute to data security; instead, it makes the data more vulnerable to destruction, especially by hackers (Chander and Lê, 2015). Rather than reducing data security risks, suppressing cross-border data flow places universities at a disadvantage. Thus, universities with greater cross-border data flow may have fewer data breach incidents. Therefore, we propose the following hypothesis.

H2: There is a negative relationship between the frequency of cross-border data flow and occurrences of data breaches.

According to the Data Breach Investigation Report by Verizon (2022), more than 75% of breach incidents in the education industry are by external attackers. Financial motives account for 95% of attacker motives, meaning that hackers mostly attack for money (e.g., by selling personal information and through blackmail). Academics are the heart of a university, and the performance of the faculty affects the quality of student learning and the strength of the university, which in turn impacts the contributions of academic institutions to society (Shrand and Ronnie, 2019). Many indicators of research success are significantly associated with a university’s reputation (Linton et al., 2011). In higher-ranked universities, the volume of research is larger. According to routine activity theory, offenders tend to choose more valuable targets. Therefore, hackers who hack for money are more likely to aim for academically stronger universities, as they are more famous and perform better in industry. Similar concerns have been raised in previous research. Liu et al. (2020) considered the impact of research grants on cybersecurity attacks since the valuable intellectual properties generated in research and development activities are at risk of being stolen and misappropriated, which makes universities particularly attractive targets for cybersecurity attacks. In other words, academically stronger universities are more likely to be attacked, leading to additional data breaches. Hence, we propose the following hypothesis.

H3: There is a positive relationship between academic strength and the number of data breaches.

According to Weulen Kranenbarg et al. (2018), one motive for white-hat hackers’ CVD reporting is to gain status in the hacker community, as they expect recognition and acknowledgment. The other motive is cash bounties, which account for 15% of motives. However, considering that no such bounty programs exist on the Education Industry Vulnerability Reporting Platform and that only gifts can be redeemed, we assume that the main motivations for CVDs by hackers are to gain status regarding and acknowledgment of their skills and actions. Undoubtedly, CVDs are aimed at more famous and influential universities, in contrast to “normal” universities. Therefore, the vulnerabilities of universities with higher academic achievements and greater social impact are more likely to be reported or disclosed to attract more social attention. Based on this, we propose that such vulnerabilities mediate the relationship between academic strength and data breaches.

H4(a): The number of vulnerabilities has a mediating effect on the relationship between research strength and the number of breaches.

Considering that universities with stronger academic strength have broader worldwide influence and more academic communication with foreign institutions and individuals, they may commit to larger-scale, global data flow around the world. Therefore, we propose that the scale of cross-border data flow mediates the relationship between academic strength and data breaches.

H4(b): The amount of cross-border data flow has a mediating effect on the relationship between research strength and the number of breaches.

Millions of companies and institutions use the cloud to store data remotely and run applications and services, thereby reducing costs and accelerating operations (Rawding and Sacks, 2020). According to the Cloud Usage and Digital Economy Development Report (2018) of the Tencent Research Institute, the degree of “cloudification” is an important indicator of digital economy development. Zhang et al. (2020) investigated the security risks posed by cloud computing services and found that cloud adoption can significantly reduce losses from data breaches. Pang and Tanriverdi (2022) concluded that performing cloud migration for legacy IT systems significantly reduced cybersecurity risks. It has been shown that maintaining information security in a distributed environment brings challenges to information security management (Fried, 1993) and that the integration of IT-related processes can further reduce data breaches (Angst et al., 2017). Therefore, cloud storage, as a type of centralized storage (Bandara et al., 2021; Ouf and Nasr, 2015; Wu et al., 2014), may be the safer option for avoiding data breaches. Thus, we propose the following hypothesis.

H5: Universities adopting cloud storage present a relatively lower frequency of data breaches.

Cloud storage could be safer because of the centralization and integration of related processes, making maintenance and management easier and resulting in fewer vulnerabilities. Therefore, cloud storage has a moderating effect on the relationship between the number of disclosed vulnerabilities and the number of data breaches. When cloud storage is adopted, there will be fewer vulnerabilities and thus fewer data breaches. Thus, we propose the following hypothesis.

H6(a): Adopting cloud storage plays a negative moderating role in the relationship between the number of vulnerabilities and the number of data breaches.

Moreover, cloud computing enables a larger volume of data from across the world to flow over a larger area. Cloud services accessed remotely over the internet can serve customers across national boundaries and achieve cross-border data flow, which enables reduced unit costs and prices over time and flexible and technologically sophisticated services (Coyle and Nguyen, 2019). Thus, cloud storage makes it easier to achieve data transmission without geographical restrictions/changes; accordingly, it may promote data flow in universities. Therefore, cloud storage has a positive moderating effect on the relationship between the number of cross-border data flows and data breaches. When cloud storage is adopted, there is greater cross-border data flow and thus fewer data breaches. Hence, we propose the following hypothesis.

H6(b): Adopting cloud storage plays a positive moderating role in the relationship between the number of cross-border data flows and the number of data breaches.

Data and variables

Sample data collection

The data used in this study were drawn from Chinese universities, where data breaches pose a risk to personal privacy, intellectual property rights, and even national security, especially during the COVID-19 epidemic when most courses were delivered online.

The study samples were collected as follows. First, 21,135 records from January 1, 2020, to January 1, 2021, related to breach incidents and public disclosures of vulnerabilities were collected from the Education Industry Vulnerability Reporting Platform (https://src.sjtu.edu.cn/). After removing the data for education departments of provincial governments and junior colleges, a dataset comprising 681 universities with a total of 9916 records was obtained. Each record contained information on the university/institution name, author, event description, vulnerability type, and risk level. The types of vulnerabilities were classified as sensitive information breaches, structured query language injection vulnerabilities, vertical/horizontal permission bypasses, weak passwords, or file upload vulnerabilities. The statistical results of the vulnerability types are shown in Fig. 4. Breaches of sensitive information accounted for more than 30% of all vulnerabilities. Except for breaches, other incidents were mainly caused by technical failures.

Fig. 4: Numbers of different types of vulnerabilities.
figure 4

A description of the types of vulnerabilities and the distribution of each type of vulnerability using data from the Education Industry Vulnerability Reporting Platform.

Second, we obtained records of cross-border data flow from the experimental logs of collaborating institutions. By excluding data of research institutes and enterprises, we obtained valid records on the cross-border data of 110 universities, which are included in the list of 681 already collected.

Third, we collected other university-related information using different methods based on the list of 110 universities. Among them, data on research projects for measuring research strength were collected from the LetPub Fund Project Inquiry System (http://www.letpub.com.cn/); university-related information for measuring control variables and cloud adoption were collected from universities’ official homepage sites; and economic statistics were collected from the website of the National Bureau of Statistics.

Finally, after matching the data, a valid dataset comprising 110 universities and more than 900 valid data breach records was obtained for further analysis.

Variables and measurements

The number of data breaches in universities in 2020 was used as the dependent variable to reflect universities’ data security status. For the independent variables, the number of data transmissions abroad was used as a measure of cross-border data flow. Following a prior study (Sen and Borle, 2015), the number of publicly reported vulnerabilities was used to measure vulnerability disclosure. The number of research projects was used to measure the academic strengths of the universities. Cloud storage utilization was coded “1” if a university had cloud storage at the beginning of 2020 and “0” otherwise.

According to Gartner, using contextual information, such as location and time data, can help users understand dynamic information security environments and make more accurate security decisionFootnote 5. According to Say and Vasudeva (2020), experiencing a failure can expose an organization’s potential problems and weaknesses, whereby the organization acquires important experience and lessons to reduce the possibility of subsequent failures. For universities aiming to prevent future breach incidents, training in information security is one of the best remedial measures. Therefore, in this study, to capture the differences caused by training, we controlled for the relevant IT security training conducted at universities. It has also been empirically shown that institutions’ scales are positively related to the risk of data breaches (Gabriel et al., 2018; Kim and Kwon, 2019). Therefore, the scale of the universities was controlled, as demonstrated by the number of undergraduate majors. In addition, economic indicators were found to be positively correlated with the risk of data breaches (Sen and Borle, 2015); thus, we also controlled for the GDP of the city where each university was located. In addition, other control variables were added for the number of national key disciplines, master’s programs, doctoral programs, time of establishment, type of university, attributes of university, and number of universities in the same city. The detailed definitions of the variables are provided in Table 2.

Table 2 Definitions of variables.

Descriptive statistics

Table 3 describes the statistics calculated for the main variables. Although the data security risks of universities appeared uneven, they generally faced a severe risk of data breaches, with a mean value of 8.5 breaches in 2020. Among the 110 universities, in terms of attributes (only the highest title of the university was taken), 20% were universities in “Project 985” and 35% were universities in “Project 211”Footnote 6. In addition, 48% of the universities were comprehensive universities, whereas 52% were noncomprehensive universities (such as those limited to medicine, finance and economics, normal education, or science and engineering). Regarding the urban distribution of universities, 23% were in the most developed first-tier cities; 40% were located in new first-tier citiesFootnote 7; and the rest were from less-developed areas.

Table 3 Descriptive statistics.

Table 4 presents the correlation matrix. Considering that some correlations were high, and that multicollinearity may have existed among the variables, we conducted a variance inflation factor (VIF) test. Except for the largest VIF value of 3.39 (Num_Research_Project), the remaining VIF values were no higher than 3, indicating no significant multicollinearity issues.

Table 4 (a) Correlation matrix. (b) Correlation matrix.

Empirical analyses

Main results

To empirically test the proposed hypotheses, we constructed the following baseline model for analysis.

$$\begin{array}{l}\log \left( {E\left( {Num\_Breach_i\left| {X_i} \right. + Controls} \right)} \right)\\ \quad = \alpha + \beta _1Num\_Vulnerability_i + \beta _2Num\_Research\_Project_i\\ \qquad+ \,\beta _3Num\_Data\_Flow_i + \beta _4Ind \_Cloud\_Storage_i + \gamma Controls\end{array}$$
(1)

Poisson regression was employed since the dependent variable Num_Breach was measured using discrete and countable data, representing the number of reported data breaches for university i; Xi is a vector of the independent variables; Num_Vulnerabilityi is the total number of disclosed vulnerabilities of university i; Num_Research_Projecti is the number of research projects of university i; Num_Data_Flowi is the number of cross-border data flows of university i in 2020; Ind_Cloud_Storagei indicates the data storage method of university i; Controls denotes a series of control variables, including training commitment, number of doctoral programs and master’s programs, number of national key disciplines, years since university establishment, type of the university, attributes of the university, GDP, and number of universities in the city where the university is located.

The empirical results are presented in Table 5. Column (1) presents the results for the control variables. Column (2) presents the full Poisson regression model. A multiple linear regression model and a negative binomial regression model were further run for robustness tests, and the results are presented in Columns (3) and (4), respectively. All three regressions show similar estimation results for the main variables. The full Poisson regression model displays the best goodness-of-fit among all models, with the smallest Akaike information criterion (AIC) and Bayesian information criterion (BIC) values in Column (2).

Table 5 Main empirical results.

The results in Column (2) of Table 5 show that the public disclosure of a vulnerability has a positive and significant effect on data breaches (Num_Vulnerability: β = 0.005, s.e. = 0.001, p < 0.001), indicating that the more disclosed vulnerabilities there are, the more breach incidents occur and the greater the risks of such data breaches are. Thus, H1 is supported. The effect of the cross-border data flow on the breach is negative and significant (Num_Data_Flow: β = −0.021, s.e. = 0.008, p = 0.002), which supports H2. This shows that the higher the frequency of data flow is, the fewer reported breach incidents there are. First, data flow reflects the fluidity and mobility of data to a certain extent. In universities with strong data fluidity, data security management generally receives greater attention and thus provides a higher level of data protection. Moreover, universities with strong data flows have more open data systems. Their data security protection and high openness reduce the motivation for potential attackers. These findings provide insights for possible future research directions. For new IT utilization, the effect of cloud adoption is statistically significant (Ind_Cloud_Storage: β = −0.335, s.e. = 0.088, p < 0.001) and shows that universities adopting cloud storage are less likely to have breach incidents. Thus, H5 is supported. Notably, the direct effect of academic strength (proxied by Num_Research_Project) on data breaches is not significant, as shown in Column (2). Additional analyses and explanations are presented in the next section.

Although not the main focus of our study, the coefficients of the other control variables also merit consideration. The scale of the university, as measured by the number of undergraduate majors, increases the risk of data breaches, similar to the results of previous research (Gabriel et al., 2018). Undoubtedly, relevant training helps reduce data breaches. Noncomprehensive universities generally face more severe risks than comprehensive universities. Universities in “Project 985”, as first-tier universities in China, face a higher risk of data breaches. Interestingly, the GDP of a city has a positive effect, whereas the number of universities in the city has a negative effect. This indicates a higher risk of data breaches in developed cities and a lower risk in cities where higher education is well developed.

Mediating effect

Contrary to our expectations, the relationship between academic strength and data breaches is not significant. In this section, we investigated the possible mediating effects of these results. First, the public disclosure of vulnerabilities was considered as a mediator. Universities with higher academic achievement and greater social impact are more likely to be reported and exposed negatively because they attract more social attention. Thus, we addressed the mediating effect of the public disclosure of vulnerabilities. The models were constructed as follows:

$$\begin{array}{l}Num\_Vulnerability_i = \alpha + \beta _1Num\_Research\_Project_i\\ \qquad\qquad\qquad\qquad\qquad\,\,\,+ \;\gamma Controls + \varepsilon \end{array}$$
(2)
$$\begin{array}{ll}\log \left( {E\left( {Num\_Breach_i\left| {X_i} \right. + Controls} \right)} \right)\\ \quad = \alpha + \beta _1Num\_Research\_Project_i + \beta _2Num\_Data\_Flow_i\\ \qquad +\, \beta _3Ind\_Cloud\_Storage_i + \gamma Controls\end{array}$$
(3)

Equation (2) verified the relationship between the number of research projects and the disclosed vulnerabilities. Equation (3) was employed to address the existence of a mediating effect based on Eq. (2). The estimation results are presented in Table 6. The first two columns show the results obtained through Eq. (2): Column (1) is for Num_Research_Project only, and Column (2) incorporates the related controls. Unsurprisingly, academic strength has a positive effect on the disclosure of vulnerabilities. Column (3) presents the results without Num_Research_Project and Num_Vulnerability. In Column (4), the number of research projects is positively related to the number of data breaches, without interference from the mediating variable. Column (5) replicates the main result of Column (2) in Table 5, where the effect of Num_Research_Project is insignificant. Therefore, we concluded that the number of research projects indirectly affects the increase in data breach incidents through the corresponding vulnerabilities, thus supporting H4(a).

Table 6 Mediation effect: vulnerability and academic strength.

Similarly, we specified a model for investigating the mediating effect of the number of cross-border data flows as follows:

$$\begin{array}{l}Num\_Data\_Flow_i = \alpha + \beta _1Num\_Research\_Project_i\\ \qquad\qquad\qquad\qquad\quad\quad+ \gamma Controls + \varepsilon \end{array}$$
(4)
$$\begin{array}{l}\log \left( {E\left( {Num\_Breach_i{{{\mathrm{|}}}}X_i + Controls} \right)} \right)\\ \quad= \alpha + \beta _1Num\_Research\_Project_i + \beta _2Num\_Vulnerability_i\\ \qquad\,+ \beta _3Ind\_Cloud\_Storage_i + \gamma Controls\end{array}$$
(5)

Table 7 shows the estimation results. As expected, cross-border data flow increased with the number of research projects. Column (3) presents the results without Num_Research_Project and Num_Data_Flow. The number of research projects is positively correlated with the number of data breaches, without interference from the mediating variable, as shown in Column (4). We concluded that the number of research projects indirectly affects the increase in data breach incidents through cross-border data flow. Universities with higher academic achievements tend to communicate more with academics worldwide. Thus, H4(b) is supported.

Table 7 Mediation effect: cross-border data flow and academic strength.

Moderating effect

To further investigate how new IT utilization (i.e., cloud storage) influences data breaches, we specified an econometric model with cloud storage adoption as a moderating variable. First, we addressed the moderating role of cloud storage in the relationship between vulnerabilities and data breaches. The analytical model was constructed as follows.

$$\begin{array}{l}\log \left( {E\left( {Num\_Breach_i{{{\mathrm{|}}}}X_i + Controls} \right)} \right)\\ \quad = \alpha + \beta _1Num\_Vulnerability_i + \beta _2Num\_Research\_Project_i\\ \qquad+\, \beta _3Num\_Data\_Flow_i + \,\beta _4Ind\_Cloud\_Storage_i \\ \qquad+\,\beta _5Num\_Vulnerability_i \ast Ind\_Cloud\_Storage_i + \gamma Controls\end{array}$$
(6)

The empirical results are presented in Table 8. Column (1) shows all controls without Num_Vulnerability, Ind_Cloud_Storage, and the interaction term. Column (2) is from Column (2) in Table 5 to allow for an easy comparison. Column (3) shows the estimates considering the moderating effect, where the interaction term is negatively related to data breaches (β = −0.011, s.e. = 0.001, p < 0.001). Thus, cloud storage mitigates the positive relationship between vulnerabilities and data breaches as a moderating variable. Cloud storage enables a more integrated consolidation of distributed data stored in different systems, thus making it easier to maintain and manage. Therefore, adopting cloud storage could reduce the possibility of breaches caused by vulnerabilities. Accordingly, H6(a) is supported.

Table 8 Moderating effect: vulnerability and cloud storage.

Below, we addressed the moderating role of cloud storage in the relationship between cross-border data flow and data breaches.

$$\begin{array}{l}\log \left( {E\left( {Num\_Breach_i\left| {X_i} \right. + Controls} \right)} \right)\\ \quad = \alpha + \beta _1Num\_Vulnerability_i + \beta _2Num\_Research\_Project_i\\ \qquad+ \,\beta _3Num\_Data\_Flow_i + \beta _4Ind\_Cloud\_Storage_i\\ \qquad+ \,\beta _5Num\_Data\_Flow_i \ast Ind\_Cloud\_Storage_i + \gamma Controls\end{array}$$
(7)

Table 9 shows the estimation results. Column (1) represents the controls only, and Columns (2) and (3) report the results without and with a moderating effect, respectively. The interaction term in Column (3) is negatively related to data breaches (β = −0.055, s.e. = 0.014, p < 0.001), indicating that cloud storage strengthens the negative relationship between cross-border data flow and data breaches, as cloud storage makes it easier to transfer data worldwide. Thus, H6(b) is supported.

Table 9 Moderating effect: cross-border data flow and cloud storage.

Robustness checks

To ensure the robustness of the conclusions, this section discusses several robustness checks from four perspectives. First, we tested the significance of the mediating effects. Second, we expanded the time window for several variable measurements to mitigate the impact of COVID-19. Third, we explored whether the effects of specific cloud adoptions differ by redefining the cloud services and classifying them into private and public cloud storage. Finally, because data breach incidents have different risk levels, we considered the effects of various factors at different levels of risk.

Significance test for mediating effect

Three other methods (Aroian, 1947; Goodman, 1960; Sobel, 1982) were used to test the significance of the mediation effect(s). As shown in Table 10, Row (1) is the test result obtained using Column (1) in Tables 8 and 9 from investigating the relationship between the independent and intermediary variables. Row (2) is for Column (2) in Tables 8 and 9. All p values are less than 0.01, except for the Aroian Test of Num_Data_Flow in Row (2), where p = 0.01006, indicating that the mediation effects are highly significant.

Table 10 Robustness check: significance test for mediation effect.

We then investigated the proportions of the mediation effects and direct effects, as shown in Table 11. Using Num_Data_Flow as the mediator, there is a significantly negative mediation effect between the independent and dependent variables. The average direct effects are insignificant. For Num_Vulnerability as the mediator, there is a significantly positive mediation effect, and the direct effects are insignificant. This indicates that the effects of vulnerabilities on data breaches going through the mediator account for almost the entire total effects. The mediation and direct effects have different signs, explaining why the proportion of the effects going through the mediator exceeds one.

Table 11 Robustness check: bootstrap mediation tests.

Varied length of time window

Expansion of the time window length of data breach incidents

For the main analysis, we collected data on breach incidents in 2020. Regarding the global outbreak of COVID-19, the incidents in 2020 may have been affected by fluctuations in the epidemic, making them unrepresentative of typical data security issues in universities. Therefore, the data breach incidents reported during 2017–2019 were used to measure the level of universities’ data security protection.

Column (1) in Table 12 presents the estimated results, where Num_Breach’ denotes the number of data breach incidents in universities reported during 2017–2019, and Num_Vulnerability’ measures the number of publicly disclosed vulnerabilities in universities during 2017–2019. Column (3) replicates the original results in Table 5 for ease of comparison. The impacts of the main variables on breaches are consistent, thereby confirming the key findings.

Table 12 Robustness check on time window length of data collection.

Expansion of the time window length of cross-border data flow

In addition to the COVID-19 outbreak, another breakout in 2020 was related to global medical data sharing, particularly regarding coronavirus epidemic-related data. This may have caused abnormal fluctuations in cross-border data flow at universities. To alleviate this concern, we used the cross-border data flow collected in 2019. Table 12 shows the results in Column (2), where Num_Data_Flow’ measures the number of cross-border data flows in universities during 2019. The results remain consistent and the significance of Num_Data_Flow’ is even higher, providing further evidence of the robustness.

Different cloud service types

As discussed in the empirical results section, adopting cloud storage can result in fewer vulnerabilities and improve data fluidity. However, considering that different types of cloud storage may have different effects, we further classified cloud storage into two types, namely, private and public clouds, as defined by cloud providers in the market. According to Alibaba Cloud, a private cloud provides a corporation or organization with a dedicated cloud environment that can be operated internally by the IT team to better control its computing resources (Li and Li, 2017). A private cloud can be physically located in the organization’s data center or hosted by a service provider. A public cloud is a cloud infrastructure provided by service suppliers for users, individuals, or enterprises. Users can access these servers by purchasing public cloud services and data storage. On a public cloud, all users share the same hardware, storage, and network equipmentFootnote 8.

The effects of three variables, Ind_Cloud_Storage, Ind_Cloud_Private, and Ind_Cloud_Public, were investigated. Table 13 shows the results, where Ind_Cloud_Storage is coded as “1” if the university adopted any type of cloud storage and “0” otherwise; Ind_Cloud_Private is coded as “1” if the university adopted a private cloud and “0” otherwise; and Ind_Cloud_Public is coded as “1” if the university adopted a public cloud and “0” otherwise. Notably, public clouds have the most significant negative effects. According to routine activity theory, guardianship is essential to cybersecurity, and universities can enhance their guardianship by seeking external governance from external vendors (Pang and Tanriverdi, 2022). Public clouds enable external guardianship provided by cloud service vendors who are more capable of effective information protection (Pang and Tanriverdi, 2022). In addition, outsourcing vendors can achieve economies of scale and scope when offering IT services to clients, making it more economically feasible for vendors with professional security teams to protect their information systems (Levina and Ross, 2003). Therefore, in terms of cloud adoption, public clouds may be a better choice for cybersecurity.

Table 13 Robustness check on different cloud types.

Different risk levels of data breach

The factors related to the risk of breaches were tested for repercussions beyond the mere occurrence of such breaches. The Education Industry Vulnerability Reporting Platform scores the risks of all data breach incidents on a scale of 0–10. This scale is further categorized as low (0–4), medium (4–7), high (7–9), and severe (9–10) risks. Table 14 presents the descriptive statistics of breaches with different risk levels.

Table 14 Descriptive statistics of breaches at different risk levels.

Instead of the total number of breaches, the number of breaches with different risk levels was counted for universities in 2020 and regressed onto independent variables. As there were relatively few instances of severe incidents, only three risk levels were considered. The results are presented in Table 15. For low- and medium-risk breach incidents, the cross-border data flow still has a significant negative effect, and the number of publicly disclosed vulnerabilities still has a significant positive effect. The effects of the main variables are highly consistent with the previous research results. For high-risk breach events, a good fit is not achieved because of the small number of observations; however, the coefficient signs of the main variables are consistent. The results show that cross-border data flow only affects the occurrence of medium-risk breaches and that vulnerabilities tend to mostly increase the occurrence of high-risk breaches, as they have the highest significance in the regression results. The adoption of cloud storage may only influence the occurrence of low-risk breaches. Despite the few observations, this finding provides insights into possible future research directions.

Table 15 Robustness check on risk levels for data breaches.

Discussion and conclusion

In this study, we identified and analyzed the key elements of data security incidents in the context of higher education from an empirical perspective. Based on crime opportunity theory and routine activity theory, we constructed a conceptual model and proposed hypotheses to investigate the underlying mechanisms that impact data breaches. The key findings were obtained through a series of empirical analyses and robustness checks. First, it was determined that the public disclosure of vulnerabilities increased data breaches, which complements the conclusion of Sen and Borle (2015) in the context of universities. Second, when incorporating the cross-border data flow effect and measuring the data fluidity and mobility, we found that it negatively affected data breaches, leading to fewer breaches. Third, academic strength influenced the occurrence of data breaches in different ways. Academically stronger universities tended to have more data flow and publicly reported vulnerabilities, which played a mediating role in the relationship between academic strength and data breaches. Fourth, new information technologies such as cloud storage could help reduce data breaches and have moderating effects on vulnerabilities and data flow. In addition, public clouds were found to be relatively safer than private clouds in terms of data breach issues, which complements the research focusing on cloud services and data securities.

Theoretical contributions and implications

This study makes theoretical contributions to the literature. First, we contribute to the data security literature by exploring a new context. According to the available literature, this study is among the first to examine the factors influencing data breach risks in the context of universities. Prior studies on data breach risk have focused on several other industries, such as the medical industry and companies. Relatively few studies have focused on universities even though they are at great risk of data breaches. In this study, the increased risk associated with the number of public disclosures of vulnerabilities is highlighted. The underlying mechanisms explaining how academic strength affects the risk of data breaches are investigated.

Second, we contribute to research on data breaches by discussing the effects of cross-border data flow, which are valued and regulated by numerous countries and regions for their contributions to the digital economy and potential security risks. Prior research has barely considered cross-border data flow in the context of data breaches and has mostly focused on developing and managing relevant policies to prevent potential risks incurred by cross-border data flow. We investigate the effects of cross-border data flow on data breaches and provide another perspective for understanding the value of cross-border data flow.

Third, we contribute to the information security literature by identifying the impacts of clouds on data breaches, distinguishing between the effects of different types of cloud adoption on the risk of data breaches. IBM (2022) reported that the cost of a data breach incident in organizations with public, private, or hybrid clouds can differ significantly. Our findings further reveal and strengthen the difference in terms of the impact on data breaches, which has implications for studies focusing on data security in cloud environments. Future research could break down the types of clouds and explore their effects in different contexts to support decision-making relevant to clouds.

Fourth, this study has implications for research on data breaches. Although we focus on a specific industry, and some of the identified factors and key findings are industry specific, they nonetheless provide an impetus for analyzing the causes of data breaches in other contexts, thereby enriching the literature by identifying the factors influencing data security incidents and, in particular, data breaches in the context of universities.

Practical implications

Our study provides a basis for improving the data security of universities and other scientific research institutions in the higher education industry, which has practical implications for universities aiming to shape their data security strategies to mitigate data security risks. First, regular system maintenance and timely discovery and repair of technical vulnerabilities can reduce opportunities for attackers and create a secure and stable information environment. Second, strengthening data fluidity and openness is conducive to creating more valuable data. Third, when embracing new information technologies, such as cloud storage, universities may consider the possibility of data breaches resulting from different service types, thereby weighing the advantages and disadvantages. Fourth, strengthening the intensity of data security training and improving the data security awareness of relevant personnel can help prevent problems and information breaches caused by human errors before they occur.

Limitations and future research

Certain limitations and future research directions are summarized as follows. First, the data were confined to universities in China and had a relatively short time series. A dynamic panel integrating data analysis along the time dimension for institutions of higher education in different countries could be empirically created and analyzed in future research. Second, the scale of a data breach was not incorporated in this analysis for an assessment of the risk impact, as the measurements for the numbers and types of leaked data were not accessible from the Education Industry Vulnerability Reporting Platform. Providing risk quantification of data breach incidents could be an important future research direction. Third, higher education institutions invest heavily in IT (Nash, 2007), which plays a key role in data-security management. Thus, the effects of information security investments and new IT utilization, such as biometric identification technologies, could be quantitatively valued in future research.