Introduction

Healthcare around the world is under high pressure due to limiting financial resources, over-population, and disease burden. In this modern technological age the healthcare paradigm is shifting from traditional, one-size-fits-all approach to a focus on personalized individual care1. Additionally, the healthcare data is varying both in type and amount. The healthcare providers are not only dealing with patient’s historical, physical and namely information, but they also deal with imaging information, labs, and other digital and analogue information consists of ECG, MRI etc. This data is voluminous, varying in type and formats, and of differing structure. These are the capabilities of Big Data to handle not only different types of and forms of data, but can handle 10 V structure including volume, variety, venue, varifocal, varmint, vocabulary, validity, volatility, veracity and velocity. Thus, the doctors facing an increasing burden of rising patient numbers coupled with progressively less time to spend with each patient. In other words, we are facing more patients, more data, and less time.

Big data has significantly attracted the researchers to explore different research fields including healthcare, banking, imaging, smart cities, internet of things (IoT) based smart applications, tracking and transportation system etc.2. Software engineers constantly develops new applications for patient’s health and well-being. Both government and non-government organizations develop infrastructure using big data analytics for improved decision making capabilities of both doctors and managers3. It was recorded that 80% increase in big data is due to cloud sources, big data analytics, mobile technology and social media technologies4. A number of research articles proposed using big data analytics in varying domains especially in healthcare such as Kumar et al.5 proposed a cognitive technology-based healthcare evaluations system using big data analytics. Chen et al.6 presented an intelligent healthcare application for brain hemorrhage detection using Big Data analytics and machine learning (ML) techniques. Smart health appointment system is developed by Liang and Zhao using big data analytics is7.

Some researchers explored big data analytics in healthcare domain in different ways. They presented survey papers and review papers to understand the meanings of big data analytics in healthcare such as Galetsi and Kasaliasi performed a review of healthcare big data analytics8 while Lindell defined big data analytics in terms of accounting and business perspectives9. Alharthi proposed a review article on healthcare challenges facing in Saudi Arabia by performing analysis of the available literature10. Lee et al.11 presented a survey paper to explore the applications and challenges of healthcare big data analytics. From the literature it is concluded that multiple new applications are developed for big data analysis. Review and survey papers are presented to outline the published literature, but most of these papers are region specific or limited to a few numbers of papers. On the other side systematic review process formulate multiple research question and identifies keywords to explore the available literature from different angles. Systematic analysis of the available literature is presented in many fields like PMIPv6 domain12, in smart homes13, navigation assistants14, and many others, but there is no significant work reported on systematic analysis for healthcare big data domain to find the gaps in the available literature and suggest future research directions.

The inspirational point that led us to pursue this systematic analysis was the pervasive and ubiquitous nature of big data. Efficient management and timely execution are the dire needs of big data, to extract enriched information regarding a certain problem of interest15. Many factors involved behind this systematic research work, but the most eminent reasons are:

  1. i.

    The exiting research reported on big data does not provide significant information about the key features that should be considered to integrate both structured and unstructured big data in healthcare domain. The pervasiveness of big data features challenging the researchers in pursuing research in this specialized domain. The underlying research on finding the key features will not only help in integrating big data in healthcare domain, but it will also assist in findings new gateways for future research directions.

  2. ii.

    Digital transformation of healthcare systems after the integration of information system, medical technology and other imaging systems have posed a big barrier for the research community in the form of a vast amount of information to deal with. While the over-population, limited data access, and disease burdens have restricted the doctors and practitioners to check more patients in a limited time. So, finding a suitable model that can efficiently process healthcare big data to extract information for a certain disease symptoms will not only helps the practitioners to suggest accurate medication and check more patients in timely manners, but it will open future research directions for the industrialists and policymakers to develop optimal healthcare big data processing models.

  3. iii.

    Accurate disease diagnosing by processing of gigantic amount of data, especially a plethora of types of data, within an interested processing domain is a key concern for both researchers and practitioners. Developing an efficient model that can accurately diagnose a certain by classifying images or other historical details of patients will not only helps the doctors to diagnose disease in timely manner and suggest medicine accordingly, but it will encourage the researchers and developers to develop an accurate disease identification model.

The remaining research paper of the paper is organized as follows. Section 2 of the paper outlines the related work reported in the proposed field. Section 3 presents the research framework followed for this systematic research work. Quality assessment is detailed in Sect. 4. Section 5 outlines the discussion on findings of the proposed systematic research work. Section 6 provides the limitations of this systematic study traced by the conclusion and future work in Sect. 7 of the paper.

Literature review

From the last few decades, we experienced an unprecedented transformation of traditional healthcare systems to digital and portable healthcare applications with the help of information systems, medical technology and other imaging resources16. Big data are radically changing the healthcare system by encouraging the healthcare organizations to embrace extraction of relevant information from imaginary data and other clinical records. This information will produce high throughput in terms of accurate disease diagnosing, plummeting treatment cost increase availability. In data visualization context the term ‘big data’, is firstly introduced in 199717, posed an ambitious and exceptional challenge for both policy-makers and doctors with special emphasis on personalized medicine. Nonetheless, data gathering moves faster than both data analysis and data processing, emphasizing the widening gap between the rapid technological progress in data acquisition and the comparatively slow functional characterization of healthcare information. In this regard, the historical information (phonotypical and other genomic information) of an individual patient form electronic health records (EHR) are becoming of critical importance. Figure 1 represents the primary sources of big data.

Figure 1
figure 1

Main steps of the research protocol.

Significant research work has been reported in the domains of healthcare big data analytics. To process this vast amount of information in timely manner and identify someone’s health condition based on his her is more difficult. Researchers proposed numerous applications to address this problem such as; Syed et al.18 proposed a machine learning-based healthcare system for providing remote healthcare services to both diseased and healthy population using big data analytics and IoT devices. Venkatesh et al.19 developed heart disease prediction model using big data analytics and Naïve Bayes classification technique. Kaur et al.20 suggested a machine learning (ML) based healthcare application for disease diagnosing and data privacy restrictions. This model works by considering different aspects like activity monitoring, granular access control and mask encryption. Some researchers presented review and survey papers to outline the recent published work in a specific directions such as Patel and Gandhi reviewed the literature for identifying the machine learning approaches proposed for healthcare big data analytics21. Rumbold et al.22 reviewed the literature for find the research work reported for diabetic diagnosing using big data analytics.

From the above discussions, it is worth mentioning that most of the researchers and industrialists gave significant attention towards the development of new computational models or surveyed the literature in a specific research direction (heart disease detection, diabetes detection, storage and security analysis etc.), but no significant research work is reported to systematically analyze the literature with different perspectives. To address this problem, this research work presents a systematic literature review (SLR) work to analyze the literature reported in healthcare big data analytics domain. This systematic analysis will not only find the gaps in the available literature but it will also suggest new directions of future research to explore.

Research framework

Systematic literature reviews and meta-analysis has gained significant attention and became increasingly important in healthcare domain. Clinicians, developers and researchers follow SLR studies to get updated about new knowledge reported in their fields23,24, and they are often followed as a starting point for preparing basic records. Granting agencies mostly requires SLR studies to ensure justification of further research25, and even some healthcare journals follows this direction26. Keeping these SLR applications in mind the proposed systematic analysis is performed following the guidelines presented by Moher et al.27 (PRISMA) and Kitchenham et al.28. This SLR work accumulates the most relevant research work from primary sources. These papers are then evaluated and analyzed to grab the best results for the selected research problem. Figure 2 represents the results after following the PRISMA guidelines. This systematic analysis are performed using the following preliminary steps:

  • Identification of research questions to systematically analyze the proposed field from different perspectives.

  • Selection of relevant keywords and queries to download the most relevant research articles.

  • Selection of peer-reviewed online databases to download relevant research articles published in healthcare big data domain during the period ranging from 2011 – 2021.

  • Perform inclusion and exclusion process based on title, abstract and the contents presented in the article to remove duplicate records.

  • Assess the finalized relevant articles for identifying gaps in the available literature and suggest new research directions to explore.

Figure 2
figure 2

PRISMA process model for articles accumulation, screening, and final selection.

Research questions

Selecting a well-constructed research question(s) is essential for a successful review process. We formulate a set of five research questions based on the Goal Questions Metrics approach proposed by Van Solingen et al.29. The formulated research questions are depicted in Table 1 below.

Table 1 Set of Research questions.

Search strategy

Search strategy is the key step in any systematic research work because this is the step that ensures the most relevant article for the analysis and the assessment process. To define a well-organized search strategy a search string is developed using the formulated relevant keywords. For the accumulation of most relevant articles for a certain research problem, only keywords are not sufficient. These keywords are concatenated in different strings for searching articles in multiple online repositories30. Inspired from the SLR work of Achimugu et al.31, in software requirement domain, our search strategy consists of four main steps includes identification of keywords relevant to selected research problem, formulation of search string based on the keywords, and selection of online repositories to accumulate relevant articles to the problem selected.

Selection of keywords

List of keywords are defined for each research question to download all relevant articles. Some researchers defined a generic query32 and starts downloading articles. Although it is simple for the accumulation of articles from online database but mostly it tends to skip some most relevant articles. So, the correct option is to define keywords for each research question. In fact, it is a hectic job, but it ensures the retrieval of each relevant article from online databases regarding a certain research problem.

Formulation of search string

Search strings (queries) are formulated using the keywords identified from the selected research questions. The search string is tested in online databases and was modified according to retrieve each relevant articles from these databases. Inspired from the guidelines proposed by Wohlin33, following are the key steps undertaken to develop an optimal search string:

  1. i.

    Identification of key terms from the formulated topic and research questions

  2. ii.

    Selection of alternate words or synonyms for key terms

  3. iii.

    Use “OR” operator for alternating words or synonyms during query formation

  4. iv.

    Link all major terms with Boolean “AND” operator to validate every single keyword.

Following all these preliminary steps a generic query/search-string is developed that is depicted in Table 2. This generic query is further refined for each research question as depicted in Table 3 to retrieve each relevant article.

Table 2 Generalized query.
Table 3 List of online datasets.

Selection of online repositories

After identifying keywords and formulating search strings the next step is to download relevant articles specific to the interested research problem. For the accumulation of relevant articles six well-known and peer-reviewed online repositories are selected, as depicted in Table 3.

Articles accumulation and final database development

For relevant articles accumulation and final database development we followed the guidelines suggested by Kable et al.34. After specifying the research questions, identifying keywords, and formulating search queries, and selecting online repositories, the next key step is to develop a relevant articles database for the analysis and assessment purposes that includes three prime steps: (1) identification of inclusion/exclusion criteria for a certain research article(s), and (2) Relevant articles database development. These steps are discussed in detail below.

Inclusion and exclusion criteria

After selecting online database and starts the articles downloading process, the most tedious task that the author (s) facing, is the decision about whether a certain paper should be included in the final database or not? To overcome this problem an inclusion and exclusion criteria is defined for the inclusion of a certain article in the final set of articles. Table 4 represents the inclusion and exclusion criteria followed for this systematic research work.

Table 4 Inclusion of exclusion criteria.

A manual process is followed by the authors for the inclusion and exclusion of a certain article. These articles are evaluated based on title, abstract and information provided in the overall paper. If more than half authors agree upon the inclusion of a certain article based on these parameters (title, abstract, and contents presented in the article), then that paper was counted in the final database otherwise rejected. A total of 134 relevant primary studies are selected for the final assessment process. To ensure no skip of relevant article snowballing is applied to retrieve each relevant article.

  • Snowballing To extract each relevant primary article snowballing is applied in the proposed research work33. In this systematic analysis both types of snowballing backward and forward snowballing is applied to ensure extraction of each relevant primary article. 145 relevant articles retrieved after applying snowballing process. These articles are then filtered by title and resulted for 53 relevant articles. After further processing by abstract resulted into 19 articles, and at last when filtered by contents presented in the paper resulted into only 5 relevant articles. This overall process is depicted in Fig. 3. After adding these articles to the accumulated relevant articles, a total of 139 articles added to the final database.

Figure 3
figure 3

Extraction of each relevant article using snowballing.

Relevant articles database development

After accumulating each primary article reported in the proposed field, a database of relevant articles is developed for the assessment and analysis work, to find the current available trends in healthcare big data analytical domain and investigate the gaps in these research articles to open new gates for future research work. A total of 139 relevant articles are added to the final database. The overall contribution of the selected online repositories in the relevant articles database development is depicted in Fig. 4.

Figure 4
figure 4

Distribution of primary studies.

From Fig. 4, it is concluded that IEEE Xplore and Science Direct contributing the more that reflects the interest of research community to present their work with.

Articles accumulation and final database development

After developing a database of relevant articles, it is evaluated using different parameters like type of article (conference proceedings, journal article, book chapter etc.), publication year, and contribution of individual library. Figure 5 represents the information regarding the total contribution of articles by type in the final database.

Figure 5
figure 5

Evolution of final database by type of article and year.

Figure 5 concludes that the researchers paid significant attention towards the development of new healthcare systems instead of finding the gaps in the available systems and develop enhanced solutions accordingly. This enhanced solution can accurately identify and diagnose a certain disease based on patient’s historical medical information. A small amount of work is reported using review articles, survey papers, but no systematic mechanism is followed to analyse the work in specific range of years followed by a set of research questions. The same problem can also be seen from Fig. 6 where highest percentage contribution is shown more comparative to book sections, conference papers etc.

Figure 6
figure 6

Percentage contribution by type of paper.

Figure 7 depicts the percentage contribution of each library in the proposed assessment work.

Figure 7
figure 7

Percentage contribution of each library.

Figure 8 represents the annual distribution of articles selected for the analysis and assessment purposes. Form Fig. 8 it is evident, that with passage of time number of articles increases, and that shows the maturity and interest of the researchers in this specific domain.

Figure 8
figure 8

Annual distribution of articles.

From Fig. 8, it is concluded that IEEE Xplore contributing the more in the final database of relevant articles that shows the trend of researchers to present healthcare relevant works in the IEEE journals. Figure 9 represents the total number of journal articles, survey papers, conference papers, and book sections in the selected relevant articles database.

Figure 9
figure 9

Evolution of database by number of articles by type.

From Fig. 9 it is concluded that significant attention is given towards the development of new healthcare models. This shows the maturity of the proposed field. Dealing with such a mature field and extracting useful information is hectic job for the researchers. A systematic analysis of this research field is needed to provide an overview of the work reported during a specific range of years. This analysis will not only save precious time of the researchers, but it will also open gates for the future research work in this field.

Table 5 represents the annual contribution of studies in the final relevant database.

Table 5 Inclusion or exclusion criteria.

Overall information regarding type of paper, publication year and number of records is depicted in Fig. 10 below.

Figure 10
figure 10

Evolution of final database.

Quality assesment

After executing exclusion and inclusion process, all the relevant articles in the database are manually assessed by authors to check the relevancy of each article with the selected research problem. A quality criterion is defined to check every research article against the formulated research questions. This quality criteria is defined in Table 6.

Table 6 Quality criteria.

Weighted values are assigned against each quality criteria to check the relevancy of an article with a certain research question. These weighted values and description is depicted in Fig. 11.

Figure 11
figure 11

Quality criteria for the proposed SLR work.

After the assessment process, the relevancy of each article is decided based on its aggregated weighting score. If the score is greater than 3 it represents the most relevancy of an article to the selected research topic. Figure 12 represents the aggregate score values of each article based on the defined quality assessment criteria.

Figure 12
figure 12

Quality assessment process.

Results and discussion

After executing the quality assessment work, the next key step of an SLR work is, to analyse all the relevant article to identify different techniques proposed for efficient communication between patient and practitioner, accurate feature extraction from healthcare big data and implement it in practical use.

This section of the paper performs a descriptive analysis of each article based on five research questions. In this systematic review process, a total of 139 research articles published during the period ranging from 2011 to 2021.

Healthcare big data

The researcher and data analysts suggested no contextual name for “big data” in healthcare, but for implementation and interpretation purposes they divided it into 5 V architecture. Figure 13 depicts a 5 V architecture of big data.

Figure 13
figure 13

Big Data 5Vs15.

The exponential increase in IoT-based smart devices and information systems resulted a plethora of information in healthcare domain. This information increases exponentially on daily basis. These smart IoT based healthcare devices produces a huge of data. An alternated term “Big Data” is selected for this gigantic amount of data. This is the data for which scale, diversity, and complexities require innovative structure, variables, design, and analytics for efficient utilization and management, accurate data extraction and visualization, and to grab hidden stored information regarding a specific problem of interest. Main idea behind the implementation of healthcare big data analytics is to retrieve enriched information from huge amount of data using different machine leering and data mining techniques191. These techniques help in improving quality of care, reducing cost of care, and helps the practitioners to suggest medicines based on clinical historical information.

RQ1. What are the key features adapted to integrate the structured and unstructured data in healthcare big data domain?

Big data comprises a huge amount of data to be processed, especially a plethora of types of data to process and extract enriched information regarding a problem of interest. Several features are assessed and analyzed especially in healthcare domain, to integrate both structural and non-structural data. Multiple researchers analyzed semantic based big data features for big data integration purposes while some researchers proposed behavior and structural based features for patient monitoring and activity management purposes151,192. While some performed real-time analysis using a group of people for data integrating and clustering purposes. Table 7 enlists the research work published for the structural and non-structural data integration purposes.

Table 7 List of key features used for data integration purposes.

After analysing the available literature in Table 8, it was concluded that mostly semantic based, structure-based, and real-time activity-based features are considered for the information extraction and organization purposes. If we consider geometric based feature and adapt clustering mechanism for data organization purposes, then this will not only integrate both structural and non-structural data efficiently, but it will improve the simulation capabilities of different applications.

Table 8 List of techniques proposed for easy and time data-access interface.

RQ2. What are different techniques proposed to provide an easy and timely data-access interface for doctors?

Digital transformation of healthcare systems by using of information system, medical technology, handheld and smart wearable devices has posed many challenges for both the researchers and caretakers in the form of storage, dropping the cost of care and processing time (to extract relevant information for refining quality of care and reduce waste and error rates). Prime goal of healthcare big data analytics is, to process this vast amount of data using machine learning and other processing models to extract certain problem relevant information and use it for human well beings195. Several supervised and unsupervised classification techniques are followed for the said purposes. ML-based architectures and big data analytical techniques are integrated in healthcare domain for efficient information retrieval and exchange purposes, risk analysis, optimum decision-support system in clinics, and suggesting precise medicines using genomic information196. Table 8 represent the literature reported for the providence of an easy and timely data-access interface for the practitioners.

RQ3. What are different ways to improve communication between the doctor and patient?

Healthcare around the world is under high pressure due to limiting financial resources, over-population, and disease burden. In this modern technological age, the healthcare paradigm is shifting from traditional, one-size-fits-all approach to a focus on personalized individual care 1. Additionally, the healthcare data is varying both in type and amount. The healthcare providers are not only dealing with patient’s historical, physical and namely information, but they also deal with imaging information, labs, and other digital and analogue information consists of ECG, MRI etc. This data is voluminous, varying in type and formats, and of differing structure. These are the capabilities of Big Data to handle not only different types of and forms of data, but can handle 5 V structure including volume, variety, value, veracity, and velocity. Thus, the doctors facing an increasing burden of rising patient numbers coupled with progressively less time to spend with each patient. In other words, we are dealing with more patients, more data, and less time.

Different techniques are proposed in the literature to provide an easy and timely communication interface for both doctors and patients. Table 9 depicts different information exchange tools/techniques reported in the literature.

Table 9 List of techniques proposed for efficient communication and information exchange.

RQ4. What are different types of classification models proposed for accurate disease diagnosing using patient historical information?

This research question aims to outline different disease diagnosing models proposed in the literature using healthcare big data. Around the world diverse approaches are proposed by researchers for healthcare big data analysis to ensure accurate disease diagnosing capabilities, provide healthcare facilities at doorstep, development of eHealth and mHealth applications, and many others. Multiple statistical and ML-based approaches proposed for accurate diagnosing purposes. Figure 14 represents multiple techniques proposed for automatic disease diagnosing purposes using healthcare big data domain.

Figure 14
figure 14

Multiple disease diagnosing techniques proposed in the literature.

All these techniques perform the diagnosing process using semantic-based features or structural based features. But no attention is given towards geometric feature extraction techniques that are prominent in extracting enriched information from data and results in high identification rates. Also, no advanced hybrid neural network and shallow architectures are proposed for the automatic diagnosing purposes. Keeping these gaps in mind, an optimum eHealth application can be developed by applying these hybrid techniques.

RQ5. What are different applications of big data analytics in healthcare domain?

Big data analytics has revolutionized our lives by presenting many state of the art applications in various domains ranging from eHealth to mHealth, weather forecasting to climate changes, traffic management to object detection, and many others. This research question mainly focusing on enlisting different applications of big data analytics in Table 10.

Table 10 Applications of big data analytics in healthcare domain.

Limitations

This article has a number of limitations. Some of these limitations are listed below.

  • For this systematic analysis articles are only accumulated from six different peer-reviewed libraries (ACM, SpringerLink, Taylor & Francis, Science Direct = IEEE Xplore, and Wiley online library), but there exist a number of multi-disciplinary databases for articles accumulation purposes.

  • This systematic analysis covers a specific range of years (2011 –2021), while a number of articles are reporting on daily basis.

  • Articles are accumulated from online libraries using search queries, so if a paper has no matching words to the query, then it was skipped during search process.

  • Google Scholar is skipped during the articles accumulation phase to shorten the searching time. Also, it gives access to both peer-reviewed and non-peer-reviewed journals and we only focused on peer-reviewed journals for the relevant articles.

  • Being a systematic literature work it can be broadened to grab the knowledge about other varying topics such as healthcare data commercialization, health sociology etc.

Besides these limitations we hope that this systematic research work will be an inspiration for future research in the recommended fields and will open gates for both industrialists and policymakers.

Conclusion and future work

In this research article, the existing research reported during 2011 to 2021 is thoroughly analysed for the efforts made by researchers to help caretakers and clinicians to make authentic decisions in disease diagnosing and suggest medicines accordingly. Based on the research problem and underlying requirements, the researchers proposed several feature extraction, identification, and remote communication frameworks to develop doctor and patient communication in a timely fashion. These real-time or nearer to real-time applications mostly use big data analytics and computational devices. This research work identified several key features and optimum management designs proposed in healthcare big data analytical domain to achieve effective outcomes in disease diagnosing. The results of this systematic work suggests that advanced hybrid machine learning-based models and cloud computing application should be adapted to reduce treatment cost, simulation time, and achieve improved quality of care. The findings of this research work will not only help the policymakers to encourage the researchers and practitioners to develop advanced disease diagnosing models, but it will also assist in presenting an improved quality of treatment mechanism for patients.

Advanced hybrid machine learning architectures for cognitive computing are considered as the future toolbox for the data-driven analysis of healthcare big data. Also, geometric-based features must be considered for feature extraction purposes instead of semantic and structural-based features. These geometric-based feature extraction techniques will not only reduce the simulation time, but it will also improve the identification and disease diagnosing capabilities of smart health devices. Additionally, these features can help in accurate identification of Alzheimer, tumours in PET or MRI images using upgraded machine learning and big data analytics. Cluster-based mechanism should be considered for data organization purposes to improve big data timely-access and easy-management capabilities. Promoting research in these areas will be crucial for future innovation in healthcare domain.