Big data collection in pharmaceutical manufacturing and its use for product quality predictions

Žagar, Janja; Mihelič, Jurij

doi:10.1038/s41597-022-01203-x

Download PDF

Data Descriptor
Open access
Published: 23 March 2022

Big data collection in pharmaceutical manufacturing and its use for product quality predictions

Janja Žagar¹ &
Jurij Mihelič²

Scientific Data volume 9, Article number: 99 (2022) Cite this article

16k Accesses
6 Citations
11 Altmetric
Metrics details

Subjects

Abstract

Advances in data science and digitalization are transforming the world, and the pharmaceutical industry is no exception. Multiple sensor-equipped manufacturing processes and laboratory analysis are the main sources of primary data, which have been utilized for the presented dataset of 1005 actual production batches of selected medicine. This dataset includes incoming raw material quality results, compression process time series and final product quality results for the selected product. The data is highly valuable for it provides an insight into every 10 seconds of the process trajectory for 1005 actual production batches along with product quality collected over several years. It therefore offers an opportunity to develop advanced analysis models and procedures which would lead to the omission of current conventional and time consuming laboratory testing. Benefits for both the industry and patient are obvious: reducing product lead times and costs of manufacture.

Measurement(s)	Incoming raw material quality (particle size distribution, water content, impurities level, residual solvents, pH) • In process control measurements of tablet core and film coated tablets (weight, thickness, diameter, hardness, yield of a process) • Final medicine quality on a representative sample of film coated tablets (drug release in defined time, active ingredient content, impurities level, residual solven content) • Process time series of selected tablet compression parameters (every 10 s of the compression process)
Technology Type(s)	Laboratory based analysis (particle sizer using laser diffraction, loss on drying method, HPLC method, GC method) • Automatic IPC check machine (combining balance, hardness, thickness and diameter measurements) • HPLC (High performance liquid chromatography), GC (gas chromatography) • Tablet compression machine calibrated sensors for the following main parameters: main and pre-compression force, fill depth, cylindrycal height, ejection force, number of wasted tablets.
Factor Type(s)	batch • code • strength • size • start • api_code • api_batch • smcc_batch • lactose_batch • starch_batch • api_water • api_total_impurities • api_l_impurity • api_content • api_ps01 • api_ps05 • api_ps09 • lactose_water • lactose_sieve0045 • lactose_sieve015 • lactose_sieve025 • smcc_water • smcc_td • smcc_bd • smcc_ps01 • smcc_ps05 • smcc_ps09 • starch_ph • starch_water • tbl_min_hardness • tbl_max_hardness • tbl_av_hardness • tbl_min_thickness • tbl_max_thickness • fct_min_thickness • fct_max_thickness • tbl_min_weight • tbl_max_weight • tbl_rsd_weight • fct_rsd_weight • fct_min_hardness • fct_max_hardness • fct_av_hardness • tbl_tensile • fct_tensile • tbl_yield • batch_yield • time series: tbl_speed • time series: fom • time series: main_comp • time series: tbl_fill • time series: SREL • time series: pre_comp • time series: produced • time series: waste • time series: cyl_main • time series: cyl_pre • time series: stiffness • time series: ejection
Sample Characteristic - Organism	Selected medicine
Sample Characteristic - Environment	manufacturing process
Sample Characteristic - Location	Pharmaceutical industry

Meta learning based residual network for industrial production quality prediction with limited data

Article Open access 25 May 2024

Economic statistical model of the np chart for monitoring defectives

Article Open access 14 August 2023

Multidisciplinary cancer disease classification using adaptive FL in healthcare industry 5.0

Article Open access 12 August 2024

Background & Summary

The pharmaceutical industry is known for innovation, development of life-saving products, and a high level of quality¹. On the other hand, it is also relatively reluctant to embrace changes, new technologies, and approaches that could improve ways of working in the production of medicines^2,3. Why? Primarily due to the highly regulated field, which calls for a long and document-heavy process of change implementation⁴.

Regulatory bodies such as FDA (Food and Drug Administration, US) and EMA (European Medicines Agency) have acknowledged that a shift towards more data-oriented medicine manufacturing needs to occur^5,6. The pharmaceutical industry is being encouraged to embrace new digital technologies and better utilize the data collected for the demonstration of the quality of their products and improvement of manufacturing efficiency^7,8.

Medicine manufacturing processes are equipped with numerous sensors monitoring and controlling critical process and equipment parameters⁹. Every product manufactured for the market therefore has a large amount of data describing its every step, from incoming raw materials entering into the process, process parameters, intermediate product characteristics, as well as final product quality (Table 1). These production-related datasets range from simple laboratory analysis of incoming raw materials to complex time series outputs available for every second of the manufacturing process^10,11. The data are stored in several different databases and servers and are normally used only to confirm the predefined quality of incoming raw materials, intermediate products, and final products.

Table 1 Overview of data sources for a pharmaceutical product.

Full size table

The present data collection study focuses on a high-volume pharmaceutical product (i.e., medicine) intended for lowering blood cholesterol. The pharmaceutical dosage form is film-coated tablets with an immediate release drug profile. The product was chosen due to its simple formulation and straightforward manufacturing process.

The composition of the selected medicine consists of excipients and an active pharmaceutical ingredient (API). The excipients that were considered for this case study have a potential impact on final product quality and are as follows: lactose, silicified microcrystalline cellulose, and starch. Before their use, quality control analysis is performed for each incoming material.

The manufacturing process includes direct compression, which has a significant impact on final product quality based on expert knowledge. The process related data, therefore, consists of tablet compression time series and intermediate product (i.e., tablet cores) testing data.

After the process is finalized, product quality is tested for every manufactured batch on a representative sample of film-coated tablets. The product is only released for use if all specification criteria for final product testing are met.

The data presented in this study were collected from three main databases: “Laboratory Sample Manager Database”, production database, and process time series database connected with tablet press SQL database.

Incoming materials have a significant impact on final product quality, and when combined with process data, additional insight into the product may be obtained. The dataset presented in detail below offers an opportunity for researchers to find alternative ways to determine product quality.

Methods

Due to the sensitivity of industry data, all data descriptors (i.e., batch number, product code, excipient batches, etc.) were anonymized.

Dataset scope

The product or product family in the scope of the research has several product sub-families, which are defined by product code. Product sub-families differ in strength and manufacturing batch size. There are four different strengths and nine different batch sizes present in the research dataset. Products of different strengths within the scope have proportional or semi-proportional formulations and only differ in the weight of the final tablet, keeping formulation ratios the same. In order to account for the differences between product sub-families, categorical data are also included in the research dataset.

The data collected for the present research range from November 2018 to April 2021. The time interval exceeding one year ensures that seasonal variation, changes in incoming raw materials, the impact of operator shift work, holidays, and other common process and equipment variability, are all taken into account. It is thus safe to assume that the presented dataset is robust and representative of the selected product.

Data sources

The primary data sources are laboratory analysis results of incoming raw materials (excipients and API), of the intermediate product (tablet cores), and of the final product. The analyses were performed by trained laboratory technicians specialized in corresponding test. Devices used for analysis ranged from HPLC (high-performance liquid chromatography), GC (gas chromatography), moisture analyzer and particle size analyzer to automatic tablet cores analyzer.

The second primary source of data are the tablet compression process time series. Time series output, such as tablet press speed, compaction force, fill depth, etc., is generated by tablet press sensors (Table 2). Time series output is generated for every second of the process and is stored in the tablet press SQL database. From there, time series are uploaded to a server that allows for visualization or extraction of the data by domain experts. This data is semi-structured and requires cleaning and organizing before use.

Table 2 Sources of data in the process and analysis flow explained.

Full size table

Data collection methods

Before accessing and exporting securely the stored laboratory and process data, the so-called batch genealogy was performed. All laboratory and process data in the above-mentioned databases are stored using batch identifiers. In order to extract the relevant data from databases, it was necessary to determine the corresponding raw material batches that entered into each of the 1,005 final product batches included in this data descriptor study. Only after this initial information was known, did the process of data collection begin.

We exported the data by product material code (i.e., product sub-family), which groups all the batches that have been manufactured under that particular code. The export filter settings, therefore, included the time interval, product code, and laboratory analysis range.

The process time series export was more challenging compared to the laboratory data, due to the quantity of the data. The tablet compression process typically runs between 2 hours and 20 hours, depending on product sub-family (i.e., product code), which defines the batch size (i.e., the target number of tablets produced). The larger the batch size, the longer the process. We exported the data for every 10 seconds of the process and thus slightly reduced complexity, while still keeping all important process information.

The exported batch time series data provided datasets of several thousand rows (see details in the Data Records section). The main identifier of this type of data is a timestamp, which was unstructured and needed preprocessing due to different time formats present in the primary data.

Preprocessing of laboratory data

The laboratory data includes the results from the incoming raw material analysis (independent variables), intermediate product quality (independent variables), and final product quality (dependent variables). We visualized every exported parameter (i.e., laboratory analysis results) to observe for any outliers.

Another aspect that needed to be evaluated was whether the data had the expected spread based on product and process knowledge. The following preprocessing steps were applied:

Batches with no final product quality results available were excluded from the dataset.
Four different product “sub-families” have four different target tablet core weights, which would have affected further analysis. For this reason, a normalized parameter, weight relative standard deviation (RSD), was included instead. However, for the purpose of alternative future analysis approaches, the original weight data for every batch is kept in the provided dataset.
Every product code (sub-family) also has a different target thickness, diameter and hardness. We therefore decided to prepare a new parameter normalized with respect to tablet dimensions and shape – tensile strength (Eq. 1)^12,13. Unlike hardness, tensile strength is a normalized parameter comparable across product sub-families. Since all product sub-families have the same cylindrical tablet shape, no alterations to Eq. 1 are needed. A new parameter was created by applying the following equation for every batch:
$$Tensile\;stregth:\sigma =2\ast \frac{F}{\pi \ast t\ast d};$$
(1)
where F represents tablet hardness in Newton (N), t is tablet thickness in millimeters (mm) and d represents tablet diameter in millimeters. The average hardness value and the maximum thickness and diameter for both tablet cores and film-coated tablets were considered in the calculation.
Product quality parameters included in the dataset are final product impurities, residual solvents and drug release results.

Preprocessing of time series data

We performed visualization on time series to evaluate the quality of data, any unusual events, and any requirements for special preprocessing or complete removal of certain batches. Among all the parameters included in the time series dataset, tablet press speed and the number of rejected tablets, counted by the press, were the most descriptive of batch dynamics. The first parameter indicates the time when batch tablet production started and whether there were any unusual process interruptions. The second parameter selected was the number of rejected tablets, because it helps to understand whether tablet press speed reaching 0 is due to the process being finished or due to another start-up or more challenging issues occurring during the process. If the number of rejected tablets does not increase, then the process is finished. Furthermore, the trajectory of rejected tablets in a batch cannot fluctuate; it can only increase with time. The following preprocessing and cleaning steps were taken:

The process time series were quality checked by applying the above explained visualization of the two parameters for every single batch included in this dataset (1,005 batches in total).
We have observed that some batches were stopped because of the weekends or bank holidays. Due to this, a considerable part of the time series displays the value of 0, as the tablet press was paused. When dealing with this data, this particular characteristic needs to be considered, as it could hinder future analysis. These parts have been deliberately kept as part of the time series dataset, because leaving blended powdered material to rest for prolonged periods of time could potentially impact the quality of compressed tablets.
The tablet press SQL database includes unstructured data, and the transfer of these into a readily available database (iHistorian) resulted in a mixed, 12- and 24-hour time structure. Standardization of the time format was performed.
Visualization enabled us to detect a drop in the rejected tablet parameter for some batches. This required a detailed data investigation for those particular batches. We discovered that this issue had occurred with batches that spread over two or more days because of the incorrect date structure in some batches. Corrections were made for all batches where the issue was detected.
Some time series datasets had a so-called “gap”, meaning that for a period of several minutes, there were no data available, i.e., empty cells. This occurs when the tablet press is shut down for whatever reason, e.g., weekend, shift change, calibration of the tablet press, calibration of the automatic weight and hardness control system, etc. Since these gaps of missing data have no value for analysis, they were excluded.
Once all the above steps were applied and the data has been cleaned and the time structure corrected, another visualization cycle was performed to check whether all the preprocessing steps have worked.

Only the correctly preprocessed, high quality time series data were kept and are presented in this research.

Time series data reduction

The time series presented a vast amount of data compared to laboratory result entries for each batch. The aim was to create new parameters for each batch that would best describe the original time series data for a particular batch.

These new parameters were then combined with the laboratory data linked with the same product batch. Such data manipulation allows for further in-depth analysis, process understanding, and product quality predictions. The time series data were provided intact for this data descriptor paper as well as in a reduced format; as new attribute vectors that were tailored separately for each process parameter based on expert knowledge.

The following feature extraction steps were applied and new attributes created for every batch time series dataset:

Average tablet press speed: the average speed of the tablet press machine, excluding the tablet press values of 0 due to known longer interruptions, such as weekends and holidays, which would mask the true process characteristics.
Number of tablet press speed changes: tablet press speed is generally set in the beginning of the process and usually only changes or drops to 0 if issues with tablet compression arise. The best process is a constant process without interruptions or changes to the parameters. This attribute was normalized by batch size, because larger (and therefore longer) batches will naturally have more interruptions than smaller ones.
Tablet press speed of 0: the total time the tablet press speed was 0. This parameter, too, was normalized by batch size for the same reason as explained above. It was necessary to consider if a batch was run or paused during the weekends and holidays. For this reason, we created an additional categorical attribute to note which batches had a “weekend run”. If the tablet press was stopped during a weekend or holiday, this time was subtracted from the total time the tablet press speed was 0.
Total number of rejected tablets: this parameter is cumulative and tells us for every moment of the process how many tablets have been rejected up to a certain point. It was normalized with batch size.
Number of rejected tablets during start-up: it gives an indication how much time and effort was needed to prepare tablet press parameters for a particular blend. The higher the amount of rejected tablets, the more effort was needed due to more challenging material properties or potentially an inexperienced operator. Either way, both have a potential impact on tablet quality. Normalization is not needed, because the size of the batch does not have an impact on the set-up of the tablet press.
Average filling device speed: the filling device speed average excluding 0 values.
Filling device speed changes: it is very uncommon to change the filling device speed during the compression process itself, so it makes more sense to consider only those changes during tablet press set-up, when the “production” parameter is 0, meaning that no good tablets have yet been made and we are therefore in the start-up phase of the process.
Average SREL start-up: the average standard relative deviation of the main compression force (SREL) during compression start-up, excluding values where the tablet press is not running (i.e., the tablet press speed or the filling device speed is 0).
Average SREL: the average SREL during a compression run, excluding values where the tablet press is not running (i.e., the tablet press speed or the filling device speed is 0).
SREL max: the maximum value of SREL during a compression run. It is essential to eliminate any excessive values, i.e., anything above 15, because such values are unrealistic for the product based on expert product knowledge and only appear in the beginning of the tablet press run due to the main compression force transitioning from 0 to the target value. The difference is naturally large, which shows in the SREL parameter.
Simple statistical methods were applied to other process time series parameters for the compression run as well as the start-up phase of the compression process: min, max, standard deviation, mean, median.

Based on batch sizes, normalization factors were calculated, which are included in a separate file within the enclosed dataset.

Data Records

The data records are available at https://doi.org/10.6084/m9.figshare.c.5645578.v1¹⁴. The data is divided into the laboratory data (“Laboratory.csv”), the time series data (“Process time series”), the extracted features of time series data (“Process.csv”), and the normalization factors (“Normalization.csv”). Each of these is presented below.

Laboratory data

This file combines genealogy information, the incoming raw materials analysis, the intermediate product analysis and the final product analysis. There are altogether 1,005 rows of data, each row presenting one final product batch (Table 3). The rows are identified by final product batch numbers; the batches can be grouped into the so-called product sub-families by product codes. Laboratory data parameters are presented and explained in Table 4.

Table 3 Overview of laboratory data.

Full size table

Table 4 Detailed description of Laboratory.csv file content.

Full size table

Process time series data

The time series data are arranged in one file per product code (i.e., product sub-family). Each product code combines all final product batches manufactured in the selected period. The process time series data includes the most relevant tablet compression process parameters based on product history and expert knowledge collected for every 10 seconds of the process.

The following process parameters were considered: tablet press speed, filling device speed, main and pre- compression force mean value, tablet fill depth, standard relative deviation of main compression force (SREL), good and bad production, cylindrical height for main and pre- compression, bottom punch stiffness and ejection force. The effect of these parameters on tablet characteristics is summarized in Table 5, along with the shortened naming convention used in the uploaded data files as well as units for each parameter. Each time series file grouped by product code contains the same process parameters with the same naming convention.

Table 5 Detailed description of process time series.

Full size table

It should be noted that all potential variations of the process parameters described in the table below led to a process within specification limits for all batches included in this research dataset. The variation of process parameters and product characteristics within the specification limits is normal and expected.

Extracted features of time series data

The time series data needed reduction and creation of new attributes before they could be readily used for the prediction analysis selected for the product. An example of attribute preparation is detailed in the Methods section, Preprocessing of time series data. A list of new attributes replacing the whole time series per each batch is detailed in Table 6.

Table 6 Description of attributes derived from time series per batch.

Full size table

Based on batch sizes, normalization factors were calculated, which are included in a separate file within the enclosed dataset.

Technical Validation

The laboratory data included in the current study are generated by trained laboratory technicians. Each analysis is performed following approved operating procedures for a particular test. The equipment used for the analysis is qualified by the supplier and the engineering team before release for use in the company. The maintenance of these analytical instruments demands periodic services by an external certified company and regular calibration before use for analysis. Analytical instruments need to comply with strict international and internal industry standards and are subject to regular audits, which confirm the robustness and reliability of these devices. Every analytical result generated as described above is then transcribed into a dedicated database by a laboratory technician that performed the analysis. This entry needs to be verified and signed off by a second person in order for it to be uploaded into the database. The entire process follows good laboratory, manufacturing and documentation practice as defined by pharmaceutical industry standards (international and internal), resulting in data we can trust.

The laboratory data for 1,005 batches have been collected from databases and verified further to check for potential outliers and observe whether the scatter of the data is expected for the selected product. We applied statistical analysis to all collected laboratory parameters. Furthermore, the process capability index was calculated for measurements where laboratory data follows normal distribution and the parameter includes both the upper and the lower specification limit. (Note: quality parameters often only have either the upper or the lower specification limit, not both.) Overall process capability (Ppk) tells us whether the process is under control and within the limits (i.e., upper and lower specification limits). Ppk above 1 indicates a well-controlled process and is expected for the selected product (e.g., Figs. 1, 2). The laboratory parameters that are applicable for Ppk calculation are dissolution average (Fig. 1) and dissolution min (Fig. 2), which both demonstrate normal distribution. This is confirmed by a comparison of the actual process data (blue histogram bars) with a normal distribution curve (solid red curve on the graph). Ppk calculation applied to these data is as follows:

$$Ppk=\min [(USL-\bar{X})/3\sigma ],[(\bar{X}-LSL)/3\sigma ];$$

(2)

where USL stands for the upper specification limit, LSL the lower specification limit, $\bar{X}$ for the average value and σ for standard deviation. Statistical evaluation results (average, standard deviation, minimum and maximum values), which are gathered in Table 7 and indicate the data, follow the expected spread without extreme outliers from the average.

Table 7 Statistical analysis and performance qualification.

Full size table

All measurements are within the specification limits for each parameter. The greatest variation occurs for API total impurities, residual solvent and total product impurities (RSD is 50.1%, 91.1% and 71.3% respectively). This variation is acceptable due to the small absolute spread of the data. Furthermore, the upper specification limit for these three parameters is double or higher the maximum value, which further confirms the acceptability of results given.

The time series data included in the current study are generated by the tablet press machine. Tablet presses are installed and qualified by the supplier. Before they are released for use in production, an internal engineering team performs extensive qualifications and verifies all operational functionalities that need to comply with the predefined user requirements. Tablet presses are then approved for use in production and are regularly serviced, re-qualified and calibrated with the frequency defined by international pharmaceutical standards for production equipment. Software and server functionalities are separately verified on a regular basis by dedicated IT teams. Tablet presses are also subject to regular audits, which demand the highest industry standards for every production equipment. The data generated by the tablet press are thus a reliable source of information about process dynamics.

The time series were extracted from a server linked with the tablet press database and visualized. In total, 1,005 visual inspections were carried out as part of initial data validation. The batches where process profiles did not follow the expected trend were inspected in detail and preprocessed if needed. An example of a preprocessing requirement due to unstructured time may be seen in Figs. 3 and 4.

Besides unstructured time for some batches (e.g., Fig. 3), there was a so-called “data gap” observed. These are short intervals of several minutes, where, for some batches, no data is logged. The reason for this is loss of tablet press power (i.e., shut down). These are very rare events and were removed because they do not bring any added value to the process understanding, but could potentially impact further data analysis.

Time format correction led to time series profile, which conforms with the nature of the manufacture process in question (Fig. 4).

Missing data

The laboratory data that are missing for some batches are active pharmaceutical ingredient (API) analysis results: total impurities, impurity L, water content and particle size. These are missing for a particular source of API material, i.e., all missing data belong to the same API code. It is recommended to either attempt further analysis with parameters that have missing data or, in case they are not relevant for the type of analysis, the parameter may be dropped altogether.

Usage Notes

Pharmaceutical product quality analysis is mandatory before any batch of medicine is released to the market^11,15. This process is currently predominantly laboratory based and thus very time consuming for most products across the industry. Each product batch has a wealth of data from incoming raw materials to intermediate product analysis and detailed process time series for every equipment sensor¹. These data have been described and provided for in the present data descriptor publication for the selected pharmaceutical product. We recommend to use the data further for the prediction of final product quality (for parameters in Table 4, last 6 rows).

As described in the example analysis dataset, we have used time series datasets to extract parameters based on expert knowledge and the impact the tablet compression process has on the product in question (Table 6). If this approach does not lead to reliable enough prediction models, it would be better to use whole time series datasets and use a deep learning methodology in order to find attributes that describe the selected product quality better.

Code availability

Python code is available at figshare within the same collection as the dataset:¹⁴ https://doi.org/10.6084/m9.figshare.c.5645578.v1. The code is available for preprocessing, visualization and attribute extraction.

References

Finelli, L. A. & Narasimhan, V. Leading a digital transformation in the pharmaceutical industry: Reimagining the way we work in global drug development. Clin. Pharmacol. Ther. 108, 756–761 (2020).
Article Google Scholar
Klemenčič, J. & Mihelič, J. Application of algorithms and machine learning methods in pharmaceutical manufacture. IPSI Trans. Internet Res. 15(1), 16–22 (2019).
Google Scholar
Fisher, A. C. et al. Advancing pharmaceutical quality: an overview of science and research in the US FDA’s Office of Pharmaceutical Quality. Int. J. Pharm. 515, 390–402 (2016).
Article CAS Google Scholar
Haleem, R. M., Salem, M. Y., Fatahallah, F. A. & Abdelfattah, L. E. Quality in the pharmaceutical industry–A literature review. Saudi Pharm. J. 23, 463–469 (2015).
Article Google Scholar
Lawrence, X. Y. et al. FDA’s new pharmaceutical quality initiative: Knowledge-aided assessment & structured applications. Int. J. Pharm.: X 1, 100010 (2019).
Google Scholar
Teasdale, A., Elder, D. & Nims, R. W. ICH Quality Guidelines. (Wiley Online Library, 2017).
Rantanen, J. & Khinast, J. The future of pharmaceutical manufacturing sciences. J. Pharm. Sci. 104, 3612–3638 (2015).
Article CAS Google Scholar
Pesqueira, A., Sousa, M. J. & Rocha, A. Big data skills sustainable development in healthcare and pharmaceuticals. J. Med. Syst. 44, 197 (2020).
Article Google Scholar
Su, Q. et al. A perspective on Quality-by-Control (QbC) in pharmaceutical continuous manufacturing. Comput. Chem. Eng. 125, 216–231 (2019).
Article CAS Google Scholar
Žagar, J. & Mihelič, J. Creation of attribute vectors from spectra and time-series data for prediction model development. IPSI Trans. Internet Res. 15(2), 32–38 (2019).
Google Scholar
Woodcock, J. The concept of pharmaceutical quality. Am. Pharm. Rev. 7, 10–15 (2004).
Google Scholar
Podczeck, F. Methods for the practical determination of the mechanical strength of tablets—From empiricism to science. Int. J. Pharm. 436, 214–232 (2012).
Article CAS Google Scholar
Pitt, K. G. & Heasley, M. G. Determination of the tensile strength of elongated tablets. Powder Technol. 238, 169–175 (2013).
Article CAS Google Scholar
Žagar, J. & Mihelič, J. Big data collection in pharmaceutical manufacturing. figshare https://doi.org/10.6084/m9.figshare.c.5645578.v3 (2021).
Markl, D. et al. Review of real-time release testing of pharmaceutical tablets: State-of-the art, challenges and future perspective. Int. J. Pharm. 582, 119353 (2020).
Article CAS Google Scholar

Download references

Acknowledgements

We thank the Lek Pharmaceuticals d.d. solids manufacturing department for allowing to retrieve and use the real product process and quality data for scientific purposes and allowing this data to be used by wider research community to potentially develop advanced methods for data handling and use.

Author information

Authors and Affiliations

Lek Pharmaceuticals d.d., Ljubljana, Slovenia
Janja Žagar
University of Ljubljana, Faculty of Computer and Information Sciences, Ljubljana, Slovenia
Jurij Mihelič

Authors

Janja Žagar
View author publications
You can also search for this author in PubMed Google Scholar
Jurij Mihelič
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Janja Žagar collected the data and performed cleaning, data validation, initial data analysis, creation of new features and drafted the manuscript. Jurij Mihelič provided support with coding details, performed anonymization of all the data, guided the content write-up and revision of the manuscript before submission.

Corresponding author

Correspondence to Janja Žagar.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Žagar, J., Mihelič, J. Big data collection in pharmaceutical manufacturing and its use for product quality predictions. Sci Data 9, 99 (2022). https://doi.org/10.1038/s41597-022-01203-x

Download citation

Received: 12 October 2021
Accepted: 15 February 2022
Published: 23 March 2022
DOI: https://doi.org/10.1038/s41597-022-01203-x