Introduction

Over the last few years, the complexity of clinical and biological data for a proper diagnosis and prognostication of onco-hematological diseases has remarkably increased, especially in the field of lymphomas1,2. In parallel, novel therapeutics found continue approvals from large, controlled trials, but missed parallel validation in the Real-World (RW) settings3. This major controversy claims for an urgent improvement of the capability to collect and share RW data with the final goal to support clinical and translational research4. Frequently, RW data are derived from fragmented sources as medical registries, electronic records, computerized patient order entries, individual databases, paper notes, as well as monocentric bio-banking-related annotations. Moreover, the common dearth of specialized data-entry professionals and the uneasy accessibility to data-extraction systems for most physicians accentuate the need for tools that facilitate the process of health data recording 5.

Natural Language Processing (NLP) is a consolidated technique to extract essential unstructured data from text, e.g. from diagnostic and prognostic notes6,7,8,9,10, widely adopted also in onco-hematology11,12,13,14,15,16. REDCap (Research Electronic Data CAPture) is a recognized platform of electronic case report forms (eCRFs) enabling rapid, high-quality and standardized annotation of data17,18. A potential bridge between NLP and eCRFs population is interposed by Optical Character Recognition (OCR), namely a state-of-the-art technology able to convert paper-based reports into digital formats to be further structured—possibly through NLP techniques—in electronic health records (EHR). Thus, OCR overcomes the need of integration between textual reporting and digital storage systems19,20.

Here, we describe the development of a NLP-based tool, named ARGO (Automatic Record Generator for Onco-hematology), to automatically convert RW reports in standardized eCRFs for data collection. To test its generalizability, we applied ARGO to a multicentric set of RW pathology reports of Non-Hodgkin Lymphomas (NHL) and validated its functionality, performance, and suitability for future translation into the daily practice.

Results

Electronic data collection workflow

The capacity of ARGO to effectively automize eCRF generation was tested on both internal and external cohorts of NHL paper-based pathology reports, including Diffuse Large B-Cell Lymphoma (DLBCL), Follicular Lymphoma (FL), and Mantle Cell Lymphoma (MCL). ARGO read all the words in each template currently adopted at the Pathology Unit of the IRCCS Istituto Tumori “Giovanni Paolo II” as well as at Pathology Units of six additional Italian Hospitals. As illustrated in Fig. 1A, each histopathology report included several data organized in four main sections: (1) biopsy date and Identification (ID) report number; (2) patient demographical information; (3) specimen characteristics; and (4) biomarker and diagnosis description.

Figure 1
figure 1

Graphical description of the framework. (A) Each paper-based report is manually transformed into an image file by a common digital scanner (right upside, an example of paper-based report from the Pathology Unit of the IRCCS Istituto Tumori “Giovanni Paolo II” of Bari, Italy). Then, the image is uploaded into ARGO through a web interface (black block), transformed in structured text through OCR and saved (by an NLP approach) as structured data in a database via webserver. “Diagnosis” attribution is carried out via API connecting ARGO with SEER servers (blue block). Finally, ARGO automatically populates eCRFs via API (red block). (B) Representative picture of REDCap dashboard for a single case report including “Demography” and “Disease parameters” forms (red bullets). Abbreviations. ARGO: Automatic Record Generator for Onco-hematology, OCR: Optical Character Recognition, NLP: Natural Language Processing, SEER: Surveillance, Epidemiology, and End Results, eCRFs: electronic Case Report Forms, API: Application Programming Interface, REDCap: Research Electronic Data-Capture, ID: Identification.

The first implementation step of ARGO pipeline consisted in the advantageous transformation of each paper-based report into an image file (.jpg extension) by using a common digital scanner. Thus, each report was uploaded on the ARGO application, which saved structured text into a support database, retrieved all the relevant data from the text, and transferred them directly into dedicated eCRFs. 233 out of 239 reports of the internal series and all those belonging to the external series (n = 93) were successfully converted in eCRF records. ARGO failed in converting six reports due to either low optical quality or their length (> 1 paper page). Figure 1B shows the main sections of each eCRF, which included both “demographic” and “disease” modules (see also Figure S1 from Supplementary Appendix), in a way consistent with the content of the corresponding original paper report. A video demonstrates ARGO’s functionality in Multimedia Appendix.

Data retrieved from diagnostic reports

Among the 239 paper-based reports retrieved from the internal series, 106, 79, and 54 were conclusive for a diagnosis of DLBCL, FL, and MCL, respectively (Fig. 2A). Overall, 110 diagnostic specimens were obtained from a tissue extracted from a lymph-node (LN), 76 were extra-nodal (EN), and 39 from bone marrow (BM), 2 from peripheral blood (PB), while for 12 cases this information was not available (Fig. 2B). In 85% of cases, a matched bone marrow biopsy was not available (Fig. 2C). Results from immunohistochemistry (IHC) staining for MYC, BCL2, BCL6, cluster of differentiation (CD)10, CD20, and Cyclin-D1 were available in 227 out of 239 cases and included a qualitative (positive/negative) assessment for the most relevant markers (Fig. 2D-E). A FISH (Fluorescent in situ hybridization) analysis (for MYC, BCL2, BCL6 or Cyclin-D1) appeared in the 29% of reports (Fig. 2F), whereas Cell of Origin (COO) categorization was reported in nearly 18% of cases (Fig. 2G). Of note, 186 out of 239 reports included the quantitative value of the Ki-67. Among these, 54 reported a value lower than 30% (Fig. 2H). Table 1 shows the full reports characteristics from the internal series.

Figure 2
figure 2

Characteristics of data retrieved from diagnostic reports. Graphical representation of diagnostic features, subdivided into specific fields, captured by ARGO from a total of n. 239 paper-based pathology reports of the internal series. Abbreviations. ARGO: Automatic Record Generator for Onco-hematology, FCL: Follicular Lymphoma, DLBCL: Diffuse Large B-cell Lymphoma, MCL: Mantle Cell Lymphoma, BM: Bone Marrow, LN: Lymph-Node, EN: Extra-Nodal, PB: Peripheral Blood, NA: Not Available, IHC: Immunohistochemistry, N: number, CD: Cluster of Differentiation, FISH: Fluorescent in situ hybridization, GCB: Germinal Center B-like.

Table 1 Characteristics of pathology reports.

Among the 93 paper-based reports retrieved externally from other six Italian centers, 49, 24, and 20 were conclusive for a diagnosis of DLBCL, FL, and MCL, respectively (Table 1). Overall, 53 diagnostic specimens were obtained from LN, 28 were EN, and 12 from BM. In 85% of cases, a matched bone marrow biopsy was not available. Results from IHC staining for MYC, BCL2, BCL6, CD10, CD20, and Cyclin-D1 were available in 93 out of 93 cases and in external series included a qualitative (positive/negative) assessment for the most relevant biomarkers. A FISH analysis (for MYC, BCL2, BCL6 or Cyclin-D1) appeared in the 5.4% of reports, whereas COO categorization was reported in nearly 35.5% of cases. Of note, 93 out of 93 reports included the quantitative value for Ki-67. Among these, 17 reported a value lower than 30%. Table 1 shows the full reports characteristics from the external series.

Internal vs external validation

Overall, ARGO detected 127,578 terms of interest and successfully generated EHR from 326 out of 332 processed histopathology reports. Figure 3 shows the comparative (internal vs. external series) post-hoc validation of ARGO performance for all the study data fields. For the “DIAGNOSIS” field, ARGO reached 88.9% vs. 87.9% (p > 0.05) of accuracy and recall, 93.5% and 93.7% of F1-score, also achieving 100% of precision in both series. For the “BIOPSY DATE” and the “ID NUMBER” fields of the internal series, all the applied metrics were > 90%. In comparison, accuracy, F1-score, and recall of external series for the “BIOPSY DATE” field were 94.6%, 95.0%, and 90.6%, respectively (P > 0.10), whereas “ID NUMBER” field ranged between 77.3% (recall from the external series) and 100.0% (precision from both internal and external series) (P > 0.10). For the “SPECIMEN TYPE” field, ARGO reached 86.6% vs. 91.4% of accuracy, 98.5% vs. 100.0% of precision, 92.5% vs. 95.5% of F1-score and 87.2% vs. 91.4% of recall (P > 0.10 in all instances). Similar high performance was observed for the “IHC EXECUTION” field (95.4% vs. 97.8% of accuracy and recall, 97.6% vs. 98.9% F1-score, and 100.0% of precision [P > 0.10]), although accuracy, recall and F1-score, but not precision (100.0%), slightly decreased as for single biomarker analyses (Supplementary Table S1). Similar results were also recorded the “BM AND FISH EXECUTION” fields. Finally, ARGO allowed the detection of “Ki-67”-related information with 85.4% vs. 80.6% (P > 0.10), 99.4% vs. 100.0% (P > 0.10), 81.9% vs. 76.3% (P > 0.10), and 89.8% vs. 86.6% (P > 0.10) of accuracy, precision, recall and F1-score, respectively. Overall, no significant differences between internal vs. external series were found in 14 out of 15 tested data fields (Supplementary Table S1). Of note, there is significant improvement (P < 0.01) in detecting the CD10 biomarker from the internal (67.4% of accuracy, 96.9% of precision, 55.9% of recall, and 70.9% of F1-score) compared to the external series (82.8% of accuracy, 98.5% of precision, 81.0% of recall, and 88.9% of F1-score).

Figure 3
figure 3

ARGO performance. Radar graphs indicate the performance metrics as percentage of accuracy, precision, recall and F1-score in different data fields for both internal and external series of pathology reports. Abbreviations. ID: Identification, IHC: Immunohistochemistry, BM: Bone Marrow, FISH: Fluorescent in situ hybridization.

To assess potential weaknesses of OCR in detecting data regarding single biomarkers, we selected 50 reports from the internal series with higher image resolution and reassessed the validation metrics (Table 2). Overall, recall and F1-score metrics improved of 12.9% and 9.1%, respectively. Moreover, we assessed the NLP performance on the internal series, independently of OCR. Interestingly, we observed an incremental trend of the recall for 7 of the 8 variables analyzed (“DIAGNOSIS” 87.9% vs. 90.0%; “ID NUMBER” 92.1% vs. 96.2%, “SPECIMEN TYPE” 87.2% vs. 92.7, “IHC EXECUTION” 95.4% vs. 95.8%, “FISH EXECUTION” 93.7% vs. 97.5%, “BM EXECUTION” 92.9% vs. 97.1%, and “KI-67” 81.9% vs. 89.4%). Only the field “BIOPSY DATE” showed a slight decrease from 97.1 to 96.2%, which we considered not relevant (Table 3).

Table 2 Comparison of ARGO performance in the whole vs. the top 50 reportsa.
Table 3 Comparison of ARGO performance using OCR + NLP and NLP alone (internal series).

Discussion

In the study, we aimed at designing a pipeline to automate the collection of RW onco-hematological data, about lymphoma diagnoses. Leveraging well-recognized technologies as OCR and NLP we developed a new tool, called ARGO, and provided a “proof” for its reliability in generating eCRFs directly from unstructured histopathology reports. We successfully tested ARGO performance, in terms of accuracy, precision, recall and F1 score, on a multicentric cohort of 326 lymphoma cases including DLBCL, FL, and MCL from seven independent centers.

ARGO generalizability stands on three assumptions: (1) the implementation of a function fully dedicated to recognize each input template independently of the clinical features collected in the report (a new template form an additional center might be easily read by adding few NLP regular expressions to the header_function.py); (2) ARGO is able to detect the set of clinically relevant terms for the diagnosis definition by matching standard criteria, that might be tailored to every clinical field (for example, other subtypes of lympho-proliferative disorders); (3) the choice at developing eCRFs on the College of American Pathologist (CAP) templates conferring high level of standardization to the clinical content.

In comparison with other applications in oncology, ARGO confirmed super-imposable performances in data field detection6,11,14,16,21,22, while overcoming some limitations. For instance, in the work by Nguyen et al., each metric decreases as the number of classes describing a certain data field increases11. This trend is globally confirmed in our experience, and even for data fields with high number of classes, such as “SPECIMEN TYPE”, we achieved a very high precision level. Also, to potentiate the OCR performance, we created three separate thesauri for “BIOMARKERS”, “SPECIMEN TYPE” and “DIAGNOSIS”. As shown in Tanenblatt et al.22, we first included officially-recognized nomenclatures in the “biomarkers” and “diagnosis” dictionaries, referring to the “International Statistical Classification of disease and related health problems 10th revision” (ICD10, version 2019, World Health Organization) classification23. Then, we manually added synonyms, abbreviations and other uncommon expressions noticed in our set of reports. Nevertheless, ARGO failed in converting six reports as a direct consequence of OCR-based limitations in reading reports with both low-quality optical resolution and describing IHC analyses from multiple samples. At this regard, the improvement of ARGO performance observed excluding OCR from the pipeline indicates a potential pitfall, which can be easily overcome by a manual supervision by a dedicated data-entry/manager.

From a more applicative point of view, ARGO might maximize the use of clinical data in translational research by boosting the adoption of EHR. Especially in onco-hematology, the public healthcare system still lacks standardized models of RW data collection, and several gaps exist concerning how to electronically collect unstructured information. Application of a computerized approach to extract data from paper-based reports and directly populate eCRFs provides two main advantages, such as the standardization of data collection and the data integration between Institutions and research networks. Finally, our system takes advantage from two levels of personalization related to REDCap, i) the designing of graphic interfaces directly by the clinical investigators according to specific clinical endpoints; and ii) the easily population of eCRFs via Application Programming Interface (API). Therefore, ARGO appeared as a valid tool for a precise and time-saving recording of clinical data when compared to manual abstraction16. Our approach results feasible in the daily practice, facilitating consultation, filtering, and management of RW data. This step is crucial to study wide proportions of onco-hematological patients who have no access to clinical trials and support national research networking.

Main limitations of the study could be the language of histopathology reports. However, current pathology reporting systems allow the use of personalized data fields according to shared templates and translating software, e.i. as “MyMemory” software, enable the easy switch up across languages. Moreover, providing the set of regular NLP rules used into ARGO might easily address this issue simply translating from Italian to other languages all words researched in the text included in each report.

Given the accuracy and efficiency in generating correct electronic records for multicentric subsets of different lymphoma types, our approach could be tailored to additional disease models in oncology and could set the basis to validate novel biomarkers for translational research.

Methods

Data collection

Overall, 332 histopathology paper-based reports were collected between 2014 and 2020 at the Pathology Unit of the IRCCS Istituto Tumori 'Giovanni Paolo II' in Bari, Italy (239) and from six different Italian centers (93) from Unit of Hematology, Azienda Ospedaliero-Universitaria Policlinico Umberto I in Rome, Italy, Hematology, AUSL/IRCCS of Reggio Emilia in Reggio Emilia, Italy, Division of Hematology 1, AOU “Città della Salute e della Scienza di Torino” in Turin, Italy, Division of Hematology, Azienda Ospedaliero-Universitaria Maggiore della Carità di Novara in Novara, Italy, Department of Medicine, Section of Hematology, University of Verona in Verona, Italy, and Division of Diagnostic Haematopathology, IRCCS European Institute of Oncology in Milan, Italy. The internal series included 106 DLBCL, 79 FL, and 54 MCL, while the external one comprised 49 DLBCL, 24 FL, and 20 MCL.

A unique ID code was assigned to each report. According to the diagnostic criteria for each lymphoma subtype, reports included IHC results obtained from LN, EN, BM or PB specimens. Qualitative and quantitative information for IHC markers including MYC, BCL2, BCL6, CD10, CD20, Cyclin-D1 were reported. Some reports also included molecular data from FISH analysis, while some reports included either FISH results or the level tumor cell infiltration as addendum. For DLBCL, molecular classification according to the COO estimated by the Hans algorithm was also included24. Ki-67 proliferation index was also reported as quantitative value ranging from 5 to 100%.

The work was approved by the Institutional Review Board of the IRCCS Istituto Tumori “Giovanni Paolo II” hospital in Bari, Italy. All methods were carried out in accordance with relevant local regulations and after obtainment of dedicated informed consent.

Automated detection of relevant terms in paper-based reports

We aimed this step of the workflow at automating the detection of relevant terms to be extracted from the text fields of paper-based reports. ARGO exploits OCR25 and NLP26 techniques to convert images of reports into text and detect relevant words in the text based on an “ad-hoc” thesaurus.

The conversion from image to text has been implemented in Tesseract OCR© (version 4.1.1-rc2-20-g01fb). To improve conversion performance, each pathology report was firstly converted from pdf to image through Poppler library (version 0.26.5). Then, the image was translated in a grey scale of 8 bits (from 0 to 255 levels of grey).

Image transformation was developed in Python by OpenCV© software (version 4.2.0).

In ARGO, NLP techniques were adopted to automatically extract relevant terms for the disease diagnosis, to be transferred into the digitalized eCRFs. Thus, a set of NLP regular expressions were applied to extract information concerning the diagnosis, date of the report, report ID, type of the specimen, execution of BM biopsy, IHC, and FISH analyses, as well as quantitative and qualitative data of selected IHC markers (MYC, BCL2, BCL6, CD10, CD20, Cyclin-D1), COO subtypes and Ki-67 proliferation index (paragraph “ARGO function and NLP rules”).

The disease nomenclature was assigned based on the highest match between the pattern of detected biomarkers in each report and a reference pattern, as reported in the “Hematopoietic and Lymphoid Neoplasm Coding Manual guidelines” from the “Surveillance, Epidemiology and End Results (SEER) program” of the National Institute of Health27. The final diagnosis nomenclature was referred to the ICD10 classification23. Communication between ARGO and SEER official servers was flexibly dealt via API.

ARGO was developed in Flask©, version 1.1.2, the webserver was an Oracle© Linux Server 7.8 with kernel 4.14.35–1902.303.5.3.el7uek.x86_64. We used MariaDB© 5.5.68 as database. NLP algorithms were developed in Python 3.6.8. Translation from English to Italian language was dealt via API tool MyMemory© (version 3.5.0). To increase the detectability of biomarkers in the reports we also built three thesauri in Phyton with NLP regular expressions (Supplementary Appendix Source code S1 and Table S2). Despite the domain specificity of such thesauri, the technique of knowledge extraction by flexibly introducing a new thesaurus is a general feature of ARGO.

ARGO functions and NLP rules

ARGO was developed according to three functions: function_read.py, header_info.py, and params.py. Function_read.py was the main function and incorporated (1) the call to the header_info.py function to recognize the report template as input, (2) the set of NLP expressions to identify both biomarker and diagnosis description, and (3) the call to the params.py function which included two API tokens, the first to take data on biomarkers and diagnosis from the SEER database and the second provided from the REDCap project ID to allow automatic data entry. Supplementary Fig. S2A details the pseudocode to process a pathology report. ARGO embedded two main activities, namely i) the recognition of the template from the header section including the fields “BIOPSY DATE” and “ID NUMBER”, the demographical patient information (“NAME”, “SURNAME”, “DATE OF BIRTH”, “PLACE OF BIRTH”, “SEX”, and “SSN” [Social Security Number]), and the “SPECIMEN TYPE” (via header_info.py), and ii) the recognition of the “IHC MARKERS” (“POSITIVITY/NEGATIVITY” or “QUANTITY”) from the biological samples, the fields “FISH”, “DIAGNOSIS”, and “CELL OF ORIGIN” from the disease section (via function_read.py). Supplementary Fig. S2B shows an example of NLP input from the internal series. The regular expressions used to automatically recognize the header section for internal reports are reported in Table 4. Those for the external reports are detailed in Supplementary Table S3.

Table 4 Set of NLP regular expressions embedded into the header_function.py for the internal reports.

Concerning function_read.py, we identified the set of pathological description patterns according to the following four scenarios:

  1. 1.

    description of qualitative markers by symbolic qualifiers in a free text form (e.g. “ + ” for positivity and “-” for negativity);

  2. 2.

    description of qualitative markers by textual qualifiers in a free text form (e.g. “positive”, “reactive” or “immunoreactive” for positivity and “negative” or “immunonegative” for negativity);

  3. 3.

    description of both qualitative and quantitative markers by symbolic or textual qualifiers in a bullet form;

  4. 4.

    description of pure quantitative markers (as Ki-67).

Table 5 shows three representative patterns of description with their relative NLP pseudocodes and expected results. The whole set of patterns is detailed in Supplementary Table S4.

Table 5 Representative sets of NLP rules embedded into the function_read.py for patterns 1.1, 3.2, and 4.1.

Data-mapping and automatic population of eCRFs

For a systematic collection of the diagnostic variables in this study, we designed dedicated eCRFs on REDcap17,18. eCRFs were suited to the synoptic templates provided and approved by the CAP. We referred to DLBCL, FL, and MCL templates28,29. The data-mapping between ARGO and the eCRFs was performed by providing the relevant data fields from the REDCap dictionary as a flexible input to the application (Supplementary Table S5). Finally, we used API technology for the automatic data entry and final upload of the information of interest into the eCRFs.

Validation metrics

ARGO performance, regarded as the level of consistency between data included in the original pathology reports and those automatically transferred into eCRFs, was assessed in terms of accuracy, precision, recall and F1 score30. To calculate each measure, we defined the cases in the following (1) true-positive: cases in which ARGO detected correctly the expected variables; (2) false-positive: cases in which ARGO detected variables even if not present in the original report; (3) true-negative: cases in which ARGO did not detect a variable not present in the original report; and (4) false-negative: cases in which ARGO failed in detecting a variable present in the original report.

Results for each data-field of internal and external series were statistically compared by a chi-square test.