Electronic case report forms generation from pathology reports by ARGO, automatic record generator for onco-hematology

The unstructured nature of Real-World (RW) data from onco-hematological patients and the scarce accessibility to integrated systems restrain the use of RW information for research purposes. Natural Language Processing (NLP) might help in transposing unstructured reports into standardized electronic health records. We exploited NLP to develop an automated tool, named ARGO (Automatic Record Generator for Onco-hematology) to recognize information from pathology reports and populate electronic case report forms (eCRFs) pre-implemented by REDCap. ARGO was applied to hemo-lymphopathology reports of diffuse large B-cell, follicular, and mantle cell lymphomas, and assessed for accuracy (A), precision (P), recall (R) and F1-score (F) on internal (n = 239) and external (n = 93) report series. 326 (98.2%) reports were converted into corresponding eCRFs. Overall, ARGO showed high performance in capturing (1) identification report number (all metrics > 90%), (2) biopsy date (all metrics > 90% in both series), (3) specimen type (86.6% and 91.4% of A, 98.5% and 100.0% of P, 92.5% and 95.5% of F, and 87.2% and 91.4% of R for internal and external series, respectively), (4) diagnosis (100% of P with A, R and F of 90% in both series). We developed and validated a generalizable tool that generates structured eCRFs from real-life pathology reports.


Electronic Case Report Forms generation from pathology reports by ARGO, Automatic Record
Generator for Onco-hematology.
SUPPLEMENTARY APPENDIX Figure S1. Figure 1B from main manuscript in high resolution format. Figure S2. Logical description of the NLP rules implemented in ARGO. Table S1. Performance metric from n. 239 internal and n. 93 external pathology reports. Table S2. Referred thesaurus for biomarkers recognition and for the diagnosis definition. Table S3. Set of NLP regular expressions embedded into the header_function.py. Table S4. Set of NLP rules embedded into the function_read.py for the whole patterns identified according to each scenario. Table S5. Data dictionary extracted from REDCap for data fields used to map each word detected from the NLP.
Source code S1. Source code developed in Python for the application of the thesaurus rules.

References
This supplementary material has been provided by the authors to give readers additional information about their work. Figure S1. Figure 1B from main manuscript in high resolution format.
Representative picture of REDCap dashboard for a single case report including "Demography" and "Disease parameters" forms (red bullets).

Figure S2. Logical description of the NLP rules implemented in ARGO.
header_info.py 1) ARGO recognized the hospital template in the header section (NLP regular expression reported in Supplementary Table S1). Thus, 1.1) ARGO sought words related to the reported date to initialize the BIOPSY DATE data-field.
1.2) ARGO sought words related to the report ID date to initialize the the ID NUMBER data-field 1.3) ARGO sought the words related to the patient's identification (e.g. "Cognome", "Nome"): -NAME, SURNAME, DATE OF BIRTH, PLACE OF BIRTH data fields -SSN, i. in case the SSN code was present, ARGO initialized the SSN code data-field, ii.
In case the SSN code was not present, ARGO automatically calculated the SSN code via external webservice from NAME, SURNAME, DATE OF BIRTH, PLACE OF BIRTH datafields. 1.4) ARGO sought the words related to the specimen material at identifying the SPECIMEN TYPE data-field function_read.py A. IHC MARKERS. For each marker recognized in the text (Supplementary Table S2) i. ARGO prompted the biomarker to the SEER database via API key (via params.py), ii.
The SEER database responded providing the relative biomarker, iii.
if either points i) or ii) failed, ARGO internally prompted the biomarker to the "in-house" thesaurus (Supplementary Table S1); iv.
If also the point iii) failed, the relative data field from eCRF was not initialized.
B. IHC MARKERS (POSITIVITY/NEGATIVITY). A marker was assumed positive if the nearest adjective/noun reported on the left was "positivo/positività" or, if appended to marker is reported a '+' (plus).
A marker was assumed negative if the nearest adjective/noun reported on the left is "negativo/negatività" or, if appended to marker was reported a '-' (dash ARGO sought FISH markers (MYC, BCL2, BCL6, and CYCLIN D1) and if they were either positive or negative. A marker was assumed positive if the nearest adjective/noun reported on the left was "positivo/positività" or, if appended to marker was reported a '+' (plus). A marker was assumed negative if the nearest adjective/noun reported on the left was "negativo/negatività" or, if appended to marker was reported a '-' (dash).
E. CELL OF ORIGIN. ARGO seeks in the report the words "Germinal Center B-like" or "GCB". The COO is assumed negative if the nearest word reported on the left of the COO is "non" or "no".
F. DIAGNOSIS. i. ARGO prompted the diagnosis to the SEER database via API key (via params.py), ii.
The SEER database responded providing the relative biomarker, iii.
if either points i) or ii) failed, ARGO internally prompted the diagnosis to the "in-house" thesaurus (Supplementary Table S1); iv.
If also the point iii) failed, the relative diagnosis from eCRF was not initialized.
The pseudocode describes all logical phases executed by ARGO in recognizing each data field from the header and disease section of a paper-based report. B) Application of each NLP phase on an example of paper-based report from the internal series (Pathology Unit of the IRCCS Istituto Tumori "Giovanni Paolo II" of Bari, Italy).