A novel automated label data extraction and data base generation system from herbarium specimen images using OCR and NER

Digital extraction of label data from natural history specimens along with more efficient procedures of data entry and processing is essential for improving documentation and global information availability. Herbaria have made great advances in this direction lately. In this study, using optical character recognition (OCR) and named entity recognition (NER) techniques, we have been able to make further advancements towards fully automatic extraction of label data from herbarium specimen images. This system can be developed and run on a consumer grade desktop computer with standard specifications, and can also be applied to extracting label data from diverse kinds of natural history specimens, such as those in entomological collections. This system can facilitate the digitization and publication of natural history museum specimens around the world.

www.nature.com/scientificreports/Natural Language Processing (NLP) is a technology aimed at determining and analyzing contextual nuances of language within a document.A major part is automated "information extraction" by which specific text is recognized and presented in a structured form 22 .Named Entity Recognition (NER) 23 is a system that extracts named entities such as place names, proper nouns, and time from unstructured text.Since the information recorded with specimens in natural history collections can be tagged as named entities, such as scientific names, locality, collector names, and dates, we considered to label the data extracted by OCR with specific named entity extraction codes from which a database would automatically be created.
The idea of applying natural language processing technology to biodiversity science for automatically extracting specific information from various kinds of documents had previously been proposed 24 .A feasibility study has also been conducted to determine whether label data from natural history collections can be extracted semi-automatically from digital images of plant specimens using OCR and natural language processing 25 .Using 250 scanned sample images along with label data manually input by humans (gold standard), the attempt has been made to extract country names, personal names, and place names from texts extracted from the sample images using a system trained by Stanford NER (The Stanford Natural Language Processing Group 2018).That research had tested the feasibility of using OCR and NER to extract and structure sample label information, but a corresponding system was actually not developed at that time.
We here report the successful development of a system that automatically generates a specimen database by first creating a corpus of herbarium specimen labels and then structuring the text data obtained from specimen images using OCR and NER.

Results
When we searched the text for the parts that matched the scientific name dictionary and the place name dictionary, the number of matching parts was less than half (Table 1).Therefore, we decided to extract the named entity using machine learning.Table 2 shows the results of k-fold cross validation (k = 5, 10) for the three natural language processing libraries, BERT, Albert, and SpaCy.Among them, SpaCy had the highest F-value for both k = 5 and 10, followed by BERT by a very small margin, and Albert having the lowest result.Although the difference between BERT and SpaCy is negligible, we decided to use SpaCy for subsequent application development for versatility, since SpaCy runs on ordinary desktop machines or servers, whereas BERT requires a GPU to run.
Table 3 shows the results of k-fold cross validation with three types of data (manual corpus data, manual + 10,000 artificial data, and 10,000 artificial data only) given to SpaCy.The best results were obtained when the manual corpus + artificial data were given, followed by the manual corpus data only, and the lowest F-value was obtained when the artificial data only were given.A demonstration of the developed application can be viewed at https:// youtu.be/ 2jt_ GMUqr WQ.
Label data entry using the new data entry system resulted in 19-20 labels per hour in average.As they became more familiar with the system, they were able to input even faster.
Interested users who would like to test the system for a limited amount of time can apply to the corresponding author by e-mail to receive a password for database access.

Discussion
We show that data augmentation indeed improves the results of our model.As seen in Table 3, the F-value was highest when a manual corpus plus artificial data was used, followed by training using a manual corpus alone.Machine learning requires a huge amount of training data, but it is often difficult to prepare such a large amount of data.In such a case the use of artificial data is beneficial 26,27 .In our case, artificial data alone could not substantially improve the learning effect, but data augmentation was able to improve the results of the model.It thus is necessary to manually create a high-quality corpus as well as appropriately generated artificial data to improve the accuracy of the model.
Application of the new system developed and described here will help to increase the mobilization of herbarium specimens in Japan.Many systems for label data input by OCR have been developed in Europe and the www.nature.com/scientificreports/USA [13][14][15][16] .However, it has so far been impossible to apply this technology in Japan, the reason being the uniqueness and complexity of Japanese specimen labels.Specimen labels produced by Japanese collectors are written in several languages/scripts, with the locality and collector's name in Japanese or both Japanese and English, and the scientific name of the plant in Latin (Fig. 1).Automated text extraction requires the use of multilingual OCR, and it was necessary to develop a label data extraction system specifically for images of Japanese herbarium specimens.Japan has 178 active herbaria with a total of 15 million specimens 28 .Although Japan has lagged behind other countries in digitizing herbarium specimens 10 , specimen photography equipment developed by Takano et al. 2019 now has been used at the University of Tokyo (TI), Kyoto University (KYO), Osaka Museum of Natural History (OSA), and others, and specimen digitization is now performed in several large herbaria.However, for specimen mobilization, it is essential to generate metadata as well as to take digital images.The development of this automated label data input system can help to promote metadata extraction and further advance the mobilization of herbarium specimens in Japan and beyond.While the application presented here is specifically designed for Japanese herbarium sheet labels, it is suitable for supporting automated sample label reading of any country or language by applying a suitable and well-trained NLP library along with the corresponding learning dictionaries and data sets, since the code of the application is available in GitHub.Applying this system in other countries may help to promote the digitization of plant specimens all around the world.

Methods
Images of 20,000 herbarium specimens were prepared at the Museum of Human and Nature Activities, Hyogo (HYO), in conjunction with using the database of scientific names of Japanese vascular plants set up at HYO and the database of Japanese postal codes (https:// www.post.japan post.jp/ zipco de/ downl oad.html).Labels for named entity recognition were set as shown in Table 4.

Named entity recognition (NER) using dictionary-based text matching
Using 10,000 sample images, we first tried a dictionary-based text matching method.The method extracts from the input text the parts that match the entries in the dictionary as classes of the matching dictionary entries.The label part was detected from the sample image 18 , the text was extracted using Google OCR, and the text information was stored in a column.

Creation of a corpus
In order to perform NER based on classification by machine learning, we manually created a corpus of herbarium specimen labels used for test data and training data manually.As a strategy, we gave top priority to covering the variation of specimen labels found in 20,000 HYO specimen images.Usually, specimen labels are created by the person who collected the plants.The information written on the specimen labels is roughly the same for plant name, collection site, name of the collector, collection date, etc.The format of the labels differs depending on the collector, so they are variable (Fig. 1).Depending on the collector, the collector's number may be above or to the bottom right of the label, and the collector`s name may also be in the bottom right or bottom left.As the format of the labels varied, we decided to let the machine learning models learn the various label formats in order to achieve accurate data extraction.Collectors tend to continue to create labels in the same format, therefore, we grouped the data for each collector, ranked them randomly, and selected the first to third ranked ones as the target of creating learning data.In order to create an efficient corpus, we created a learning data creation tool (a web application) (Fig. 2).The application uses Laravel, a PHP framework, and runs in graphical user interface (GUI) and on the command line (CLI).The project files are available on GitHub (https:// github.com/ Hajim eKona gai/ Hitoh akuAI-Larav el).Laravel version 10.24.0 environment is required to use the application.Given a set of specimen images and metadata, the application can prepare teaching data to cover variations in label description style by grouping specimen images by collector and randomly selecting one or more images from each group.The image data after detection from each collector group can be found at the following URL (https:// data.hitoh aku-ai.jp/ images.zip).Since the extraction was done randomly by each collector, the results obtained in this study will not be consistent with those obtained in the replication.In addition, specimens of endangered species were excluded to avoid disclosing information on the collection sites).The specimen images selected above were subjected to OCR.Based on our experience of comparing various OCRs in the past 17 , we chose OCR by Google Cloud Vision.The text resulting from OCR was manually annotated based on the tags in Table 4.The annotated data were used to generate training/teaching data in json format.From the number of names of the collectors and the number of herbaria exchanging specimens with HYO, it was estimated that there were about 1,000 kinds of label formats among the 20,000 specimens.Therefore, we annotated the named entities using an annotation tool, and created 893 training data.

Data augmentation
Since the manual annotation method is labor-intensive, 10,000 artificial data were generated as a supplement.Data augmentation was carried out for each item.Family names (Japanese and English), scientific names, and Japanese names were randomly extracted from the botanical name database of HYO.Each data item of place names (Prefecture, City, etc.) was randomly extracted from the national address database of Japan Post and the 20,000 input data of HYO with a probability of half and half.For the collection date, a random date was generated, and various formats were set as follows (e.g., March.28).Altitude is usually described: either in meters or as a range from xx m to xx m (ex. in the right center of Fig. 1, Takahashi 3911).Therefore, a number was generated randomly between 0 and 8000, and a single notation or a range of xx to xx m was generated for the altitude.Any additional memo information on the labels was not considered as it is too diverse.

Evaluation of three NLP libraries
We prepared an application for machine learning (https:// github.com/ Hajim eKona gai/ Hitoh akuAI-Python), where we trained three Natural Language Processing (NLP) libraries and used k-fold cross validation to measure the generalization ability of the algorithm to determine which one to adopt.After loading the pre-trained Japanese data into three NLP libraries (SpaCy, BERT and Albert), the three algorithms were evaluated by using k-fold cross validation.The model is constructed by dividing the data into k groups, with each group as test data and the remaining data as training data, and measuring the accuracy of the model.The k-fold cross validation was set to k = 5 and 10.The pre-trained data and the size of the models at the time of parameter storage were as follows: 1. SpaCy: 'ja-ginza' (5.55 MB), 2. BERT: ' cl-tohoku/bert-base-japanese-whole-word-masking' (419 MB), 3. Albert (lightweight model for BERT): 'ken11/albert-base-japanese-v1-with-japanese-tokenizer' (43.4 MB).
After having decided on the NLP library, again we performed k-fold cross validation with three types of data (1.manually generated corpus only, 2. artificially generated corpus of 10,000 cases only, and 3. manual + artificially generated corpus of 10,000 cases), with k = 5 and 10.

Evaluation of the automated data entry application developed in this study
The speed of label data input was examined when using the label data entry system developed here.We asked workers who normally input label data using the OCR-based label data input system 17 to input data using the newly developed system, and measured how much data could be input per hour.

Figure 1 .
Figure 1.Different kinds of labels on herbarium specimen sheets at HYO.

Figure 2 .
Figure 2. Example of an application for labeling OCR-extracted text label data.The text extracted from the specimen image in the top left-hand corner is displayed in the left-hand column.Labels suggested by the system are displayed in different colors.When labels are modified, changed, etc., in the text in the left-hand column, the changes are reflected in the right-hand column.

Table 1 .
Number of matched cases using OCR text extracted from 10,000 herbarium specimen images.

Table 2 .
Result of K-cross validation (k = 5, 10) applied for the three NLP libraries.

Table 3 .
Result of k-cross validation (k = 5, 10) applied for SpaCy with manual corpus, artificial data, and combined data.

Table 4 .
Labels used in the Named Entity Recognition (NER) in this study.*Asterisk indicates the labeling using deep learning only.