A dataset of oracle characters for benchmarking machine learning algorithms

Wang, Mei; Deng, Weihong

doi:10.1038/s41597-024-02933-w

Download PDF

Data Descriptor
Open access
Published: 18 January 2024

A dataset of oracle characters for benchmarking machine learning algorithms

Scientific Data volume 11, Article number: 87 (2024) Cite this article

1071 Accesses
Metrics details

Subjects

Abstract

Oracle bone script is an ancient Chinese writing system engraved on turtle shells and animal bones, serving as a valuable resource for interpreting ancient culture, history, and language. We introduce the Oracle-MNIST dataset, comprising of 28 × 28 grayscale images of 30,222 ancient characters from 10 categories, designed for benchmarking pattern classification, with particular challenges related to image noise and distortion. The training set totally consists of 27,222 images, and the test set contains 300 images per class. Oracle-MNIST follows the same data format with the original MNIST dataset, enabling direct compatibility with all existing classifiers and systems, but it constitutes a more challenging classification task than MNIST. The images of ancient characters suffer from (1) extremely serious and unique noises caused by three-thousand years of burial and aging and (2) dramatically variant writing styles by ancient Chinese, which all make them realistic for machine learning research.

A new dataset for mongolian online handwritten recognition

Article Open access 02 January 2023

MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification

Article Open access 19 January 2023

One shot ancient character recognition with siamese similarity network

Article Open access 01 September 2022

Background & Summary

In the last few years, rapid progress has been unfolding in machine learning (ML) due to the release of specialized datasets that serve as experimental testbeds and public benchmarks, thus focusing the efforts of the research community. The most widely known dataset in computer vision is the MNIST dataset, which was first introduced in 1998 by Lecun et al.¹. MNIST is a 10-class digit classification dataset, and consists of 60,000 grayscale images for training and 10,000 grayscale images for testing. The entire dataset is relatively small, free to access and use, and is encoded and stored in an entirely straightforward manner, which have almost certainly contributed to its widespread use.

However, with the discovery of improved learning algorithms, the performance has been saturated on MNIST. For example, Convolutional Neural Networks (CNNs)^2,3 can easily achieve an accuracy of above 99%. This is partially attributed to the benchmark that does not capture requirements of many real-world scenarios. To avoid the saturated performance and offer challenges for the improved ML algorithms, some modified MNIST datasets are constructed, e.g., EMNIST⁴ and Fashion-MNIST⁵. EMNIST extends the number of classes by introducing uppercase and lowercase letters, but the extra classes require a change of the framework of deep neural network used by MNIST. Fashion-MNIST contains 70,000 grayscale images of 10-class fashion products. These product images are taken from Zalando’s website, which is the Europe’s largest online fashion platform (http://www.zalando.com). They are shot by professional photographers, and thus are clear and standardized. However, it fails to capture as wide of a range of variations as possible in the real world.

The purpose of this paper is to provide a realistic and challenging dataset, called Oracle-MNIST, to facilitate easy and fast evaluation for ML algorithms on the real-world images of ancient characters. Oracle-MNIST contains 30,222 images of oracle characters belonging to 10 categories.

Real-world challenge. Different from handwritten digits, oracle characters are scanned from the real oracle-bone surface. Therefore, Oracle-MNIST suffers from extremely serious and unique noises caused by thousands of years of burial and aging, and contains various writing styles in each category, all of which make it more realistic and difficult for ML research.
Ease-of-use. Following the original MNIST, the images in Oracle-MNIST have 28 × 28 grayscale pixels. It can be immediately compatible with any ML package capable of working with the MNIST dataset since it shares the same data format. In fact, the only change one needs to make to use this dataset is to change the URL from where the MNIST dataset is fetched.

We introduce this dataset specifically made for machine learning research to serve as a direct drop-in replacement for the original MNIST dataset and engage the community to the field of Chinese ancient literature, which contributes not only to technology but also to culture heritage preservation and the understanding of oracle characters and ancient civilization.

Methods

Discovery of oracle characters

Ancient history relies on the study of ancient characters. As the oldest hieroglyphs in China, oracle characters^6,7, with a history spanning nearly three millennia, have contributed greatly to modern civilization, enabling the Chinese culture to be passed on from generation to generation and become the only civilization to last up to the present. As shown in Fig. 1, oracle characters are engraved on tortoise shells and animal bones, and record the life and history of the Shang Dynasty (around 1600-1046 B.C.), including divination practices, war expeditions, hunting, medical treatments, and childbirth. They were first discovered by a merchant called Wang Xirong in 1899, during the Qing Dynasty (1644–1911). In the early 20th century, Chinese researchers excavated numerous oracle bones at Xiaotun Village in Anyang, Henan Province, capital of the Shang Dynasty. Since then, the research on oracle characters has attracted much attention. It is of vital importance for Chinese etymologies and calligraphy as well as for learning the culture and history of ancient China and even the world.

Most of oracle characters are stored by scanned images, which are generated by reproducing the oracle-bone surface by placing a piece of paper over the subject and then rubbing the paper with rolled ink, as shown in Fig. 2a. Recognizing these oracle characters is difficult for both experts and machines. Thus far, nearly 4,500 different oracle characters have been discovered, but only about 2,200 characters have been successfully deciphered. The reasons are as follows. (1) Abrasion and noise. Many oracle-bone inscriptions have been damaged over the centuries and their texts are now fragmentary. The aging process has also made the inscriptions less legible, resulting in broken characters with serious noise. (2) Large variance. Different writing styles lead to a high degree of intra-class variance. Characters belonging to the same category largely vary in stroke and even topology, as shown in Fig. 2b. Some characters belonging to different categories are similar to each other, which brings great difficulty for recognition. For example, the characters of ‘wood’ and ‘cattle’ categories only differ in some small details shown in Fig. 2c,d. Clearly, providing such Oracle-MNIST benchmark for ML community would facilitate research towards oracle character recognition and help to handle these challenges from the perspective of computer vision. We also hope that archaeologists and paleographists can gain from the progress achieved by ML community in the future such that their workload can be lightened when identifying characters.

Construction of oracle-MNIST

Oracle-MNIST is based on the collection of YinQiWenYuan website (http://jgw.aynu.edu.cn/ajaxpage/home2.0). It is a large oracle-bone platform constructed by AnYang Normal University. The raw images of oracle characters are collected from eight authoritative oracle-bone publications, e.g., Jiaguwen heji⁸. Then, oracle characters are cropped from these raw images such that each cropped image is centered by one single character. Most of the per-character images have gray or black backgrounds and vary in resolution. Since these oracle characters are scanned from the real oracle-bone surface, they are broken and suffer from serious noises. The meanings of characters are utilized as their class labels. The labels are manually annotated by experts in archeology or paleography.

To build Oracle-MNIST, we selected 30,222 commonly-used characters of 10 classes. The selected images are then fed into the following conversion pipeline such that they can be converted to 28 × 28 pixel 8-bit grayscale images that match the characteristics of the digits in the MNIST dataset. An overview of the conversion process is visualized in Fig. 3.

Grayscaling. The original RGB images are converted to 8-bit grayscale pixels as shown in Fig. 3b.
Negating. Most of these scanned images contain white characters on black backgrounds; and conversely, a few images consist of black characters on white backgrounds. For consistency, we negate the intensities of the image if its foreground is darker than the background shown in Fig. 3c. The negating process can be performed by: p_new = 255−p_old, where p_old and p_new are the intensity values of images before and after negating.
Resizing. With its aspect ratio preserved, the longest edge of the image is resized to 28 using a bi-cubic interpolation algorithm, as shown in Fig. 3d.
Extending. We extend the shortest edge to 28 by padding it with 0, and put the image to the center of the canvas. The range of intensity values is then scaled to [0, 255], resulting in the 28 × 28 pixel grayscale images shown in Fig. 3e.

We also attempt to process the images by some image enhancement techniques, e.g., gray stretch and histogram equalization. Although the visual quality of images is successfully improved, the recognition performance would slightly degrade. Therefore, no image enhancement technology is applied to Oracle-MNIST. We also make the original RGB images available and left the data processing job to the algorithm developers.

We chose to resize the images to a resolution of 28 × 28 to follow the same data format as the original MNIST dataset, ensuring direct compatibility with all existing classifiers and systems. However, considering that today’s hardware allows for deep learning to operate on a larger scale, we also provide a version with a resolution of 224 × 224.

Data Records

Oracle-MNIST dataset contains 30,222 samples of 10 classes, where each class represents a unique oracle bone glyph character. Figure 4 gives a summary of all class labels in Oracle-MNIST with examples for each class. The dataset is divided into a training and a test set, and we make sure that they are disjoint. The training set totally consists of randomly-selected 27,222 images belonging to 10 categories. It is class-imbalanced due to the appearance frequency in the real source books, ranging from 3,399 examples to 2,328 examples per class. The test set contains 10 classes with 300 images per class.

Oracle-MNIST can be accessed at Figshare⁹, Science Data Bank¹⁰, GitHub (https://github.com/wm-bupt/oracle-mnist). We grant free access to the dataset, without the need for user registration. The dataset is distributed in GZIP archives with a total size of 13.8 MBytes. Images and labels are stored in the same IDX file format as the MNIST dataset, which is designed for storing vectors and multidimensional matrices. The result files are listed in Table 1. Although our test set consists of only 3 K images, it is called ‘t10k’ instead of ‘t3k’ to be consistent with the original MNIST dataset such that it can be easily compatible with any ML package.

Table 1 Files contained in the Oracle-MNIST dataset.

Full size table

The images with resolution of 224 × 224 are also available at Figshare⁹ and Science Data Bank¹⁰. The original RBG images can be downloaded from GitHub. These images are also split into a training and a test set. All of the images in BMP format are grouped by folders with labels from 0–9 representing the class labels. The images in the same folder belong to the same category. Each image is named as ‘******_#.bmp’, where ‘******’ represents the class labels (6-digit code) provided by YinQiWenYuan website and ‘#’ represents the image name.

Technical Validation

We evaluate some algorithms with different parameters on Oracle-MNIST and report the results in Tables 2–10. For each algorithm, the average classification accuracy is reported based on three repeated experiments. The benchmarks on the MNIST and Fashion-MNIST dataset are also included for a side-by-side comparison.

Table 2 Benchmark results using CNN on Oracle-MNIST, Fashion-MNIST and MNIST.

Full size table

Table 3 Benchmark results using GradientBoostingClassifier on Oracle-MNIST, Fashion-MNIST and MNIST.

Full size table

Table 4 Benchmark results using SVC on Oracle-MNIST, Fashion-MNIST and MNIST.

Full size table

Table 5 Benchmark results using MLPClassifier on Oracle-MNIST, Fashion-MNIST and MNIST.

Full size table

Table 6 Benchmark results using RandomForestClassifier on Oracle-MNIST, Fashion-MNIST and MNIST.

Full size table

Table 7 Benchmark results using KNeighborsClassifier on Oracle-MNIST, Fashion-MNIST and MNIST.

Full size table

Table 8 Benchmark results using LogisticRegression on Oracle-MNIST, Fashion-MNIST and MNIST.

Full size table

Table 9 Benchmark results using SGDClassifier on Oracle-MNIST, Fashion-MNIST and MNIST.

Full size table

Table 10 Benchmark results using LinearSVC on Oracle-MNIST, Fashion-MNIST and MNIST.

Full size table

From the results, we have the following observations. First, classic (shallow) ML algorithms can easily achieve 97% on the MNIST dataset which proves that MNIST is too easy to evaluate the algorithms. Our Oracle-MNIST dataset provides 10-class images of ancient characters and further captures as wide of a range of variations as possible in the real world to pose a more challenging classification task than the MNIST digits data and Fashion-MNIST data. As we can see that all classic (shallow) ML algorithms perform the best on MNIST, followed by Fashion-MNIST, and the worst on Oracle-MNIST. For example, the random forest classifier achieves the accuracies of 97.1%, 87.1% and 64.9%, respectively. This is because a high degree of intra-class variance and inter-class similarity as we described above would bring great difficulty for classification. Moreover, the scanned oracle images are seriously degraded and even completely lost their discriminative glyph information caused by blur, noise and occlusion.

Second, CNN outperforms all of the classic (shallow) ML algorithms on Oracle-MNIST. Benefitting from local receptive fields and spatial or temporal subsampling, CNN can force the extraction of local features and reduce the sensitivity of the output to shifts and distortions¹¹. As a result, real-world challenges such as different writing styles, noise, and occlusion can be tackled to some extent, leading to better performance on oracle characters¹². However, the performance on Oracle-MNIST has not been saturated. The CNN utilized in this paper achieves an error rate of 6.2% on Oracle-MNIST, and there is still room for improvement. Despite the powerful representation ability of CNN, the problem of recognizing these ancient characters remains to be fully solved.

Usage Notes

We provide a Python script mnist_reader.py that can be used to read the images and labels from the files of our Oracle-MNIST. It is provided together with the database on GitHub. Since Oracle-MNIST is converted to a format that is directly compatible with classifiers built to handle the MNIST dataset, the only change one needs to make to use this dataset is to change the URL from where the MNIST dataset is fetched. We also provide a Python script train_pytorch.py to enable researches to reproduce the results of CNNs utilized in this paper.

Code availability

Oracle-MNIST are freely available online at GitHub (https://github.com/wm-bupt/oracle-mnist). Tutorials for loading the dataset and code for training and testing oracle character recognition models are also publicly available without restriction.

References

LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
Article Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, 1–9 (2012).
Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Cohen, G., Afshar, S., Tapson, J. & Van Schaik, A. EMNIST: extending MNIST to handwritten letters. In Proceedings of the international joint conference on neural networks, 2921–2926 (2017).
Xiao, H., Rasul, K. & Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Preprint https://arxiv.org/abs/1708.07747 (2017).
Flad, R. K. et al. Divination and power: a multiregional view of the development of oracle bone divination in early china. Current Anthropology 49, 403–437 (2008).
Article Google Scholar
Keightley, D. N. Graphs, words, and meanings: three reference works for shang oracle-bone studies, with an excursus on the religious role of the day or sun. Journal of the American Oriental Society 117, 507–524 (1997).
Article Google Scholar
Guo, M. & Hu, H. Jiaguwen heji: the comprehensive dictionary of oracle characters (Zhonghua Book Company, Beijing, China, 1978).
Wang, M. & Deng, W. A dataset of oracle characters for benchmarking machine learning algorithms, Figshare, https://doi.org/10.6084/m9.figshare.c.6786852.v1 (2024).
Wang, M. & Deng, W. A dataset of oracle characters for benchmarking machine learning algorithms. Science Data Bank https://doi.org/10.57760/sciencedb.10146 (2023).
LeCun, Y. & Bengio, Y. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361, 1–14 (1995).
Google Scholar
Chen, S., Xu, H., Weize, G., Xuxin, L. & Bofeng, M. A classification method of oracle materials based on local convolutional neural network framework. IEEE computer graphics and applications 40, 32–44 (2020).
Article PubMed Google Scholar

Download references

Acknowledgements

This work was supported by China Postdoctoral Science Foundation under Grant 2022M720517 and National Natural Science Foundation of China under Grant 62236003 and 62306043.

Author information

Authors and Affiliations

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Mei Wang & Weihong Deng

Authors

Mei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weihong Deng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Mei Wang collected the data, conducted the experiments, and drafted the manuscript. Weihong Deng guided the content write-up and revision of the manuscript before submission.

Corresponding author

Correspondence to Weihong Deng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, M., Deng, W. A dataset of oracle characters for benchmarking machine learning algorithms. Sci Data 11, 87 (2024). https://doi.org/10.1038/s41597-024-02933-w

Download citation

Received: 14 August 2023
Accepted: 05 January 2024
Published: 18 January 2024
DOI: https://doi.org/10.1038/s41597-024-02933-w