Background & Summary

Oracle bone script, etched onto turtle shells and animal bones, stands as one of the earliest forms of writing discovered in China (see Fig. 1). These inscriptions, dating back about 3,000 years, offer a window into the human geography of the Shang Dynasty (1600 BCE - 1046 BCE), an ancient feudal dynasty ruled in the Yellow River valley. The content encompasses a range of topics including astrology, meteorology, animal husbandry, religion, and ritual practices1,2. Similar to other ancient scripts, the meanings of many Oracle Bone Characters (OBCs) have been lost over time. Of the 160,000 pieces unearthed, they reveal more than 4,600 distinct OBCs, yet only about a thousand of these have been deciphered, with their meanings and corresponding modern Chinese characters confirmed3.

Fig. 1
figure 1

3000-year-old Shang Dynasty oracle bones. They were all unearthed at the Huayuanzhuang East in Yinxu, China, and are currently housed at the Anyang workstation of the Institute of Archaeology, Chinese Academy of Social Sciences. These oracle bones date back to the reign of King Wu Ding of the Shang Dynasty.

The target of deciphering these ancient inscriptions is to translate OBCs into modern Chinese, with characters corresponding one-to-one with the same meaning. However, the deciphering task at the character level is complicated by several factors. Historically, the methods of preservation and excavation were not always ideal, leading to many of the oracle bones being damaged. This damage often results in partial, unclear, or illegible inscriptions, making interpreting them arduous. Therefore, most of the images used in current Oracle Bone Character (OBC) research are scanned images that have undergone denoising and other processing or artificially transcribed images. In addition, the nature of oracle bone script as an early writing system means that it underwent significant evolution. There is a considerable variation in the form of characters, with many characters appearing in multiple, sometimes radically different forms4 but corresponding to the same Chinese character. This variability adds another layer of complexity to the deciphering process. All these factors contribute to making the full understanding of OBC not only challenging but also a rare feat, attracting the keen interest of scholars and historians alike in the field of ancient Chinese studies.

The past decades have witnessed the widespread applications of Artificial Intelligence (AI) in various fields. Notably, the immense success of handwritten text recognition (HTR)5,6,7 technology in processing modern texts has sparked interest in the potential use of AI to aid in deciphering OBCs. Modern AI algorithms, particularly those centered around deep learning models like artificial neural networks, typically require an extensive volume of data for training. This approach enables them like AlphaGo who defeated the world champion in a Go game8 to achieve, and sometimes surpass, human-level performance in specific tasks. A fundamental step towards employing these models to decipher OBCs involves the creation and annotation of a comprehensive high-quality dataset of OBCs. In the dataset, each OBC is labeled by its modern counterpart, while different OBCs with the same label are referred to as a category in the dataset. There have been some pioneering efforts in this area. For instance, Li et al.9 built the HWOBC dataset by engaging experts from diverse academic backgrounds to handwrite OBCs. Fu et al.10 and Yue et al.11 subsequently proposed the OBI-100 and OBI-125 datasets, with the OBC images collected from books related to OBC research. Additionally, Guo et al.12 collected more than 20k OBC images from various websites to build the Oracle-20k dataset. These efforts lay a solid foundation for digitalization and recognition research of OBC. In addition, Li et al.13 created the Oracle Bone Inscriptions Multi-modal Dataset (OBIMD), but this dataset focuses on entire rubbings and lacks rich data on individual OBCs. However, the fact that these datasets have certain limitations also hinders their use in AI-assisted OBC decipherment:

  • They often have limited categories and samples of OBCs due to data collection from a single source.

  • The annotation of the categories might not be deduplicated. As shown in Table 2, the same OBCs are categorized into different classes.

    Table 1 Comparison of the scale of HUST-OBC with other oracle bone character datasets.
    Table 2 Comparison of oracle bone character annotations using Modern and Variant characters as category labels.
  • The lack of cross-validation from multiple sources casts doubt on the accuracy of some data.

  • The datasets comprise only deciphered oracle bone images, making them unsuitable for deciphering tasks.

  • Some datasets contain unprocessed images, filled with noise or blur.

To address these issues, we propose the high-quality HUST-OBC14 dataset. The HUST-OBC dataset was collected from three different sources, including books, websites, and existing databases. HUST-OBC includes two types of OBC sample images: a) OBC images obtained from processed scans of rubbings of the original oracle bones; and b) handwritten OBC images based on the original oracle bones, further subdivided into traced images based on rubbings and manually drawn images based on the glyphs. To build HUST-OBC, we designed a semi-automatic pipeline that collects and annotates data from various sources and had OBC experts review the dataset. As shown in Table 1, HUST-OBC contains over 10k deciphered and undeciphered OBC categories and more than 140k images, making it one of the largest datasets for OBCs recognition and deciphering to date. We hope HUST-OBC will aid and inspire future AI-assisted OBC research.

Methods

To construct a diverse dataset, we gathered images of OBCs from three distinct sources: book, website, and database. To organize and merge data from these varied origins, as shown in Fig. 2, we designed a semi-automated pipeline comprising four key steps: Data Acquisition, Automatic Annotation, Data Integration, and Data Validation. In this section, we will delve into the details of each step.

Fig. 2
figure 2

Flowchart of building the HUST-OBC dataset.

Data acquisition

OBCs were inscribed on turtle shells and animal bones and buried underground for over 3,000 years. These precious artifacts are dispersed in museums and private collections worldwide, where they are meticulously preserved, making direct access to the text inscribed on the original oracle bones quite challenging. Thankfully, most of the publicly available oracle bones have been transcribed by experts, making them accessible in various forms for scholarly research. Specifically, for most authoritative books or websites, the images are processed or traced based on rubbings of oracle bones by experts. Building on this, HWOBC hired experts to manually draw each oracle bone character glyph, thus expanding the dataset with handwritten OBCs. As illustrated in Table 3, the HUST-OBC dataset is constructed by gathering data from these diverse sources.

Table 3 Statistic of oracle bone character data collected from various sources.

To ensure the diversity of the dataset, the HUST-OBC was built using data collected from various sources, including books, websites, and databases. As shown in Fig. 2, we designed specific pipelines for each data source to process and extract OBC images and their corresponding labels, detailed as follows.

Books

Books remain the predominant form in documenting OBCs, with most discovered characters to date collected and interpreted in volumes like the New Compilation of Oracle Bone Scripts, which ensures accuracy by incorporating the latest research in this field. Specifically, we utilized the following books as data sources while constructing the HUST-OBC dataset.

  1. A.

    New Compilation of Oracle Bone Scripts (新甲骨文编 https://books.google.com/books?id=S0RergEACAAJ)15 encompasses samples of OBCs found since its initial discovery, as presented in all public materials.

  2. B.

    Oracle Bone Script: Six Digit Numerical Code (甲骨文六位数字码检索字库 https://books.google.com/books?id=pgvaxQEACAAJ)16 assigns digital codes to OBCs, annotating each code with its corresponding oracle bone character, modern Chinese character form, provenance, and other relevant details.

Since these books do not provide electronic databases or original image data, we manually scanned the pages of these books, obtaining 1,054 and 700 pages from books A and B, respectively. An example of scanned pages is presented in Fig. 3(a).

Fig. 3
figure 3

Extraction of OBC images from books. New Compilation of Oracle Bone Scripts (left)15 and Oracle Bone Script: Six Digit Numerical Code (right)16.

Websites

With the widespread adoption of the Internet, websites have emerged as an alternative for hosting oracle bone data, offering more convenient retrieval capabilities. We have designed a web crawler program to collect data from the following websites:

  1. C.

    GuoXueDaShi (国学大师 https://www.guoxuedashi.net/jgwhj/), initiated and maintained by enthusiasts of Chinese classical studies, which includes various historical texts including dictionaries, histories, etc. A screenshot of the website is shown in Fig. 4(b).

    Fig. 4
    figure 4

    Screenshots of example website pages from YinQiWenYuan and GuoXueDaShi.

  2. D.

    YinQiWenYuan (殷契文渊 https://jgw.aynu.edu.cn) is a data platform maintained by the Key Laboratory of Oracle Bone Inscriptions Information Processing, archives various types of data, including photos of the original oracle bones, transcribed characters, and related research articles. A screenshot of the website is shown in Fig. 4(a).

These websites feature well-organized collections of OBC images, which have been meticulously scanned, cropped, and aligned. They are systematically categorized across various web pages, facilitating the use of web crawler technology to download these images in a categorized format efficiently.

Databases

In recent years, the digitalization of ancient manuscripts and advancements in handwritten text recognition technology have opened new avenues in the study of OBC, which has led to the proposal of relevant datasets. We have included the following databases in HUST-OBC.

  1. E.

    HWOBC (https://jgw.aynu.edu.cn/home/down/detail/index.html?sysid=2) is a database specifically designed for the study and recognition of handwritten OBCs9. Compared to other books and websites that process or trace rubbings, to obtain more extensive handwritten samples of OBCs, the HWOBC dataset hired experts to manually draw each character glyph on a 400 × 400 pixel white background using a PC or smartphone, and then upload them to create a richer set of 83,245 handwritten OBC images.

Automatic annotation

Through data acquisition, we have gathered raw data from diverse sources. However, this data, in its current format, is not immediately usable. Hence, it necessitates further processing, including tasks such as cropping, annotating, and filtering.

Books

The raw data for the books consists of scanned images of pages, each displaying several OBCs along with their corresponding annotations in modern standard Chinese. As shown in Fig. 3(a), despite differences in the layout of the New Compilation of Oracle Bone Scripts (left) and Oracle Bone Script: Six Digit Numerical Code (right), they both employ a table-like vertical format. This arrangement facilitates the use of computer vision algorithms, such as edge detection, to automatically extract content from these pages. Specifically, as shown in Fig. 3(b), we employed edge detection and other techniques from the OpenCV toolkit17 to crop the original scanned images by oracle bone characters, thereby obtaining individual slices of these characters. These slices are then categorized according to the layout rules, with each assigned a corresponding category ID. For example, as illustrated on the left side of Fig. 3(b), the top of each column in the book is marked with the modern Chinese character equivalent to the OBC. If a column lacks such a marking, it implies that it belongs to the same category as the adjacent column on its right. In the figure, we used different colors of dashed lines to distinguish between categories. Using this method, we extracted 24,558 and 14,053 OBC images from source A and B books, respectively.

Although the slices are grouped during the cropping process, the corresponding modern Chinese character of each category remains unknown. A straightforward solution to determine the specific characters for each category is to use OCR techniques to recognize the marks in the books. However, most of the off-the-shelf OCR engines were trained only on commonly used modern Chinese characters and struggled to recognize the uncoded Liding and unknown characters that may be presented in these books. To address this issue, we trained a category assigner (see Fig. 5) to automatically identify these labels. The specific training procedures are detailed as follows:

  1. 1.

    Training Data Generation: The Chinese character labels we need to identify are all in standard print typeface (as seen in the Chinese characters on the outside of the table at the top of Fig. 3(b), left side), and each cropped image contains only a single, individual character. Thus, we can conveniently generate corresponding training samples using a similar SimSun font. As shown in the block on the left side of Fig. 5, we generated font images for all realistic Chinese characters according to the Ideographic Description Sequence (IDS) and assigned each a unique category ID. Additionally, to address the recognition of uncommon characters that may appear in books, we randomly synthesized Liding text using components like radicals of Chinese characters to serve as the Liding category for training purposes. Practically, we generated one image for each of the total 88,899 Chinese characters included in the IDS and randomly synthesized α Liding character images in each training epoch.

  2. 2.

    Training: Since each sample image contains only a single character, it is sufficient to train a simple classifier for recognition. For this purpose, we employed ResNet-5018 as the backbone network to train the classifier. Additionally, we utilized a weighted balanced cross-entropy loss L to address the issue of the imbalance in the number of training samples across different categories:

    $$L=-\frac{1}{N+1}\,[\mathop{\sum }\limits_{i\mathrm{=1}}^{N}\,{y}_{i}\,\log \,({p}_{i})+\frac{1}{\alpha }{y}_{N+1}\,\log \,({p}_{N+1})]$$
    (1)

    where N is the number of categories, Pi and Yi respectively represent the probabilities of the predicted and true labels being the ith category, taking values of 1 or 0, and α is the number of synthesized Liding samples in each training epoch.

  3. 3.

    Inference: During the inference phase, we input the Chinese character label images, which are cropped from the books, into the classification model trained in the second step. This process helps us determine the corresponding Chinese character for each category ID or whether it is an uncoded Liding character.

Fig. 5
figure 5

Schematic of the proposed category assigner.

After completing the aforementioned procedures, all OBC images contained within the scanned pages of sourced books acquired during the data acquisition phase have been automatically extracted and accurately categorized according to their respective classes.

Websites & Databases

The images of the OBCs collected from websites and databases have already been preprocessed by scanning, cropping, and alignment. Therefore, there is no need to design automatic annotation algorithms for this data, unlike the approach required for data from book sources. However, the following essential processes are still required:

  1. 1.

    Filtering: It is important to note that a portion of the data on the GuoXueDaShi website, contributed by enthusiasts of ancient Chinese culture, cannot be fully guaranteed for reliability.

    Specifically, these OBC images are of higher resolution and quality compared to other sources, but their unreliability stems from their labels. Currently, only about 1,500 categories of OBC have been deciphered, whereas the GuoXueDaShi website has 2,756 categories. This indicates that some undeciphered OBCs have been labeled by enthusiasts without expert verification, making them unreliable. Consequently, in our filtration process, we cross-referenced these with other sources. This allowed us to identify 1,390 categories of OBC images that were unique to GuoXueDaShi and could not be verified. As a result, we retained only 1,366 out of the initial 2,756 categories after excluding these unverifiable samples. The samples of these 1390 categories, due to their lack of reliability, have not been classified as deciphered or undeciphered samples and are stored separately in the dataset.

  2. 2.

    Code Matching: The OBC images from online and database sources are marked with specific codes, which we further mapped into modern Chinese characters. For the oracle bone inscriptions of YinQiWenYuan and HWOBC, the HUST-OBC dataset only includes individual oracle bone characters, not compound characters. The term ‘compound characters’ refers to oracle bone characters corresponding to two or more words. Moreover, HWOBC is classified based on the character forms, leading to multiple character forms corresponding to the same Chinese character. Here, we merge them into the same category based on the corresponding Chinese character.

Integration

In the stages of Data Acquisition and Automatic Annotation, images of OBCs from distinct sources were collected and annotated. However, it is important to note that the annotation conventions for one OBC may vary depending on the source. For instance, as shown in Table 2, some sources might use standard modern Chinese characters for annotation, while other sources may prefer using corresponding Variant Chinese characters19 (https://en.wikipedia.org/wiki/Variant_Chinese_characters) for annotation. This leads to a scenario where images of OBCs that should belong to the same category are classified into different categories, creating redundant categories. Table 2 illustrates this with examples of duplicate annotations, where each row shows how the same OBC image is categorized differently under the Modern Character Category and the Variant Character Category. To eliminate these redundancies, we integrate the data from different sources. For this purpose, we trained a widely-used unsupervised visual representation learning model MoCo20, with OBC images from all sources. Subsequently, all the oracle bone images were encoded into a feature vector by the model. As illustrated in Fig. 6, by calculating the similarity of these feature vectors, we merged similar samples into the same categories. In this way, we were able to reduce the original 1,781 categories obtained from different sources to 1,588, eliminating redundant categories.

Fig. 6
figure 6

Oracle bone character images classified into the redundant category exhibit higher similarity in the feature vector space.

Validation

After undergoing all the procedures, we obtained a preliminary dataset. However, due to potential errors that might occur in the automated data acquisition and annotation process, we enlisted the expertise of OBC scholars from Anyang Normal University to meticulously review our dataset. Using authoritative books and the HWOBC database fonts as reference standards, they compared and evaluated OBC data in the HUST-OBC, discarding samples with errors and retaining the relatively accurate ones. This review produced the HUST-OBC dataset.

Data Records

The HUST-OBC14 comprises a total of 140,053 images sourced from five different origins, divided into deciphered and undeciphered sections. The deciphered section contains 77,064 images spanning 1,588 categories of individual characters, and the undeciphered section features 62,989 images across 9,411 categories of characters. Due to the lack of annotations for undeciphered categories, there may be duplicates among these 9,411 categories of undeciphered OBCs, which can only be merged once they are deciphered. Table 3 provides detailed statistics of the OBC images obtained from these sources. Additionally, Fig. 7 presents a distribution histogram showing the number of sample images per category in our dataset. It reveals that most categories have fewer than 10 sample images, with the largest category boasting over 300 images.

Fig. 7
figure 7

Histogram illustrating the distribution of sample counts across various categories in the HUST-OBC deciphered split.

For efficient retrieval, the HUST-OBC is organized and stored by category names. Each image file is systematically named following the format <source>_<label>_<filename>, encapsulating its origin, category number, and sequence number, and is stored in folders named after their category number. For the deciphered categories, we have corresponding category numbers and the corresponding Chinese dictionary stored in a UTF-8 encoded JSON file. Figure 8 demonstrates some deciphered and undeciphered OBCs from the HUST-OBC.

Fig. 8
figure 8

Samples of and Oracle Bone Characters from the HUST-OBC dataset.

Technical Validation

One of the primary objectives in creating the HUST-OBC is to facilitate future AI-assisted tasks in deciphering OBCs. To this end, we further assessed the quality of the dataset by employing it to train AI models. Specifically, we divided the deciphered section of the HUST-OBC dataset into a training set, a validation set, and a test set using stratified sampling with proportions of 8:1:1, using them for training, validation, and testing in image classification tasks. Due to the limitation of classification models not being able to categorize unseen classes, we allocated all classes with only one sample into the training set. The accuracy of image classification can reflect the quality of the dataset to some extent. If the images in the dataset are of poor quality or have many labeling errors, the classifier’s accuracy will be low, and vice versa. We employed the widely-used ResNet-5018 as the backbone network for training. We tested the test set using the model that achieved the highest accuracy on the validation set, ultimately achieving a classification accuracy of 94.6% and a macro-average F1 score of 0.914, which validates the dataset’s quality and potential academic value. Table 4 shows the model’s recognition accuracy in some categories and provides example input images from different sources.

Table 4 Validation accuracy of ResNet-50 on selected categories of the HUST-OBC dataset.

Licenses

The dataset is released under a non-commercial license, CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/deed.en), which permits users to reuse and reproduce the dataset for research purposes.

Usage Notes

The HUST-OBC is available as a compressed archive, comprising three distinct folders. These folders separately house images of OBCs. The first one is for those that have already been deciphered, the second one is for those still awaiting interpretation, and the third one is for unreliable data from GuoXueDaShi. Within each folder, subfolders are organized by categories, containing images of OBCs corresponding to their respective categories. For more information, please see (https://github.com/Pengjie-W/HUST-OBC).