DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding

Recent advances in computer vision (CV) and natural language processing have been driven by exploiting big data on practical applications. However, these research fields are still limited by the sheer volume, versatility, and diversity of the available datasets. CV tasks, such as image captioning, which has primarily been carried out on natural images, still struggle to produce accurate and meaningful captions on sketched images often included in scientific and technical documents. The advancement of other tasks such as 3D reconstruction from 2D images requires larger datasets with multiple viewpoints. We introduce DeepPatent2, a large-scale dataset, providing more than 2.7 million technical drawings with 132,890 object names and 22,394 viewpoints extracted from 14 years of US design patent documents. We demonstrate the usefulness of DeepPatent2 with conceptual captioning. We further provide the potential usefulness of our dataset to facilitate other research areas such as 3D image reconstruction and image retrieval.


Background & Summary
Technical illustrations, sketches, and drawings are images constructed to convey information more straightforwardly to humans than using text alone [1,2,3].One goal of computer vision is to build models to understand information contained in these images.Specific tasks include recognizing objects and determining their attributes, capturing the relationships between sub-images, and understanding the context of objects [4].Natural images such as those contained in MS COCO [5] and ImageNet [6] have been used to develop solutions to these tasks using deep neural networks.Different from natural images [5,6], technical drawings are a type of image frequently found in design patents.Although they usually do not contain many features of natural images such as various colors, gradient, and environmental detail, drawings often abstract away unnecessary distractions leaving the strokes, lines, and shading that are sufficiently detailed so that drawn objects and aspects are still recognizable by humans.Compared with natural images, technical drawings, are under-studied by the computer vision and information retrieval communities.

DeepPatent2
2M+ drawings 132,890 Yes Yes Table 1: DEEPPATENT2 compared with major sketch-based and natural image datasets.#Categories means the number of object names.
In this paper, we introduce a new dataset, DEEPPATENT2, consisting of more than 2 million automatically segmented and tagged technical drawings from more than 300,000 design patents granted from 2007 to 2020. Figure 1 shows an example of the figures segmented and tagged from a single patent.We use an approach reminiscent of a large-scale image collection for real-world images with real human tags; similar to WebVision [16], which consisted of 2.4 million images crawled from the Web with associated metadata.It was demonstrated that the noisy web images like WebVision were sufficient for training a good deep learning model for visual recognition [16].We propose a novel pipeline using natural language processing (NLP) models to extract object names and viewpoints from figure captions, using computer vision (CV) methods to segment compound figures, and aligning text information with corresponding figures.The pipeline is developed based on our solid results on label extraction [20], visual descriptor extraction [21], and image segmentation [22].To demonstrate the usefulness of our dataset, we train baseline deep-learning models on conceptual captioning and demonstrate that the performance of the models benefits significantly from an increasing size of training data.We expect that DEEPPATENT2 will facilitate training robust neural models on other tasks such as 3D image reconstruction and image retrieval for technical drawings.Methods: Building the DeepPatent2 Dataset Overview DEEPPATENT2 contains over 2 million technical drawings from the United States Patent and Trademark Office USPTO [23] design patent documents published from 2007 to 2020.Our dataset extends the DEEPPATENT dataset [14] in three aspects.
1. DEEPPATENT2 is more than 5 times larger than DEEPPATENT; 2. DEEPPATENT2 contains both original and segmented patent drawings; 3. The metadata of each drawing contains object name and viewpoint information, automatically extracted using a supervised sequence-tagging model with high accuracy.The pipeline to create the dataset is illustrated in Figure 2. The pipeline includes three major components, namely, data acquisition, text processing, and image processing.A patent document includes an XML file containing textual content and TIFF files containing associated images.One patent TIFF file may contain several patent figures, as shown in Figure 1, which we refer to as a compound figure.In the following, we refer to a "figure file" as an image file containing a single or a compound figure.We refer to a "figure" as an individual figure (e.g., a figure with a unique label such as "FIG.1").
The text processing step aims at automatically tagging individual figures with text that a person would use to describe them.In addition to extracting the design category information included in the patent XML file, we extract humanreadable object names from figure captions.One challenge to this goal is that although the XML document contains captions and inline references for individual figures, the document does not directly map figures to figure files, because a figure file may contain multiple figures.For example, Figure 1 contains 7 individual figures in 4 figure files.To overcome this challenge, we develop modules that first segment compound figures, resolve figure labels, and link them to text processing results to associate captions to their respective figures.The final data includes JSON files containing metadata, automatically extracted descriptions (object names and viewpoints), and images of individual figures, as well as original figure files.
In the following subsections, we elaborate on each module in the pipeline.All computation was performed on a Dell server with 4 NVIDIA GTX 2080 Ti GPUs, 24 hyperthreaded Intel Xeon Silver 4116, 300GB RAM, and 7TB disk space.In addition, we applied AWS Rekognition's DetectText service on all patent figure files.The total size of the dataset is approximately 380GB before compression.
According to USPTO, design patent drawings are illustrations of a manufactured object's design, which includes detailed information about contours, shapes, material texture, properties, and proportions.Drawings and text of patent documents are public domain; and are created by writers, artists, and inventors with the knowledge that these works will become public domain [24].

Patent Data Acquisition
We collected patent data ranging from the years 2007 to 2020 from the USPTO website in the form of zip and tar files.These files consist of both the full-text (in XML format) and figures (in TIFF format).Each TIFF file contains one or multiple figures.Segmented) reflects the imperfection of the text and image processing methods, which will be incorporated in assessing the data quality.Figure 3 shows the average number of figure files per patent and the average number of figures per patent of our dataset, demonstrating the trends of the increasing popularity of using figures in patent applications over the last 14 years.The detail of the text processing pipeline is elaborated in Wei et al. (2022) [21].Here, we highlight the best model and its performance.We treated this task as an entity recognition problem.The text was first tokenized.Each token was encoded as a feature vector by a pre-trained word embedding model.Word vectors were then fed to a sequence-tagging neural network.We compared the BiLSTM-CRF (bidirectional long short-term memory or BiLSTM followed by a conditional random field or CRF) with a transformer model.Both were usually believed as the state-of-the-art sequencetagging models [25].We compared several word embedding models, including GloVe [26], RoBERTa fine-tuned on GPT-2 [27], the original RoBERTa [28], BERT [29], ALBERT [30], and DistilBERT [31].The sequence-tagging model incorporated context into the initial feature vector of each word.The CRF layer classified each token under the IOB schema.The best performance was achieved using DistilBERT [31] with the BiLSTM-CRF architecture.The average F1-measures for the overall entity recognition, object name, and viewpoint were 0.960, 0.927, and 0.992, respectively.
The ground truth corpus was created by randomly selecting 3300 patent figure captions from the US design patents.Each caption is manually annotated by two researchers independently using BRAT, a web-based text annotation tool [32].An example of an annotated caption is shown in Figure 4.The ground truth was split into training, validation, and test sets, each consisting of 2700, 300, and 300 captions respectively.We then applied this model to all patent figure captions and extracted a total of 132,890 unique object names and 22,394 unique viewpoints.The result of post-extraction validation using a random sample of 100 figure captions was consistent with the evaluation based on the benchmark.

Image Processing: Figure Segmentation and Metadata Alignment
The goal of the image processing pipeline is segmenting compound figures and recognizing figure labels, which is necessary because the patent documents do not contain information that maps individual figures' captions to figure files.This process has four steps (Figure 2).( 1

Figure Label Detection
The original figure files are in TIFF format, which is not compatible with many optical character recognition (OCR) engines and computer vision packages.Therefore, we converted TIFF to PNG format, which is widely used in computer vision and does not introduce compression artifacts as in JPEG files.We use AWS Rekognition's DetectText service, a commercial OCR engine, to recognize figure labels, including text content and bounding boxes [37].
To evaluate the quality of extraction results, we extended our previous work [20] and compared several top-performing OCR engines using a corpus consisting of 100 randomly selected design patent figures from the USPTO dataset in 2020 including both single and compound figures.The evaluation metrics included precision, recall, and F1.The precision was calculated as the number of correctly recognized labels divided by the total number of labels captured.The recall was calculated as the number of correctly recognized labels divided by the total number of labels shown on sampled figures.The results (Table 2) indicate that AWS Rekognition achieves the highest F1 (96.80%), compared with other OCR engines.In particular, Rekognition achieves the highest recall (96.03%).Google Vision API shows an excellent F1 (94.10%), but the recall score is relatively low (88.90%).Tesseract, an open-source OCR engine, achieves a relatively high precision (96.60%) but a poor recall (44.40%).Therefore, we adopted Rekognition into our pipeline.
The errors made by Rekognition are due to several reasons.Errors may happen when the labels are rotated by 90 degrees, resulting in reversed order of label tokens.For example, "Fig.3" in a single figure could be recognized as ["3", " ].This is likely because Rekognition scans the image from top to bottom.In addition, unlike many other OCR engines that output a single result, Rekognition outputs multiple results scored from 0 to 1.To address the two issues above, we developed a script that rectifies the Rekognition output when the confidence score output by Rekognition is above a threshold θ 0 = 0.5 and then reconstructs the labels by "gluing" tokens.This rectifier improved the F1 score of figure label recognition by 16%, from 0.806 to 0.968 (Table 2).
The results of this step are figure labels and their bounding boxes for each figure file.The figure label regions in the images are whitened out before figure files are passed to the segmentation module.

Compound Figure Segmentation
A significant number of US patents use compound figures, containing more than one individual figure, so they need to be segmented before they are associated with captions.We proposed a transfer learning method, with a transformer-based neural model called MedT [38] pre-trained on compound medical images and then fine-tuned on a small set of annotated patent figures.We compared our model against several state-of-the-art image segmentation models on a test set containing annotated patent figures.We then associated segmented figures with labels using a proximity-based method.The details of our approach can be found in Hoque et al. (2022) [22].Here, we outline the pipeline and highlight the performance of the best models.
To our best knowledge, there was no labeled dataset that could be used for patent figure segmentation, therefore we developed an in-house ground truth dataset consisting of 500 figure files randomly selected from design patents.Here, we assume the number of figures in each figure file equals the number of labels detected by the OCR engine.If the two numbers do not equal, the segmentation is flagged as an error.The results will still be included in the final dataset as challenging cases for future models, but these cases can be easily excluded using the flags.In our ground truth of 500 figures, there were 480 compound figures each containing up to 12 individual figures.Each figure has been annotated by a graduate student using the VGG Image Annotator [39] by drawing bounding boxes around individual figures.We performed an independent human validation of bounding boxes.
As a baseline, we proposed an unsupervised method called Point-shooting [22].The idea is to randomly "shoot" open dots onto a compound figure.Dots that overlay any strokes are filled with a single solid color and the other dots are removed, so the remaining dots created a region filled with single color pixels.A contour can be drawn around the region, outlining the profile of individual figures.A bounding box can then be drawn around the contour.Using this method, we can obtain a mask of individual figures and their bounding boxes.The bounding boxes can be directly used for comparing against the ground truth.Point-shooting achieves an accuracy of 92.5%, calculated as the percentage of individual figures that are correctly segmented.
MedT was designed to train a transformer-based semantic segmentation framework [38].The core component is a gated position-sensitive axial attention mechanism, designed to overcome the limitation of a vanilla transformer so that a more robust model can be trained using relatively small training sets.Because the point-shooting method was able to generate masks of individual figures with relatively high accuracy, we used the output of the point-shooting method as the training data for MedT.As our result would show, the MedT model achieved higher performance on the noisy training data.The training figures were first passed through a convolution block before passing through a global branch, which captures dependencies between pixels and the entire image.The same figure is broken down into patches, which were passed through a similar convolutional block before passing through a local branch, which captures dependencies among neighboring pixels.A re-sampler aggregates the outputs from the local branch and generates the output feature maps.The outputs from both branches are finally aggregated followed by a 1 × 1 convolutional layer to pool these feature maps into a segmentation mask.We fine-tuned the pre-trained MedT model on our training set and achieved an accuracy of 97% with a much shorter runtime (≈ 1/35) on the test set compared with the point-shooting method.MedT outperformed point-shooting and other deep learning baselines including U-Net [40], HR-Net [41], and DETR [42].

Label Association and Metadata Alignment
We have two sources of metadata: labels recognized by the OCR engine and semantic information parsed from XML files.To generate the contextualized figures, two alignments were conducted.
Label Association This step associates labels output by the OCR engine with segmented figures.This was treated as a bipartite matching problem.A general solution is the Ford-Fulkerson method [43].Because of the specialty of our problem, we used a simple heuristic method by matching an individual figure segmented from a compound figure with the closest label recognized by the OCR engine.The proximity was calculated as the Euclidean distance between the geometric centers between the bounding boxes of labels and figures.Despite the simplicity of the method, it achieves an accuracy of 97%, evaluated on a set of 200 randomly selected figures in the matching results.The errors in the label-figure alignment 3).Here, (#Figure Mismatch) refers to the number of figures that could not be aligned with labels using the proximity method above.Another type of error is attributed to a compact and irregular arrangement of labels.The output of this step is a set of individual figures with labels.We mark cases in which the number of labels recognized does not equal the number of segmented figures.For completeness, we still include these cases in the data product.Users can easily remove these cases in downstream training/analysis or use them to design better algorithms to automatically correct errors.3:  Metadata Alignment This step aligns labeled figures with captions extracted from XML files by matching figure labels parsed from XML files and labels associated with individual figures in the last step.The result of this step contains segmented figures, each having metadata, including the label, the caption, the bounding box coordinates, and document-level metadata, including patent ID, year, title, number of figures, etc.The method used here is based on strict integer matching, so the errors in the final data are caused by errors propagated from upstream processes.

Data Records
The final data product contains about 2 million compound PNG figures, 2.7 million segmented PNG figures, and metadata in JSON format.The dataset is organized by year (Table 3).The total size of the compressed dataset is 314 GB.The size of each year's data is roughly proportional to the number of figure files (Figure 3).
The JSON files contain document-level and figure-level metadata (one JSON file per year).Each entry in the JSON file represents a segmented figure, which is tagged with the patent ID, original figure file, the object name, and viewpoints extracted, the figure labels, the bounding boxes of segmented figures, and their labels.The schema and a sample record of this JSON file are shown in Table 4.The patent captions contain an average of 12 words with a vocabulary of about The perspective view, front view, and top plan view are the top 3 viewpoints used in design patents.The object distribution color-coded by the top three viewpoints (Figure 7) indicates that objects are depicted with diverse and disproportionate viewpoints, which poses challenges for 3D reconstruction from 2D sketches.The complete distribution is provided with the dataset.
We observe that one object name, such as "bottle", may appear in multiple patents.Figure 8 shows the distribution of the number of patents per object.Similarly, Figure 10 shows the distribution of the number of individual figures per object.The peak of Figure 8 happens when the x-axis value is 1, meaning that most frequently, there is 1 patent for    Figure 8: One object name may be described by multiple patents.This figure shows a truncated distribution of the number of patents in one object category.For example, when the x-axis value is 7, the y-axis value is 774, meaning each of the 774 objects is described by 7 patents.The x-axis is truncated because data points beyond the truncation point are sparse.The class codes in the abscissa correspond to the Locarno International Classification described in Table 5.
The full dataset [33] is publicly available under the Harvard Dataverse repository at https://doi.org/10.7910/DVN/UG4SBD.For each year, the original and segmented figures are in separate tarball files, with a JSON file attached.

Technical Validation
Data Validation, Evaluation, and Copyright Our data was automatically generated by high-performance machine learning and deep learning methods [21,22] (read Section "Methods" for a detailed description).However, errors can still occur, propagate, and accumulate leading to errors in the final data product.There are four possible error sources, namely, figure label detection, compound image segmentation, label association, and entity recognition (ER).Label association is dependent on image segmentation and OCR, but label association errors could occur even when image segmentation and OCR are correct.Assuming the label association (LA) can be approximated as the mismatch errors (Mismatch% in Table 3), which is 7.5% on average (so the precision P LA = 92.5%), the overall error rate is calculated as All figures are preserved in the final dataset, but mismatched figures are marked in their file names, so they can be used if needed.Devices and equipment against fire hazards, for accident prevention and for rescue 30 Articles for the care and handling of animals 31 Machines and appliances for preparing food or drink, not elsewhere specified 32 Graphic symbols and logos, surface patterns, ornamentation, arrangement of interiors and exteriors Table 5: The top-level class codes and descriptions of the Locarno International Classification designation adopted by USPTO for design patents.To validate the estimated error rates, we build a verification dataset by randomly sampling 100 compound figures each year from 2007 to 2020 for a total of 1400 compound figures.We manually inspected the quality of the final data product.Specifically, we inspected whether a compound figure is correctly segmented and whether the label, object name, and viewpoint of an individual figure are correctly extracted.We calculated the error rates by dividing the number of individual figures containing any errors (404) by the total number of individual figures (3464).Figure 11 (Left) shows the error rates calculated for each year and the average error rate.The verified error is consistent with the estimated error by Eq. ( 1).Table 6 shows the breakdown of the verified errors by source.Note that the Verified Overall row is calculated using direct counts and should not be calculated from previous rows using Eq.( 1), because a fraction of figures have more than one type of errors.We also calculate error rates for each class.The verification dataset contains 23 classes, of which 10 classes contain more than 40 individual figures.Figure 11 (Right) shows the error rates across classes which have more than 40 individual figures in the verification dataset.
Our estimated error rate is on par with other computer vision datasets, especially considering that our methods produce automated tags.Other auto-tagged computer vision datasets like WebVision estimate up to 20% error rates [16]; while curated datasets like ImageNet-1k [15] have been estimated to have error rates up to 6% [45].
USPTO advocates open data, which can be freely used, reused, and redistributed [46].The text and drawings of a patent are typically not subject to copyright restrictions [24].Our dataset is under the Creative Commons Attribution (CC-BY 4.0) Generic International License.

Error Type Error
Estimated -Label (Segmentation+OCR)  6: A breakdown analysis of the errors estimated using Eq. ( 1) and the errors obtained using the verification dataset in the Technical Validation Section.The results show that they are consistent with a discrepancy of 0.5%.

Usage Notes
In this section, we demonstrate the usefulness and value of the dataset on a Conceptual Captioning task, which generates short descriptive text that captures main objects and their viewpoints of a given image.This task was carried out on a dataset (hereafter CC18) consisting of natural images collected from the Web [47].In this work, we carry out this task on the technical drawings presented in our dataset.We demonstrate how to use the dataset and advance the visual-to-text conversion task by running a state-of-the-art model on data in the technical drawing domain and potentially opening up research opportunities in other topics on patent drawings.In addition, we provide other tasks, in which our dataset could be potentially adopted.Here, we focus on demonstrating the potential benefit of improving the model performance by scaling up the training set using our dataset, instead of developing a new method.

Conceptual Captioning
The task of conceptual captioning involves generating a short textual description of an image.The state-of-the-art models usually employ the encoder-decoder architecture [47].We employed ResNet-152 [48], a CNN-based model pre-trained on ImageNet [15], and fine-tuned on technical drawings selected from DeepPatent2.We varied the training data size from 500 to 1000 to 63,000 and tested the performance of each model on a corpus consisting of 600 figures with manually validated object names and viewpoints.The training sets were randomly selected from the parent sample.We selected the test set so it did not overlap with any samples from the training set.The input to the CNN encoder was an individual figure.The decoder generated a template-based caption including the object name and the viewpoint.The dimension of the encoder output was 2048 and we reduced it to 512 to match the input dimension of the LSTM decoder.We trained the image captioning network with the cross-entropy loss for 30 epochs with a mini-batch size of 32.We adopted an ADAM [49] optimizer and a learning rate of 0.001.The models were evaluated using standard metrics including METEOR [50], NIST [51], Translation Error Rate (TER [52]), ROUGE [53], and accuracy.The accuracy was defined as a fraction of the ground-truth patent object names and viewpoints that were correctly predicted by the model.The results in The computing environment includes a Linux server with Intel Silver CPU, Nvidia GTX 2080 Ti.Because of the large size of data, we recommend a disk capacity of at least 500GB for all data.To obtain the subfigures from compound images, we used Pytorch >=1.4.0, torchvision >= 0.5.0, and Python 3.8.The OpenCV package was used to load PNG images.To load large JSON files, we recommend using ijson >3.1 package, which helps to conserve computer memory space.

Potential Usages of the Dataset
• Technical drawing image retrieval and semantic understanding.Investigation of effective and efficient retrieval methods for technical drawings has attracted the attention of the computer vision community.The Diagram and Abstract Imagery competition uses the DeepPatent dataset [54].The rich semantic information of DeepPatent2 can potentially help build a new multimodal (text+image) ground truth and enable tasks such as semantic understanding of abstract drawings and technical document classification [55].
• Summarization of scholarly and technical corpora.
Many existing summarization methods [56,57,58] use only text.However, it has been shown that combining multimodal content improves a reader's understanding of each document's content while reducing the amount of content they must consume [59,60].For example, the automatic selection of relevant images [61] has focused primarily on web resources and natural images.DeepPatent2, with its segmented compound figures, can be used to evaluate summary image selection in technical corpora.
• 3D image reconstruction.Although humans are good at perceiving 3D objects from technical drawings, the task is challenging for computers.Delanoy et al. developed neural methods to reconstruct 3D images from 2D sketches using training datasets corresponding to procedural, vases, and chairs [62].Our dataset contains more diverse object types with multiple viewpoints and can potentially be used for training high-fidelity models.
• Figure segmentation.Segmenting compound figures is a common preprocessing step before individual figures are used for analysis, retrieval, or machine learning tasks.Using 4000 DeepPatent2 figures, we achieved an accuracy of 99.5% at both IoU = 0.7 and IoU = 0.9 evaluated on the technical drawings in [22].Our data can potentially be used for training a large-scale base model that is fine-tunable for scientific or medical figure segmentation problems.
• Technical Drawing Classification.Sketch classification methods [63,64,65] have been proposed to recognize sketch images from the Web.However, these datasets contain a limited number of object categories and viewpoints.Our dataset contains more diverse object types and viewpoints and can potentially be used for training robust technical drawing classification models.
• Create Generative and Multimodal Design Models for Innovation.The Generative Adversarial Networks (GAN) and diffusion models perform remarkably well on text-to-image synthesis tasks [66] and large language models (LLMs) have achieved superior performance on many generative NLP tasks [67].Whether it is possible to combine LLMs with the diffusion models to automatically generate multimodal design models for innovation is an open question.Diffusion models and LLMs are known to be inaccurate in their details, and so the object name and viewpoint information of DeepPatent2 could provide the needed detailed technical drawings for training multimodal generative models that are accurate enough for design innovation.

Figure 1 :
Figure 1: An example from DEEPPATENT2 extraction results for US Design Patent #0836880.

Figure 2 :
Figure Label from XML (FIG. 1) Figure Caption (FIG. 1 is a front, top, and left side perspective view of a pet treat according to the new design;)

Figure 3 :
Figure 3: Average numbers of figures and subfigures per patent each year from 2007 to 2020.

Figure 4 :
Figure 4: An example of a figure caption and its corresponding inline reference in a patent XML file.Orange highlighted text stands for object names, and blue highlighted text stands for viewpoints.

compound figure 1 patentID 861215 Table 4 :
figid Segmented/individual figure label 1 caption The figure caption directly extracted from the patent XML file FIG. 1 is a front, top, and left side perspective view of a pet treat according to the new design; object The object name automatically extracted from the figure caption Pet treat aspect The viewpoint automatically extracted from the figure caption front, top, and left side perspective view

Figure 5 :
Figure 5: Distribution of viewpoints identified in DeepPatent2.Only the top 20 viewpoint are shown.

Figure 6 :
Figure 6: Distribution of object names identified in DeepPatent2.Only the top 20 object names are shown.

Figure 7 :
Figure 7: Distribution of top 20 objects with the top three viewpoints color-coded.Note that the bar heights are different from Figure 6 because not all aspects were included.The object names along the abscissa are sorted by the total number of figures.

Figure 10 :
Figure 10: One object name may appear in multiple individual figures.This figure shows the distribution of the number of individual figures with one object name.For example, when the x-axis value is 7, the y-axis value is 23367, meaning each of the 23367 objects appears in 7 individual figures.The x-axis is truncated because data points beyond the truncation point are sparse.

Figure 9 :
Figure 9: Distribution of patents, figures per patent, and figures per object identified over the patent classes.Only the top 32 classifications are shown.The remaining classifications are not listed because they contain less than 5 figures.The class codes in the abscissa correspond to the Locarno International Classification described in Table5.

Figure 11 :
Figure 11: Left: Error rates calculated using the verification dataset over years; Right: Error rates across classes.The classes are ordered by the number of individual figures in each class.Class 6 (Furnishing) has the highest number of individual figures.Class 1 (Foodstuffs) has the highest error rate (0.39).
Table 3 shows statistics of the raw data.XML files before 2006 had a different schema and did not include separate XML tags for individual figures and subfigures, which may introduce further errors to figure label parsing and label-figure alignment, so we only include patents after 2007.The discrepancy between figure numbers parsed from XML (#Figure XML) and figure numbers from segmentation (#Figure

Table 2 :
A comparison of OCR engines on extracting labels and bounding boxes in patent drawings.
) Figure label detection, including text and positions; (2) Segmenting compound figure files into individual figures; (3) Associating labels with individual figures; (4) Aligning metadata with individual figures.The final dataset product is publicly available at the Harvard Dataverse repository [33].

Table 5
presents a list of top-level classes including the codes and descriptions.Figure9illustrates the distribution of patents across different classes (top panel), the average number of figures per patent (middle panel), and the average number of figures per object (bottom panel) in DeepPatent2.The analysis reveals that, on average, each patent contains approximately 5.48 figures and each object is illustrated by on average 13.83 individual figures.

Table 7 :
Table 7 indicate that increasing the training size for automatically tagged data improved the performance of image captioning models.A comparison of image captioning models with different training sizes.The best performance metrics are highlighted in bold.