Typhoon disaster state information extraction for Chinese texts

Ye, Peng; Zhang, Chunju; Chen, Mingzhu; Li, Shengcai

doi:10.1038/s41598-024-58585-8

Download PDF

Article
Open access
Published: 04 April 2024

Typhoon disaster state information extraction for Chinese texts

Peng Ye^1,2,
Chunju Zhang³,
Mingzhu Chen^4,5 &
…
Shengcai Li²

Scientific Reports volume 14, Article number: 7925 (2024) Cite this article

411 Accesses
Metrics details

Subjects

Abstract

Typhoon disasters undergo a complex evolutionary process influenced by temporal changes, and investigating this process constitutes the central focus of geographical research. As a key node within the typhoon disaster process, the state serves as the foundation for gauging the dynamics of the disaster. The majority of current approaches to disaster information extraction rely on event extraction methods to acquire fundamental elements, including disaster-causing factors, disaster-bearing bodies, disaster-pregnant environment and the extent of damage. Due to the dispersion of various disaster information and the diversity of time and space, it is a challenge for supporting the analysis of the typhoon disaster process. In this paper, a typhoon disaster state information extraction (TDSIE) method for Chinese texts is proposed, which aims to facilitate the systematic integration of fragmented typhoon disaster information. First, the integration of part-of-speech tagging with spatio-temporal information extraction is employed to achieve the tagging of typhoon disaster texts. Second, within the framework of spatio-temporal semantic units, the typhoon disaster semantic vector is constructed to facilitate the identification of information elements of typhoon disaster states. Third, co-referential state information fusion is performed based on spatio-temporal cues. Experimental analysis, conducted using online news as the data source, reveals that the TDSIE achieves precision and recall rates consistently surpassing 85%. The typhoon disaster state information derived from the TDSIE allows for the analysis of spatio-temporal patterns, evolutionary characteristics, and activity modes of typhoon disasters across various scales. Therefore, TDSIE serves as valuable support for investigating the inherent process properties of typhoon disasters.

Studying the narrative of US policy towards China: introducing China-related political texts in Congress

Article Open access 19 July 2023

Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality

Article Open access 14 February 2024

A manually annotated corpus in French for the study of urbanization and the natural risk prevention

Article Open access 22 November 2023

Introduction

Typhoons are one of the most destructive natural disasters in the world¹. Over the past 40 years, the frequency of typhoons in eastern and southeastern Asia has increased 2–3 times². This has a serious impact on natural ecosystems, industrial and agricultural production, transportation, and even human livelihoods³. In light of the escalating threat of disasters, it is crucial to acquire comprehensive disaster information and investigate the patterns of disaster evolution. This is essential for advancing research in disaster prevention and mitigating the risks associated with disasters⁴. According to “Sendai Framework for Disaster Risk Reduction 2015–2030”, traditional technologies and modern means should be fully utilized to enhance disaster monitoring, risk assessment, and service capabilities through methods such as big data, social media, and mobile internet⁵. Within the realm of disaster big data, a myriad of data types are implicated, with textual data particularly standing out due to its abundance and ubiquity. Accordingly, the extraction of typhoon-related disaster information from textual sources has emerged as a focal point within the domain of emergency management^6,7.

Event extraction refers to the automatically extracting user interested event information from unstructured text and presenting it in a structured form⁸. The investigation into event extraction related to typhoon disasters is still in its early stage^9,10. There are two main methods of event extraction: (1) pattern matching method. By formulating the sentence expression pattern of the extracted information, various pattern matching algorithms are used to match the text to be extracted with the extracted template¹¹. This method is relatively accurate, but it often relies on specific domain knowledge and has poor portability. (2) machine learning method, especially deep learning. It mainly includes convolutional neural networks¹², recurrent neural networks¹³, generative adversarial networks¹⁴, graph convolutional networks¹⁵, and other machine learning models. The machine learning method regards the event extraction task as a classification problem, focusing on classifier construction and extracting features. It requires less manual intervention and offers good portability. However, the method makes it may struggle to extract general text features and is highly reliant on the corpus, resulting in lower overall precision.

As a typical geographical event, the typhoon disaster has significant spatio-temporal dynamic features^16,17. Typhoon disaster-oriented event extraction differs from generalized event extraction. It not only identifies the elements inherent in the event (entities, time, attribute values, roles, etc.), but also places heightened emphasis on the entire process of a typhoon disaster, spanning from its start to end. Existing event extraction methods can achieve typhoon disaster detection. For instance, within typhoon disaster texts, occurrences such as <Time1, Place1, infrastructure damage> and <Time2, Place2, airport closure> represent two outcomes of event extraction. However, these fragmented disaster details fail to illustrate the dynamic spatio-temporal evolution of typhoon disasters¹⁸. There are the following key problems to be solved: (1) at present, there is a lack of information modeling for typhoon disaster process, making it difficult to standardize various information of typhoon disaster process. A systematic information modeling and classification significantly impact the accuracy of information extraction. (2) The process is scale-dependent, and there are differences in the spatial and temporal range and evolution sequence of typhoon disasters at different scales. In emergency management, it is essential to comprehend the processes of typhoon disasters from multiple spatiotemporal scales. Consequently, acquiring information of different granularities to describe these processes is also imperative. (3) Various disaster elements undergo dynamic changes during the process of typhoon disasters. For instance, in the <Time1, Place1, infrastructure damage> , <Time1, Place2, house collapse> , and <Time2, Place2, airport closure> , the subjects of these three event extraction results are infrastructure, houses, and airport respectively. However, the general event extraction method ignores the differences in information representation elements and their hierarchical relationships. Henceforth, this study regards states as the fundamental units for measuring processes, and proposes an information method for typhoon disaster states. The aim is to provide multi-granular information support for analyzing typhoon disaster processes at different spatio-temporal scales in emergency management.

At present, the scarcity of annotated corpora for typhoon disaster states poses challenges in directly applying relevant event extraction methods like pattern matching and machine learning. In addition, the abundance of text expressions featuring the co-occurrence of various typhoon disaster states complicates the extraction process. In response, this paper propose a method of typhoon disaster state information extraction (TDSIE). This method identifies diverse types of information characterizing typhoon disaster states and integrates information from co-referential states. The main contributions include:

(1)
Leveraging diverse spatio-temporal characteristics, including spatio-temporal information elements, spatio-temporal semantic units, and spatio-temporal clues, this approach extracts typhoon disaster state information from Chinese text. This addresses challenges related to the dispersion of disaster information and the diverse spatio-temporal granularity inherent in conventional disaster information extraction methods.
(2)
By employing Typhoon Lekima as a case study, the extracted typhoon disaster state information is utilized for multiscale analysis of spatio-temporal patterns and evolution characteristics. This study offers valuable insights into the application scenarios and advantages of the research results.

The main chapters of this paper are as follows: “Related works” introduces the related works, “Basic ideas” explains the basic ideas of typhoon disaster state, “Methodology” proposes the method of TDSIE; “Results and discussion” presents the experimental evaluation and the case study, and “Conclusion” presents the conclusions and future work.

Related works

The event extraction method needs to rely on the results of natural language processing tasks such as named entity identification, coreference resolution, and relationship extraction, but they are not the focus of the event extraction itself. Presently, the research on event extraction predominantly adheres to the evaluation criteria of conference ACE 2005, comprising four sub-tasks: (1) event trigger identification. This task involves discerning whether the words are event triggers or event types; (2) semantic role labeling. The task is to examine the relationship between various components and the predicate in the event, taking the predicate as the focal point; (3) event attribute classification. The task entails determine event attributes and describe objects; (4) event coreference resolution. The task is to identify whether two event instances belong to the same event¹⁹.

To realize the event extraction, two methods are mainly used: (1) pattern matching method. The key to this method is the construction of event templates, which is mainly used to indicate the context constraint information that constitutes the target event²⁰. The original template construction mainly relies on manual summary, which requires special professional knowledge²¹. Researchers have endeavored to employ machine learning for the automated construction of event templates, demonstrating commendable performance in domain-specific event extraction tasks. However, these templates exhibit limitations in cross-domain applications, leading to suboptimal portability²². (2) machine learning method. Based on the statistical model, this method transforms event detection and argument identification into classification problems. The machine learning method can be divided into pipeline method and feature union method. In the pipeline method, since the trigger identification always precedes the argument identification, the event argument cannot be considered when the event trigger identification is performed, which limits the accuracy of the trigger identification^23,24. Therefore, researchers have proposed the feature union method. The feature union method constructs a joint learning model for tasks such as trigger identification and argument identification, so that trigger word and argument information can promote each other’s extraction effect^25,26.

In recent times, machine learning methods, particularly deep learning and neural networks, have emerged as primary techniques for event extraction. Fully connected neural networks, convolutional neural networks²⁷ and recurrent neural networks^28,29 have been applied to event extraction. In addition, the weakly supervised method, which can automatically generate tagged corpus and alleviate the problem of data sparsity, is gradually gaining traction in event extraction applications.

Because of the significant spatio-temporal dynamic features of typhoon disasters, it is difficult to achieve the event extraction of typhoon disasters on multiple spatio-temporal nodes. More importantly, the results of events extraction are all static information elements related to typhoon disasters, which are difficult to reflect the dynamic nature of the process of typhoon disaster^30,31. The state is the detailed representation of a continuous process, and the state of typhoon disaster is the existing form of typhoon disaster under specific spatio-temporal conditions³². Thus, the extraction of different states of typhoon disaster becomes the premise to perceive the dynamic characteristics of typhoon disaster. The existing event extraction methods can be used for reference: (1) it is essential to ascertain whether the information within the text constitutes a state element and to determine its type. This is crucial for achieving the identification of the typhoon disaster states. (2) The text may encompass multiple states simultaneously, and a particular state may be referenced multiple times within the text. It is imperative to discern whether the recognized state pertains to the co-referential state of a typhoon disaster under the same spatio-temporal conditions.

Nonetheless, the characteristics of typhoon disaster states present unique challenges that render standard event extraction methods unsuitable for their extraction. (1) The classification of typhoon disaster states exhibits fine granularity and encompasses various types, with certain similarities in the features of state contexts. The application of pattern matching methods can potentially result in trigger template errors, hindering the formulation of accurate matching templates. (2) Supervised machine learning and deep learning heavily depend on extensive labeled corpora. Otherwise, the issue of data sparsity becomes pronounced. Currently, there is a scarcity of tagged Chinese corpora for the states of typhoon disasters, posing a challenge to the effective training of relevant machine learning models. (3) Events, being conceptual units more complex than entities, pose a challenge for existing extraction methods that rely on deep semantic understanding. Presently, deep semantic understanding technology is imperfect, exhibiting low accuracy and lacking universality in open fields. This limitation may impede the information extraction of typhoon disaster states. Hence, it becomes imperative to adeptly integrate the Chinese text description characteristics with the evolution process features of typhoon disasters. This approach is essential for devising a method tailored to the effective information extraction of typhoon disaster states.

Basic ideas

A typhoon disaster constitutes a catastrophic event resulting from a typhoon, giving rise to casualties, economic losses, and environmental damage. Regarding component objects, typhoon disasters encompass the disaster-pregnant environment, disaster-causing factors, and the disaster-bearing body. In terms of dynamic characteristics, the interaction among these components delineates the developmental process of motion. Consequently, information related to typhoon disasters can be categorized into three levels: object level, process level, and state level. The typhoon disaster information representation model is detailed in Fig. 1.

In the representation model of typhoon disaster information, the process directly refers to the lifecycle composed of stages such as initiation, development, persistence, and decay. State sequences constitute the fundamental units of the process. State sequences are formed by connecting states to each other, thus states represent relatively stable forms of existence in the typhoon disaster process. Under specific time and place conditions, combined with the attributes, behaviors and influence, different states are formed.

(1)
Time and place serve as the basic framework. Time and place describe the spatio-temporal features of states, providing the prerequisite conditions for the existence of other dimensional features. Specifically, time measures the sequence of state occurrences, while place records the spatial or positional context of state occurrences.
(2)
Attributes and behaviors serve as the core features. Attributes and behaviors describe the intrinsic features of typhoons, representing important manifestations of their developmental stages. Attributes, which are relatively static, depict the morphology of typhoons, while dynamic behaviors illustrate their actions.
(3)
Influence serves as the peripheral interaction feature. Influence describes the interaction between typhoons and the surrounding environment. The genesis, development, and evolution of typhoons are all influenced by their surrounding environment. Moreover, typhoons can cause significant damage to the surrounding environment after occurrence, all of which fall under the category of influence features.

Furthermore, based on the differences in attribute and behavior features, states can be categorized into different types. States can be regarded as highly generalized representations of various features of typhoon disasters under specific spatio-temporal conditions. For instance, on August 10, 2019, at 1:45 am, Typhoon Lekima was located in Wenzhou city, Zhejiang province, with maximum wind speeds reaching level 16. By August 11, at 8:50 pm, it had moved to Qingdao city, Shandong province, with maximum wind speeds decreasing to level 9. Typhoon Lekima was in two different states at <1:45am on August 10th, Wenzhou city> and <8:50 pm on August 11th, Qingdao city>. If we categorize states based on wind force levels, the types of the two states would be “super typhoon level” and “tropical storm level,” respectively.

In this paper, the structure of the typhoon disaster state is defined as a five-tuple of <time, place, attribute, behavior, influence>. The classification system and coding of typhoon disaster state information are shown in Table 1.

Table 1 Classification system and coding of typhoon disaster state information.

Full size table

Methodology

To extract dynamic typhoon disaster information from extensive text within a big data environment, a method called typhoon disaster state information extraction (TDSIE) is proposed, and its technical framework is depicted in Fig. 2. The TDSIE process unfolds in several steps:

(1)
Part-of-speech tagging, time, and place extraction.

These tasks are executed independently, and their results are amalgamated to achieve comprehensive word tagging within the Chinese text.
(2)
Text segmentation into spatio-temporal semantic units.

Based on time and place labels, the Chinese text is segmented into distinct spatio-temporal semantic units. The embedding features of word vectors are then extended, incorporating text characteristics specific to the typhoon disaster states. Then, utilizing vector clustering, various elements defining the typhoon disaster states are identified within each spatio-temporal semantic unit.
(3)
Identification of state coreference relationships.

Leveraging time and place elements within the states of typhoon disasters as cues, a model is employed to identify coreference relationships across different spatio-temporal semantic units. Subsequently, relevant information pertaining to state coreference is fused.

This methodology aims to effectively extract and integrate information regarding the typhoon disaster states from massive textual data in a big data environment.

Part-of-speech tagging considering spatio-temporal information

Part-of-speech tagging entails assigning grammatical categories to words based on their linguistic attributes. However, when it comes to time and place information, conventional part-of-speech tagging often labels them as nouns, pronouns, and similar categories, making it challenging to distinguish them from other types of information. Within the constituent elements defining the state of typhoon disasters, time and place serve as the foundational framework. As such, the spatio-temporal features embedded in Chinese text become crucial for extracting information related to typhoon disaster states. Therefore, it becomes imperative to consider the spatio-temporal semantics of content during the part-of-speech tagging process.

Time and place information extraction can effectively make up for the deficiency of part-of-speech tagging^34,35. Building upon the initial part-of-speech tagging, the extraction of time and place information is carried out separately. The outcomes of these extractions are then annotated with corresponding time and place tags, subsequently replacing the original tags within the part-of-speech tagging results (Table 2). The meaning and explanation of part of speech³⁶, time and place tags are shown in Table 3.

Table 2 Example of text tagging.

Full size table

Table 3 Main tags and their meanings.

Full size table

State element identification based on spatio-temporal semantic unit

Construction of spatio-temporal semantic unit

Words, phrases, clauses, sentences or paragraphs are all linguistic units, and the basic structure of the text is formed through semantic relations between units³⁷. If some linguistic units or different linguistic units can be combined to express a complete semantic connotation, it is a semantic unit. For instance, “At 6:00 on August 10, the center of Typhoon Lekima, the ninth typhoon of the year, was located in Huangyan District, Taizhou City, Zhejiang Province” is a semantic unit about a typhoon disaster state.

Through text analysis of typhoon events, changes in spatio-temporal information serve as transition markers in describing the state of typhoon disasters. Therefore, the spatio-temporal information can be used to track the state transitions recorded in the text. It is evident that the semantic units of typhoon disaster states contain spatio-temporal information, which is the important basis for classifying the semantic units of typhoon disaster states. Based on the spatiotemporal characteristics of the text, a framework for represnting spatio-temporal semantic units is constructed, and the text is divided into different spatio-temporal semantic units (Fig. 3).

State element identification

Each spatio-temporal semantic unit contains various elements about the typhoon disaster states. Since the time and place information has been tagged, the identification of state elements mainly aims at attribute, behavior and influence information. In this paper, an extended semantic vector-based approach for recognizing state elements of typhoon disasters is proposed.

The first step involves extending the feature set of word vector embedding specifically tailored for typhoon disaster states. Currently, word vector construction predominantly focuses on modeling words, grammar, and contextual aspects within text. However, the specific domain features related to typhoons are often overlooked in tasks such as state element identification. To augment discrimination, the extended feature set for word vector embedding in this study incorporates feature words, feature word types, and part-of-speech information. This comprehensive approach aims to enhance the representation of typhoon-related content and improve the accuracy of state element identification.

(a)
Feature words and their types. By sorting out the semi-structured typhoon event knowledge provided in the form of information boxes in the online encyclopedia knowledge base, several different types of typhoon disaster state feature words are summarized and refined to form a feature lexicon (Table 4). The word w_i in the text is matched with the feature lexicon, and the lexical tag is encoded r_i according to the different types of matching. If all the words do not match, r_i = 0.
(b)
Part of speech. As a basic grammatical attribute, part of speech can judge the component words act as in the text. According to the tagging results, the part of speech of the tagged words is si.

The words are combined with various extended features to form the word-feature pairs < w_i–r_i–s_i > . Word-feature pairs are used to train word vectors, and each word is mapped to a k-dimensional real number vector to get the typhoon disaster semantic vector v_i.

The second step is trigger word generation based on extended vector clustering. The typhoon disaster state needs to be characterized by a series of state elements, and the high-frequency words that appear repeatedly in the text are more likely to be state elements. In particular, because the semantics of numerals and quantifiers are clear to distinguish, the trigger words of state elements of typhoon disasters are mined from nouns and verbs. (a) The word frequency statistics of nouns and verbs appearing in the text are carried out, and the high-frequency words are selected to form a candidate set. (b) The hierarchical clustering method is used to cluster the words in the candidate set. Based on the typhoon disaster semantic vector v_i, the semantic similarity can be determined by the cosine distance between the vectors:
$${\text{cos}}\left({w}_{i},{w}_{j}\right)=\frac{\sum_{k=1}^{N}({w}_{i}^{k}\times {w}_{j}^{k})}{\sqrt{\sum_{k=1}^{N}{({w}_{i}^{k})}^{2}}\times \sqrt{\sum_{k=1}^{N}{({w}_{j}^{k})}^{2}}}$$
(1)

In the formula, ${v}_{i}^{k}$ denotes the kth dimension of the word vector of the word w_i, and N denotes the dimension of the vector.
(c)
The noise of the clustering results is filtered, includes: non-feature class cluster, and the whole class cluster has no feature words in Table 4; the non-triggering candidate words in the characteristic cluster. The similarity between the center word of the cluster and each candidate word in the cluster is calculated, and the candidate words with great differences in similarity are filtered out. Then, the filtered clusters are calculated for similarity with different types of feature words in Table 4, and the clusters are divided into the feature types with the highest similarity to form the attribute, behavior and influence trigger lexicon of typhoon disaster states.

Table 4 Feature words of state elements of typhoon disasters.

Full size table

The third step involves the identification of state elements through the fusion of trigger words and rules. The expression forms of state element information include two types: key-value type and representation type. The key-value type presents the form of “element name → element value”, while the representation type directly expresses the element value without prompting the element name. For key-value type, not only the element information needs to be identified, but also the relationship between “element name → element value” needs to be further extracted. The state element information has the following distribution law: (a) co-occurrence law: element names and element values often appear in the same sentence; (b) positional law: element names and element values are positionally adjacent, and their positional distributions are usually not more than three words; (c) part of speech law: element values are often composed of “numerals + quantifiers “. Due to the territoriality of typhoon texts, the applicable quantifiers are also relatively fixed, it mainly includes kilometers, meters, millimeters, meters/seconds, kilometers/hour, level, hectopascal, etc. Through pattern matching of text and trigger rules, the state elements of typhoon disasters are identified (Table 5). Furthermore, the type of state elements (attribute, behavior, and influence) is determined based on the category of trigger words.

Table 5 Trigger rules for state elements of typhoon disasters.

Full size table

Structured representation of state elements

According to the tuple structure of typhoon disaster states, the typhoon disaster state is reconstructed through the process of slot filling. Formally, the typhoon disaster state can be represented as TS:

$$TS=<T, P, A, B,I>$$

(2)

In the formula, T is the time element, P is the place element, A is the attribute element, B is the behavior element and I is the influence element.

Because the text has been divided into many spatio-temporal semantic units, the information of state elements of typhoon disasters is distributed in each unit. Therefore, the information of state elements can be structured according to the units they belong to. In each spatio-temporal semantic unit, the time and place information of part-of-speech tagging and the identified attributes, behaviors and influence information are filled according to the slot of formula (2) (Fig. 4). The description of typhoon disaster state in a spatio-temporal semantic unit may be limited to a certain aspect, and the attribute, behavior and influence elements may be missing in the structural representation.

State information fusion based on spatio-temporal cues

The same state of a typhoon disaster may be repetitively mentioned at various locations within the text. The spatio-temporal scales of different typhoon disaster states are also inconsistent. Moreover, certain types of state elements may be absent, resulting in incomplete and semantically ambiguous information obtained through state extraction³⁸. Therefore, after identifying the elements of typhoon disaster states, further integration is required for co-referential information pertaining to typhoon disaster states. On one hand, aligning the spatio-temporal scales and element content of different states is possible. On the other hand, efforts should be made to avoid redundant information in co-referential states.

Typhoon disaster states exist in a certain range of time and space, and time and place become an important basis for distinguishing different typhoon disaster states. It is worth noting that the information of time and place in the text is often inconsistent and incomplete, and a lot of information cannot be fused directly by hard matching. This paper takes time and space as a clue to explore the important features of co-referentiality among multiple spatio-temporal semantic units. In the fusion process, it is assumed that the typhoon disaster states TS₁ and TS₂, <T₁, P₁, A₁, B₁, I₁> and <T₂, P₂, A₂, B₂, I₂> are the elements of the typhoon disaster states, and the relevant rules in Table 6 are followed.

Table 6 Fusion rules of state information based on spatio-temporal cues.

Full size table

In Table 6, the same element name means the same type of state element in TS₁ and TS₂. For instance, if both TS₁ and TS₂ have attribute elements about wind scale, then TS₁ and TS₂ have the same element names. Element value equality is applicable to text-type element values; if two text-type element values exhibit a reference relationship, they are deemed equal. For instance: B₁ is wind enhancement, and B₂ is grade enhancement. The semantic meaning refers to the typhoon level rising, then element values of B₁ and B₂ are equal. Element value compatibility pertains to numeric-type element values; if two numeric-type element values share an inclusion relationship within the numerical range, they are considered compatible. For instance, A₁ is the wind speed of 30.2 m/s, and A₂ is the wind speed of 30.23 m/s. From the perspective of data accuracy, A₂ ⊆ A₁, then element values of A₁ and A₂ are compatible.

Results and discussion

Experimental environment

Compared to traditional official disaster statistics, online news not only provides access to more sources of typhoon event information beyond official media but also exhibits the following characteristics:

(1)
Timeliness: various news media utilize the internet as a medium for instant information dissemination, offering a convenient channel for the publication and dissemination of typhoon event information.
(2)
Massive volume: numerous online news sources swiftly aggregate and spread during or after typhoon events, increasingly manifesting characteristics of big data in terms of information quantity, dissemination speed, content diversity, and application value.
(3)
Diversity: news coverage of typhoon events encompasses various aspects such as typhoon development, weather conditions, disaster situations, and rescue operations. Online news has become an essential media resource that cannot be ignored in the field of disaster management.

Major domestic websites, such as People’s Daily Online, CCTV News, The Paper, and the China Meteorological Administration portal website, have dedicated columns for special series reporting, covering official media, information platforms, and government portal websites. In this study, 1012 news reports on Typhoon Lekima from August 7th to August 13th, 2019, are selected as experimental data.

In the experiment, the accuracy of the state element identification and the effect of state information fusion are analyzed. The ANSJ tool (https://github.com/NLPCHINA/ansj_seg) is used for word segmentation, part-of-speech tagging, and time and place information extraction. Based on the Word2Vec vector (https://github.com/NLPchina/Word2VEC _java), the experimental results of typhoon disaster semantic vector are compared to verify the contribution of extended embedding features in the word vector to state element identification in this method. Compared with the results of information fusion based on text similarity, the effect of spatio-temporal features as clues in state information fusion is verified. In the experiment, the dimension of the word vector is 200, and the CBOW model is used to train the word vector with a window size of 5. The distance between clusters is not less than 0.5 in hierarchical clustering. The evaluation of text similarity adopts the Levenshtein Distance³⁹, with a similarity threshold of [0.7, 1).

The precision rate (P), recall rate (R) and F1 value are selected as the evaluation indexes of the state element identification effect, and the precision rate (P) is used as the evaluation index of the state information fusion effect.

$${\text{P}}\left({\text{i}}\right)=\frac{{N}_{TP,i}}{{N}_{FP,i}+{N}_{TP,i}}\times 100\%$$

(3)

$${\text{R}}\left({\text{i}}\right)=\frac{{N}_{TP,i}}{{N}_{TP,i}+{N}_{FN,i}}\times 100\%$$

(4)

$${\text{F}}1\left({\text{i}}\right)=\frac{2\times P(i)\times R(i)}{P\left(i\right)+R(i)}\times 100\mathrm{\%}$$

(5)

In the formula, N_TP,i represents the number of correctly judged samples in class i; N_FP,i represents the number of samples mistakenly classified as class i; N_FN,i represents the number of i-th class samples that are mistakenly classified as other classes.

Experimental results

Comparison of state identification effect

To assess the impacts of different word vector models, the experimental texts is randomly divided into five groups. Four of these groups serve as training data for word vector models, while the remaining one is designated as the test data. The final identification result is derived from the average value obtained through cross-validation. Figure 5 illustrates the precision (P), recall (R), and F1 values for various types of state elements related to typhoon disasters. In this experiment, the Word2Vec is chosen as the baseline method, and it is compared with the state semantic vector proposed in this study.

The experimental results show that the state elements can be identified accurately based on typhoon disaster semantic vector, and the spatio-temporal semantic unit framework establishes the relationship between different state elements and forms structured typhoon disaster state information. The overall extraction precision of attribute elements is higher than that of other elements, reaching P = 93.85%, R = 98.00%. This is because the attribute types of the typhoon disaster state are limited and easier to express explicitly in text. In contrast, the identification of the influence elements is not good, with a recall rate of just 80%. Typhoons will have a destructive impact on the natural environment, social activities and other aspects. The existing types of influence feature words based on the online encyclopedia knowledge base are not comprehensive, which makes it impossible to correctly judge the missing types of related words in word vector training and vector clustering. The identification effect can be improved by increasing the type of feature words and expanding the scale of training text.

Comparing the identification effects under different input vectors, the precision of the identification results based on typhoon disaster semantic vector is significantly better than the Word2Ver, and the recall rate is also improved to a certain extent, resulting in a corresponding increase in the F1 value. Word2Ver reflects the importance of words in a sentence and their implicit relationships with other words. However, these features are still relatively sparse, ignoring the semantic features of typhoon disaster states contained in relevant words. The typhoon disaster semantic vector can well integrate the typhoon disaster state features and context characteristics of the words, and the precision of various state elements is generally balanced. In general, TDSIE uses unsupervised methods to reduce the processing cost of typhoon disaster corpus, and achieves a more accurate identification effect by mining the spatio-temporal features and language knowledge in the text.

Comparison of state fusion effect

200 extracted typhoon disaster states are randomly selected, forming 100 sets of state pairs for information fusion. Randomly selecting a total of five iterations, the average value after verification is considered as the final fusion result. Figure 6 displays the precision (P) and numerical statistics of different fusion methods in this experiment. In this experiment, the text similarity-based method is chosen as the baseline method, and it is compared with the spatio-temporal cues-based method proposed in this study.

The results show that there are 93.4 sets of state pairs correctly merged by the fusion method based on spatio-temporal cues, with a precision rate of 93.40%, and all indicators are higher than those based on the text similarity method. It is effective to use spatio-temporal features as the key strategy for state information fusion. The change of time and space is the fundamental sign of state transition. It is not accurate to judge whether the state changes by the change of attribute, behavior or influence.

Analyze the typhoon disaster states that are fused incorrectly in the experiment, mainly due to two aspects: (a) coreference resolution error. The relevant time and place information is not accurately disambiguated. For instance, place information “Jilin” does not accurately resolve whether it is “Jilin Province” or “Jilin City, Jilin Province”; (b) The state element identification error, or the state elements are incomplete. This will lead to the lack of key information on the typhoon disaster states, and thus cannot correctly analyze the fusion results.

Experimental result analysis

The TDSIE method exhibits precise and well-balanced identification of various typhoon disaster state elements, resulting in an average 29% increase in the F1 value. Moreover, it accurately combines co-referential states, demonstrating the effectiveness of TDSIE. TDSIE does not rely on extensive tagging corpora for model training, nor does it necessitate deep semantic analysis of text details. Therefore, it achieves accurate typhoon disaster state information extraction. The scope of information extraction is typhoon disaster information in Chinese text that conforms to the state tuple structure, without limitations to a specific type of state, making TDSIE a universal approach.

Case study

In the present big data environment, extracting disaster information from text serves as an effective supplement to traditional disaster monitoring methods. TDSIE can proficiently extract information regarding the state of typhoon disasters across various spatio-temporal scales, effectively addressing challenges related to the dispersion of disaster information and spatio-temporal granularity diversity. Conducting quantitative classification based on the results of typhoon disaster state information extraction enables the detection of dynamic characteristics within the typhoon disaster process and provides a comprehensive global reference. This paper utilizes the results of typhoon disaster state information extraction from August 10 to 12, 2019, as a case study.

Typhoon disaster states at various spatio-temporal scales

The abstraction and description of typhoon events across various granularities provide extensive support for disaster early warning, monitoring, command, evaluation, and other decision-making processes. For instance, when analyzing the development trend of a typhoon at the municipal scale, extracting the typhoon disaster state information for different cities at a specific time becomes crucial. To meet the demands of more detailed analysis, reducing the temporal scale enables the examination of typhoon disaster state changes in shorter time intervals. Similarly, reducing the spatial scale facilitates the assessment of disparities in typhoon disaster states among various districts and counties within the city (Fig. 7).

Spatial patterns of typhoon disaster states

The typhoon disaster state information for each city was assessed at a temporal scale of 1 h and a spatial scale at the municipal level (Fig. 8). Throughout the period spanning from August 10 to 12, 2019, corresponding to the advancement of Typhoon Lekima’s life cycle, the impact range of the typhoon exhibited a general pattern of expansion followed by contraction. In comparison with the typhoon track and wind circle issued by the China Meteorological Administration, the distribution range of the detected typhoon state appears more irregular. This suggests non-uniform wind force within the typhoon’s wind circle in different directions.

Evolution characteristics of typhoon disaster states

Adopting a temporal scale of 2 h, the extraction of typhoon disaster state information for Hangzhou city and Qingdao city is conducted, followed by the identification of the typhoon state type (Fig. 9). The time periods impacted by Typhoon Lekima in these two cities exhibit significant differences. Hangzhou city, being close to the typhoon’s landing point, experiences a prolonged impact as the typhoon moves slowly with high wind strength. On the other hand, when Typhoon Lekima affects Qingdao city, it is already in the late stages of its life cycle, resulting in a substantial weakening of the typhoon’s wind strength. Consequently, the impact on Qingdao city is shorter, and the wind strength is lower compared to Hangzhou city.

Application prospects

The method proposed in this paper provides robust support for enhancing the service capability of textual big data resources in disaster management. TDSIE, from the perspectives of static composition and dynamic process, constructs the basic framework for acquiring and reconstructing typhoon disaster information, thereby offering methodological insights for the extraction of other types of persistent disaster information. Extracting disaster state information from multiple sources of textual data enhances the comprehensiveness of disaster detection, thereby supporting various aspects of disaster management such as warning, monitoring, command, and assessment. In TDSIE, spatio-temporal features serve as crucial foundations for element identification and information fusion. However, this method overlooks the vagueness inherent in time and space information, thus it cannot handle metaphorical time expressions (e.g. when the house collapses) or spatial relationships that require reasoning (e.g. opposite to the building). The natural language expression in textual data often exhibit vagueness, and the existing TDSIE method are not suitable for social media texts with obvious colloquialism.

Conclusion

The states represent segments in the development process of typhoon disasters. Extracting state information of typhoon disasters is a prerequisite for exploring their dynamic characteristics and analyzing evolutionary patterns. Due to the scarcity of annotated corpora specifically for typhoon disaster states, prevalent methods such as pattern matching and machine learning for event extraction cannot be directly applied. The research conclusions encompass two aspects: (1) This study comprehensively analyzes the features of spatio-temporal information elements, spatio-temporal semantic units, and spatio-temporal clues contained within the text. By integrating spatio-temporal features with methods such as part-of-speech tagging, state element identification, and state information fusion, it achieves the information extraction of typhoon disaster states from Chinese texts. The TDSIE addresses issues faced by generic disaster information extraction methods, such as dispersed disaster information and diverse spatio-temporal granularities. (2) Taking Typhoon Lekima as a case study, the analysis of extracted typhoon disaster state information from perspectives such as spatial patterns, temporal relationships, and evolutionary trends can provide a holistic reference for the timely detection of the dynamic processes of typhoon disasters.

In future research, the precision of information extraction of typhoon disaster states will be enhanced by expanding the core set of state elements, optimizing the effectiveness of semantic analysis in coreference resolution, and improving the judgment of timeliness. Additionally, studying reasoning methods for vagueness spatio-temporal information in typhoon states enhances the applicability of TDSIE to other social media textual data.

Data availability

The datasets analysed during the current study are not publicly available due to private proprietary but are available from the corresponding author on reasonable request.

References

Vandermeer, J. Hurricane disturbance and tropical tree species diversity. Science 290, 788–791 (2000).
Article ADS CAS PubMed Google Scholar
Mei, W. & Xie, S. Intensification of landfalling typhoons over the northwest Pacific since the late 1970s. Nat. Geosci. 9, 753–757 (2016).
Article ADS CAS Google Scholar
Li, K. & Li, G. Vulnerability assessment of storm surges in the coastal area of Guangdong province. Nat. Hazards Earth Syst. Sci. 11, 2003–2010 (2011).
Article ADS CAS Google Scholar
Alexander, D. World disasters report 2005: Focus on information in disasters. Disasters 30, 377–379 (2006).
Article Google Scholar
Fan, Y. The future challenge of disaster risk management in China: Interpretation “Sendai Framework for Disaster Risk Reduction 2015–2030”. Disaster Rep. China 7, 18–21 (2015).
Google Scholar
Li, W. & Wen, J. Development in disaster information extraction from web pages. J. Catastrophol. 25, 119–123 (2010).
CAS Google Scholar
Zhao, Q., Chen, Z., Liu, C. & Luo, N. Extracting and classifying typhoon disaster information based on volunteered geographic information from Chinese Sina microblog. Concurr. Comp.-Pract. E 31, e4910.1-e4910.13 (2019).
Article Google Scholar
Ahn, D. The stages of event extraction. In Proceedings of the Workshop on Annotations and Reasoning about Time and Events. ACL, Sydney. 1–8 (2006).
Chen, Z. & Lim, S. Collecting typhoon disaster information from twitter based on query expansion. ISPRS Int. J. Geo-Inf. 7, 139 (2018).
Article Google Scholar
Yu, J., Zhao, Q. & Chin, C. S. Extracting typhoon disaster information from VGI based on machine learning. J. Mar. Sci. Eng. 7, 318 (2019).
Article Google Scholar
Xu, F. & Ren, A. A method for automatic recognition of typhoon disaster information based on classification rule. In Proceedings of 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Shanghai. 1888–1891 (2011).
Nguyen, T.H. & Grishman, R. Event detection and domain adaptation with convolutional neural networks. In Proceedings of International Joint Conference on Natural Language Processing. 365–371 (2015).
Nguyen, T.H., Cho, K. & Grishman, R. Joint event extraction via recurrent neural networks. In Proceedings of North American Chapter of the Association for Computational Linguistics. 300–309 (2016).
Wang, R., Zhou, D.Y. & He, Y.L. Open event extraction from online text using a generative adversarial network. In Proceedings of the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong. 282–291 (2019).
Yan, H.R., Jin, X.L, Meng, X.B., Guo, J.F. & Cheng, X.Q. Event detection with multi-order graph convolution and aggregated attention. In Proceedings of the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong. 5766–5770 (2019).
Liang, C. Y. et al. Assessing the effectiveness of social media data in mapping the distribution of typhoon disasters. J. Geo-Inform. Sci. 20, 807–816 (2018).
Google Scholar
Yang, T. F., Xie, J. B., Li, Z. Y. & Li, G. Q. A method of typhoon disaster loss identification and classification using microblog information. J. Geo-Inform. Sci. 20, 906–917 (2018).
Google Scholar
Liu, S., Chen, Y., He, S., Liu, K. & Zhao, J. Leveraging FrameNet to improve automatic event detection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2134–2143 (2016).
Missingham, R. Access to Australian Government information: A decade of change 1997–2007. Gov. Inform. Q. 25, 25–37 (2008).
Article Google Scholar
Siddharth, P. & Ellen, R. A unified model of phrasal and sentential evidence for information extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP'09). Vol. 1. Association for Computational Linguistics. 151–160 (2009).
Huang, R. & Ellen, R. Modeling textual cohesion for event extraction. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, AAAI 2012. 1664–1670 (AAAI Press, 2012).
Emanuela, B., Romaric, B., Olivier, F. & Brigitte, G. Event role extraction using domain-relevant word representations. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1852–1857. Association for Computational Linguistics, Doha (2014).
Han, R.J., Ning, Q. & Peng, N.Y. Joint event and temporal relation extraction with shared representations and structured prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong. 434–444 (2019).
Zhang, J.C., Qin, Y.X., Zhang, Y., Liu, M.C. & Ji, D.H. Extracting entities and events as a single task using a transition-based neural model. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. IJCAI.org, Macao. 5422–5428 (2019).
Zhang, Z.S., Kong, X., Liu, Z.Z., Ma, X.Z. & Hovy, E. A two-step approach for implicit event argument detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7479–7485. Association for Computational Linguistics (2020).
Huang, K.H., Yang, M. & Peng, N.Y. Biomedical event extraction with hierarchical knowledge graphs. In Proceedings of the 2020 Findings of the Association for Computational Linguistics. Association for Computational Linguistics. 1277–1285 (2020).
Feng, X., Qin, B. & Liu, T. A language-independent neural network for event detection. Sci. China Ser. F Inf. Sci. 61, 81–92 (2018).
Google Scholar
Li, P., Zhu, Q. & Zhou, G. Joint modeling of argument identification and role determination in Chinese event extraction with discourse-level information. In Proceedings of International Joint Conference on Artificial Intelligence. 2120–2126 (2013).
Zeng, Y., Yang, H., Feng, Y., Wang, Z. & Zhao, D. A convolution BiLSTM neural network model for Chinese event extraction. In Natural Language Understanding and Intelligent Applications (eds Zeng, Y. et al.) 275–287 (Springer, 2016).
Chapter Google Scholar
Yu, L., Lu, F. & Zhang, H. Extracting geographic information from web texts: Status and development. J. Geo-inform. Sci. 17, 127–134 (2015).
Google Scholar
Lv, X. F. & Chen, S. Y. Review of natural disaster network public opinion information analysis and management. Geogr. Geo-inform. Sci. 32, 49–56 (2016).
Google Scholar
Ye, P., Zhang, X., Huai, A. & Tang, W. Information detection for the process of typhoon events in microblog text: A spatio-temporal perspective. ISPRS Int. J. Geo-Inf. 10, 174 (2021).
Article Google Scholar
Ye, P. et al. TKRM: A formal knowledge representation method for typhoon events. Sustainability 12, 2030 (2020).
Article CAS Google Scholar
Zhang, C., Zhang, X., Li, M. & Wang, S. Interpretation of temporal information in Chinese text. Geogr. Geo-Inform. Sci. 30, 1–7 (2014).
Google Scholar
Wang, S., Zhang, X., Ye, P. & Du, M. Deep belief networks based toponym recognition for Chinese text. ISPRS Int. J. Geo-Inf. 7, 217 (2018).
Article Google Scholar
Yu, S., Zhu, X. & Duan, H. The guideline for segmentation and part of speech tagging on very large scale corpus of contemporary Chinese. J. Chin. Inform. Proc. 6, 58–64 (2000).
Google Scholar
De Robert-Alain, B. & Wolfgang, D. U. Introduction to Text Linguistics (Longman Paperback, 1981).
Google Scholar
Mann, G.S. & Yarowsky, D. Multi-field information extraction and cross-document fusion. In Proceedings of Association for Computational Linguistics. Association for Computational Linguistics (2005).
Soukoreff, R.W. & MacKenzie, I.S. Measuring errors in text entry tasks: An application of the Levenshtein string distance statistic. In Proceedings of the ACM Conference on Human Factors in Computing Systems-CHI 2001. ACM. 319–320 (2001).

Download references

Funding

This research was funded by the National Natural Science Foundation of China (Grant Nos. 42301522 and 42171453) and the Key Project of Philosophy and Social Science Research in Colleges and Universities in Jiangsu province (Grant No. 2022SJYB2128).

Author information

Authors and Affiliations

Urban Planning and Development Institute, Yangzhou University, Yangzhou, 225127, China
Peng Ye
College of Architectural Science and Engineering, Yangzhou University, Yangzhou, 225127, China
Peng Ye & Shengcai Li
School of Civil Engineering, Hefei University of Technology, Hefei, 230009, China
Chunju Zhang
Key Laboratory of Virtual Geographic Environment, Ministry of Education, Nanjing Normal University, Nanjing, 210023, China
Mingzhu Chen
Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing, 210023, China
Mingzhu Chen

Authors

Peng Ye
View author publications
You can also search for this author in PubMed Google Scholar
Chunju Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mingzhu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shengcai Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, Peng Ye and Chunju Zhang; Methodology, Peng Ye; Validation, Chunju Zhang; Formal analysis, Chunju Zhang; Investigation, Peng Ye and Shengcai Li; Data curation, Mingzhu Chen; Writing—original draft, Peng Ye; Writing—review & editing, Chunju Zhang and Shengcai Li; Visualization, Mingzhu Chen; Project administration, Peng Ye.

Corresponding author

Correspondence to Shengcai Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ye, P., Zhang, C., Chen, M. et al. Typhoon disaster state information extraction for Chinese texts. Sci Rep 14, 7925 (2024). https://doi.org/10.1038/s41598-024-58585-8

Download citation

Received: 07 January 2024
Accepted: 01 April 2024
Published: 04 April 2024
DOI: https://doi.org/10.1038/s41598-024-58585-8

Keywords

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Studying the narrative of US policy towards China: introducing China-related political texts in Congress

Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality

A manually annotated corpus in French for the study of urbanization and the natural risk prevention

Introduction

Related works

Basic ideas

Methodology

Part-of-speech tagging considering spatio-temporal information

State element identification based on spatio-temporal semantic unit

Construction of spatio-temporal semantic unit

State element identification

Structured representation of state elements

State information fusion based on spatio-temporal cues

Results and discussion

Experimental environment

Experimental results

Comparison of state identification effect

Comparison of state fusion effect

Experimental result analysis

Case study

Typhoon disaster states at various spatio-temporal scales

Spatial patterns of typhoon disaster states

Evolution characteristics of typhoon disaster states

Application prospects

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Comments

Search

Quick links