Introduction

The nature of grammatical categories (i.e. parts-of-speech; PoS), especially the two binary features that account for the major PoSs (i.e. [±N] and [±V] or noun-verb bifurcation), has been a topic of debate at various stages in the history of scholarly pursuits and over a wide range of disciplines. The debate can be traced back to the Greek philosophical discourse on ontology and epistemology (e.g. Plato’s Sophist and Aristotle’s The Interpretation) and remains a foundational issue in different linguistic theories. More recently, it has been one of the central issues in the neuro-cognitive studies of human conceptualisation (Gentner and France, 1988), as well as computational approaches to knowledge representation and processing (Redington et al., 1995; Redington et al., 1998). Given the historical depth and disciplinary breadth, it is foreseeable that the terms used and their definitions may be confusing or misleading when framing an interdisciplinary discussion. Hence, we start with the issues to be addressed. This common ground will be built on the philosophical distinction of contrasts between ontological and epistemological beings, as well as the recent theory of embodied cognition that has emerged as a tenet of the theories of mind and cognition in the new century (cf. Clark, 1997; Wilson, 2002).

To begin, and perhaps to risk oversimplification, we view ontology as the study of “beings,” i.e. the “nature” of existence, and epistemology as the study of how to identify, verify, or represent the nature of these beings. In contrast, for embodied cognition, the “embodied” refers to the existing physical world, and the “cognition” refers to how human beings, based on their interaction with the outside world, construe and represent in their minds a system of knowledge that is sharable with other human beings. In the context of the theoretical premises given above, the study of language provides a unique opportunity to bridge these two foundations of scientific knowledge. Human languages are viewed as knowledge systems shared by their speakers for sharing and integrating knowledge (Huang et al., 2010a). However, given the vast diversity of human languages, especially in terms of the differences in how nouns and verbs are encoded, our study of the categorial noun–verb bifurcation will not focus on the roles and functions of nouns and verbs in the linguistic systems or human cognitive processing. In addition, given the meta-theoretical dilemma that the nature of beings cannot be discussed without invoking the mapping from beings to linguistic representations, it is also not feasible to tackle the nature of beings with language as the primary data. Instead, we will focus on the following less explored questions that are both intriguing and not directly constrained by the data issues discussed above: what is the basic ontological concept, with null or minimal prior knowledge, that allows for the conceptualisation of shared human experiences to form a foundation of human languages? More specifically, in terms of the system of grammatical categories in human languages, does this shared foundation of conceptualisation lead to a noun–verb bifurcation, creating two of the most basic grammatical categories? Alternatively, in terms of embodied cognition, what is the fundamental characteristic of the physical world that allows human beings, without a priori concepts, to form a shared principle of conceptualisation on which to build the basic noun vs. verb categorial bifurcation? In general, we are looking for the most basic, and ideally, only one, ontological concept that can be experienced and/or perceived without prior knowledge that would, in turn, facilitate aspects of human cognition, such as those reflected in the shared categorial system of human languages.

The experience and knowledge of the interactions between human bodies and their environments are typically acquired through the five sensory modalities: namely, the visual, auditory, gustatory, olfactory, and tactile senses. These sensory modalities receive the five sensations respectively, i.e. vision, hearing, taste, smell, and touch. These five senses, also known as the Aristotelian senses, are the conventional categories of fundamental human perception in the literature of various sensorial studies (e.g. Lynott and Connell, 2013; Zhao et al., 2019). To explore how human beings conceptualise sensory experiences of the outside world through languages, the sensory lexicon is both a repository of sensory input, organised and encompassing main lexical categories in human languages, including verbs such as look and hear, adjectives like red and sweet, and nouns exemplified by sight and sound. One interesting fact is that lexical categories may have different tendencies toward particular sensory modalities (Strik Lievers and Winter, 2018). For example, verbs were found over-represented for the auditory and tactile senses but under-represented for vision, whereas adjectives opt for the visual and olfactory senses but less favoured touch and sound in the repertoire of sensory words in English. Such tendencies could further differentiate and reflect the nature of human senses; for instance, the verb-inclination feature carried by the auditory lexicon reveals the dynamic nature of sound. Therefore, the close relationship between lexical categories and sensory modalities is believed to provide empirical evidence in exploring the cognitive basis of grammatical categories using sensory lexicon as the dataset.

Apart from attempting to tackle the foundational issue of conceptualisation by looking at the cognitive motivation of the primary noun–verb dichotomy in human language and through the body-and-world interaction as mediated by sensory perception, it should also be clear that we also need a robust and versatile framework of meaning representation that would be felicitous in terms of cognitive studies, linguistics, and ontology. Aristotle’s qualia structure, as the representation of experiential knowledge, presents itself as a natural candidate. In order to be better equipped to deal with modern theories of cognitive sciences and formal ontology, we will adopt the updated version of the qualia structure as proposed and formalised by Pustejovsky (1991, 1995) in the Generative Lexicon Theory (henceforth, the GL theory, or GL). The GL theory, through its inclusion of telicity and agentivity in the qualia structure following Aristotle, provides a rigorous and empirically sound model for the encoding of eventive information by nominals. In addition, the substantial literature on GL-based research on the Chinese language can support the current study (e.g. Song and Huang, 2018).

This paper presents a comprehensive investigation of sensory nouns, given that the sensory lexicon is the repository of integrated human perceptual information. Thus, these nouns are the results of the conceptualisation of relatively well-defined physical contacts of human beings with the physical world. By exploring the correlation between the physical world and sensory concepts, this study aims to explicate the cognitive foundation of grammatical categories. One important precedent is Strik Lievers and Winter’s (2018) study of the English sensory lexicon, which found evidence for the cognitive representation of “verby” and suggested that this was the cognitive foundation of the noun–verb categorisation. We seek to substantiate and explicate this proposed cognitive motivation in order to establish the conceptual and/or ontological motivation for the encoding of grammatical categories in other languages, such as Chinese. We will argue that it is the ontological spatio-temporal continuum (and not the nature of an entity) that provides the most robust accounting of the classic concrete–abstract dichotomy.

Theoretical constructs

Grammatical categories: based on the feature system

In this section, we first lay a foundation for our study by reviewing the standard linguistic theory of PoSs. We first review the two binary features that account for the four major PoSs (Noun, Verb, Adjective, and Adverb) in formal generative theories, [±N] and [±V] (e.g. Chomsky, 1970). These two features are intuitively defined as being “nouny” (i.e. noun-like) and being “verby” (i.e. verb-like), respectively, so as to account for the feature value assignments of nouns as [+N, −V] and verbs [−N, +V]. Given the four major categories, the other two combinations are assigned as adjectives [+N, +V] and adverbs [−N, −V] (e.g. Baker, 2003; Haegeman, 1994). However, following the intuitive “verby” and “nouny” definitions of the two features, the two assignments for adjectives and adverbs are not clearly justified. In addition, for the same rationale, should not both the deverbal nouns (nouns derived from verbs) and the denominal verbs (verbs formed from nouns) be the most natural candidates for the [+N, +V] assignment as they are attested to have both noun-like and verb-like behaviours? The fact that they are assigned the feature according to their derived categories, e.g. [+N] for deverbal nouns and [+V] for denominal verbs, confirms the tautological nature of the intuitive definition. This widely accepted and practised feature system in linguistics underlines one of the common dilemmas of previous attempts to define grammatical categories by assuming a priori knowledge of them.

Given this circularity, the question then becomes: can grammatical categories be learned without the knowledge of any grammatical categories? The answer is a definite yes. Redington et al. (1995, 1998) demonstrated not only that PoSs can be automatically learned from an untagged corpus (Redington et al., 1995) but also that there are psychologically feasible mechanisms to support such learning without prior knowledge (Redington et al., 1998). Most current computational language processing systems with powerful machine-learning algorithms also attested to the plausibility of learning many different grammatical categories. Contrary to the widely held assumption that these powerful learning mechanisms based solely on distributional information are not interpretable, Chersoni et al. (2021) showed that what is learned by automatic machine-learning mechanisms can be interpreted in terms of semantic features. These studies showed that linguistic categories can be learned (from distributional information) without a priori concepts of the category and that it is plausible to account for such categorical learning to learn more basic features. Taking the lead from these studies, we explore the conceptual a priori’s that could lead to the conceptualisation of grammatical categories, in order to overcome the common tautological flaws in previous attempts so as to provide a clear account of the nature of grammatical categories.

The challenge, of course, remains to be the identification of such conceptual a priori’s, without any currently held linguistic knowledge. There are, in fact, several potentially viable and conceptually related proposals to account for the binary noun–verb bifurcation. Givón (2001), for instance, argued that nouns could be identified by their “temporal stability.” Gentner (1982) and Ahrens (1999), among others, showed that verbs are more mutable (i.e. have a higher propensity to change) than nouns. More recently, Strik Lievers and Winter (2018), as mentioned above, suggested that “eventivity” is the cognitive motivation for verbs. Among these proposals, of special interest to the current study, is Aristotle’s elegant definition that “…a noun…has no reference to time” and “a verb…carries with it the notion of time.”Footnote 1 Aristotle’s definition of with or without the notion of time has more recently been adopted in several formal ontologies as the foundation of knowledge systems, such as DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering; Gangemi et al., 2002) and BFO (The Basic Formal Ontology; Arp et al., 2015). In these ontologies, all entities are construed either as enduring continuants, i.e. that “existed in time and have no temporal parts,” or as perduring occurrents, i.e. that consist of “temporal parts and [that] have phases and temporal slices corresponding to the intervals and moments through which [the parts] perdure” (Simons and Melia, 2000, pp. 59–60).

Note that a fundamental challenge to the theories of conceptualisation or categorisation is required a priori knowledge to form a concept or to define a category. Thus the question we pose is: given the ultimate tabula rasa of a single spatio-temporal continuum, what is the minimal prior knowledge needed to support conceptualisation? The endurant–perdurant bifurcation requires the knowledge of “reference to time,” which may or may not be inherent to the nature of the spatio-temporal continuum. In contrast, concrete–abstract or entity–eventivity bifurcations require a priori knowledge of two categories and likely additional knowledge of how they differ. Similarly with even higher requirements, the [±N] and [±V] theories of the grammatical system are built on prior knowledge of what nouns and verbs are like. By Occam’s razor, “reference to time” is the minimal premise required given the tabula rasa of the spatio-temporal continuum, and hence our null hypothesis until proven otherwise. Therefore, we seek evidence to support the idea that this foundational concept of reference to time also underlies the formation of a categorial system in human languages.

In what follows, we will delineate the problems of using an entity–eventivity dichotomy to represent the noun–verb bifurcation and further explicate our proposal to adopt time-independent properties, i.e. an endurant–perdurant dichotomy, so as to capture the knowledge representation manifested in one of the major PoSs that is under examination herein, i.e. the noun.

Classification of nouns: based on the entity–eventivity dichotomy

It is commonly believed that PoSs have cognitive origins (e.g. Gentner, 1978; Langacker, 1987). Prototypical verbs mostly represent events and processes, while prototypical nouns are concrete, abstract, or imaginary entities (e.g. Langacker, 1987). However, although entities are typically encoded as nouns while events are surfaced as verbs, not all “verbs” represent events, much like not all “nouns” signify entities.Footnote 2 One set of linguistic facts that could shed light on the nature of noun-verb bifurcation is the deverbal nominals. These are nouns derived from verbs and thus retain the notion of eventivity. In particular, we take data from Chinese, a language with minimal morphological marking of inflection and derivation (Hsieh et al., 2022), as there are typically no overt markers to differentiate derived and non-derived forms, so as to provide a unique data set and associated opportunity to examine the nature of the ontological changes, independent of the effect of morpho-lexical rules. For example, we show in (1)–(4) bàogào ‘report; to report’, which has the original lexical category of a verb (1), can also function as a process nominal (2), a result nominal (3), and an event nominal (4) with the same wordform.

(1)

bàogào

le

sān

ge

xiǎoshí.

  
 

s/he

report

ASPFootnote 3

three

CL

hourFootnote 4

  
 

‘S/he reported for three hours.’

(2)

zhèngzài

zuò

bàogào.

    
 

s/he

currently

make

report

    
 

‘S/he is (making) reporting.’

(3)

jiāo

le

fèn

bàogào.

  
 

s/he

submit

ASP

one

CL

report

  
 

‘S/he submitted a report.’

(4)

Zhè

chǎng

bàogào(huì)Footnote 5

chíxù

le

sān

ge

xiǎoshí.

 

this

CL

report(meeting)

continue

ASP

three

CL

hour

 

‘This report (meeting) lasted for three hours.’

The differences between “process nominal” and “result nominal” are well established in past literature on nominalisation (cf. Fu, 1994; Grimshaw, 1990). They can often be differentiated by derivational affixes in many languages, such as English (e.g. giving versus gift), but not always (e.g. building a building). In Chinese, process nominals and result nominals select different types of classifiers (Huang and Ahrens, 2003). For instance, fèn ‘a copy (for documents, newspapers, periodicals)’ in sentence (3) is an individual classifier marking bàogào in (3) a result nominal. Moreover, bàogào can also follow an event classifier ‘occurrence; time’ that enumerated events, e.g. zuò le yī cì bàogào ‘made one time/instance of reporting.’ Another determining factor in differentiating the two is whether the noun phrase allows durative time expressions and time points; for example, sān ge xiǎoshí ‘three hours’ can only go with a process nominal but not with a result nominal.

As for the event nominal (or event nouns) in the sentence (4), they are basically the naming of an event as an entity (Chierchia, 1986). The event nouns can be differentiated from nouns referring to physical objects by several empirical criteria: event nouns which (a) can be selected by event classifiers (e.g. chǎng [for sporting or recreational activities], [for enumerated events], dùn [for meals, beatings, scoldings, etc.]), (b) can provide argument structure information when serving as an object of light verbs (e.g. kāishǐ ‘begin,’ jìxù ‘continue,’ tíngzhǐ ‘stop’), (c) can allow temporal noun suffixation to denote temporal duration orientations (e.g. qián ‘before,’ hòu ‘after,’ zhōng ‘in the course of’), and (d) can allow durative temporal expressions (e.g. sān gè xiǎoshí ‘three hours,’ shí tiān ‘ten days’), to name a few (Han, 2010; Shao and Liu, 2001; Wang, 2013). It should be noted that previous literature did not converge on a consensus definition or scope of event nouns in Chinese. For example, Deng’s (2021) “event noun” is rather similar to the aforementioned “process nominals.” He suggested that “simple event nominals” (Grimshaw, 1990) or “(pure) event nouns” (Wang, 2013) could be considered a sub-category of the “process nominals.” Nevertheless, in order to differentiate derived from non-derived eventive nouns, in this paper, we follow Wang (2013) and refer to nouns naming events that are not derived from verbs as “event nouns,” while those nouns that are derived from verbs are referred to as “deverbal nominals” or “deverbal nouns.”

In general, nouns that make reference to physical objects, deverbal nominals (process nominals and result nominals), and nouns referring to events, can be differentiated based on their denotations of entity and/or eventivity, as shown in Table 1.

Table 1 Classification of nouns based on entity–eventivity dichotomy.

As seen from the above summary, the entity–eventivity dichotomy does not align with the noun-verb PoS classification. In other words, nouns encompass entities and events, and considerable noun-verb categorical fluidity exists in Chinese (Kwong and Tsou, 2003a, 2003b). Note that although verbal meanings are more mutable than their nominal counterparts (Ahrens, 1999; Gentner, 1982), the degree of ambiguity of noun-verb dual category words is the same for each category unless the direction of change can be identified. Given that the dichotomy of entity–eventivity cannot reliably predict the noun-verb category assignment, an alternative conceptual dichotomy that does not require the intuitive entity–eventivity bifurcation is thus needed.

Re-classification of nouns: based on the endurant–perdurant dichotomy

Recall our earlier discussions on the endurant–perdurant bifurcation in formal ontologies, which will serve as the theoretical foundation of the current account. All concepts can be classified according to whether they exist independent of time (i.e. endurant or continuant) or are dependent on time (i.e. perdurant or occurrent). Ontologically, information is described as attributing properties to some constant objects, but for such encoding to work, the “sameness” of the object must be maintained (i.e. its being endurant or continuant); on the contrary, in order to describe changes, there must be properties that can be identified as a variable of time (i.e. being perdurant or occurrent). When such ontological bifurcation is reflected in the language, a noun is simply the default way to encode an endurant concept, as it is a “rigid designator,” while verbs, requiring interpretation of their values depending on time, are the default way to encode perdurant concepts (Huang, 2015, 2016).

From this ontological point of view, deverbal nouns and denominal verbs involve type-shifting. Deverbal nouns represent eventive information that is disassociated from temporally bound interpretations. An apt example is the way we refer to a scheduled flight (Huang, 2015). A 3:30 flight, with the deverbal noun flight derived from the verb to fly, does not need to reference any specific point in the spatio-temporal continuum. In other words, a flight is a 3:30 flight is not linked to any specific time points in terms of the event. Instead, it is defined by belonging to a class of eventive entities that share the scheduled flight time and can happen at any time. That is, eventive information, such as the time of the scheduled flight, is treated as constant and can be assigned time-sensitive interpretation when co-occurring with a verb, such as The 3:30 flight took off at 4:30 today. It is, therefore, possible to use the fundamental conceptual bifurcation of time dependency to conceptualise the linguistic lexical categories to view the entities and their eventive readings from an ontological perspective.

To sum up, ontologically speaking, a noun can be classified as consisting of enduring or perduring features (i.e. maintaining constant referent or not through different times). The current study will focus on sensory nouns in Mandarin Chinese to test the cognitive foundation of grammatical categories differentiated by the endurant–perdurant dichotomy. We will elaborate on the analytical framework (Generative Lexicon Theory) in the next section, followed by data and methodology (corpus-based). Results, as well as the discussion and conclusion, will be provided in the final sections. The main objective of this research is to provide empirical evidence of the linguistic encoding of human sensory experiences as reflected by their heterogeneous nature (endurant or perdurant) within a particular lexical category (the noun) and further shed light on the cognitive foundation of grammatical categories.

Analytical framework

The Generative Lexicon (GL) theory is chosen as the framework for our study based on two important considerations. First, GL has a fully formalised qualia structure that is linked to an ontology (Pustejovsky et al., 2006). As such, we can treat it as a fully implemented version of Aristotle’s qualia and leverage its theoretical constructs to link to the primitive concept of endurant/perdurant, as well as to represent experiential information of the physical world. Second, a methodologically critical motivation is that GL provides a theory of argument selection based on experiential knowledge (i.e. qualia structure) instead of other theories that rely on grammatical categories or related features. We noted earlier that the failure to fully understand the nature of nouns and verbs in previous studies is probably due to the fact that each of them requires a certain degree of knowledge of grammatical categories. Even the well-designed study reported in Strik Lievers and Winter (2018) relies on prior knowledge of grammatical categories, that is, the PoS assigned by the dictionary and by corpus annotation. Conversely, GL predicts argument realisation in terms of semantic typing (based on qualia information) and by the semantic process of selection, exploitation, and coercion without referring to grammatical categories. The basic structure of GL and its relevance for our current study is explicated below.

In the GL theory, the semantic representation of a word can be represented at four levels: argument structure, event structure, qualia structure, and lexical typing structureFootnote 6 (Pustejovsky, 1995, 2013; Pustejovsky and Jezek, 2008), as shown in Fig. 1. Argument structure (ARGSTR) mainly specifies the number and nature of the arguments a nominal phase can take, while event structure (EVENTSTR) identifies the event type and any sub-eventual structure a word or a phrase may have. Since this study targets nouns, only qualia structure (see the section “Qualia structure”) and lexical typing structure (see the section “Lexical typing structure”) will be consulted, as these two are more effective in explaining the semantic representations of nouns.

Fig. 1: Lexical representation of a word (Pustejovsky, 2013, p. 26).
figure 1

Note. α = a word; ARG1 = argument 1; E1 = subevent 1.

Qualia structure

The Qualia structure in GL is adapted from the Aristotelian qualia with four causes: material cause, formal cause, efficient or moving cause, and final cause (Pustejovsky and Jezek, 2014). As depicted in Fig. 1, the constitutive role (CONST) corresponds to material cause, as it describes the relationship between an entity and its constitutive parts, as well as the relation between the parts and the entire entity by referring to what the entity is made of. The formal role (FORMAL) focuses on how the specific entity distinguishes from other objects within a larger domain; in other words, it encodes taxonomic information and carries information about the basic conceptual category (Pustejovsky and Jezek, 2014). The telic role (TELIC), being the final cause, dealing with the purpose and function of the entity, includes direct telic and indirect telic. The last agentive role (AGENTIVE), corresponding to the efficient or moving cause, involves factors related to the entity’s origin that force the entity to come into being. As mentioned earlier, we treat qualia structure as the collection of lexically conventionalised experiential knowledge, following Aristotle’s original design.

Lexical typing structure

Pustejovsky (2001, 2013) further divided the domain of individuals into three types based on the four fundamental qualia roles in the qualia structure, as illustrated in Fig. 2.

Fig. 2
figure 2

Lexical typing structure with referring to qualia roles.

As shown in Fig. 2, natural types differ from artefactual types in their references to formal and constitutive roles only. Noise is an example of a natural type, as its lexical meaning inherits directly from its superordination, and only the formal and constitutive role of noise (i.e. sound) is exploited in the common use of this word. Conversely, an artefactual type refers to telic and agentive roles, especially emphasising the function or purpose of the object. For example, piano is considered an artefact, given that the purpose of a piano lies in its telic role, which is to be played and to allow people to listen to the melody being played. Last but not least, a complex type (or dot object) makes references to the relation between the above two types. Song is an instance of a complex type because of its composition of sound and information (i.e. [sound·info]). For instance, in sentence (5), the meaning facet of sound is elicited because melodic describes the melody of the song, while in (6), information being mainly activated as inspirational indicates that the lyrics of the song are encouraging, which conveys information.

  1. (5)

    This is a melodic song.

  2. (6)

    This is an inspirational song.

By referring to the qualia structure and the lexical typing structure, we can assign endurant features to those natural types in which only their formal and/or constitutive roles are exploited in their meaning representations. As for artefactual types, since they focus on the function or purpose of the objects and how these objects come into being, they may either be endurant or perdurant, depending on which meaning facet has been selected via type coercion (Pustejovsky, 2001; Pustejovsky and Jezek, 2008). For instance, in the sentence of I saw a piano, the endurant feature of the piano is selected because the meaning of piano in such a visual event refers to a physical object that does not reference time points; however, when an event (mostly due to different verbs) emphasises on the function of the object, such as in I heard (someone is playing) the piano, the perdurant, or temporarily bounded properties of the piano will take effect in order to participate in the meaning coercion. Complex types, referring to natural and artefactual types, shall be considered perdurant when a time concept is involved. This is because co-prediction for these dot objects is always allowed, e.g. This book is long and interesting. In other words, this type of noun does not need to go through the type-shifting process in order to get its meaning across, and its eventive meaning is always presented. We summarise the correspondence between the three lexical typing structures and the endurant–perdurant dichotomy in Table 2.

Table 2 Relations between lexical typing structures and endurant–perdurant dichotomy.

Method

Data collection

The relevant data will be collected following the motto of corpus linguistics as proposed by Firth (1962) “you shall know a word by the company it keeps.” To study sensory nouns, we will start with the words accompanying them, especially the objects selected by the perceptual predicates. Thus, this study partly follows Pustejovsky and Jezek’s (2008) corpus-based investigation of identifying mechanisms of semantic coercion in predicate-argument constructions.

To examine the largest possible number of sensory nouns, we used the basic sensory verbs as the predicates of the five sensory modalities to extract the sensory events in the corpus, including visual events (indicated by the visual verbs kàn ‘to look,’ jiàn ‘to see,’ and kàn/jiàn-dào ‘saw’), auditory events (indicated by the auditory verbs tīng ‘to listen,’ and tīng-dào ‘heard’), gustatory events (indicated by the gustatory verbs cháng ‘to taste,’ and cháng-dào ‘tasted’), olfactory events (indicated by the olfactory verbs wén and xiù ‘to smell; to sniff,’ and wén/xiù-dào ‘smelt’), and tactile events (indicated by the tactile verbs and chù ‘to touch,’ gǎnjué ‘to feel,’ mō/chù-dào ‘touch; feel,’ and gǎnjué-dào ‘felt’). Note that although gǎnjué ‘to feel’ usually is not considered a typical tactile verb, it is defined as “to perceive and distinguish external stimuli via bodily sensations” in the Chinese WordNet 2.0 (Huang et al., 2010b);Footnote 7 therefore, this word is highly tactile-related and is believed to trigger bodily feelings.

The data extraction and analysis procedures are the following:

  1. a.

    To extract the set of nouns from the corpus that typically co-occurs with the verb in a specified grammatical relation. For our current purposes, we restrict our investigation to the relation of object-of the sensory verbs.

  2. b.

    To annotate the selected nouns with the foregoing qualia structures.

  3. c.

    To classify those nouns into three types and analyse their respective characteristics with reference to their qualia values.

All of the data and sentence examples presented in what follows, unless otherwise specified, were extracted from a Chinese online corpus, Chinese Web 2011 (zhTenTen11) in the Sketch Engine (Kilgarriff et al., 2014).Footnote 8

Data cleaning

Since objects that the sensory modalities can perceive are the primary concerns in this study, we will only look at those nouns that elicit perceptual information. In what follows, we will use visual perception (through the visual verb kàn ‘to look; see’) as an example to demonstrate how we identify and select sensory nouns from the corpus.

First, we use the Word Sketch function in the Sketch Engine to generate an exhaustive list of objects collocated with the keyword kàn ‘to look; see.’Footnote 9 Next, as sketched in the Chinese WordNet 2.0, among the various meanings of kàn ‘to look; see,’ two meanings are associated with visual perception, including ‘to perceive through sight’ and ‘to understand and appreciate through sight,’ as shown in the sentences (7) and (8), respectively:

(7)

zài

qiáo-shàng

kàn fēngjǐng,

kàn fēngjǐng

de

rén

zài

lóu-shàng

kàn nǐ

 

you

on

bridge_on

look scenery

look scenery

DE

person

on

building_on

look you

 

‘You are enjoying the scenery on the bridge while the people enjoying the scenery are looking at you from upstairs.’

(8)

Yuánběn

shǔyú

dàzhòng

yúlè

de

kàn diànyǐng

biànchéng

le

gāo

xiāofèi.

 

before

belong

public

entertainment

DE

watch movie

become

ASP

high

consumption

 

‘Watching movies was previously deemed an entertainment for the general public, but it is an expensive consumption nowadays.’

The above two examples showed that kàn ‘to look; see’ can take objects such as fēngjǐng ‘scenery’ and person (e.g. ‘you’) in (7), which evokes the meaning of ‘to perceive through sight’; kàn ‘to look; see’ also co-occurs with the object like diànyǐng ‘movie’ in (8), with the meaning associated with ‘to understand and appreciate through sight.’ Therefore, we only consider the objects selected by these two meanings in the visual events. All the extended meanings and/or metaphorical meanings of the sensory verbs are not considered.

Results

Visual nouns

310 nouns selected by kàn ‘to look; see’ were identified from the corpus after data cleaning. The overall distribution of three lexical typing structures in the two meanings of kàn ‘to look; see’ is presented in Table 3. Note that categorising lexical typing structures is strictly pertinent to the specific sensory domain being examined. For example, fēngjǐng ‘scenery’ is considered a natural type in visual events rather than other possible types in other perceptual events. The time-dependency conceptualisation of each cell is determined by the interaction of the lexical types and the predicate. For easy reference, perdurant terms are shown in bold, in Table 3.

Table 3 Distribution of lexical typing structures in meanings of kàn ‘to look; see’.

To perceive through sight

Among the 160 nouns that elicit the meaning of “to perceive through sight,” natural type is the majority, constituting 77.5% of the three structures. The natural type exploits the formal and constitutive roles of a noun; in most circumstances, the formal role plays a key role in the collocative meaning between kàn ‘to look; see’ and its objects. After attributing ontological categories to the visual nouns related to the meaning of “to perceive through sight,” the formal roles of the natural types mainly fall into appearance (e.g. yàngzi ‘appearance; look’), colour value (e.g. báisè ‘white’), lights (e.g. guāngxiàn ‘light’), location (e.g. zhōuwéi ‘surrounding’), natural things (e.g. tiānkōng ‘sky’) and scene (e.g. fēngjǐng ‘scenery’) based on their ontological taxonomies. Apart from a large number of natural types, 16.9% of the nouns are labelled as artefactual types. This group primarily contains images (e.g. túpiàn ‘picture’) and artefacts (e.g. yānhuā ‘fireworks’). Note that the complex type (5.6%) is not salient in the visual events related to “to perceive through sight.” Since these are dot objects, they refer to the physical objects and the eventive information it carries. For example, a shǒubiǎo ‘watch’ is made for a particular purpose, that is, for people to check the time. Hence, “taking a look at a watch” necessarily involves perceiving the physical watch, as well as telling the time, which refers to the telic role of a watch.

To understand or to appreciate through sight

The second meaning of kàn ‘to look; see’ is “to understand or to appreciate the content of or information about the object being looked at.” Since this type of object, in most cases, contains a specific function and is artificially created rather than naturally existing, the artefactual type (57.3%) and, especially, complex type (42.7%) far outweigh the natural type (0%). The artefactual type in the meaning of “to understand through sight” mainly consists of texts and writings, such as wénzhāng ‘article’ and xīnwén ‘news,’ while entertainment-related items such as diànyǐng ‘movie’ and jiémù ‘programme’ give rise to the meaning of “to appreciate through sight.” The main difference here lies in the purpose of the entity—the former is created to meet information needs, whereas the latter is mostly used to satisfy entertainment or leisure demands.

Complex type comprises words that have physical references but also maintain information or entertainment functions. Some examples include shū ‘book’ and bàozhǐ ‘newspaper’ (to understand), as well as diànshì ‘television’ and zhǎnlǎn ‘exhibition’ (to appreciate). Also, note that both the physical entity (9) and the information carried in the entity (10) can be exploited in the visual events, as exemplified in the following two examples:

(9)

kàn

zhe

shūguì

shàng

lín-láng-mǎn-mù

de

shū

 

I

look

ASP

bookshelf

on

a_dazzling_array_of_beautiful_things

DE

book…

 

‘When I was looking at the dazzling array of books on the bookshelf….’

(10)

zhēn

shi

ài

kàn-shū

de

háizi!

 

you

really

be

CL

love

read

DE

child

 

‘You are a child who likes reading!’

In the meanings related to “to appreciate through sight,” there exist a few words that not only contain information but also involve events and affairs ([event·info]), including yǎnchànghuì ‘concert,’ zúqiúsài ‘football game,’ chēzhǎn ‘motor show,’ etc. This type of noun is categorised as event nouns. Although the number of event nouns is not very productive in visual nouns, that existence hints that sensory nouns may also denote eventive information to a certain extent.

Summary

In sum, we have shown above that the visual verb kàn ‘to see’ has two senses: one for perception and one for integration of visual information (understanding or appreciating using cognitive skills). The most frequently attested instances are the perception of natural types, which can be considered prototypical visual cognition. Interestingly, the ratio of natural type versus artefactual type is only roughly 11/10 (124/113). Overall, the distribution suggests that vision is a dominant and versatile sensory domain.

Auditory nouns

As sketched in the Chinese WordNet 2.0, apart from its original meaning, “to perceive sound through hearing” (e.g. tīng shēngyīn ‘listen to the sound’), the semantic facets tīng ‘to listen; hear’ also denote “to appreciate (sound) through hearing” (e.g. tīng yīnyuè ‘listen to music’) and “to receive information through hearing” (e.g. tīng yǎnjiǎng ‘listen to the speech’). After data cleaning, a total of 385 nouns related to the above three meanings of tīng ‘to listen; hear’ were collected. Table 4 presents an overview of the distributions of three lexical types in the three meanings of tīng ‘to listen; hear.’

Table 4 Distribution of lexical typing structures in meanings of tīng ‘to listen; hear’.

To perceive sound through hearing

In the natural type that evokes the meaning “to perceive sound through hearing,” the words mainly relate to different aspects and physical qualities of sound (e.g. shēngyīn ‘sound,’ yīnliàng ‘volume,’ yīnzhì ‘sound quality,’ jiézòu ‘rhythm’). However, the majority of the perceived sound objects fall into the complex type (79%), given that all the words under this category are events concerning both the facets of sound and event ([sound·event]). We further identified three primary categories in this group. First, nouns that are induced by events, and they are primarily compound nouns, including qínshēng ‘the sound of playing instruments,’ gēshēng ‘the sound of singing,’ jiǎobù-shēng ‘the sound of footsteps.’ The second type is (simple) event nouns, e.g. fēng ‘wind,’ ‘rain,’ hǎilàng ‘waves,’ liúshuǐ ‘running water.’ The third type is deverbal nouns, including xuānxiāo ‘shouting,’ hūxī ‘breathing,’ and xīntiào ‘heartbeat.’

To appreciate (sound) through hearing

The second meaning of “to appreciate (sound) through hearing” takes an artefactual type that is typically part of music, such as yīnyuè ‘music’ and xuánlǜ ‘melody,’ in which the telic role of being listened to is exploited. The number of complex types is much greater than the number of other types (83%), which can be further divided into facets denoting sound and information ([sound·info], e.g. gēqǔ ‘song’), physical object and sound ([object·sound], e.g. yuèqì ‘musical instrument’), human being and sound ([human·sound], e.g. Bèiduōfēn ‘Beethoven’), and event and sound ([event·sound], e.g. yǎnchàng ‘sing’). The type of [sound·info] not only comprises sounds but also incorporates content and information that allow listeners to appreciate the melody as well as the content that the melody holds. In the second type [object·sound], hearing events select this type by exploiting the sound facet and mainly use its telic role to generate auditory-related meaning. While in the third type [human·sound], although they are human beings, tīng ‘to listen; hear’ can resort to their telic roles of singing and performing arts or the agentive roles of writing and composing music.Footnote 10 The last type ([event·sound]), on the contrary, contains event nouns such as yǎnchànghuì ‘vocal concert’ and yīnyuèhuì ‘musical concert.’

To receive information through hearing

The last meaning examined is “to receive information through hearing” by virtue of the activity tīng ‘to listen; hear.’ In a similar vein, the nouns are mainly categorised into two types, i.e. artefact (47.6%) and complex (52.4%). The telic role, including to speak, to listen, and to communicate, of the artefactual type is mainly exploited. As for the complex type, the major category involves information and events ([event·info]) since the facet of sound is less prevalent here. Two types of nouns are also shown in this category, namely, event nouns (e.g. ‘class,’ jiǎngzuò ‘seminar’) and deverbal nouns (e.g. liáotiān ‘chatting,’ tánhuà ‘talking,’ and huìbào ‘reporting’).

Summary

As shown above, the hearing verb tīng ‘to listen/hear’ has three senses: one for perception and two for integration of auditory information. The most frequently attested instances are the perception and integration of complex types. This suggests that hearing involves strong integration of physical and abstract information. This is expected as the perception of music and speech as described requires either explicit or implicit knowledge of systems of abstract concepts such as loudness, melody, pitch, prosody as well as phoneme and tone, based on the physical properties of amplitude, articulation, duration, frequency, etc. Another significant feature is the very low percentage of natural types as the target of perception (5.5%). Overall, the distribution suggests that hearing is a sensory domain that is crucial to the integration of information, especially eventive information.

Gustatory nouns

Adopting the same method, the objects selected by the gustatory verb, cháng ‘to taste,’ are scarce compared to the nouns of the above two sensory modalities. The reason may lie in the single meaning for cháng ‘to taste,’ which is only “to distinguish or taste the flavour of food.” Of all the 42 gustatory nouns, no complex type was found, and the artefactual type (81%) was more prevalent than the natural type (19%), as shown in Table 5. Natural type is mainly constituted by the attribute or attribute values of the flavour of the food, for example, zīwèi ‘ flavour,’ fēngwèi ‘flavour,’ and xiāngwèi ‘fragrance;’, whereas the artefactual type consists of food that is made to nourish the body or satisfy the appetite, e.g. càiyáo ‘dishes,’ měijiǔ ‘fine wine,’ and xiǎochī ‘snacks.’

Table 5 Distribution of lexical typing structures in the meaning of cháng ‘to taste’.

In sum, the gustatory verb cháng ‘to taste’ has one single sense. There are no attested instances of the perception of complex types. Moreover, the ratio of artefactual type over natural type is roughly 4 to 1 (34/8). On the one hand, this seems unusual in the context of taste involving typically embodied objects. On the other hand, this should be expected as most of the food we ingest is artefactual in the sense of being processed. Overall, the distribution suggests that taste is a sensory domain that has evolved to interact primarily with man-made/packaged ingestible objects and only rarely with the natural environment (i.e. such as via personal farming) (Table 6).

Table 6 Distribution of lexical typing structures in the meaning of wén/xiù ‘to smell; to sniff’.

Olfactory nouns

Since two olfactory verbs, i.e. wén and xiù ‘to smell; to sniff,’ are commonly used to depict olfactory experiences, nouns collocated to both verbs were examined. 52 olfactory objects were generated, of which all of them being categorised as natural types. Odour and odour values are the most common components of olfactory nouns. The most distinctive feature of these nouns is that they are either composed of the morphemes wèi ‘taste; smell’ or xiāng ‘fragrance’ (e.g. qìwèi ‘odour,’ chòuwèi ‘bad smell,’ fāngxiāng ‘fragrance,’ and qīngxiāng ‘faint scent’). The pattern and structure of the artefactual type are also fairly consistent. They mainly consist of compounds with wèi ‘taste; smell’ and xiāng ‘fragrance’ as the stems and are used to describe the smell or fragrance of the artefacts. Examples include yóuyānwèi ‘the smell of fuel fume,’ fǔchòuwèi ‘a rancid smell,’ jiǔxiāng ‘aroma of wine,’ fànxiāng ‘rice fragrance,’ to name a few.

Similar to the gustatory category, the olfactory verb cháng ‘to taste’ has one single sense. The only attested instances involve the perception of natural types; given the low frequency and the strong connection between the gustatory and olfactory senses, the lack of artefactual type is of interest. This may be the result of insufficient data or might be due to the fact that smell, unlike taste, is often not volitional.

Tactile nouns

Finally, tactile nouns were collected by examining the collocations with two tactile-related verbs, i.e. ‘to touch’ and gǎnjué ‘to feel.’ In the Chinese WordNet 2.0, ‘to touch’ is illustrated as “to use hands to touch the object” while gǎnjué ‘to feel’ is “to perceive and distinguish external stimuli via bodily sensations;” hence, two distinct categories of tactile nouns are expected because of the distinct meanings of the tactile predicates.

As presented in Table 7, a total of 58 tactile nouns were generated. It is found that the number related to ‘to touch’ (79.3%) far outweighed that collocated to gǎnjué ‘to feel’ (20.7%). The natural type related to the meaning of ‘to touch’ embraced body parts (e.g. nǎodai ‘head,’ dùzi ‘belly’), body substance (e.g. jīfū ‘skin,’ pífū ‘skin’), and salient substance on the body (e.g. yìngwù ‘hard substance,’ zhǒngkuài ‘lump’). In the natural type for the meaning of gǎnjué ‘to feel,’ it mainly consists of temperature-related items, especially more abstract experiences such as nuǎnyì ‘warmth’ and liángyì ‘coolness.’ The artefactual type is mostly associated with ‘to touch.’ It is primarily comprised of technology products, such as píngmù ‘screen’ and jiànpán ‘keyboard.’ In summary, the results showed that nouns selected by ‘to touch’ are more related to the tactile perception over the physical objects, whereas nouns collocated with gǎnjué ‘to feel’ more indicate bodily feelings.

Table 7 Distribution of lexical typing structures in the meanings of ‘to touch’ and gǎnjué ‘to feel’.

In sum, Table 7 shows two tactile verbs: one for perception and one gǎnjué for integration of tactile information. The touch nouns are dominated by natural types. This is expected as tactile sense is commonly considered the most embodied sense modality. Compared with vision, another sensory domain with a high frequency of natural-type objects, touch has a significant portion of artefactual-type objects and is also much lower in total numbers. Overall, this suggests that the tactile sense is well-grounded but less versatile.

Discussion

This section synthesise the above results of sensory nouns according to their involvement of endurant and/or perdurant properties. Note that in the GL theory, natural type objects are concrete entities (i.e. formal and constitutive), thus, they are endurant by nature. Artefactual type objects (i.e. agentive and telic) can be either endurant or perdurant, depending on which meaning facet is coerced by the predicates. In perceptual events, we suggest that endurant properties will mostly be elicited when these artefacts are selected by pure perceptual events (e.g. the objects that are perceived through sight, hearing, etc.); whereas when integration of perceptual information is needed (e.g. objects that are appreciated or understood through sight, hearing, etc.), perdurant features will be resorted to. Taking piano as an example again, as demonstrated above, in a visual event, piano is considered an artefact that is only perceived via pure perception (i.e. objects that are perceived through sight only); it should be a concrete object that does not involve any time points. In contrast, if the sound of (someone is playing) the piano is appreciated as an auditory event, then piano shall be bounded by temporal features because the sound of playing contains temporal intervals, even though piano per se is an object. As for the complex types, they are a combination of entities and events. Thus, we can assume that the complex type following perceptual verbs would be dominated by their natural type meaning, while the same dot objects following information integration type verbs would be dominated by their artefactual type meaning. Obviously, the nature of complex type objects means that the two aspects of their meaning are always accessible regardless of the context. However, it is also reasonable to assume that the selection of the verb will make one aspect comparatively more accessible.

Generally, we mark all the natural types and artefactual types perceived via pure perception to contain endurant features, while artefactual types perceived via integration of perceptual information and all the complex types to contain perdurant features. Based on the results in the above section (Tables 37), we further summarise the sensory nouns’ involvement of endurant or perdurant properties in Table 8.

From Table 8, we can see that tactile perception is strongly preferred by endurant/time-independent objects, while auditory perception strongly prefers perdurant/time-dependent objects. Lastly, visual perception is most versatile in usage, given that the sense has no strong preference or inclination for either endurant or perdurant properties. Note that we only take sensory nouns as a target of perception and use the lexical typing theory of GL to examine the distribution of sensory objects. Both steps presuppose no knowledge of grammatical categories. Interestingly, note that by combining the above results with our proposal that endurant entities are linguistically represented as nominal units, and perdurant entities as verbs, the current results imply that the categorical assignment of the tactile sensory properties is more likely to instantiate as nominal elements. The strong perdurant dominance means that the auditory sensory properties are more likely to be expressed as verbal elements. The above results corroborate the results from the category counting study of Strik Lievers and Winter’s (2018) on English. Recall that they showed that touch is over-represented by nouns, and hearing is over-represented by verbs. They also report an over-representation of vision by adjectives, which was not accounted for. Our results, however, can be interpreted as vision’s allowing it to both modify nominal category adjectives and serve as a source domain in linguistic synaesthesia (cf. Zhao et al., 2019), which also favours vision occurring in pre-nominal modifying positions.

Table 8 Distribution of endurant–perdurant types for sensory nouns in three sensory modalities.

Note that we only list three sensory domains in the above table. There are three reasons to omit olfactory and gustatory senses in this comparison. The first is the relative sparseness of data, as observed above. The second is that the lack of an information integration verb in the data for these two sensory domains makes it impossible to perform a reliable direct comparison. Third, we observe that neuro-cognitive studies more typically involve the three senses of vision, hearing, and touch (using the more updated term of somatosensory). For instance, Sanchez et al. (2020) compared these three senses in terms of late latency and interpreted the results in terms of conscious perception. We compare our findings to their findings which used a different methodology below.

The comparative study of three sensory modalities based on MEG measurement of brain activities reported by Sanchez et al. (2020) has some interesting parallelism with our results. Their study aimed to establish a supramodal brain network for processing all senses, as well as to differentiate these three senses, given that they share the same supramodal network for processing. After establishing the shared uses of a supramodal network, Sanchez et al. (2020) showed significant differences in late latency among their senses. In particular, they showed that hearing and vision have later latency overlapping with the P300 area, and the somatosensory system has relatively earlier latency than the other two. The proposed account of this difference is based on the assumption that the somatosensory does not involve conscious perception, while vision and hearing do. Conscious perception means more efforts are needed to integrate and represent sensory information instead of a simple recording of sensory data from a non-conscious perception. Sanchez et al.’s (2020) results are compatible with the result of our study of sensory nouns, and we speculate that are several possible interpretations. One of the most straightforward hypotheses is that the perception of endurant targets is instantaneous (as it does not have to involve time), while the perception of perdurant targets requires experiencing time that naturally adds to processing latency. It may also be attributed to taking the SNAP perspective (which is instantaneous) or the SPAN perspective (which requires a time course) (Grenon and Smith, 2004). Lastly, it may also be accounted for in terms of the qualia-based lexical types. That is, perception of natural type properties (i.e. formal and constitutive qualia roles in GL) involves simple (i.e. non-conscious) selection of classes. In contrast, perception of artefactual and complex types (i.e. telic and agentive qualia roles in GL) involves recalling experiential eventive information (i.e. conscious) integration.

In terms of foundations of language, our theory predicts that all languages should have the noun-verb bifurcation at the foundation of their system of grammatical systems or that there must be two basic categories to instantiate the binary contrast motivated by conceptual (in)dependency of time. A well-known linguistic fact that could pose a serious challenge to this claim involves tense-marked nouns.Footnote 11 The challenge is that if nouns are indeed endurant and defined without reference to time, how can they be marked by time through tense and aspect?

Nordlinger and Sadler (2004) provided a comprehensive set of data from various languages showing that nouns can be marked by tense/aspect and introduced several important theoretical issues. A debate ensued on how to account for this phenomenon and its associated theoretical implications (e.g. Nordlinger and Sadler, 2004, 2008; Tonhauser, 2007, 2008). The debate focused on issues such as whether nominal tense can be applied to the whole sentence or if it is limited to the local context of the tense-marked noun (e.g. clause/phrase). Bertinetto (2020) proposed a comprehensive account that treats such tense makers as part of the more comprehensive set of nominal semantic features that were originally thought to be verbal features and specifically brings out the temporality of nouns, such as being engaged in a specific event or possessing a specific ability.

Recall that our basic claim of nouns being endurant is based on the fact that each occurrence of the same noun is considered an instantiation of the same entity despite obvious changes in the significant features and or environment of that entity. In contrast, verbs being perdurant and being occurrents have to do with the fact that each occurrence of the same verb (at different spatio-temporal locations) is considered a separate event. Tense-marked nouns do not affect this fundamental dichotomy at all. In fact, tense markers only underline the endurant features of nouns by showing that instances of the same noun retain the same reference regardless of the differences in the explicit information of temporality that they may be carrying. Consider the following example sentence from Huang (2015):

11.

gōngjīn

ròu

zhǔ shú

hòu

zhǐ

shèng

dào

600

gōngkè.

 

one

kilogram

meat

cook-ed

after

only

left

not

arrive

 

gram

 

‘One kilogram of meat only weighs less than 600 grams after being cooked.’

The above example can be viewed in the same spirit as Bertinetto’s (2020) account and demonstrates that the nominal classifier system (i.e. gōngjīn ‘kilogram’ and gōngkè ‘gram’ used to modify meat) can also be used to mark temporality. The classifier system in Chinese is generally considered to consist of two main subcategories, individual classifiers and measure words (e.g. Ahrens and Huang, 2016). Huang (2015) showed that the individual classifiers mark the endurant feature of nouns, being required to remain identical for the same entity. Measures (hence measure words) are, however, sensitive to spatio-temporal context. Thus, the same noun/entity can be modified by different measurements even though the modified nouns remain “the same,” as in retaining the same reference, as shown in the example above. In sum, tense markers of nouns, just like measure words and DE-insertion discussed in Huang (2015, 2016), are simply linguistic devices that allow a language to highlight the variations of the same endurant entity and which would not be interpretable if the modified noun is not endurant.

Conclusion

This paper is the first to establish the cognitive motivation of noun vs. verb bifurcation without presupposing any prior knowledge of grammatical categories. In particular, inspired both by Aristotle’s definition that nouns make no reference to time and the more recent Aristotelian primary ontological bifurcation of endurant vs. perdurant, we propose that the manipulation of the ontological perspectives to obtain time (in)dependent conceptualisation is the foundation of human cognition and of grammatical categories.

To verify this hypothesis, this study focuses on the sensory nouns of the five sensory modalities and the analysis was carried out according to the three lexical types and their associated qualia structures as elaborated in the GL theory. Our findings demonstrate that the time-independent/endurant or time-dependent/perdurant concepts are encoded differently in sensory nouns. Such disparity further differentiates cognitive properties of sensory modalities in the light of embodied cognition. Tactile entities, nearly always endurant, support the intuition of why touch is considered the most embodied (i.e. closely related to bodily contact and involvement), or the most concrete sense, among the sensory modalities (e.g. Zhao et al., 2019). Hearing is the least embodied or the most abstract sense, and its sensory properties are dominated by perdurant objects. Although vision is also less embodied, the versatility and dominance of the visual sense render visual objects to evenly encompass endurant and perdurant properties. Because of the sparseness of the data found for the gustatory and olfactory senses, we are not able to propose a more explicit account of the two senses. This, in fact, echoes the lower accessibility and less frequent embodied encoding of olfactory experiences to some extent (Shen, 1997; Shen and Aisenman, 2008). However, the language-specific situation is worth noting because sensory modalities may exhibit different codability patterns in different languages. For example, some languages encode olfactory experiences much more frequently than other senses (Majid et al., 2018), and olfaction plays a critical role in everyday communication in these communities and languages (e.g. Levinson and Majid, 2014; Majid and Burenhult, 2014).

Drawing upon the findings in this study, there are also implications for other related sensory language studies, e.g. linguistic synaesthesia and modality exclusivity norms. For example, words for auditory concepts appear to be the most “exclusive” as found in previous modality exclusivity norms studies (e.g. Chen et al., 2019; Lynott et al., 2020; Zhong et al., 2022), meaning auditory experiences may have little in common with other perceptual experiences; moreover, auditory sensory is considered the most frequent target domain on the scale of the mapping tendency in linguistic synaesthesia (Zhao et al., 2019). We hypothesise that the time-dependent nature of the auditory sense can account for these results to some extent because the auditory sense is the most fluid in terms of its categorical ambiguity among all the sensory modalities (from verbs to nouns). In sum, we propose that the concept of time dependency may drive a possible synergetic account incorporating diverse approaches such as the categorical dependency of meaning mutability, the cognitive basis of parts-of-speech, and the ontological motivation for differences in the linguistic representation of sensory meanings.