A large dataset of semantic ratings and its computational extension

Evidence from psychology and cognitive neuroscience indicates that the human brain’s semantic system contains several specific subsystems, each representing a particular dimension of semantic information. Word ratings on these different semantic dimensions can help investigate the behavioral and neural impacts of semantic dimensions on language processes and build computational representations of language meaning according to the semantic space of the human cognitive system. Existing semantic rating databases provide ratings for hundreds to thousands of words, which can hardly support a comprehensive semantic analysis of natural texts or speech. This article reports a large database, the Six Semantic Dimension Database (SSDD), which contains subjective ratings for 17,940 commonly used Chinese words on six major semantic dimensions: vision, motor, socialness, emotion, time, and space. Furthermore, using computational models to learn the mapping relations between subjective ratings and word embeddings, we include the estimated semantic ratings for 1,427,992 Chinese and 1,515,633 English words in the SSDD. The SSDD will aid studies on natural language processing, text analysis, and semantic representation in the brain.

www.nature.com/scientificdatawww.nature.com/scientificdata/using a data-driven approach and found that social-emotional and sensory-motor semantics are associated with the opposite ends of the most important data-driven semantic dimension.Therefore, the social and emotional dimensions can serve as important supplements to the visual and motor dimensions to reflect semantic representation.The time and space dimensions are especially important for the representation of events and situations [36][37][38] .Dissociable neural correlates of these dimensions have also been indicated by neuropsychological and neuroimaging research 37,39 .The representativeness of the six dimensions has been reflected by a comprehensive review of experiential semantic attributes by Binder et al. 1 .Binder et al. 1 summarized 65 semantic dimensions belonging to 14 domains, among which more than 2/3 of the dimensions belong to the domains of vision, motor, socialness, emotion, time, and space.The SSDD treats these six domains as coarse-grained semantic dimensions and provides general ratings for each of them.
The SSDD contains two datasets: the first is the subjective ratings for 17,940 commonly used Chinese words on the six semantic dimensions.The second is a computational extension of the subjective rating data.We combined the subjective ratings with computational models and then estimated the semantic ratings of 1,427,992 Chinese and 1,515,633 English words.The SSDD makes it possible to analyze the semantic components of various natural language materials, such as natural texts, speeches, and the language produced by neurological and psychiatric patients.

Methods
Subjective rating dataset.Participants.A total of 85 healthy undergraduate and graduate students (52 women, M age = 22.73 years, SD age = 2.24) participated in the rating experiment.All participants were native Chinese speakers.No participant had suffered from psychiatric or neurological disorders or sustained a head injury.Each participant read and signed the informed consent form before the experiment.All experiments were approved by and performed in accordance with guidelines and regulations of the Institutional Ethics Committee at the Institute of Psychology of the Chinese Academy of Sciences.Participants were asked to complete at least one rating experiment session (see Procedure of the rating experiments) and were compensated with 30 RMB per session.Each participant could complete as many sessions as they wanted as they passed the quality evaluation every time.Those who failed the quality evaluation once were not allowed to complete more sessions.Following exclusions (see Procedure of the rating experiments), the final sample comprised 80 participants (49 women, M age = 22.88 years, SD age = 2.21) who provided at least one session of valid data.
Stimuli.The stimuli were 17,940 items that could be separated into three sets based on their sources.The first set of items was 12,814 high-frequency Chinese words selected from the Wikipedia Chinese corpus (https:// dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2).These items were selected based on four inclusion criteria: (1) They are the 20,000 most frequent items of the Wikipedia Chinese corpus; (2) they are also included in at least one of two supplementary Chinese corpora, that is, the Contemporary Chinese Dictionary 40 and the Chinese Linguistic Data Consortium (2003) corpus (https://catalog.ldc.upenn.edu/LDC2003T09); (3) they do not contain any non-Chinese characters; and 4) they were judged as Chinese words but not phrases or nonwords and were not judged as proper nouns by at least two of three independent raters (two authors, Nan Lin and Weiting Shi, and a graduate volunteer).We used supplementary Chinese corpora and subjective assessments because the boundaries between words and phrases in Chinese are vague.There is often a discrepancy between different corpora and between corpora and subjective judgments 41 .We excluded proper nouns because participants' knowledge of them may highly depend on personal experiences and interests.
The second set of items was 4,915 Chinese words selected from the stimuli of two recently published fMRI datasets 42,43 , a published study 10 , and several unpublished experiments of ours.Items were excluded if they contained non-Chinese characters or were evaluated as nonwords, phrases, or proper nouns by at least two of three independent raters.
The last set of items was 211 Chinese translations for the English stimuli of the semantic rating experiments from Binder et al. 1 and Tamir et al. 5 .The two studies included 535 and 166 English words, respectively.Most of their Chinese translations had already been included in the first two sets of items.The remaining 211 translations (which include a small number of phrases) were included as the last set of items.The rating data of these items were used to validate the results (see Technical Validation).
Procedure of the rating experiments.We conducted six rating experiments on the 17,940 items, each focusing on one semantic dimension.Each experiment was separated into 18 sessions containing 1,000 words (the last session contained 940 words).The data were collected through the free-access online platform "Wen Juan Xing" (https://www.wjx.cn/).Except for the rating experiment on the semantic dimension of emotion, which used a 13-point scale (−6 = very negative, 0 = neutral, and 6 = very positive), all other rating experiments used 7-point scales (7 = very high, and 1 = very low).Before each rating session, participants read instructions about the working definitions for the semantic dimension to be rated (See Table 1) plus a few example words with high and low ratings.For the semantic dimension of motor, we further specify the working definition based on the charade/pantomime rating from previous studies: 18,[44][45][46] "Please rate the extent to which the meaning of a word can easily and quickly trigger corresponding body actions in your mind.Specifically, suppose you were playing a pantomime game in which one person had to identify a word based on how another person mimicked various actions that might be associated with its meaning.The easier a word is for the game, the higher its rating score should be; the harder a word is, the lower its rating score should be." To control the quality of the rating data, after each session of rating, we calculated the correlation between the ratings of each participant and the mean ratings of the remaining participants using Jamovi (https://www.jamovi.org/).For a given session, if the correlation between the ratings of a participant and those of the others www.nature.com/scientificdatawww.nature.com/scientificdata/ was lower than 0.5, then the data of this participant would be excluded 1 , and the participant would be excluded from the rest of our experiment.This criterion resulted in the rejection of 28 sessions or 0.87% of the data.If the data of a participant were excluded, a new participant was recruited to complete the rating session.For each session of each experiment, 30 valid participants were recruited.
Data analysis.For each experiment, we calculated the average rating for each word to represent its value on the rated semantic dimension.In addition to the six rated dimensions, a seventh semantic dimension was obtained by calculating the absolute value of the average emotion rating for each word.We believe this dimension (i.e.valenced vs. neutral) reflects the relatedness of word meanings to emotion (see Technical Validation for evidence of this argument).We added 1 to this measure to match its scale with that of the five nonemotion ratings.
As shown in Fig. 1, the distributions of the word ratings on all dimensions for the 17940 items are skewed, indicating that for each dimension, only a small proportion of words contain rich semantic information.Because our experimental stimuli are composed of the most commonly used Chinese words, these distributions should represent Chinese vocabulary.Figure 2 shows the correlations between the seven dimensions of rating data.Most correlations are low, indicating that the semantic dimensions are mostly independent.The highest correlations were found between the dimensions Vision and Motor (r = 0.49) and Vision and Space (r = 0.40).These correlations are reasonable because the visual system plays an important role in perceiving and acquiring motor and spatial information.computational extension dataset.Chinese.By combining the subjective rating data with computational models, we estimated the semantic ratings of a vocabulary of 1,427,992 Chinese words.This vocabulary was constructed by including the words consisting of Chinese characters and with counts no less than 5 in the Xinhua news corpus (19.7 GB in total and collected from http://www.xinhuanet.com/whxw.htm).
We first tested a variety of context-insensitive models (GloVe and Word2Vec that their word embeddings are static) and context-sensitive models (GPT2, BERT ERNIE, and MacBERT 47 that their word embeddings vary according to their context) in predicting semantic ratings.Results show that Word2vec and MacBERT achieved the best performance in their category in the cross-validation analysis (average Pearson correlation between the predicted and actual rating scores across all dimensions: 0.613, 0.782, 0.850, 0.877, 0.881, 0.886, for GloVe, Word2Vec, GPT2, BERT, ERNIE and MacBERT).Therefore, in the following experiments, we utilized these two representative models to extract word representations for Chinese words.Specifically, for Word2vec, we used the default parameters as Skip-Gram architecture with embedding dimensions of 300.To obtain the word embeddings from MacBERT, following Chersoni et al. 48, we extracted 10 to 1,000 sentences for each word (depending on the counts of the word) from the Xinhua corpus and used MacBERT to calculate the representations of the sentences.We then calculated the averaged sentence representation and used it as the word representation.We obtained 1,427,992 word representations from Word2vec and 900,243 (here, we only use words with counts greater than 10) from MacBERT.
Afterward, for each of the seven semantic dimensions and each word embedding method, we trained a ridge regression model with a 10-fold cross-validation method to learn the mapping function from word representations to the mean semantic ratings of corresponding words.We then use the best-trained regression model (which achieved the lowest error on the validation set out of 10 models from the 10-fold validation) to estimate the semantic ratings for the extended Chinese vocabulary.
English.We also extended the computational dataset to an English vocabulary.This extension is based on the assumption that the Chinese and English semantic spaces share the coarse-grained semantic dimensions that we studied.The assumption got direct support from the high cross-language validity of our Chinese dataset: for all semantic dimensions, the ratings of Chinese words were strongly correlated with the ratings of their English translations from previously published English rating datasets (see Technical Validation).The English vocabulary includes 1,515,633 words.This vocabulary was constructed by including the words with counts no less than 5 in the Wikipedia corpus (13 G and downloaded from https://dumps.wikimedia.org/enwiki/latest/). Consistent with the methods for constructing the extensional Chinese dataset, we utilized Word2vec with default parameters and the pretrained BERT model (which has been proven to achieve the best performance at predicting semantic features among other variations 49 ) to extract word representations for each English word.Specifically, to obtain word embeddings from BERT, we first extract 10 to 1,000 sentences (depending on the word counts) from the Wikipedia corpus for each.We use BERT to calculate the sentence representations.We then use the averaged sentence representation as the word representation.We obtained 1,515,633 word representations from Word2vec and 930,668 (here, we only use words with counts more than 10) from BERT.

Vision
the extent to which the meaning of a word can easily and quickly trigger corresponding visual images in your mind Motor the extent to which the meaning of a word can easily and quickly trigger corresponding body actions in your mind Socialness the extent to which the meaning of a word relates to relationships or interactions between people Emotion the extent to which the meaning of a word relates to positive or negative emotions Time the extent to which the meaning of a word relates to time, including early or late, length, sequence, frequency, etc.
Space the extent to which the meaning of a word relates to spatial information, including location, direction, distance, path, scene, etc.
Table 1.Working definition of each semantic dimension.
www.nature.com/scientificdatawww.nature.com/scientificdata/To estimate the semantic ratings of English words, we first trained a model to align the English embedding space to the Chinese embedding space.Specifically, we extracted all single word translation pairs (i.e., remove Chinese-English pairs in which English is more than one word) from a Chinese to English dictionary, the CC-CEDICT, at https://www.mdbg.net/chinese/dictionary?page=cc-cedict, which is the largest open-sourced Chinese to English dictionary to our knowledge, and obtained a Chinese-English bilingual lexicon of 19,424-word pairs.Next, we trained a ridge regression model to learn the mapping between the English embeddings and Chinese embeddings based on the bilingual word pairs.www.nature.com/scientificdatawww.nature.com/scientificdata/Finally, the semantic ratings for the English words were estimated in two steps.First, we projected each English word representation from the English semantic space to the Chinese semantic space.Second, the projected word representation was taken as the input word representation for the semantic rating prediction models for Chinese words.Then the output of the model was taken as the estimated semantic rating for the English word.

Data records
The SSDD 16 is available on the OSF repository at https://doi.org/10.17605/OSF.IO/N5VKE.The data are sorted into two main folders.The first main folder is "Main_Data, " in which we provided the final subjective and estimated rating results.The second main folder is called "Supplementary_Data, " in which we provided the information of participants, the instructions for the rating experiments, the raw rating data, the validation data for the subjective ratings and computational extension ratings, the word embeddings, and the code for calculating and validating the estimated ratings.More details about the data are provided below.
Main Data.The average ratings across participants for the 17,940 Chinese words on the six rated semantic dimensions are provided in the file "Rated_semantic_dimensions.csv." Additionally, we also provided the absolute value of the average emotion rating for each word as the seventh dimension, which is called "emotion_abs + 1" in the file.The estimated semantic ratings for extensional Chinese and English vocabularies using different computational models are provided in four files named "Estimated_semantic_dimensions_word2vec_Chinese.csv", "Estimated_semantic_dimensions_macbert_Chinese.csv","Estimated_semantic_dimensions_word2vec_ English.csv", and "Estimated_semantic_dimensions_bert_English.csv".

Supplementary data. Information of participants.
The file named "Information_Participants.xlsx" provides the age and sex of the participants, the number of valid and invalid sessions and words that the participants' data contain, and which of the six experiments the participants participated in and provided valid data.
Instructions for the rating experiments.At the start of each session of each rating experiment, participants were provided an instruction that contained the working definition of the semantic dimension and a few examples.These instructions are provided in the file named "Instructions.docx." Raw rating data.The raw data of the 6 rating experiments are provided under the subfolder named "Raw_ rating_data."The data are sorted into six folders named by the rated dimensions (Vision, Motor, Socialness, Emotion, Time, and Space).Under each folder, 18 files (named "session*.csv," in which * is 1 to 18) correspond to the 18 sessions.In each file, the column named "Word" provides the items for which the semantic ratings were collected, for example, '花朵' (meaning "Flower" in English).The remaining columns are named by the initials of 30 participants who rated the words and show the rating scores from each participant.www.nature.com/scientificdatawww.nature.com/scientificdata/Validation data for the subjective ratings.In the file "Validation_Ratings.xlsx, " we provided the data and results of the validation analyses for our ratings.In addition to the validation analyses and results mentioned in the section "Validity of the subjective rating dataset" for each dimension of ratings, we also provided the correlations of our ratings to several fine-grained semantic dimensions of ratings provided by Binder et al. 1 .
Validation data for the computational extension ratings.In the file "Validation_Computational_Ratings.xlsx," we provided the data and results of the validation analyses for our computational extension dataset.
Word embeddings.The word vectors used to compute the computational extension datasets are provided in the subfolder named "Word embeddings, " including the Word2vec and MacBERT embeddings for Chinese words and the Word2vec and BERT embeddings for English words.
Code for calculating and validating the estimated ratings.See the section "Code Availability."

technical Validation
Reliability of the subjective rating dataset.We examined the reliability of the ratings by computing the intraclass correlation coefficients (ICCs) for each experiment and each session.For each experiment, we calculated the one-way random ICC because different participants rated different items; for each session, we calculated the two-way random ICC because there were always 30 consistent participants who rated all items 50,51 .The results are summarized in Tables 2, 3.For all experiments and all sessions, the ICCs were above 0.9, which indicates good reliability of the ratings.In addition, in the SSDD, we rerated the socialness of 945 words from a prior study of us 10 and obtained a cross-study correlation of 0.955.

Validity of the subjective rating dataset.
We examined the validity of the ratings by calculating the correlations between the ratings obtained in the current study and those provided in several previous studies.The results are shown in Table 4.The full set of validation data is provided in the Supplementary Data of the database.
For the semantic dimension of vision (visual imageability), the ratings were validated based on Binder et al. 1 , Liu et al. 22 , and Su et al. 52 .The rating instructions used in the 4 studies are similar.Binder et al. 1 , Liu et al. 22 , and Su et al. 52 obtained their ratings using English words, single-character Chinese words, and two-character Chinese words, respectively.Their correlations to the current study are 0.756, 0.627, and 0.821.The relatively lower correlation to Liu et al. 22 than to Su et al. 52 might be due to that many Chinese characters have multiple meanings so that single-character words are more often ambiguous in their semantics than two-character words.
For the semantic dimension of motor, the ratings were validated based on Heard et al. 45 and Binder et al. 1 .The rating instructions used in the current study and Heard et al. 45 are very similar, both focusing on how easily a word's referent can be pantomimed.Similar motor-semantic ratings have been used to reflect the general impact of motor-semantic representation on cognition and neural activities in several previous studies 18,44,46,53,54 .The correlation between Heard et al. 45 and the current study is 0.806.Binder et al. 1 did not set any general rating for the motor dimension.We therefore correlated our ratings with the four fine-grained motor ratings of Binder et al. 1 , i.e., Head, UpperLimb, LowerLimb, and Practice.The correlations are in the range of 0.133 to 0.342.www.nature.com/scientificdatawww.nature.com/scientificdata/We further correlated our ratings with the mean of the four motor ratings and obtained a correlation of 0.426.These relatively low correlations indicate that the four motor dimensions rated by Binder et al. 1 may not be able to fully explain the content of our ratings.It is likely that our ratings reflect more dimensions of motor knowledge than those included in Binder et al. 1 , such as postures and gestures.For example, the word '怀孕' meaning "pregnant" was rated as high-socialness.Being pregnant is associated with specific postures and whole-body motor features but not with specific motor features of the head, feet, or hands.In addition, people often use gestures to represent some particular concepts, especially when performing pantomimes or playing charades.These gestures should also be viewed as a type of motor knowledge as long as people can reach a consensus on their meanings.The motor-rating instructions used in Heard et al. 45 and the current study should be more sensitive in detecting these additional types of motor knowledge than those used by Binder et al. 1 .
For the semantic dimension of socialness, the ratings were validated based on Diveica et al. 3 and Binder et al. 1 .The core ideas of the instructions used in the 3 studies are all centered on interpersonal interactions and relationships.However, the instructions used in the current study and Binder et al. 1 were both brief, while those used by Diveica et al. 3 were much more detailed, that is, "a social characteristic of a person or group of people, a social behavior or interaction, a social role, a social space, a social institution or system, a social value or ideology, or any other socially relevant concept." The correlations of our ratings to those of Diveica et al. 3 and Binder et al. 1 are both 0.724.
For the semantic dimension of emotion (valence), the ratings were validated based on Xu et al. 55 and Binder et al. 1 .The instructions used in the current study and Xu et al. 55 are similar, and the correlation between the two studies is 0.935.The emotion ratings of the current study are closely associated with two dimensions of Binder et al. 1 , that is Pleasant and Unpleasant.Therefore, we calculated composite scores of the two dimensions by subtracting the ratings of Unpleasant from those of Pleasant and correlated the scores with our ratings.The correlation is 0.795.
We also validated the absolute values of our emotion ratings.As mentioned above, this measure, which can be referred to as the dimension of "valenced vs. neutral", can reflect the relatedness of word meanings to emotion.To validate this measure, we correlate the absolute values with the emotion ratings collected by Tamir et al. 5 .The correlation is 0.617, indicating that the absolute values of our emotion ratings can reflect the general emotional relatedness of words.Additionally, the absolute value of valence is also related to another important dimension of emotion, called arousal.Arousal increases as a function of both positive and negative valence 56 .The absolute value of valence rating has been used to represent arousal in some previous studies 57 .We correlated the absolute values of our emotion ratings with the arousal ratings provided in Xu et al. 55 and Binder et al. 1 .The correlations are 0.585 and 0.532, respectively, which is consistent with the findings in the literature.
Finally, for the semantic dimensions of time and space, we validated the ratings based on Binder et al. 1 .Binder et al. 1 did not set any general rating for these dimensions.Therefore, we averaged the ratings of two time-related dimensions (Time and Duration) to correspond to our time ratings and averaged the ratings of six space-related dimensions (Landmark, Path, Scene, Near, Toward, and Away) to correspond to our space ratings.The correlations are 0.715 and 0.716 for time and space ratings, respectively.www.nature.com/scientificdatawww.nature.com/scientificdata/human annotated data.Furthermore, the mismatch between English and Chinese semantic spaces is a potential limitation of our method because we projected English words from the English semantic space to the Chinese semantic space to accomplish our estimation.

Fig. 1
Fig. 1 Distribution of ratings for the seven semantic dimensions.

Fig. 2
Fig.2Pearson correlation coefficients between the seven semantic dimensions of ratings.

Table 2 .
ICCs for each experiment (One-Way Random).

Table 3 .
ICCs for each session of each experiment (Two-Way Random, consistency).

Table 4 .
1esults of the validation analysis.The language of the stimuli rated in each study is indicated in parentheses.Note: The following dimensions are calculated based on the original ratings from Binder et al.1.1) The scores of Motor_General are calculated by averaging the ratings of the dimensions belonging to the domain of Motor, which include Head, UpperLimb, LowerLimb, and Practice.2) The scores of Pleasant_minus_ Unpleasant are calculated by subtracting the ratings of Unpleasant from those of Pleasant.3) The scores of Time_General are calculated by averaging the ratings of Time and Duration.4) The scores of Space_General are calculated by averaging the ratings of the dimensions belonging to the domain of Spatial, which include Landmark, Path, Scene, Near, Toward, and Away.

Table 5 .
Results of the cross-validation analysis for the estimated ratings of Chinese words.

Table 6 .
Results of the validation analysis for the estimated ratings of Chinese words by Word2vec.

Table 9 .
Results of the validation analysis for the estimated ratings of English words by MacBERT.