Background & Summary

Research on the wide and multidisciplinary area of language (e.g., perception, production, processing, acquisition, learning, disorders, and multilingualism, among others) frequently uses pictures of objects as stimuli for different paradigms such as naming or classification tasks. Importantly, experimenters need to have access to normative data on diverse properties of the pictures (e.g., naming agreement, familiarity, or complexity) to be able to compare and generalise their results across studies. Crucially, in a world in which multilingualism is the norm— it has been estimated that more than half of the world’s population speaks two or more languages1,2—it is essential for researchers to be able to access such normative information of experimental items for different languages.

Snodgrass and Vanderwart3 created the first normalised picture dataset for the American English language, which has been adapted to other languages in order to conduct cross-linguistic research (e.g., British English4; Chinese5; Croatian6; Dutch7; French8; Argentinian Spanish9; Italian10; Japanese11; Spanish12). However, these datasets involve black and white line-drawings, which have been shown to generate weaker recognition than coloured pictures13,14. Considering these findings, researchers have developed coloured image datasets, also in different languages (e.g., English15; French13; Italian16; Russian17; Modern Greek18; Turkish19; Spanish20).

Despite all these efforts to develop standardised and open datasets of pictures and their properties in different languages, there are still some limitations. First, these datasets typically only include around 300 images (except for English15 and Canadian French15), which greatly restrict experimental designs. Second, these datasets were created independently of one another, and hence, they were normalised using different protocols. To overcome these limitations, we have created a database of 500 coloured pictures of concrete objects for 32 different languages or language varieties (i.e., American English, Australian English, Basque, Belgium Dutch, British English, Catalan, Cypriot Greek, Czech, Finnish, French, German, Greek, Hebrew, Hungarian, Italian, Korean, Lebanese Arabic, Malay, Malaysian English, Mandarin Chinese, Netherlands Dutch, Norwegian, Polish, Portuguese, Quebec French, Rioplatense Spanish, Russian, Serbian, Slovak, Spanish, Turkish, Welsh) using the same procedure for data collection and preprocessing. To this end, we developed a procedure similar to that reported in Duñabeitia et al.21 who created the initial dataset, which included 750 coloured images standardised for six commonly spoken European languages (i.e., British English, Dutch, French, German, Italian, and Spanish).

This Data Descriptor describes, in a comprehensive manner, the experimental method, the preprocessing protocol, and the structure of the data. Our aim is to make this database freely available to all researchers so that they can conduct empirical studies in any language. This is especially interesting for researchers concerned with any multilingual issue, since it offers them the opportunity to design studies for which the properties of the materials have been tested in a parallel manner for all the languages in their study. The datafile containing the whole dataset has been stored in a public repository22, and we encourage any researcher to use it for their studies.


We selected 500 coloured pictures with the highest name agreement across languages from a set of 750 pictures created by Duñabeitia et al.21. These pictures were in PNG format with a resolution of 300 × 300 pixels at 96 dpi and they have been stored in the public repository in a compressed folder for the convenience of readers and potential users. Additionally, given that some users may want to opt for different versions of the PNG pictures13,14, the same public repository includes a folder containing black and white and grey scale versions of the same drawings.

The same experimental software was used across sites. To this end, a custom program was generated using Gorilla Experiment Builder23 and replicated across languages with exactly the same instructions to ensure homogeneity in the protocols. Participants were told that they would see a series of images, and that they should type in the name of the entity represented in each picture. Each of the pictures was presented individually in the centre of the display of a computer or tablet. Participants were asked to make sure they spelled the word correctly, and try not to use more than one word per concept. If they did not know the name of the element depicted, they could indicate this by typing “?”, and this would then be considered as an “I don’t know” response (see below). After typing the name, they were asked to indicate their self-perceived familiarity with the concept, using a 100-point scale slider (with the lowest value indicating “not familiar at all” and the highest value representing “very familiar”). Participants were asked to use the whole scale during the experiment and avoid using only the extreme values. In order to get used to the procedure, they completed two practice trials before starting the experiment. The entire experiment lasted about one hour, and breaks were inserted during the test at every 50 trials.

The data were collected during 2020 and 2021 in the context of a large-scale crowdsourcing study. Ethical approval for conducting the general study was obtained from the Ethics Committee of Universidad Nebrija (approval code JADL02102019), and from the participating institutions that required individual extensions or ethics approval from their local ethics boards. The data preprocessing procedure included checking the answers for spelling errors by native speakers of each language and merging variants of the same response, following the procedure described in Duñabeitia et al.21.

These datasets were then combined with the data for the 500 pictures extracted from the original study21 regarding Belgium Dutch, British English, French, German, Italian, Netherlands Dutch, and Spanish. In the original study, speakers of different languages were also asked to rate following a 1-to-5 scale the visual complexity of the drawings, and results showed a very high cross-linguistic correlations (with r-values larger than 0.90). For this reason, and considering that those visual complexity scores are readily available from the original study can be applied to the new set of languages reported here, in the current multi-centre study we decided to focus on familiarity as a different dimension that could vary across cultures. At this regard, it is worth noting that even if the original set of languages reported in Duñabeitia et al.21 did not include familiarity ratings, these could be easily obtained from published databases (e.g., British English24, Dutch25, French26, German27, Italian10, Spanish28). Together, data from a total of 2,573 participants are reported. See Supplementary Table for a full description of the dataset.

Data Records

The dataset resulting from the online testing is freely available in CSV and XLSX formats22. Each row in the file represents the aggregated data for one specific item across all participants who completed the test in each language, and each column represents a variable of interest. The column labelled Language includes a string referring to the specific language or variety out of the 32 tested to which the data refers (American English, Australian English, Basque, Belgium Dutch, British English, Catalan, Cypriot Greek, Czech, Finnish, French (standard), German, Greek (standard), Hebrew, Hungarian, Italian, Korean, Lebanese Arabic, Malay, Malaysian English, Mandarin Chinese, Netherlands Dutch, Norwegian, Polish, Portuguese, Quebec French, Rioplatense Spanish, Russian, Serbian, Slovak, Spanish, Turkish, or Welsh). The column labelled Code includes a number between 1 and 747 corresponding to the picture to which the data refer, numbered according to the number sequence used in the original MultiPic dataset21. The column Number of Responses corresponds to the number of individual responses collected for each item in each language (namely, the number of participants who provided an answer). The column named H Statistic includes the level of agreement in the responses for a given item in a given language across participants as measured by the H index29, which increases as a function of response divergence. The column Modal Response includes the strings corresponding to the most frequent response for each item in each language; note that in cases in which the same level of agreement was found for two different responses, both are presented separated by a “/” symbol (e.g., response1/response2). The column labelled Modal Response Percentage corresponds to the percentage of responses corresponding to the modal response out of all valid responses (namely, responses for each item in each language that do not correspond to “I don’t know” or idiosyncratic responses). The column “I don’t know” Response Percentage provides the percentage of participants in each language who did not know the name of the displayed element and selected the corresponding button. The Idiosyncratic Response Percentage column includes the percentage of responses to each item in each language that were provided only by a single participant (N = 1). Finally, the column labelled Familiarity includes the mean familiarity score calculated from the total responses to each item using the 0-to-100 scale of all participants in each language or language variety. Supplementary Table presents a summary of the descriptive statistics of these measures for each language or variety, with the only exception being familiarity measures for those included in the original study21, since their items were not normed for this factor.

Technical Validation

First, a descriptive analysis was performed to validate that the resulting datasets per language or variety were of sufficient quality. To this end, two measures were analysed across languages or varieties: the mean H statistic and the mean modal response percentage. All analyses were done using Jamovi30 and R31. The mean H statistic of the current general dataset was of 0.53 (standard deviation = 0.58), with values ranging between the lower bound of 0.30 (Spanish) and the upper limit of 1.07 (Mandarin Chinese). The mean value of the H statistic is in line with those reported in earlier normative studies with different materials (e.g., 0.67 in17; 0.55 in18; 0.68 in9; 0.32 in13), and not surprisingly, aligns with the mean H statistic of 0.74 reported for the general set of 750 drawings normed in21. (Note in this regard that stimuli selection for the current study considered 500 items with the highest name agreement from the original study in the 6 languages or varieties tested). The mean modal response percentage of the general dataset was 86.8% (standard deviation = 16.5). The language with a lower percentage of modal response is Mandarin Chinese (73.30%), and the language with a higher percentage is Spanish (93%). These values are similar to the 80% reported in the original study21, and closely approach the mean modal response percentages provided in earlier studies with different sets of stimuli (e.g., 85% in8; 87% in18; 87% in3). Together, the relatively low mean H statistics and the high mean modal response percentages of the current dataset suggest a high name agreement across items, languages and varieties, validating the materials for their use in different kinds of experiments and tests. Fig. 1 illustrates the density plots of the H Statistic and the Modal Response Percentage in each language/language variety.

Second, a series of correlation analyses were conducted to validate individual dataset quality. To that end, and considering that there is no a priori reason to expect cross-language similarities in name agreement measures, since each language has its own particular lexicon, initial focus was on familiarity values. While the specific name or names used to refer to an entity can easily vary across languages, yielding heterogeneous name agreement scores, the way the materials were created and selected pointed to high familiarity with the entities depicted across cultures. Consequently, reasonably high cross-language correlation coefficients were expected between familiarity scores. A correlation analysis performed on the different familiarity scores obtained for each item in each language showed that all the Pearson pairwise correlation coefficients were significant at the p < 0.001 level, with r-values ranging between 0.351 (Catalan vs. Turkish) and 0.919 (Greek vs. Cypriot Greek), and a very high mean correlation coefficient of 0.702 across tests.

Fig. 1
figure 1

Density plots of the H Statistic and the Modal Response Percentage across items in each of the languages and varieties. Dots represent mean values for individual items and vertical black lines represent mean values across items.

As a final validation analysis, we took a close look at the pool of varieties from the same language, since it was expected that results for different dialectal forms or varieties of a given language would elicit similar responses across measures. To this end, the name agreement in the 4 different varieties of English that were included in the dataset (i.e., American English, Australian English, British English, and Malaysian English) were analysed. A correlation analysis of the H statistic showed that responses overlapped highly across varieties, with the lower r-value being 0.579 (American English vs. Malaysian English) and the highest being 0.772 (American English vs. Australian English), and all correlations being significant at the p < 0.001 level. Similarly, the mean percentage of modal responses was also significantly correlated across varieties, with r-values ranging between 0.551 (American English vs. Malaysian English) and 0.759 (American English vs. Australian English), again with all p-values being below 0.001.