## Introduction

Languages differ markedly in the number of colour terms in their lexicons. The languages of some remote populations are reported to have no or as few as two colour terms whereas most European languages have many more; at least 11 according to most definitions of a colour term (Berlin and Kay, 1969/1991; Wierzbicka, 2015). We wish to consider the mechanisms whereby the number of colour terms increases in a language and to report data where this has occurred in a remote population. Furthermore, we wish to show that understanding of colour term acquisition may be enhanced by considering machine learning.

The assessment of colour terms in remote populations is not without its problems. Using paper swatches, as in all previous studies, obtaining colour name responses to a large number of samples can be time-consuming. Thus, studies vary greatly in the number of colour samples they have used, from as few as 23 (Lindsey et al., 2015) to the much more extensive World Color Survey (Kay et al., 2010), where responses were obtained for 320 samples. Even here, the colour terms were obtained from only highly saturated colour samples as is general in cross-cultural studies of colour naming (Berlin and Kay, 1969/1991, Bimler and Uusküla, 2017; Gibson et al., 2017; Kay et al., 2010; Lindsey et al., 2015; Roberson et al., 2005), despite worries about the outcomes being affected by the variation in that dimension (Paramei, 2005; Roberson et al., 2005; Witzel, 2016; Witzel, 2018; Witzel et al., 2015). To our knowledge, there have been few, if any, studies with remote populations that have used computerised colour presentations to overcome these problems, perhaps because of worries about uncontrolled colour reproduction and the unfamiliarity of computer screens to indigenous populations. The latter objection has been shown to be of no concern for the Himba (Biederman et al., 2009; Linnell et al., 2018) and the former is now easily overcome (Mylonas et al., 2019; Paggetti et al., 2016). In consequence, we revisited the Himba who, when tested in 2004 (Roberson et al., 2005), were found to use a 5-colour-term grue language, by which is meant that the same word is used for green and blue regions of colour space (Kay, 1975). We will not only examine the colours tested in 2004 but also those from the inner core of colour space. Here, we may find colour terms that could not have been found in ours and all other previous cross-cultural research.

We wish to consider the augmenting of colour terms to a colour lexicon by which we do not mean the addition of any colour term but only those that can be considered to be colour categories (Berlin and Kay, 1969/1991; Davidoff, 2015; Gibson et al., 2017; Levinson, 2000; Lindsey et al., 2015; Mylonas and MacDonald, 2016). For example, in English, the term crimson denotes a type of red—not a different category—so would be an addition rather than an augmentation of the colour lexicon. There have been many attempts to answer the augmentation question. The earliest extensive attempt was by Berlin and Kay (1969/1991), who used the term “basic” for the major colour terms and declared an order for the acquisition of colour categories. In their view, augmentation occurred by partitioning existing colour terms that were previously able to name all colour space. Though behavioural rules were given by which a term could be considered basic, their origin was thought to be in universal colour physiology (Kay and McDaniel, 1978). Their physiological account proposed six colour terms that align with the postulated opponent channels of Hering and all other colour terms arise from their mixture (Hering, 1878/1964; Kay and McDaniel, 1978). The idea that primary colours are associated with the opponent-process cells in early vision has been disputed (Abramov and Gordon, 1994; Valberg, 2001; Wuerger et al., 2005). Nevertheless, these primary perceptual categories contain examples that are held to be unique in that they are perceived to contain no other colour and have been considered by some important in the development of colour categories (Forder et al., 2017; Kuehni, 2005; Philipona and O’Regan, 2006, but see also Jameson, 2010).

A notable subsequent alternative explanation is that colour categories appear maximally spaced within the 3D sub-volume of perceptual colour space. Colour lexicons, it is argued, develop by optimising the division of an irregular perceptual colour space to maximise similarity within a category and minimise similarity across categories (Boynton and Olson, 1987; Jameson and D’Andrade, 1997; Regier et al., 2007; Regier et al., 2015; Zaslavsky et al., 2018). These accounts are based on discrimination data and, though this could be considered independently of the underlying physiology (Jameson and D’Andrade, 1997), they are presumed reliant on early perceptual mechanisms (Kay and McDaniel, 1978; Zaslavsky et al., 2019). An optimal division of the surface of the colour solid into six well-formed categories by Regier et al. (2007) corresponded to the English terms: white, black, red, yellow, green and blue-purple. However, a subsequent study showed that the optimal criterion produced inadequate results for colour lexicons with more than 6 terms (Jraissati and Douven, 2017) and it is unclear whether the optimal partition principle can hold across the colour space (Lindsey and Brown, 2014).

There is a somewhat similar proposal, though from a clearly cultural perspective, whereby a category is achieved through language (colour terms) rather than early level colour physiology (Roberson et al., 2000). In that approach, the greater similarity for within-category colours is known as Categorical Perception (CP) though, in practice, to determine CP a colour category needs to have a large extension in colour space and is unavailable, say, for assessing a category such as yellow (see Davidoff, 2015).

There are other culturally determined views on augmentation; for example, the emergence hypothesis (Levinson, 2000; Lyons, 1995) proposed that new colour categories emerge in regions of colour space that previously were not named or were named inconsistently (Everett, 2005; Gooyabadi et al., 2019; Kay and Maffi, 1999; Levinson, 2000; Lindsey and Brown, 2014; Lindsey et al., 2015). Yet, a further culturally based alternative for augmentation is that terms are simply borrowed from other cultures. Such loanwords are a more than “plausible” alternative for augmentation (Lindsey and Brown, 2009).

Turning to existing empirical data, there are at least two accounts of changes over time to a colour lexicon (Kuriki et al., 2017, Mylonas and MacDonald, 2016). Kuriki et al., found differences in current Japanese from those recorded by Uchikawa and Boynton (1987). Most of the changes are in the forms of additions or clustering of colour names but there was also the report of a new basic term for light blue as found in many European and other Asian languages (Androulaki et al., 2006; Paramei, 2005; Paramei & Bimler, 2021 for a review). Through a crowdsourcing colour naming experiment, Mylonas and MacDonald suggested the augmentation of the English inventory from the 11 basic colour terms reported by earlier studies (Boynton and Olson, 1987; Sturges and Whitfiled, 1995) to 13 terms including lilac and turquoise. The candidacy of turquoise as basic colour term is further supported by the recent category insertion hypothesis where an incipient basic colour category is added at the BLUE-GREEN category boundary (Paramei et al., 2018; Roberson et al., 2009; for a review see Paramei & Bimler, 2021).

For remote groups with smaller colour lexicons, all previous studies offer only a single snapshot of the development of colour lexicons on the surface of the colour solid (see Lindsey and Brown, 2006, Regier et al., 2015). It is, therefore, of great interest to return to a remote society and ask whether there have been augmentations to their colour lexicon. We note that the Himba, while still outwardly similar to the population of 15 years ago, now have more contact with other cultures. These contacts are not great, yet we have already documented that they affect local/global processing (Caparos et al., 2012, 2013), the perception of geometric illusions (Bremner et al., 2016) and lightness perception (Linnell et al., 2018). To consider whether there might also be changes to their colour categories, we clearly need and thus introduce a procedure to identify the minimal number of independent colour categories that can name all colours.

We wish to consider the long-standing debate on whether perceptual or linguistic similarity is the critical force driving the augmentation of colour lexicons from the field of machine learning. A debate that is also relevant for other modalities of perceptual learning, such as acoustic (Goudbeek et al., 2009), emotion (Azari et al., 2020) and object (Khaligh-Razavi and Kriegeskorte, 2014) category acquisition. In a computational framework, the question can be viewed as whether unsupervised learning or supervised learning is the most appropriate strategy to train an artificial intelligent system that can communicate about colour with speakers of different languages (Mylonas et al., 2010; Mylonas et al., 2013). In unsupervised learning, algorithms deduce some inherent structure to the data using only unlabelled samples and produce a set of universal categories based on perceptual similarity between colours (Kuriki et al., 2017; Lindsey and Brown, 2006; Regier et al., 2007; Yendrikhovskij, 2001; Zaslavsky et al., 2019). In supervised learning, algorithms learn colour categories from labelled data in different languages based on linguistic similarity between colours. There are many different supervised colour-naming models, using variants of Gaussian or Gaussian-Sigmoid distributions (Benavente et al., 2008; Lammens, 1994; Mylonas et al., 2010), multinomial conditional probability distributions (Chuang et al., 2008; Heer and Stone, 2012) and deep-neural networks (Cheng et al., 2017). Here, we use criteria expressed as an ensemble of random decision trees (Mylonas, 2020) that have been shown to be highly effective for many diverse supervised classification or regression problems (Breiman, 2001; Cutler et al., 2007; Gislason et al., 2006).

Effective algorithms for construction of ensembles of decision trees based on training examples ensure that models are strongly diversified by infusing randomisation into the learning algorithm and exploit at each run a different random subset of the training data. An advantage of them is that they do not assume commensurate feature dimensions, or normally distributed feature values. To generalise our observations to any colour and identify the indispensable colour terms in the Himba language, we will use a Rotated Split Trees (RST) approach in regression mode that predicts probabilities and provides state-of-the-art performance in computational colour naming models (Mylonas, 2020). Given our training set of colour points X = x1, ..., xn with their Himba naming responses Y = y1, ..., yT, and a free parameter B = 100 trees, RST ensembles B random split trees, fb = {T1,…,TB} by using for each tree T the full training dataset (Geurts et al., 2006). Prior to any splitting, a proper rotation matrix R is generated using Householder QR decomposition (Andrews et al., 2017; Blaser and Fryzlewicz, 2016; Householder, 1958). The rotated trees have different orientation and vastly dissimilar data partition and are capable of producing smoother, non-axis, parallel decision boundaries than un-rotated trees. The regular, un-rotated, trees, would make splits parallel to the axis in the feature space of the dataset by selecting at each node the split on the attribute that produces the maximum gain. In contrast, rotated trees, combine attribute values and produce a rotated hyperplane with a smaller number of splits that tend to outperform un-rotated trees by separating better instances of the dataset that pertain to distinguishable categories. For growing a tree, RST splits the training data at each node independently of the target variable fully at random, unlike the optimum criterion of Random Forests (Breiman, 2001). Top-down binary recursion continues until no further splits are possible, that is, until all samples have been partitioned into their own leaf node. The predictions of each tree are then aggregated to predict the distribution of colour names for each test colour sample $$\widetilde x$$. In practice, the RST estimator favours colour names with high probability to maintain congruence between observed and predicted data. So, more frequent and consistent colour categories tend to subsume less common and inconsistent terms. In Table S1, we show a comparison of computational colour naming models on the Munsell array (n = 320 chips) against Sturges and Whitfield’s (1995) results in English (Mylonas, 2020). RST performed equally well (100% classification accuracy) with other state-of-the-art colour naming models (SFKM, TSMES and NICE, see also Parraga and Akbarinia, 2016) on Sturges and Whitfield’s results but RST also identified five additional terms on the Munsell array. Thus, RST better determines the number of colour categories from the data compared to previous models that constrained their lexicons only to the 11 basic colour terms. In addition, RST produces perfect performance on estimated distributions of the 11 basic terms.

In contrast, perception-based-learning methods process unlabelled colours to group them into clusters based on statistical regularities of the data. k-means is the most common clustering method where colour samples are assigned into a predefined number of k categories based on the Euclidean distance from each category’s centroid in CIELAB (Kuriki et al., 2017; Lindsey and Brown, 2006; Yendrikhovskij, 2001; Zaslavsky et al., 2019). The number of k clusters needs to be defined in advance based on the number of colour terms in the observed data or through statistical analysis. Then the k-means algorithm can be used to construct a set of imaginary colour naming systems without any colour naming observations.

To evaluate the performance of unsupervised and supervised machine-learning methods we will measure classification accuracy against observed data collected at different time periods and in different colour spaces, namely CIELAB and sRGB. The reasons for examining model performance in different colour coordinate systems are twofold. First, we would desire models to be neutral about the approximately perceptually uniform structure of CIELAB and the non-uniform perceptual structure of sRGB. We selected CIELAB over CIELUV for consistency against earlier studies and because the latter can only marginally improve the accuracy of machine-learning methods over CIELAB (Table 5.2 in Mylonas, 2020). Second, we wish to quantify, against observed data, the divergence of colour categories produced by unsupervised perceptual learning in different colour spaces reported earlier using computer simulations (Steels and Belpaeme, 2005) that we feel has not been given appropriate attention by recent studies (Chaabouni et al., 2021).

Comparing different machine-learning methods and selecting the most effective model for communicating with humans at different time periods provides a new framework to advance our understanding of colour categorisation and helps identify the crucial factors that determine category acquisition.

## Methods

### Participants

There are several groups of Bantu origin in north-west Namibia but the Himba are the most remote; they still have very few Western artifacts including clothes and the women cover themselves daily in ochre. Himba is part of the Niger-Congo language family (Zone B). Himba is a dialect of Herero and they can communicate with speakers of that language. They are no longer entirely pastoralists and grow some maize (Bollig, 2010, p. 206).

Fifty-five native Himba speakers (female: 23, male: 32, mean age = 27.4, age range = 16–60, mean years of schooling = 1.4; schooling range 0–10) from remote villages in north-west Namibia completed the experiment. Of these, 31 were below the mean age (i.e., young) and 38 never went to school. The educational attainment among Himba people remains low with 65% of the adults found to be non-literate in this region (Ndimwedi, 2016). Participants were compensated for their time with gifts of flour. The study received ethical approval from Goldsmiths University of London (N°1390, 4th of June 2018).

### Stimuli and apparatus

Test stimuli were 2 degrees uniformly coloured discs with a black outline of 1 pixel. The stimuli consisted of 589 simulated samples approximately uniformly distributed in the Munsell Renotation Data and restricted in the sRGB gamut plus 11 achromatic samples (Mylonas and MacDonald, 2010; Newhall et al., 1943). To achieve an approximately uniform sampling within the Munsell colour solid, we followed the suggestions of Billmeyer in Sturges and Whitfield (1995). Specifically, a variable number of hues were sampled at different levels of Value and Chroma. At Chroma 2, 10 hues were sampled, whilst at each successive Chroma step the sampled hues were increased by 10. That means from Chroma 8 to the boundaries of the sRGB gamut, all 40 hues were sampled.

The overlap on the surface colours against the sampling of the World Color Survey is 91% due to the limits of the sRGB gamut but we have shown in earlier studies (Mylonas and MacDonald, 2010) and in Table S1 that we can estimate the distribution of basic colour terms in English with 100% accuracy on the surface of the Munsell solid. The 600 in total colour samples were presented one at a time and in a random order for each observer and against a neutral grey background with luminance of 40 cd/m2.

Two Asus Transformer Mini T102HA (10.1”) were calibrated using a ColorCal CRS colorimeter (Cambridge Research Systems, Rochester, UK) and a RadOMA spectroradiometer (Gamma Scientific, San Diego, California). The measured CIE 1931 chromaticity coordinates of the white point of the monitors were x = 0.3067, y = 0.3318, and x = 0.3055, y = 0.3296 with a correlated temperature of 6816 K and 6907 K, respectively. Repeating the spectro-radiometric measurements of the monitors a month later after the fieldwork showed only a minimal drift of their white point over time (<0.003). The stimulus presentation was controlled by PsychoPy-version 1.84.2 software (Peirce, 2007).

### Procedure

Observers were seated inside a tent during daytime approximately 80 cm away from a monitor. The task of the observers was to name out loud the colour of the stimulus so that others would know to which colour they were referring. Observers were free to use as broad or narrow names as they liked. They were not instructed to answer “don’t know” if unsure but such answers were always accepted without question. Under direction, responses were both audio-recorded and typed by a research assistant who spoke the native language of the participants.

## Results

The new Himba colour naming dataset included in total 33,000 raw naming responses for 600 colour samples from 55 observers (mean age = 27.38, SD Age = 9.79, age range = 16–60; Female = 41.82%, Male = 58.18%). For the data analysis, we considered colour names given by two or more observers. Unique responses (0.8%) from single observers were excluded because we could not be confident that other observers would understand the colour name and, therefore, these responses were considered idiosyncratic. Contrary to Lindsey et al. (2015) where the Hadza observers were explicitly instructed to use a specific don’t know response to cluster together difficult to name regions, Himba observers did not offer a name to a colour in 665 responses. These were sparsely distributed to 434 samples of colour space (maximum 4 don’t know responses to any sample) and were excluded from further analysis (see Fig. S1). This filtering resulted in a dataset of 32,087 responses. Before considering outcomes from machine learning, we first analyse our data in terms of frequency and modal naming as is usual in previous studies.

### Frequency of colour terms

In line with the analysis in Roberson et al. (2005), we report the frequency of each colour term in our new Himba data and the number of observers producing each term. The meaning of frequency for our data needs clarification. It is, as usual, the number of times that each colour name is used to describe our colour stimuli but, as we sample from the whole of colour space, frequency of any colour term is likely to increase with the extent of a category in colour space. Figure 1 shows the centroids of the more frequent Himba colour names given by two observers or more scaled by their frequency of occurrence. Serandu (19.6%; reddish) was the most frequent colour name followed by burou (19.3%; bluish). Both terms were offered by all 55 Himba speakers. These were followed by grine (12.4% by 51 observers; greenish). Zoozu (6.9% by 39 observers; blackish) was the fourth most frequent term followed closely by dumbu (5.9% by 45 observers; yellowish), vapa (5.3% by 54 observers; whitish), pinge (5.1% by 32 observers; pinkish), zorondu (3.7% by 20 observers; a second blackish term), ranje (3.2% by 28 observers; orange-ish), ngara (3.2% by 33 observers; pale yellowish) and vinde (3.1% by 33 observers; brownish). Colour terms with relatively lower frequency of use (<3%) include worindja, peese, honi, pepera, vahe, baraona, kuze, dovazu, otji, siriva, hurune, gerei and mbambi. We found no purple terms in the Himba colour lexicon. The nearest terms to purple, peese (2.1% by 15 observers) and pepera (1.6% by 4 observers), both describe an overlapping magenta-ish region. The lack of a Himba purple term rather questions the claim that indigenous cultures share the same categories as infants (Skelton et al., 2017).

In total, the Himba offered 24 colour terms shown in Fig. S2 for naming the 600 approximately uniformly distributed Munsell samples of the computerised experiment, a large increase over the 9 colour terms (5 frequent and 4 infrequent) that were reported in our previous study for naming the 160 fully saturated samples of the physical Munsell Book of Color (Roberson et al., 2005) and the 10 colour terms elicited in a list task (Grandison et al., 2014). In agreement with earlier studies, Himba speakers did not make use of modifiers in their responses.

The high frequency of grine offered by 51 out of 55 observers (mean age = 27.38; SD = 9.84) requires further investigation because in our previous study, grine was an infrequent term (0.5%) that was offered by only 2 observers out of 31 (Roberson et al., 2005) though Grandison et al. (2014) in their list task found it was offered by 43 out of 62 observers. For stimuli (n = 98) where grine was now the most frequent term, the ratio of using grine over burou was higher, t(53) = 3.1, p < 0.003, for younger (n = 31; younger mean age = 21.06, SD = 4.09, age range = 16–27; Mratio = 0.83, SD = 0.28) than for older (n = 24; older mean age = 35.54, SD = 9.07, age range = 28–60; Mratio = 0.55, SD = 0.37) Himba. Those who had attended school (n = 17; educated mean age = 22, SD = 7, age range = 16–41; Mratio = 0.91, SD = 0.1) also used grine more than burou for these stimuli than did Himba who had not been to school (n = 38; non-educated mean age = 29.79, SD = 10.05, age range = 16–60; Mratio = 0.61, SD = 0.38), t(53) = 3.2, p = 0.002. However, education by itself was not the main determinant of colour term change as, considering only the young observers (n = 31), there was no difference between the educated (n = 14; younger educated mean age = 19.57, SD = 4.16, age range = 16–26; Mratio = 0.91, SD = 0.1) vs. non-educated (n = 17; younger non-educated mean age = 22.29, SD = 3.70, age range = 16–27; Mratio = 1.5, SD = 0.36) young Himba; t(29) = 1.5, p = 0.14. Furthermore, we found no significant differences t(53) = 0.12, p = 0.24 between male (Mratio = 0.66, SD = 0.36) and female (Mratio = 0.77, SD = 0.33) Himba in using grine over burou for greenish samples. We found very large colour differences between the centroids of the GRUE category in the earlier data and the new BLUE (burou) category ΔΕab = 45.1 and also against the centroid of the new GREEN category ΔΕab = 17.95 that provide support for the augmentation of the Himba colour lexicon. Yet, colour differences above 10 CIELAB ΔE units cannot be trusted (Xu et al., 2001) and alternative methods are required to demonstrate the augmentation of the new colour terms.

### Consensus of modal terms

Continuing with analyses of our new data in line with the previous, we consider consensus that describes the agreement among observers in naming colour samples (Brown and Lenneberg, 1954; Boynton and Olson, 1987). The modal term with the highest consensus for each colour sample was determined as the peak of the conditional probability P(n | c) that each name n = 24 was reported for each colour c = 600 (Chuang et al., 2008; Heer and Stone, 2012). Figure 2 shows the 600 colour samples named by most Himba using 10 modal terms. 73% of all stimuli were named with 0.5 (max = 1) or above agreement involving the major modal terms for c colour samples: serandu, c = 219; burou, c = 131; grine, c = 98; zoozu, c = 69; dumbu, c = 42 and vapa, c = 35. The high consensus of the frequent grine term for a large number of samples confirms its status as an important new colour category in Himba. There were also a few areas with lower agreement (>0.2 and <0.5) that involved colour samples of the above terms plus of the less-frequent ngara c = 3; vinde, c = 1; pinge, c = 1 and peese, c = 1. These minor modal terms could not have been found without sampling the interior of the colour space but being the most frequent term for an inconsistently named colour sample was not a sufficient condition to show high consensus. Similarly, previous studies using thresholds for a colour sample being named with consensus gave undefined results for rarely named colours (Boynton and Olson, 1987; Conway et al., 2020; Davies and Corbett 1995; Lindsey et al., 2015). Our consensus analysis supports earlier conclusions that the modality of colour terms is not a simple dichotomous but a continuous gradual characteristic that can reveal potentially important colour terms in further investigations (Witzel, 2019). For example, while we used almost double or greater number of samples than earlier studies, the mean colour difference of the four nearest neighbours across our 600 stimuli was ΔΕab = 7.14 and it could be that the density of our sampling was still not sufficient to capture the regions of these minor modal terms with higher consensus. In later sections we will examine the indispensability of these terms using computational models but first we will compare naming differences from 2005 to the present and then consider the naming that occurs when using the full colour gamut.

### Comparison of Himba naming in 2005 and 2019

To illustrate the change in naming behaviour between the earlier and new studies, in Fig. 3 we show the estimated categories on the traditional Mercator projection of the surface of the Munsell Book of Color (n = 320 samples) by RST using the Himba data of our earlier study (Roberson et al., 2005) and the Himba data of this study. The classification of the Munsell array into the five Himba colour terms (serandu, dumbu, zoozu, burou, and vapa) by RST when trained by the earlier Himba data was overall consistent with the distributions of the five modal terms reported in Fig. 1 of Roberson et al. (2005) with a classification accuracy of 88% for the n = 160 samples. We note that using our probabilistic approach for identifying modal terms (peak of P(n|c)) in the earlier data, we found an additional minor modal term (vinde) for a few inconsistently brownish and purplish samples but RST classified them to the major modal terms. RST with our new data show six colour terms on the Munsell array (the earlier five terms plus grine). There was a large area (19.4%) on the surface of the colour solid named consistently across observers as grine. RST found no minor terms on the surface on the Munsell system.

We considered if the grine term could have been latent in our 2005 data but there is no evidence of separate green and blue foci within the GRUE category. In fact, the confidence for the boundary chips (7.5BG) between GREEN and BLUE in the new data (Mconf = 0.6, SD = 0.1) is significantly lower, t(14) = 3.5519, p = 0.003, than in the old data (Mconf = 0.8, SD = 0.1). In other words, the chips with the highest confidence of the GRUE category have become boundary colours between GREEN and BLUE categories. Similarly, the location of the colour chips with the highest confidence for grine (Pconf = 0.8, h = 7.5GY, V = 6) and for burou (Pconf = 0.9, h = 7.5PB, V = 5) in our new data were boundary colours in our earlier data. We also note that our highest confidence stimuli correspond roughly with the locations of the green and blue foci in English (Sturges and Whitfield, 1995).

### Colour naming across the full colour gamut

To explore the confidence of Himba colour naming across the full colour gamut we classified a grid in CIELAB of 4-unit bins at 8 lightness levels constrained in the sRGB gamut (Heer and Stone, 2012) using RST and the recent Himba data. Figure 4 shows categories with high confidence (of various shapes due to the non-parametric nature of RST), separated by boundaries of lower confidence where there was higher naming confusion across the colour gamut.

The test samples (n = 5693) were classified to 7 colour categories: the six most common terms that we identified on the surface of the colour solid: serandu (assigned to 34.5% of samples), grine (23.2%), burou (22.9%), zoozu (8.1%), dumbu (7.7%), vapa (3.5%), and the additional but smaller brownish category vinde (0.2%) in the interior of the colour space. The colour samples of the other minor modal terms, ngara, pinge and peese, were replaced by major neighbour modal terms. Although vinde is assigned to fewer samples and with relatively lower confidence than the other major Himba terms, unlike the other minor modal terms, is an indispensable term retained at two different lightness levels (L* = 33–44) by our RST analysis in this dataset. Interestingly, there is no neutral category at mid lightness levels nor a category between red and blue as the Himba are missing both consistent grey and purple terms.

### Comparing supervised and unsupervised learning

To gain some leverage on whether cultural (i.e., linguistic) or perceptual similarity is the critical driver for effective colour communication with the Himba, we evaluated the classification accuracy of supervised (RST) learning in a leave-one-out cross validation mode and unsupervised (k-means) learning on predicting the observed modal terms for each colour stimulus in both Himba datasets. First, we consider the 160 colour samples on the surface of the Munsell system in the approximately perceptually uniform colour space of CIELAB used by Roberson et al. (2005), second the 600 colour stimuli used in this study in CIELAB and finally we compare the performance of both methods, again for the new dataset, but in a different colour space, namely sRGB, a non-uniform but the most widely used colour space of the Internet.

To investigate unsupervised learning, we used a k-means algorithm. The output of the k-means algorithm is not labelled; hence to compare model performance and avoid local optima, we constructed a distance matrix between the centroids of the perceptual k categories, and the centroids of the observed categories obtained in the colour naming experiments and then we assigned optimally the first to the latter categories using the Munkres assignment algorithm (Kuhn, 1955). We repeated this process 100 times and retained the optimal imaginary k system of categories that produced the smallest mean Euclidean colour difference in CIELAB (Zaslavsky et al., 2019). We confirmed that a larger number of iterations did not significantly change the results.

Culture-based-learning approaches learn colour categories directly from labelled data but the output is not always correct due to noisy data labels as the naming process can be long, costly and prone to error. Indeed, evaluating supervised classifiers on data on which they were trained is generally misleading. To ensure that the predicted classes are generalisable to unknown samples we employed a leave-one-out cross validation strategy and trained our models on all other chips and predicted the name of the test sample that was not known in advance. For each Himba dataset that we assessed, we built j separate RST classifiers where j is equal to the number of colour examples in each dataset (c = 160 and c = 600). Each RST classifier was trained on c − 1 labelled colour samples, with a different colour sample left out. The communication accuracy for each colour sample was then computed by the classifier, which was trained with it left out. Then we aggregate the results of all these classifiers for all left out test colour samples.

For Roberson et al. (2005), on the surface of the Munsell solid, the classification accuracy of k-means with k = 5 (equal to the reported modal terms) was 64% while RST produced a better accuracy by classifying 89% of the samples correctly. RST identified five colour categories (serandu, dumbu, zoozu, burou, and vapa). The vinde term was not identified on the surface of the colour space in the leave-one-out mode by RST. A McNemar test (Dietterich, 1998) showed that the two proportions were significantly different χ2(1, N = 160) = 32, p < 0.01 with a Yates’ correction. Given that we found a 6th dispensable modal term in Roberson’s data, we also tested the k-means algorithm with k = 6 but its performance deteriorated further with a classification accuracy of 54%.

Considering the new Himba data that also cover the interior of the colour space, the k-means with k = 10 (equal to the number of the modal terms) classified 40% of the 600 samples correctly while the RST produced again superior performance with a classification accuracy of 93% including the major modal terms plus a smaller vinde category in the interior of the colour space. A McNemar test showed that the two proportions were significantly different χ2(1, N = 600) = 296, p < 0.01 with a Yates’ correction. It is possible to set k = 24 equal to the number of all Himba colour terms given by two observers or more by using an automatic process for determining the number of k clusters using the Elbow method (Thorndike, 1953). When carried out, performance of k-means dropped to a classification accuracy of just 16%. Considering earlier results that showed the unfitness of perceptual-learning methods (Jraissati and Douven, 2017; Regier et al., 2007) to capture more than 6 categories, we also tested the k-means approach with k = 6 equal to the number of major modal terms but again its classification accuracy of 60% was significantly lower than RST’s χ2(1, N = 600) = 170, p < 0.01 with a Yates’ correction.

In our third comparison of the two learning approaches in a different coordinate system, we set k = 7 equal to the number of indispensable terms identified by RST for a direct comparison and we tested both methods on the 600 stimuli specified in the sRGB colour space (Fig. 5). The k-means algorithm achieved 49% classification accuracy while the RST retained its performance with 93% accuracy. Again a McNemar test showed that the two proportions were significantly different χ2(1, N = 600) =243, p < 0.01 with a Yates’ correction. The k-means approach with k = 7 in both CIELAB (59%) and sRGB (49%) colour spaces produced significant diverging results χ2(1, N = 600) = 25, p < 0.01, confirming, with empirical data, an earlier report using simulations that explanations of colour categories being based on statistical regularities of the data are spurious (Steels and Belpaeme, 2005).

Overall, our results show that supervised learning significantly excelled unsupervised learning for effective colour communication with the Himba. It excelled for data collected at different time periods, various number of colour terms, surface and interior colours as well as different colour spaces.

## Discussion

Theories of augmentation have been essentially conjectural as there is little actual augmentation data but we now provide, along with Kuriki et al. (2017) for Japanese and Mylonas and MacDonald (2016) for English, information concerning the language of a remote group where we have seen just such a development. In 2005, the Himba were reported to use a 5-colour-term grue language. Today, they have seven colour categories. If a regular pattern of colour term evolution exists as suggested by Kay et al., (1991), Himba has evolved from a Stage V to a Stage VI language with 7 colour terms using their classification scheme. One of these new categories (the brown term vinde) could have been present in 2005 since it is only with our present methodology that we can investigate the desaturated interior of colour space. In fact, the term was in inconsistent (non-categorical) use then for a variety of saturated brownish and purplish colours. The augmentation of the other term is because Himba is no longer a grue language with the possibility that the change has been gradual (Grandison et al., 2014). It might be thought that 15 years would provide insufficient time to observe the introduction of a GREEN category. However, there has been an increase in tourism with accompanying roads and infrastructure and therefore contact with other cultures and it might not require much contact to produce cognitive change (see Caparos et al., 2012). In any case, laboratory studies might have advised otherwise given that the acquisition of a new colour category can take a matter of days with sufficient practice (Özgen and Davies, 2002). It is easy to imagine that many of the languages in the World Color Survey would have reported more important colour names should they have also examined desaturated colours and revisited these preindustrial societies.

Before considering any claims about augmentation from our machine-learning data, we need to assure that the arrival of the new colour terms is not a result of the new methodologies. The computational procedures of the new method gave the same outcome of a 5-colour term language for our 2005 data so any differences cannot be due to the change to RST. Could it be that the new procedure produced, for the many other additional colours, somewhat different hues that were greener than their equivalent paper swatches? Our own studies (Mylonas and MacDonald, 2016; Paramei et al., 2018) show that this is not so but, in any case, the Himba can only name the stimuli as green if they have that colour term and clearly in 2004, they did not. It still could be that the colours were slightly different and the new boundary between their green and blue terms would be in a different place if naming had been measured from paper swatches. However, the boundary between GREEN and BLUE in our new data is where it would be either if the category had been imported from another culture or if it were perceptually driven. So, there is no reason to believe that the new computerised procedures by themselves would have produced any different outcomes than that obtained with paper swatches. We therefore turn to consideration of what are the driving forces for augmentation of colour terms.

One hypothesis for the Himba grue term having split in the last 15 years is that there might be optimal ways of dividing colour space and so predict how a 5-colour-term grue language might form 6 categories (Regier et al., 2007; see also Kay and Regier, 2007). It was argued that these proposed optimal partitions could be based on the uneven shape of perceptual colour space where several large “bumps” of saturation presumably produce areas with greater consensus among speakers across languages (Jameson and D’Andrade, 1997; Jameson, 2005). Indeed, the speed with which the augmentation has taken place might argue that the change “was waiting to happen”. However, the new bluish (burou) and greenish (grine) categories were not latent in the GRUE category of the old data (Roberson et al., 2005) as has been suggested (Kay and McDaniel, 1978; Lindsey et al., 2015; Regier and Kay, 2004). The colour chips with the highest confidence of the GRUE category in our earlier data have become boundary colours between the new grine and burou categories in our current data. Alike, the location of the chips with the highest confidence for grine and for burou in our new data were boundary colours in our earlier data. So, there is no evidence that latent categories were responsible for augmenting the colour lexicon in the adult Himba nor indeed that they drive colour naming in the development of colour naming in Himba children (Roberson et al., 2004).

Further insight into augmentation can be obtained from our machine-learning data but it is important to issue a caveat. All the machine-learning models are statistical techniques and by themselves do not predict the underlying mechanisms of change. Even the k-means models do not necessarily entail that there is a physiological underpinning of the perceptual structure that they discover (Davidoff, 2015; Jameson and D’Andrade, 1997). However, our evaluation of supervised (RST) and unsupervised (k-means) machine-learning approaches suggests that perceptual similarity alone cannot explain colour categorisation and that linguistic similarity is the driving force for facilitating effective colour communication with the Himba at different time periods, for any number of categories, on the surface and across the colour gamut and in different colour spaces. Even for a language with only five terms, the unsupervised (perceptual structure) is suboptimal compared to the consistent performance of supervised learning.

In contrast to supervised cultural learning, the communication accuracy of unsupervised perceptual learning falls short mainly because minimising the variance in the data tends to produce equal size clusters that fail to fit human colour categories of various sizes and also because exploiting statistical regularities found in data is sensitive to scaling and produces diverging results in different coordinate systems (Steels and Belpaeme, 2005). As a result, perceptual learning produces a suboptimal universal hypothetical scheme that ignores the variation on the number and distribution of human colour categories in different languages (Mylonas and MacDonald, 2012). Generally, the main advantage of unsupervised learning is that no human labelled data are required but in the case of colour naming this is not true as they still need human feedback to define the number of categories and optimise their performance (Lindsey and Brown, 2006; Zaslavsky et al., 2019). Contextual information found in sufficiently representative large sets of natural images could boost their performance (Yendrikhovskij, 2001) but again their performance will be sensitive to scaling of different colour spaces (Steels and Belpaeme, 2005).

It is not certain which of the supervised (cultural) explanations for the need of additional terms (Gibson et al., 2017) is most appropriate for the augmentation of the colour lexicon in the Himba. However, the emergence hypothesis (Levinson, 2000) that highlights regions of colour space with less consensus and without a name would seem an inadequate explanation for the augmentation of the Himba green term. Observers in our 2005 study were sure about their responses; they were consistent and with very few sparsely distributed unnamed stimuli. In any case, as the Himba did not have areas of blue/green colour space unnamed, the emergence hypothesis could not provide an explanation for the splitting of the grue term. Indeed, augmentation of colour terms in English is also indifferent to whether their region is on a boundary between two existing categories or is inserted within an existing category. Mylonas and MacDonald (2016) reported the augmentation of colour terms in British English from the 11 terms of Berlin and Kay (1969/1991) to 13 terms by the addition of turquoise and lilac. Neither the emergence nor the partitioning hypothesis alone could explain these results as turquoise emerged at the boundary region between blue and green while lilac partitioned the large colour category of purple in agreement with the results of Lindsey and Brown (2014) that showed both processes also in the development of modern American colour terms (teal and lavender). The category insertion hypothesis could only explain the addition of turquoise but not of lilac. Nevertheless, it would be possible to propose an emerging case for other terms in the Himba colour lexicon that are obviously related to objects (Conklin, 1973; Davidoff et al., 1999; Levinson, 2000) and where there were areas of inconsistent colour naming. The vinde (brown) Himba term that is present at the boundaries between GREEN (grine), BLACK (zuzou) and YELLOW (dumbu) categories in the interior of the colour space is from cattle appearance, and the pale yellowish term ngara is from a flower that may be borrowed from a neighbouring Bantu language (Nurse and Philippson, 2006). A similar desaturated LIGHT BROWN category (color de coyuche) borrowed from Spanish and denoting the colour of organic cotton was reported by MacLaury (2007) in Zapotec (see Jameson, 2018 for a digitised archive of MacLaury’s dataset).

The augmentation that we have shown for the Himba seems unlikely to be simply the result of schooling as suggested by Grandison et al. (2014) but very much as if the Himba, especially the younger ones (Griber et al., 2021), have imported a green term, probably from Herero who recently started using the term ngirine (Nguaiko, 2010) instead of the earlier tarazu (Kolbe, 1883); a process we refer to as cultural transfer or simply loanwords. The centroid of the newly acquired Himba word for GREEN was indeed located at much the same place in colour space as it is in English, but it is also in the same place in the neighbouring language of Herero where its words for green and blue come from European languages (Roberson et al., 2005). Inspection of the less-frequent Himba terms (e.g., ngara~light yellow, pinge~pink and ranje~orange) suggests that there are other loanwords on the way to becoming independent colour terms.

Of course, we cannot claim that loanwords have been the mechanism for augmentation in all languages. However, it is important to note (see Witzel and Gegenfurtner, 2018) that all colour names in all languages derive from the colours of objects. It is important because one might argue that there could be a different, perhaps physiological, origin for the colour term in the language from which the term was borrowed. Loanwords would seem to be important in their adoption to colour lexicons and much more the case than has been widely accepted. Even as apparently elementary a colour as red can be traced back to the Proto-Indo-European Breudh^ for red ochre and copper and is related to the Sanskrit word, rudhiraB for blood (Alexander and Kay, 2014; Jones, 2013). Other major colour terms have similar roots in naming; green has its roots in the Germanic word Bghro^ that refers to growing and flourishing (Jones, 2013; Welsch and Liebmann 2004, p. 64), which links green to plants. Orange originates in the Sanskrit word narangah for orange trees (Jones, 2013). So, it could be that new colour categories largely arrive in a lexicon simply by being imported from other languages and, it could even be that perceptual similarity had no role in the origin of the colour terms.

In conclusion, our findings showing the augmentation of a green term provide further evidence against the claim that primary colour categories are constrained by early perceptual mechanisms (Abramov and Gordon, 1994; Bosten and Boehm, 2014; Emery et al., 2017; Malkoc et al., 2005; Mylonas and Griffin, 2020; Valberg, 2001; Wool et al., 2015; Wuerger et al., 2005) and challenge explanations based on this claim (Berlin and Kay, 1969/1991; Kay and McDaniel, 1978; Kuehni, 2005; Philipona and O’Regan, 2006; Regier et al., 2007). Our findings from machine learning give priority to linguistic similarity as the mechanism for augmentation. While we recognise that there is some commonality in the organisation of colour categories across the world’s languages that could be due to perceptual similarity expressed as variation of saturation on the surface of the Munsell system (Lindsey and Brown, 2009; Olkkonen et al., 2010; Regier et al., 2007; Witzel, 2016; Witzel et al., 2015), one needs to look for complete explanations elsewhere (Davidoff, 2015; Gibson et al., 2017). To explain the augmentation of colour lexicons, we need to address how colour naming functions are learned by individuals in communities through interactions with other cultures, context and technological development.