General intelligence disentangled via a generality metric for natural and artificial intelligence

Success in all sorts of situations is the most classical interpretation of general intelligence. Under limited resources, however, the capability of an agent must necessarily be limited too, and generality needs to be understood as comprehensive performance up to a level of difficulty. The degree of generality then refers to the way an agent’s capability is distributed as a function of task difficulty. This dissects the notion of general intelligence into two non-populational measures, generality and capability, which we apply to individuals and groups of humans, other animals and AI systems, on several cognitive and perceptual tests. Our results indicate that generality and capability can decouple at the individual level: very specialised agents can show high capability and vice versa. The metrics also decouple at the population level, and we rarely see diminishing returns in generality for those groups of high capability. We relate the individual measure of generality to traditional notions of general intelligence and cognitive efficiency in humans, collectives, non-human animals and machines. The choice of the difficulty function now plays a prominent role in this new conception of generality, which brings a quantitative tool for shedding light on long-standing questions about the evolution of general intelligence and the evaluation of progress in Artificial General Intelligence.

In this supplementary material, we first include more details about the experimental results and the case studies. Then we perform a more thorough interpretation of the newly introduced notion of generality from the psychometric perspective (person-fit, IRT, SLODR and the c factor), an evolutionary perspective (g and G factor in animals, cognitive resources, selective pressure) and a computational interpretation (AGI, competitions and benchmarks in AI), properly covering the related work in these areas. Finally, we give further insights into the formal definitions, the properties of generality and the proof of the theoretical results.

A Case studies: details and extended experimental results
The following subsections extend the details about the experimental results shown in the paper.

A.1 Elithorn's mazes
Elithorn's Perceptual Mazes [1] take the form of a V-shape triangle, as shown in Fig. S1, where the coloured dots are superimposed at the intersection of the pathways. The imposed structure of the V-shape maze reduces the number of uncontrollable variables and holds the shape of the task constant. The goal is to collect as many yellow dots using a pathway up to the top. Three conditions must be fulfilled in order to successfully complete the task: (1) the pathway must lie along the lattice line; (2) the pathway can only move in an upward orientation and stay connected at all times; (3) the pathway must pass through the route with the maximum number of yellow dots. Therefore, it is critical that the subjects strategise before they begin working on the task. Figure S1: Maze of rank 10 with 30% saturation.
Given the structural nature of the task, several parameters of the maze have been considered to influence task difficulty. Previous research found that the two main physical properties that were most pronounced to influencing the difficulty of the maze were the size of the maze (rank) and the density of the pattern (saturation level) [2,3]. The assumption is that the maze difficulty typically increases with size, and monotonically with saturation up to about 50 percent, where it begins to decrease monotonically thereafter [3]. In a separate research, Davies & Davies (1965) proposed that a maze would be found more difficult if the number of steps required to pass through the optimal route was greater. Thus, this was considered as a third variable which was used to calculate maze difficulty.
The physical parameters used to calculate maze difficulty led to several authors proposing different mathematical models of difficulty [4,2], which are then intrinsic difficulty metrics, in contrast to extrinsic (i.e.,  psychometric or populational) difficulty. In particular, we will use the method proposed by [2]. Following their notation, 2 R is the total number of possible paths to pass in the maze and Um is the number of distinct paths that allow one to achieve the maximum number of dots for a given maze. For example, if the maximum number of dots that is permissible to pass is 3, then Um is the number of unique routes through the 3 dots. The saturation level pSq is an expression for maze density. Finally, the number of steps l required to achieve the greatest score for any given maze was included as well. The final version of the difficulty function is: ℏ " log 10ˆ2

RˆS aˆlb
UmẆ e used the recommended parameters a " 4 and b " 4 in the mazeGen R package [5], following [2]. Using this package, we generated 23 items with a range of difficulties. The difficulties are shown in Table S1. We administered them to 530 participants via the Cambridge Psychometrics Centre testing website 1 . The testing website has over 20 different psychological tests that are made publicly available for people who have access to the Internet to complete. They have the option to stop completing the test at anytime. Participants do not receive any form of financial incentives, but they are encouraged to complete by providing immediate feedback after the end of the test. Responses are binary. The respondent either found an optimal path or not. Of the 530 participants who completed the test, only 496 provided their demographics information Table S2, to which we restricted our analysis. Fig. S2 shows the means of the 496 responses for each item vs difficulty. The Pearson correlation is´0.49, which is reasonable (more difficult items get worse responses). This is sufficiently high to consider the difficulty metric to be fit for our purposes. There are several ways in which we can analyse whether the respondents have a general behaviour. One simple option seems to calculate the mean result per respondent and their variance. Fig. S4 shows that this is completely uninformative in this case, as the variance is determined by the mean for binary outcomes, and the variance of a Bernoulli distribution is pp1´pq, being a function of the mean response p. In any case, even if the responses were quantitative, one would not expect respondents to have low variance if a wide range of difficulties is considered, because most would fail at the difficult ones and would succeed at the easy ones. Actually, as it is expected that respondents fail on the difficult items, if there are some of these, very high generality -understood as reciprocal of variance-would be impossible.
A common way of analysing "generality" of a population is through factor analysis, trying to identify a dominant factor that would explain why a result for one item usually entails a similar result for other items.         Figure S4: Subjects' means vs subjects' standard deviations of 496 respondents and 23 items each of Elithorn's Mazes. As responses are binary, the plot just shows the standard deviation of the Bernoulli distribution: a pp1´pq.
We perform factor analysis using the R package psych and the function fa. By just limiting to one factor, we get the 23 loadings for the items between 0.123 and 0.518 (average 0.342). This factor, however, only explains a proportion of 0.123 of the variance. The scree plot in Fig. S5 shows that for this population and items a single factor is not enough. This contrasts with a reasonable degree of generality that we found using the generality analysis (see top left of Fig. 4 in the paper). Factor analysis simply ignores difficulties. Also, factor analysis just tells us whether there is a dominant factor in the population, but it cannot inform us on whether some particular respondents are more general than others.
Another common option to analyse this kind of results is to infer IRT models. Using the mirt function of the homonymous R package and a 2PL model we get the item response curves, all of them with positive discrimination parameters. The (Pearson) correlation between the estimated difficulties (the positions of the logistic models) and the intrinsic difficulties is 0.39. The (Pearson) correlation between these estimated difficulties and the response means, shown in Fig. S6, is´0.94, close to´1 as expected, which is higher than the corresponding plot with the intrinsic difficulties (Fig. S2). We now see an almost perfect monotonic relation, which basically illustrates that IRT difficulty is almost a monotonic transformation of response means, which makes it hard -if not impossible-to differentiate generality from capability. Nevertheless, we are interested in having a look at the two measures when using IRT difficulties, even if this analysis is a bit circular (we get the difficulties from the results and then we use them to analyse the results). Fig. S7 shows a higher dispersion than for the intrinsic difficulties and, as expected, we see higher normalised generalities on average, as the 2PL models assume this for ability. Still, it may be useful to compare individual agents with this capability and generality.

A.2 Letter series
Thurstone letter series is a common cognitive test that was introduced for his Primary Mental Abilities theory [6,7]. Given a sequence of letters from the Latin alphabet (where the successor of 'z' is 'a' and the predecessor of 'a' is 'z'), the goal is to predict the next letter in the series. Fig. S8 shows some examples.
In [9,10] a sequence generator based on computational principles and algorithmic information theory. The generator was built to ensure stability of the sequence (they should not have exceptions or noise) and its unquestionability (the shortest program that generates the sequence should not be rivalled by another program of similar complexity that gives a different continuation to the series). The generation also calculated difficulty for each sequence x, derived as Ktpxq, its Levin complexity [11,12] over a minimal instruction set computer with 14 instructions. Fig. S9 shows a sample of sequences and their intrinsic difficulties.  Figure S9: Examples of some of the sequences and their intrinsic difficulties, as generated for the C-test [9,10].
Reusing this study, we take 35 instances: 20 for prediction as shown in Fig. S9 and 15 for imputation (the gap is in the middle of the sequence) with difficulties ranging from 7 to 14. All these sequences were applied to 48 young humans with ages between 14 and 18 [9,10]. Fig. S10 shows the means of the 48 human responses for each item vs difficulty. The correlation is´0.77, which is quite strong (more difficult items get worse responses). For our study, we also added new results from 12 artificial systems, as shown in Table S3. The generality analysis was shown in Fig. 4 (top right). We show a human-machine split in Fig. S11.  Predicts or imputes the letter just before the end or the gap. x, y, z, a, b, c Constant models always predicting or imputing the same letter. repdiff Predicts or imputes according to an arithmetic series. alternate Predicts or imputes using an arithmetic series using only the even or odd letters depending on whether the gap is at an even or odd location. halfright A baseline model predicting or imputing correctly on about half of the sequences (0/1 chosen randomly). magic Using the web version (nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller. html) of MagicHaskeller, an inductive programming system [13,14,15]. We also perform factor analysis on this data, using the same methodology as in the previous case, focusing on the 48 human respondents. As a result, we get the loadings for the 35 items with values between´0.212 and 0.593 (average 0.215). This is very poor, and the first factor only explains a proportion of 0.093 of the variance. The scree plot in Fig. S12 shows that for this population and items a single factor is not enough. Again, this contrasts with a reasonable degree of generality that we found using the generality analysis (Fig. S11, left).
Finally, we also infer IRT models for this case. The Pearson correlation between the estimated difficulties (the positions of the logistic models) and the intrinsic difficulties is 0.02, but the Spearman correlation is 0.50. Then, the correlation between the IRT difficulties and the column means (item's mean response, a proxy for difficulty) is 0.00 (Pearson) and´0.60 (Spearman), as we can see in Fig. S13. While difficulties should correlate negative with responses by construction in IRT, this case is explained by a difficulty value on´150, an outlier difficulty value. Compare with the same plot with intrinsic difficulties (Fig. S2). In this case, we should analyse what the scale of these difficulties mean, the exclusion of outliers and whether we can derive generality from here. We do not explore this option further for this case.

A.3 Object recognition
For the object recognition scenario we used a sample of the results involving 24 objects with 10 variations each, totalling 240 cases, referred to as "primary test images" and shown in [16, Fig.1A]. Each result in our dataset is the aggregation of each object (e.g., a zebra with a given distortion) being confronted against the other objects (e.g., the non-distorted zebra image vs a non-distorted dromedary, a non-distorted zebra vs a non-distorted guitar, etc.). Because agents have to choose between two images, expected accuracy by chance is 0.5, so we normalised accuracy as 2x´1q. The agents are six deep convolutional artificial neural networks: ALEXNET, ZEILER, VGG, GOOGLENET, RESNET and GOOGLENETv3 (inception), the monkeys (macaques) and the humans. In total we have a non-binary response matrix of 8 rows and 240 columns. Fig. S14 shows the (unnormalised) average accuracy of the eight agent groups for the 24 objects in this study. We see that some objects are harder than others, but we will not use this information initially (just the psychophysical attributes).  Figure S14: (Unnormalised) accuracies for each the 8 agent groups (2 primate species and 6 ANN architectures) against the 24 objects we use in our study.
The psychophysical distortions are created by modifying four image attributes: eccentricity, size, pose and contrast. Not all of them affect the problem monotonically. For instance, pose has a minimum of performance at intermediate values, as was shown in [16, Fig.7C]. We derived a formula of difficulty that takes this into account: where sp¨q performs a min-max scaling and pose 1 "´|1´pose|, as the minimum distortion is around pose " 1. We binned difficulties into 15 equal-width binning, and we removed those bins with less than 4 difficulties, retaining 233 results from the original 240 cases, filtering the response matrix accordingly.
The ACCs for the eight agent groups can be seen in Fig. S15. Humans and monkeys have smoother curves. As the population size is high for humans but small for macaques, this is not clearly explained by sample size, and may be indicative that the effect of the psychophysical distortions is more gradual for primates than ANNs.   Figure S15: ACCs for the eight agent groups in the object recognition problem. Intrinsic difficulty derived from the psychophysical parameters.
The results of the generality analysis are shown in the main part of the article. In this case, as we only have 8 agent groups, we cannot do factor analysis or IRT. A populational estimation of difficulty is outlined in [16,Fig.2] by analysing the "behavioural signatures" (very much like what we show in Fig. S14) or approximated by the "ones with performance below the 25th percentile". Note that these difficulties are per object, where they determine that recognising a camel is more difficult than recognising a watch. With the difficulties derived from the psychophysical parameters, we are not using the information about the object, just the level of distortion, regardless of what object has to be classified. With this, we leave a great deal of the difficulty not been included in the metric we are using. This is consistent with the fact that the proportion of variance that is explained by the psychophysical parameters is very small (ă 0.1) [16, Fig.7D].

A.4 Odour span task
There were two main reasons for choosing the Odour (or Olfactory) Span Task (OST), and experiment 2 in [17] in particular. The first one is that this task has items with an associated difficulty that is very natural (number of scents to remember). Since capability and spread are in the same scale as the difficulty metric, the interpretation is very intuitive and it even has units: number of scents. Note that many other metrics of performance are unitless, such as accuracy. The second reason for choosing this task is that span length is a very interesting metric, and does have units (also number of scents). Actually, if the generality of an agent were maximum, span length would be exactly equal to capability (and exactly equal to longest run), as the agent would be able to perform perfectly until the number of stimuli reaches its capability, falling sharply to zero afterwards.
However, this extreme does not seem to happen for this experiment. Actually, longest run is usually larger than span length, which means that some good streaks are obtained when the rat has already seen many stimuli. Looking back to Fig. 4 (middle right), we can notice that rats have very high capabilities (the average is 50.04 scents), when we think that this really means the number of scents they are able to remember. In particular, five of the ten rats (T12, T13, S17, S1, V20) had no previous exposure to the stimuli before the experiment, so whenever they had to choose between the new scent and the old scent in the experiment, the new scent was totally new. This means that these rats did not have to distinguish between the old scents, just being able to tell if the scent was new, which could be linked to some basic olfactory physiology rather than memory. These rats perform better than the other five. However, the other five rats, which had been exposed to all scents before the task, while worse, still have a capability above 45 scents on average. So this 'novelty' interpretation of the task can only explain part of this exceptional performance. Another partial explanation is combinatorial. For a very large number of stimuli, the probability that an old scent reappears is high, as it has been in the pool for a long time. On average, the probability that any scent that appeared before as new is picked again as old is close to one on expectation (it is a series adding 1{n each time to the probability, with n being the number of stimuli seen so far). In particular, scent 1 appears 4.85 times in expectation in an experiment with 72 odours ( ř 71 i"1 1{i). This basically serves as reminders that may be helpful to remember the old scents. Taking all this into consideration, we can look at some particular examples in Fig. S16. Rat D2 has an almost monotonically decreasing curve, going down slowly from 0 to 65, where the data is discontinued. On the contrary, rat S17 is much flatter, but as it is almost perfect up to difficulty 65 (and the data is continued assuming the best possible case), we get a lower spread and higher generality. Rat F16 has an abstruse behaviour and seems to increase for higher number of stimuli. Finally, rat S1 has a very sharp decrease in performance around difficulty 50. There seems to have reached a saturation point, which would have given this rat a high generality, but the curve also sharply recovers between 55 and 65. Notwithstanding, spread is still the smallest.
The key element to understand this experiment is that we have ceiling issues. We do not really reach a number of stimuli where accuracy really falls to zero. This may be the case for larger numbers of stimuli, but as [17] recognise, that would make the experiment very hard to conduct, as 72 is already a very high number of trials. But it may well that "rats can recognize the relative familiarity of odors in some fashion that is largely independent of the number of stimuli to remember. In other words, within a given session, choices may involve an assessment of the 'newness' or 'oldness' of a stimulus in relation to the other stimuli present on a given trial, with the ultimate choice being made towards the least familiar option". Overall, independently of the ultimate cause, we have shown that generality analysis is able to detect when generality takes place, and those rats that are more or less general.

A.5 Feature-based classification
We started with 473 different machine learning classifiers from study number 7306 from OpenML [18]. We removed all classifiers with accuracy below 0.35, which is roughly what a random classifier would achieve for this problem, since there are three balanced classes. After this filter, we end up with 419 classifiers. Again, as in the previous cases, we do not consider any predefined partition of the instances in subdomains (apart from the three classes). Each machine learning technique can be better or worse for some subgroups (better in some regions, in dense areas, for some classes over others, etc.). The interesting point of our analysis is precisely this; we can derive generality without considering any a priori separation of the examples into subdomains. Fig. S17 shows the points of the aggregated ACC (without really connecting the points of the 419 classifiers for clarity). The left plot with KDN difficulty shows points that are largely arranged in a steplike manner,  Figure S16: ACCs for four selected rats in the Odour Span Task. As there are only two choices, the accuracy of a random guess should be 0.5, so the values in the y-axis are scaled to 2pr j,i´0 .5q.
with difficulty 0.4 being the tipping point. The aggregated ACC on the right corresponds to T D U difficulty. In this case, it is also very steplike, but the inflection point takes place at difficulty 4, with no much happening in difficulties 1 to 3.  We can delve further into the classifiers shown on Fig. 4 (bottom). Let us identify one technique whose underlying mechanism is simple and well understood, such as logistic regression, a linear model based on the original attributes. Given some non-linearities in iris, this classifier does not reach exceptional values in capability (0.63) for difficulty KDN. However, it turns out to be very general (spread 0.17); as it is based on hyperplanes, it cannot really specialise too much. Similar results happen for difficulty T D U : capability (4.78) and generality (0.42).
In order to have a more complete understanding of how several techniques behave under different difficulty metrics, we can look at the three best classifiers in terms of capability or generality. For KDN , the best three classifiers in Table S4 are algebraic (linear or non-linear), while the best three in generality in Table S5 are based on decision rules or decision stumps in ensembles. For a difficulty metric based on multivariate distances, high generality is given for those methods based on univariate partitions. The distinction gets more blurred in tables S6 and S7 for difficulty T D U , but models with high generality are now in this case based on a metric space (except the decision tree ensemble, which samples on the features), while the metric is based on short descriptions. Further analysis should be needed to find consistent patterns between the type of difficulty and the generality observed for related and unrelated techniques.   Overall, we have seen that the difficulty metric has a strong effect on the capability and generality measures, and orders classifiers differently (both in terms of capability and generality), and ultimately differently too from the order given by the model accuracies.

A.6 Chess (with Opp transformation)
The Opp transformation in Table 1 takes place in situations where the difficulty of the task depends on the opponent, as happens in many one-vs-one or team-vs-team competitions. To illustrate this case, we analyse the results of the World Computer Chess Championship (WCCC), usually part of the Computer Olympiad, where several computer chess players compete against each other. The transformation takes the score of the opponent as difficulty with values being 1 (wins), 0.5 (draws) and 0 (losses). if two or more opponents ended up with the same score, they are considered together as "tasks" of the same difficulty). This is why we see values (grey circles) on 1 (wins), 0.5 (draws), 0 (losses), but also some other values. In the end, if we have all possible pairwise matches, we have a difficulty scale that goes between 0 (the opponent has lost all matches) and the total number of participants (the opponent has won all matches). After this transformation, we can apply generality analysis (GA) as usual. Avg. γ 0.87 Avg. spread 1.51 Figure S19: ACCs for all the participants in the World Computer Chess Championship using the final score of the opponent as difficulty. Data from https://www.game-ai-forum.org/icga-tournaments/game.php?id=1. Left: Reykjavik 2005 with 12 participants. The winner (Zappa) and the last one (Fute) won and lost all matches respectively except the one between them, which was surprisingly a draw. Right: Leiden 2015 with 9 participants. Here, no low-rank participant beat any high-rank participant, and draws were usually between participants with close scores. Accordingly, the average generality is higher in this case.
The plots show that both editions show good generality values, with the 2015 edition being actually more general (although in this case the correlation between generality and capability is negative). We can also conclude that the degree of transitivity is high, in light to the generality values. Note that in adversarial settings like chess, agent transitivity and task transitivity are two sides of the same coin, as tasks are opponents, which are also participants (if a participant A beats B and B beats C, how likely is it that A beats C?).

A.7 ALE (with Aref and Rnk transformations)
ALE [19] is a collection of old Atari video games that has become very popular for the evaluation of generalpurpose reinforcement learning algorithms interacting with the game at the raw pixel level (screen shots) [20]. These video games have no difficulty metric and each game has different score magnitudes. In order to see progress in these games, AI researchers have compared their results against humans, using some recorded data about human performance on all games. This is exactly the ARef case in Table 1, where our Ref Agent would be humans. We start with a response matrix R with M " 24 subjects (23 AI systems plus the human reference) and N " 45 games. The transformation of R is straightforward in this case. We simply discretise results to 0 or 1 depending on whether the particular cell in the matrix is greater or lower than the corresponding result for a human for that game. Then, in order to derive a metric of difficulty for each game, we take a populational approach, the percentage of AI systems that are below the human score. The higher this proportion, the more difficult the game is. After this transformation, we can apply generality analysis (GA).
When we do the generality analysis, we can see the measures for the 23 systems and the human reference (human.noop) in Fig. 5 (second row, left). In this case, as the human reference is always equal to or larger than itself, we used the convention of setting a score of 0.5 for humans (they are borderline with the human limit), so we see humans (human.noop) more or less in the middle (and on the constant isometric, as humans are borderline with humans independently of the difficulty of the problem). Overall, for all other agents generality is quite high, and R2D2 (the best AI system for this benchmark) only seems to fail (be worse than the human reference) on one game over the 45 games included here, which is also very difficult.
But can we analyse the data without a human reference? Yes, but the transformation is more convoluted. Following case Rnk in Table 1, we use a value of c " 100, getting the result in Fig. 5 (second row, right). We see that the new right plot is very similar to the left plot, albeit using two different transformations. But now, as humans have been considered as any other agent, their place in the plot, and its generality, is more meaningful. Actually, we see that the capability is more or less on average (as in the left plot), but generality is very low in comparison. In comparison with the population of AI agents, humans do very well on some games and poorly on others, being less general than most of the AI agents.
Overall, we can trace the dates when the AI agents were introduced for any of the plots of Fig. 5 (second row). In this way we could analyse the temporal evolution of AI systems in both generality and capability, from the early systems being worse than humans for many games (usually on the left of the plots) to current systems, being better than humans for almost all games (more on the right). Figures S20 and Fig. S21 provide the names of the supplementary files where this evolution is animated.
The animation for the temporal evolution (by month) of AI systems for the ALE benchmark in terms of capability and generality can be found in the supplementary file: ale.refHUMAN.capability-vs-spread.evolution.months.mp4 Figure S20: Temporal evolution (by month) of AI systems for the ALE benchmark.
The animation for the temporal evolution (by year) of AI systems for the ALE benchmark in terms of capability and generality can be found in the supplementary file: ale.refHUMAN.capability-vs-spread.evolution.years.mp4 Figure S21: Temporal evolution (by year) of AI systems for the ALE benchmark.
The evolution videos show that the improvement has not always been incremental in terms of capability and generality, until reaching the final point (R2D2), better than humans for all but one game, with high generality and capability. However, it is important to have in mind that here the algorithm is retrained for each game. In other words, it is not the same system solving all these games, but the same algorithm (after specific training) solving each of them separately, with no memory or transfer between the games. Of course, humans need some training in each of the games, but the number of required episodes in not comparable, mostly because they do transfer a cross-generalisation. Because of these important differences, significant research in AI is taking place in systems that can transfer and change between games with little retraining.

A.8 GVGAI video games (with Rnk transformation)
In the case of the Atari games above, we see that the positions in terms of generality and capability are similar with the Aref and Rnk approaches, which gives us support for the use of this methodology in cases where we do not have a reference agent, such as GVGAI. General video game AI (GVGAI) is another popular video game competition in AI [21,22], with a benchmark that comprises a large number of real-time 2D grid games. In this case, we use 23 systems from a competition in which they confronted 49 games [23]. Each game has 5 variations. We will explore the case where we consider each variation as an independent game (so having 245 items) and a grouped version where we aggregate the five variations of each game into one (so having 49 items).
Before doing the transformation and the generality analysis, we performed factor analysis. In the ungrouped case, there were some constant columns, leading to problems in the correlations, but in the grouped case, the mean of the FA loadings for one factor was 0.60 and the accounted variance was 0.43, which suggests the existence of a possible general factor in this population. Now we perform the Rnk transformation (see Table 1), deriving difficulty as the ranks of the AI systems from which we have collected results, using a generation of new columns with c " 100. The capabilities and spreads are shown in Fig. 5 (third row, left ungrouped, right grouped). What we see is that the systems are quite general (which is consistent to what we got from FA), but they distribute in a quite thin flat band, where the agents with middle capabilities are more general (at least according to normalised generality) than those with low or high capability. The results for the ungrouped and grouped cases are very similar, with a little bit more dispersion on the right plot (probably due to more robustness on the estimation). In the right plot we can see more clearly that with about the same capability, MnMCTS is more general than TeamTopBug. Both use different approaches, but the latter may have a tendency to specialisation for some types of games (independently of its difficulty).

A.9 Physical cognition tasks (with Rnk transformation)
We now explore the results from [24], an empirical analysis of orangutans (Pongo abelii and Pongo pygmaeus) on five physical cognition tasks: Box Task, Detour Reaching, TubeTrap Task, Honey Tool Task and Reversal Learning. The study analyses "general cognitive abilities", so it is interesting to see what the generality measure can show in this case.
There are 53 orangutans and 5 items. We start with factor analysis, where the mean of the FA loadings is 0.36 and the accounted proportion of the variance is 0.19. This might be partially explained by the low number of items (5), but in any case we cannot conclude (or rule out) the existence of a general factor.
Let us try generality analysis instead. Again, we use the Rnk transformation with c " 100 (this generates 500 columns, giving a sufficient resolution of difficulties). The capabilities and spreads are shown in Fig. 5 (bottom row, left). We see important differences in capability and generality for the 53 orangutans (shown with their names in grey). Some of them, such as Bella, are more general than other more capable ones, such as Julius. The aggregates (shown in coloured symbols) are very similar for different groups (species, age, etc.), where only the background results (bkg-*) deviate from their centre (but their subsamples are smaller too). Finally, we separated the results of low-capability orangutans and high-capability orangutans, with average generalities of 0.73 and 0.75 respectively. Again, no SLODR in the context of GA appears in this case.

A.10 The Primate Cognition Test Battery (with DRef transformation)
Finally, we are going to analyse an increasingly popular battery for the evaluation of a wide range of cognitive capabilities. Initially introduced for primates in [25], it has also been used or adapted for non-primate animals. We will use the original data from [25], which evaluates human infants, chimpanzees and orangutans. We process the results aggregated into six categories: "Space", "Quantities", "Causality", "Social Learning", "Communication" and "Theory of Mind". In this case, we compare populations and not individuals, but interestingly we have the standard deviation of the results for each species in the six categories. So in this case we are going to use a different transformation, DRef, where we reconstruct a distribution of results for humans. In order to do this, we use the mean and standard deviation for each category to derive the quantiles using a beta distribution (more appropriate than a normal distribution as the values are between 0 and 1). With this, we generate c " 100 new columns for each category, as we did in the Rnk approach, but in this case looking at the quantiles of the reference distribution rather than the ranks. Also differently, we derive the difficulty as the quantiles of the distribution. In other words, results and difficulties are transformed using the human distribution as a reference.
What we see in Fig. 5 (bottom row, right) illustrates that the generality of humans is higher than chimpanzees (and their normalised generality also higher than orangutans). Of course we are using humans as a reference here, but the location of points is basically given by non-human apes not being good at the non-physical categories, while humans being consistently good for the six categories. While the data in this case is very aggregated to allow for a more refined analysis, the picture of Fig. 5 (bottom row, right) is sufficiently simple and clear to understand how these apes compare in terms of capability and generality (for the PCTB).
B Psychometric interpretation: generality, the g factor, SLODR and the c factor In this section we will analyse the interpretation of the notion of generality in the context of the science and literature of human intelligence and psychometrics [26]. We will first flesh out the clear connections and inspirations, and then we will explore some other more profound implications.
B.1 Related metrics and models: person-fit, Guttman scales, reliability and variable-θ models The connection between ability and difficulty is has been frequently vindicated as "the foundation of [...] measures of mental ability" [27]. Accordingly, psychometricians will find the curves and indicators familiar in some ways. The use of two parameters, difficulty for items, and ability for subjects is common in classical test theory and especially in item response theory [28,29]. Also, plotting the performance, or the probability of correct response, against ability on the x-axis leads to the item characteristic curves. Similarly, plotting this against difficulty on the x-axis leads to subject or person response curves [30,31]. It is important to note, however, that in IRT, both ability (usually denoted by θ) and difficulty (usually denoted by b in logistic models) are latent factors, which are estimated by making several assumptions: "1) local independence, 2) unidimensionality, and 3) a specified shape for the item characteristic curve" [31]. The shape is determined by a model, which is usually a decreasing monotonic function on b´θ, such as a logistic function. Then the parameters are estimated from a response matrix r j,i . In our case, we are not considering a measurement problem (yet), and we are not (necessarily) plotting latent variables. Difficulty could be a notion derived from the items themselves, and capability -the metric we use instead of the psychometric 'ability'-is not the parameter of any function. Actually, we define capability -and we use a different term on purpose-as an area, and not the location of the steepest point of any curve. For models that are symmetric at y " 0.5, such as 1PL or 2PL logistic models, the area equals this location. However, for irregular curves not following a model at all, it is the area what is really meaningful. Also, we are not plotting correct response for each item, but aggregated or expected response per difficulty.
The key question about the assumptions in IRT is that even if some models allow for a discrimination parameter for the items, so that that the correlation between correct response and ability for all items is relaxed (it might even be negative), this is not usually the case for ability. For many models, IRT is actually assuming a strong (negative) correlation between correct response and difficulty for all agents. Note, by the way, that a fully steplike ACC does not maximise (negative) correlation. In this extreme case, correlation is actually´0.866 if the step is in the middle of the range of abilities, but it may even be 0 in the limit if the difficulty is not bounded. The models (including the variable-θ ability models we will mention below) consider that a subject being better at difficult items than easy items is an aberration, mostly because the models and estimations are done in such a way that this is assumed not to happen (or should just show a bad fit to the model).
This has actually led to a myriad of person-fit metrics [32], which is a way of analysing subjects at the individual level. This aims at identifying cases such as "low-ability examinees who copy answers to several difficult items from a much more able neighbor and very high-ability examinees fluent in another language but not yet fluent in English, who misunderstand the wording of several relatively easy questions" [31]. But in the end, all this is about whether the observed curve matches the expected curve. This was not meant to measure generality.
As there are so many person-fit metrics, some of them are relatively similar to Γ , as defined in this paper. Especially relevant are those that compare the person response curve with a Guttman conformal curve, which is a curve that is right for the first r items of lowest difficulty and wrong for the rest (a step function). In this setting, the closest metric seems to be the norm conformity index [33], which basically counts how many ranking mismatches there are between a Guttman curve and an observed curve. Another very related metric is the disagreement index [34], where the agreement index (the sum of the results multiplied by the difficulty index for all items) is compared with the score of the Guttman conformal curve with the same number of correct responses (NC score). Since all these metrics are ordinal, and convert the difficulty of the items to ranks (index), the correspondence to Γ is only direct when we have a uniform distribution of items per difficulty. In other words, all these metrics take all instances as equally valuable -the NC score is the number of counts, the number of correct responses-, while the agent characteristic curves shown in Fig. S30 sum with the assumption of difficulties being uniformly distributed. So, if there are more items for some difficulty values than others, the count (the NC score) and the area (the capability) would be different. This is intentional, as we are not interested in a capability according to a set of items, but according to different levels of difficulty. Actually, for many difficulties, the number of items might be infinite. Assuming an uneven number of items per difficulty does not have more support than assuming them uniform. Capability is invariant to this.
Still, because many of these metrics take the step function as a reference, it is important to look at the Guttman scale or, more precisely, the deterministic model [35,36], which can be considered a precursor of IRT. A deterministic model just captures the item response curve as a step, i.e., the probability of correct response is 0 for values below the ability θ and elsewhere. This model produces agent response curves that are also a step -the Guttman conformal curves-and, hence, they would have infinite generality. Several properties derive when items (and hence agents) follow this model. In particular, task transitivity and agent transitivity are true under this model, as shown in §E.2 (properties #6 and #7).
The Guttman scale assumes monotonicity (higher probability of response for higher ability), but there are many other models (some non-parametric [37,38] and some parametric [28,29]) assuming this. The Guttman model has been used in cases where solving one item means all items of lower levels of difficulty have to be solved as well. For instance, in arithmetic, at the lowest level of difficulty one might have addition and then at the next level we can have multiplication. Arguably, one cannot do any multiplications without knowing addition (although there are very simple cases such as multiplications by zero or by one that do not require any addition in the process). In general, the Guttman model does not hold for practical sets of items, and it is mostly used because of its simplicity.
It is important, hence, to say that our notion of generality is not assuming the Guttman model for items (or a non-ordinal version of it) or a conformal Guttman curve, but just measuring how far the expected responses of an agent are from that theoretical situation.
Finally, there is a clear resemblance of the notion of generality with "person reliability", as introduced by Lumsden [39]. The notion of reliability wants to capture "tremor effects", i.e., each person has a variability on its ability θ. Actually, Lumsden models this reliability with a normal distribution and then the agent characteristic curve turns out to be its CDF. For constant-θ IRT models, like the traditional logistic models or the Guttman model, the theoretical agent characteristic curve has the same slope for all respondents. This changes for variable-θ IRT models, where reliability is introduced as an extra parameter (sometimes sacrificing the discrimination parameter, depending on the degrees of freedom).
In general, without considering any particular model, an agent can get constant θ, with no reliability issues at all, and still have a flat curve. Simply, the agent is consistently bad at easy problems, like the two top plots on Fig. S30. It is only when we limit ourselves to some particular models that we can understand the slope of the curve as a reliability. In other words, variable-θ models assume "that the person trait level varies during test administration" [40]. By using expected values and thresholds transforming them into accomplishment values we exclude the reliability component and we focus exclusively on generality. Tremor effects can also be discounted as they should appear for all difficulty levels.
Perhaps because of this confusion between reliability and generality, the agent reliability metrics are not as widespread as the person-fit metrics commonly used for constant-θ IRT models. But we have to be careful about person-fit: "From a constant-θ point of view, person reliability can be considered as a source of misfit or overfit at the individual level. Thus, the imprecise, highly unreliable respondent [...] will produce an almost random pattern that will be regarded as misfitting. At the other extreme, the highly reliable respondent is expected to produce a highly scalable response pattern that fits the stochastic model too well and that will be regarded as overfitting". Here, in contrast, with the individual metric of generality, we are not considering any model to fit. For generality we just examine the distribution of the expected responses in terms of difficulty.
Once the differences between generality and reliability are clarified at the conceptual level, we may be interested in the connections at the formulaic level. For instance, if we generate expected responses according to a normal distribution (like the middle right and the two bottom plots in Fig. S30), with a standard deviation σ we have the following 2 : Proposition 1. Assuming a normal distribution on capability, with standard deviation σ, the slope of the ACC will be´1 σ ? 2π . Less trivially, we can show the following lemma and proposition: Lemma 2. Assuming a normal distribution on capability, with mean µ and standard deviation σ, such that the location is sufficiently beyond 0 to have negligible mass below 0 (i.e., µ σ " 0), we have that M j " σ 2`µ2 2 .
Proposition 3. With the same assumptions as lemma 2, we have that spread S j " σ and Γ " 1 σ . The definition of person reliability was just 1{σ [40], so we see the equivalence between reliability and generality if the agent had an ACC that were complementary of a normal CDF. However, in our case we do not understand σ as the standard deviation of capability or its measurement and there is no special reason why this should be normal. A different interpretation appears if we consider the estimation of difficulty to have an error that is normally distributed, with a variance that is constant for all difficulties. We will come back to this when we discuss possible ways of estimating the difficulty function.

B.2 From individual generality to populational generality: manifolds and the g factor
As we mentioned in the paper, Charles Spearman found an important phenomenon; when he analysed a set of different tests taken by the same population, and calculated the correlations between tasks 3 , he found a positive average correlation (ρ " 0). A person obtaining good results on a test usually obtained good results on the others. This phenomenon was known as the 'positive manifold' [41,42], and was stronger the more culture-fair and abstract the tests were. The correlation does not emanate from the tests only, but depends on the agents in the population. Despite this dependence on the population, the positive manifold has appeared repeatedly for different human populations and different sets of tests, provided they are not too linked to particular cultural or educational backgrounds. Spearman introduced a rudimentary factor analysis to extract a dominant latent factor, which he called the g factor, explaining a significant part of the subjects' variance. He called it the g factor. Since then, this factor has been found systematically in different populations [43,44] and has predictive value about many facets of human life, from academic performance to (lack of) religiosity. The controversy appears when g is associated with general intelligence, and is said to underlie all other factors and facets of intelligence. In other words, it is not g which has been called into question, but its interpretation. Note that the theory behind g allows psychometricians to estimate how much of this factor an individual has, the g score, giving us a latent factor that can characterise and distinguish individuals. But this score is not generality, but (general) ability. For two different people with the same g score, we could have that one person achieves good results for many cognitive tests consistently but another person may get a more uneven performance for the same set. In other words, the predictability of g scores is analysed globally, but still some individuals may be less predictable than others. One possible reason may be reliability 4 , but another reason is simply that some individuals are less general than others. In the end, g was the result of observing a general factor emerging from human performance on a range of tests. But where does this general factor come from in the first place? Is it a necessary result if the individuals are really general? This new question is what we try to explore below.
Let us first analyse the situation where the positive manifold is extreme, with a mean correlation of 1. This means that all columns in the response matrix r j,i have a correlation of 1. Now let us assume that all columns are normalised (same mean and standard deviation). The only possible situation for this to happen is that all columns are equal, @j, i r j,i " r j,1 . So clearly, for each individual we have zero variance in the tests. But this zero variance between numeric values for a test does not mean infinite generality (which must be calculated from an ACC). Consequently, we cannot relate this to generality as we do not know the instances in each test and their difficulty. But let us assume that we convert each response in the response matrix into a step ACC with capability at the r j,i , or in other words, we assume that ψ j phq " 1 iff r j,i " r j,1 ě h, and 0 otherwise, and merge them into a single ACC for each agent j as all of them are equal. Clearly we have that Ψ j " r j,1 as all of them are steps by definition, also having Γ j " 8. In brief, only under this "thresholding" approach that constructs step ACCs for a range of tasks, a mean correlation of 1 would imply infinite generality. Actually, this transformation approach is analysed in §F.4, where we get an interesting and straightforward connection between generality and row variance in the general case (no extreme manifold).
Secondly, let us now analyse the situation where all agents have maximum generality. Without loss of generality, we can consider that the rows of the response matrix R MˆN are ordered by increasing capability (the columns may also be ordered by increasing difficulty but this is irrelevant here). The values for each and every column in the response matrix r j,i would be of the form 0 p 1 q , with p`q " M , i.e., the item response curves would follow a Guttman model. If p ą 0 and q ą 0 the correlations will be well defined and will be strictly greater than 0 and there will be a positive manifold. Depending on the distributions of capabilities and difficulties the magnitude of the average correlations will vary. For instance, it is easy to see that if we consider a normal distribution of difficulties and an equal normal distribution of capabilities, the mean correlations will be around 0.47, which is a very important positive manifold. In this situation, we see that individual generality implies a positive manifold. We do not even need to do factor analysis to check whether individual generality extends as a populational generality. As Guttman points out, a notion of populational generality can just be defined "as having all correlations positive or zero", without the need of "a common factor" [45].
Spearman, and most of the literature after him, analysed the positive manifold for tests instead of items. Tests group a number of items that are considered to be related (e.g., a maths test) and include a range of difficulties so that we get diversity of results for the test according to the population it is going to be applied to. So let us consider that items or tasks µ i are grouped into tests τ k . Now we can construct a new response matrix where columns k are tests and rows are agents j. We can analyse that by aggregating items into tests, mean correlations may get much higher under different scenarios.
For instance, let us consider both item difficulties and capabilities following the same normal distribution (sufficiently far from 0 so that there is negligible mass below 0). In this case, we have that if we group the items randomly, we can get mean Pearson correlations above 0.99. If the new groups preserve the item difficulty distributions, for each agent π j we will have exactly the same results for all tests on expectation. As the agents have different capabilities, we will have a mean Spearman correlation equal to 1 and, if the distributions are normal, a very high mean Pearson correlation.
Other similar connections can be obtained with some other distributions, assuming that each test preserves a range of difficulties such that it ensures the differences in capabilities to be represented per each test. This is actually a very natural condition for a test to be informative (if all respondents got similar values for the test, then the test would not very informative). As a result, if this variance is preserved, the only strong sufficient condition for a high manifold to appear is individual generality 5 . Resp. variance: 0.24 Figure S22: Four agent characteristic curves, with the bands showing two possible tests (red and green) grouping subsets of tasks. Top: We see two maximally general agents. Independently of how the groups are made for the two tests, provided the same range of difficulties is covered, the curves for each subset would be the same and so the effect on the populational generality. Bottom: groups can be made in such a way that the green test gets all positives for the bottom left plot but all negatives for the bottom right plot, and the opposite for the red test. As a result, the manifold might even be negative. Compare with Fig. 1 (bottom).
Only the maximally general ACCs can ensure that for every possible partition or sampling of instances, provided the range/distribution of difficulties is kept, the manifold is created, since the capabilities are preserved for each subtest. This is illustrated at the top of Fig. S22 in contrast with the bottom of the same figure. If individuals have low generality, choosing sets where a difficulty range is preserved is less important, and the positive manifold could still appear if the tests are not splitting the items by pockets of speciality. In other cases, the manifold might even be negative. On the other hand, if individuals have high generality, any partition of items into tests provided the range of difficulties is preserved would lead to high positive manifolds. In any case, negative manifolds would never appear.
Negative manifolds are very rare in the literature of human intelligence. Also, having sets of cognitive items for which difficulty does not play a role seems very unnatural. But still, the evidence might be compatible with some moderate degrees of generality or some individuals being more general than others. The plausibility (or necessity) of a particular scenario in light of a positive manifold will depend on a series of assumptions. Of course, the sufficiency direction is clear: if we are able to measure generality of the individuals in a population and we know how tests are formed, we can predict the manifold.

B.3 Spearman's Law of Diminishing Returns (SLODR) and individual generality
There is another source of evidence that can help us with the analysis of the plausibility of individual generality in light of a positive manifold. This evidence was also first gathered by Spearman. He calculated the correlations and g factors for the results of two different groups over the same tests: a group of humans A with normal abilities and a group B with low abilities. The mean correlation for group A was 0.47 while the mean correlation for group B was 0.78. The proportion of the variance explained by g for the low-ability group (scoring worse) was much higher than for the normal-ability group (scoring better). The more intelligent a population is, the less variability is explained by g. Spearman associated this with a possible saturation of g for most able groups, such as the law of diminishing returns in economics, and since then it became known as Spearman's Law of Diminishing Returns (SLODR). The finding has been replicated in many different experimental settings since then [46,47,48].
Spearman argued the items that were only solved by the most able individuals required the combination of many skills, and g would play a smaller role. Several other explanations have been introduced, such as [46] and [47]. These explanations have been contested but the very experimental evidence itself has been put into question. One explanation describes the phenomenon as an artefact of the way items are selected, choosing narrow ranges of difficulty. Basically, there are no sufficiently challenging items, and very able individuals do not require to use general intelligence, as they can solve the problems more mechanically. Using the same tests for both groups, and a ceiling effect given by an insufficient range of difficulties would create the observed effect. Relatedly, Jensen [43, p. 587] discovered that the most able group showed lower variance than the less able group. Finally, when the more able group was asked to solve problems of higher difficulty, the SLODR vanished, and the more able group even showed higher correlations [49].
This observation is more consistent with individuals having generality, such that if the distribution of difficulties of items is not adjusted for the two subpopulations (or sufficiently wide to accommodate both), the items would be on the left of the step ACCs for many individuals of the more able group, so the correlations of the most able group would be smaller. Note that this would not appear for flat ACCs (with very low individual generality).
In other words, the SLODR, without adjusting the difficulties, would not appear if the individuals were not general at all. However, it appears if the individuals are highly general. And it is also easy to see that if we adjust the difficulties, so that the distributions are the same for both groups (and the relative distributions of abilities are the same), then we would have exactly the same manifold, so no diminishing or increasing returns.
Indeed, there is a pressure about resources when trying to achieve capability. This may make the ACCs more compact for higher capability, leading to more individual generality for the more able group. That would entail an augmenting return, as postulated with the so-called Universal Law of Augmenting Returns (ULOAR) [50]. We will return to these issues under an evolutionary framework (pressure of resources) and also under a computational framework, by looking at the invariance theorem and the stability of difficulty.

B.4 Individual generality, collective intelligence and the c factor
Finally, let us comment very briefly about collective intelligence. Collective intelligence appears in the natural world in many ways, as groups of individuals of the same species, the whole species taken as a whole and even in ecosystems with complex interactions between several species. Here we just limit our analysis to the view of collective intelligence as teams that have to work together to solve tasks, usually in quite controlled conditions [51,52]. While maximum generality is optimal for an individual, it is not for groups [53]. If all the individuals in a group have maximum generality, the one with highest ability will dominate the rest. By agent transitivity (property #7), everything any other agent solves will be solved by the dominant agent. Consequently, and the result of the group will be at most the result of the best agent in the group. With more specialised agents, there could be more options to exceed the results of the most capable individual. Of course, this depends on many assumptions about the dynamics of the groups, with the exact outcomes easier to derive when groups just combine their capabilities by voting or weighted voting (if confidence is used) [54,55,56]. In more realistic scenarios, redundancy may be needed in a group, and hence more generality or more overlap in the abilities of the components may be beneficial.
One straightforward way of considering the aggregation of several agents is to consider that each agent is answering a proportion of the questions, chosen randomly. In the particular case of having N agents with step ACCs (maximum generality) then we have the conditions of lemma 19 and we have average capability and spread equal to the standard deviation of the original capabilities. Randomly choosing from general agents with different capabilities basically creates a non-general composite agent.
Interestingly, the aggregation of several curves for which we do not know the shape could be understood as a normal distribution on the reliability of the capability, transformed into a sigmoidal cumulative density function for the ACC. Consequently, findings such as the c factor [51], could be re-analysed by looking at the individual generalities first, rather than looking at the individual g scores (or IQ scores).

C Evolutionary interpretation: generality and general intelligence in the animal kingdom
The study of intelligence in animals (including humans) usually distinguishes between domain-general and domain-specific kinds of cognition. Much debate has been held on the presence of degrees these have in humans and other non-human animals. This ultimately relates to discussions around modular views of the mind [57] versus developmental domain-general learning [58]. It is also common to analyse whether social species are associated with more domain-general cognition, and the so-called social hypothesis (see, e.g., [59,60,61,62]).
The definitions of what is general and what is specialised also vary in the literature, but it is usually understood as coping with a wide range of cognitive tasks, or flexibility for changing cognitive demands in an unpredictable environment [63]. Note that this view is similar to the notion of generality we are discussing in this paper, except for the explicit use of difficulty. In our case, we say that an animal or a species is cognitively general if it is able to perform equally well on a wide range of problems up to a limited difficulty. This contrasts with specialised animals or species that display a hardwired fixed repertoire of domain-specific functionalities where they excel, but are unable to cope with even the simplest tasks beyond the repertoire.

C.1 The g and G factors and intelligence convergence in animal cognition
A data-driven approach to the issue of general intelligence in animal cognition has usually been conducted with population analyses performed on several non-human species [64]. Burkart et al. [65] provides the most comprehensive review to date of the study of the correlation manifold in non-human animals, both intra-species (denoted by g) and inter-species (denoted by G). The main conclusion is that "there is increasing evidence for g in nonhuman animals, particularly in mice and primates [...] At the interspecific level [...], studies of primates and birds provide a robust pattern consistent with G" [65]. Although some methodological caveats exist [66], the evidence is understood as these factors being stronger if they are able (on their own) to explain a high proportion of the variance of results for a battery of tests in a population (individuals in a species for g, or species averages in a multi-species analysis for G).
If we represent the performance of several individuals or species for several domains, as shown in Fig. S23, the evidence for g would be more in alignment with the plot in the middle, which shows that when one individual displays some performance in one domain the individual tends to display similar performance in the other domains. This is much in alignment with the early notions of general intelligence in humans and the positive manifold, as per our early motivation around Fig. 1 (bottom, especially if we rotate the plots) and what we have just discussed about Fig. S22.
So we are in a very similar situation to the human case. We cannot directly derive individual generality from these findings unless we postulate further assumptions, especially in terms of the difficulties used for the items in the domains. Of course, items are wisely chosen with the right difficulty such that there is variability of results to explain. This variability is basically what is been looked after (a factor that explains a great proportion of the variance). For instance, if we take Fig. S23 (left) and add more items of low difficulty to domain D 2 and D 4 , we would simply even the results, as there would be more correct responses for all individuals in these domains. The contrary is also true: there are spurious ways, by adding further easy or hard instances of particular domains, to go from the middle plot to something that resembles the leftmost plot. This is one of the key reasons why ignoring difficulty and considering low variability (as seen in middle plot) as an indication of generality is wrong, especially when we think of the spurious ways of achieving it. Despite all these problems, it is not customary to perform a systematic analysis of difficulty (for instance, using cognitive demands for each item, or applying The individuals behave better for some domains than others with very small differences between the individuals for each particular domain. Middle: The individuals behave equally well for all domains, but some individuals show higher performance than others, also in a consistent way. Right: At the species level, there seems to be no difference between domains, but individuals perform differently for some domains, either by individual differences or by "heterogeneous developmental conditions". [Adapted from [65, Fig.1].] IRT). Also, in the first place, the identification of domains (such that they are actually diverse) is one of the fundamental methodological issues in the analysis of general intelligence in animals. "The issue of task selection is thus closely linked to the identification of domains in animal cognition, which in fact is part of the empirical question that needs to be addressed in intelligence research in animals in general, by using batteries as diverse as possible and statistical procedures that are a priori agnostic to the underlying factor structure" [65]. This is linked to the problem of dimensionality reduction, which has been highlighted by [66] as possibly being the cause of spurious results.

C.2 Cognitive resources and generality
In animal cognition, the references to resources (cognitive demands) required for the tasks in several domains are usually part of the discussions. Burkart et al. [65], for instance, set the question around how much extra neural tissue is needed, taking into account that domain-specific cognitive adaptations may require much less additional expensive brain tissue [67] than domain-general cognitive ability, which is also less directly linked to fitness-relevant benefits. They face "the puzzle that domain-general cognitive ability apparently evolved in at least some lineages, or perhaps even in birds and mammals in general, even though its evolution has had to overcome more obstacles compared to the emergence of domain-specific cognitive adaptations" [65]. One possible theory that explains this puzzle is the cognitive buffer hypothesis [68], which states that this extra effort in domain-general cognitive processes in larger brains buffers animals against environmental variation, and pays off for a wider range of behavioural patterns given by innovation, learning and, most especially, cultural transmission [25,67,69,70].
Evolution usually finds a trade-off between specialised functions and more general capabilities, according to the effort that has to be put in terms of evolutionary innovations and energy consumption of bigger brains on one hand and how expectable and regular the tasks that are faced by the species are in their environments. In particular, Del Giudice and Crespi [71] include a series of trade-offs over four key properties: performance, robustness, efficiency and flexibility. Clearly, performance, efficiency and flexibility are closely related to capability, difficulty and generality -although the connections are qualitative rather than quantitative, at the measurement level. The trade-offs between efficiency vs flexibility and performance vs. flexibility are both associated with the "generalist-specialist trade-off" . We can see this trade-off in Fig. S24, where we compare the gains and the efforts of a domain-general cognitive enhancement versus a domain-specific cognitive enhancement.
Of course, how meaningful these numbers are depend on how well we can estimate the effort for general solutions versus specialised solutions and how likely the specific tasks are versus all other tasks. Actually, Fig. S24 assumes that all tasks are equally likely (or, more precisely, all difficulties are equally likely). When some specific behaviours or domains are very likely in the environment of the species, then specialisation will of course pay off. It is when there is environmental unpredictability in that many tasks are similarly likely, that the pressure for more general intelligence takes us to the kind of increase like the violet band in the figure rather than the orange one. Actually, in an environment where most tasks change in a few generations we would h ψ q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q h ψ q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Figure S24: Using ACCs to represent two different ways in which the capability of a species can be enhanced, with a domain-general cognitive enhancement (vertical violet rectangle) or a domain-specific cognitive enhancement (horizontal orange rectangle). Left: both rectangles cover the same area (1ˆ1 and 0.1ˆ10), and hence increase capability in a similar amount, but the orange one has a higher expected difficulty, and hence effort (M j " ş 11 10 h¨1 dh " 10.5 vs M j " ş 20 10 h¨0.1 dh " 15, according to Eq. 2). Right: both rectangles imply the same extra effort (M j " ş 11 10 h¨1 dh " 10.5 vs M j " ş 17.6 10 h¨0.1 dh « 10.5, according to Eq. 2), but the specialised one (orange rectangle) now has a smaller area, and hence less increase in capability than the general one (violet rectangle).
have an ACC closer to the maximally compacted one and maximum generality (as given by the compactness property, #2), as this would be resource-optimal in order to obtain maximum capability (and maximum success if tasks are so unpredictable). Of course, many tasks requiring cognition, such as navigating and eating -but not foraging or hunting-might still be linked to a few particular specialised skills, as they are more constant in the evolutionary history of many species 6 .
How difficult a domain-specific functionality is or how much effort it requires can be analysed in different ways. One first way is to look at the energy effort, by examining the involved neurological modules that are dedicated for that functionality, and map this with energy consumption. A second approach is to estimate evolutionary effort by looking at changes in DNA that make the functionality possible (from an ancestor that did not have it), contrasting with the ecological pressures and other similar functionalities. A third pathway is to identify these tasks and make them be learnt by systems that do have general capabilities, and estimate their difficulty from them. In this case, extreme care has to be made for many confounding factors. Finally, a fourth possibility is to determine the difficulty of tasks intrinsically (e.g., working memory requirements, pattern complexity, etc.).

C.3 Looking at evolutionary selective pressure through observable scores: capability and generality
Some less general species are able to solve very complex problems by specialisation that other more general species cannot do. We can finally look a this in terms of the two observable indexes we have introduced here: capability and generality. Plotting generality and capability against the level of social interaction (intra-specific and by diversity of predators), cultural inheritance, neural tissue mass, etc., with octopuses, hyenas, koalas, raccoons, primates and corvids, among other species, is expected to scatter points on very different locations. As a result, this could also help us see whether these traits are related, or whether there might be one-directional causalities. Both capability and generality are observable variables, the first is aggregated performance (the area under the ACC curve, eq. 1) and the second is a metric of how compact this performance is (how steplike it looks over difficulty, eq. 5).   S26 shows a similar simulation where, in addition, items have a uniform range of difficulty and success for the tasks is randomly proportional to the difficulty. In both figures the correlations can get very high since the pressure goes in the same direction: more capability and less effort. This is simply the result of the compactness property (#3). But again, it is important to notice that as generality and capability become more correlated (especially in humans) there is a tendency in confounding them, ending up talking about general intelligence, without knowing clearly whether the emphasis is on generality or capability.
The theory of general intelligence, the positive manifold and the g and G factors have all (in different degrees) raised bitter controversies. Setting aside the interpretation issues, one of the major arguments against these theories is that they might be considered statistical artefacts, produced as the result of making some choices on the items and test batteries. Some of these choices are critical: they should fit the population of individuals -not too easy, not too difficult, so there is variance to explain. In a very insightful way, Woodley of Menie et al. [72] break the species groupings by considering humans and chimpanzees together into a single population and then correct for these "ceiling or floor effects", by reducing the number of tests to those that have higher coefficients of variance. Fig. S27 shows the correlation of scores (d) and g loadings on the y-axis against different values of the variance produced by progressively selecting the tests with higher variance. Although not mentioned in [72], this analysis is of course closely related to the SLODR (and the alternative ULOAR hypothesis) discussed in the previous section, where by adjusting the variance we can get that g and scores can grow together, as we see in Fig. S27. By looking at generality, as an individual observable measure, we can simplify the analysis in many ways. First, the measure does not depend on a population of individuals, so we do not need to determine what is a sensible group or even consider them into species for the analysis. Second, because we can apply these metrics for a single individual, we do not need to obtain results for a large and unbiased sample of one species, which is usually very costly or problematic. Third, generality is algebraically independent from capability, and the correlation must come then from evolutionary or other kind of efficiency pressures applied to the individual or the species. For instance, actual plots like Figures S25 and S26 can be used as an alternative to Fig. S27, and done for individuals of many species together.
Another kind of criticism around the study of general intelligence is about whether "the positive manifold provides little or no constraint on the possible architectures of cognition" [73]. General intelligence may then originate from primary specialised modules being boosted by more general secondary modules (or evolved in this more compressed/abstract way for the economy of the brain), by a wide range of specialised modules that are switched on or off depending on the task at hand or by a truly general system helped by particular biases according to what environment demands are most frequent for a species. All this diversity of explanations could be extended to generality, as a high value of Γ can be obtained in many ways (but not as many as g, as we discussed in the previous section). As we will see in the following section, looking at individuals that have gone through no selective pressure, or a different engineering one (i.e., AI systems), can give us a wider theoretical and empirical scenario to exclude some interpretations of the existing findings. Average coefficient of variance in tests retained Human g loadings Chimpanzee g loadings Combined g loadings Figure S27: Correlations between task g loadings and the scores d on the y-axis as a function of the average coefficient of variance in the tests retained, choosing them by removing those with smallest variance first. Trends shown for chimpanzees, humans and a combined population. [Adapted from [72].] research to come, which is looking at the values of Γ and Ψ in the animal kingdom.

D Computational interpretation: generality and artificial (general) intelligence
The debate along the spectrum between general intelligence and specialised (or narrow) intelligence has also pervaded artificial intelligence since its inception. The very early attempts were directed towards a General Problem Solver [74] and the goal of "generality" [75]. In the following decades, many of these programs failed to fully realise the complexity of intelligence, while other more narrow applications started to be successful. In 1978, John McCarthy published a new version of his 1971 Turing Award Lecture on "Generality in Artificial Intelligence" [76], recognising that one of the major problems was that, if behaviour was represented by programs, these programs could only cover a finite set of domains or problems.
Of course, these were the times were machine learning was not a dominant paradigm in artificial intelligence. Nowadays, the use of machine learning techniques, coupled with sufficient data, allows systems to be adapted to different domains, using the same algorithm, which generalises the data. Generalisation is an intrinsic -if not definitional-part of learning. Learning is hence the way in which AI systems (and human and non-human animals) can adapt to unseen situations. In other words, when considering a large and diverse number of tasks, coding particular solutions for all of them is infeasible, and hence learning becomes the solution.
Consequently, it may seem that (machine) learning systems are then general by definition: give a learning system sufficient examples and it will learn any possible task. The goal of machine learning, and AI, would be to define this universal machine learning system. While this idea is still behind some of the narratives in machine learning and artificial intelligence, there is an important objection to this universal generality: efficiency. Some systems can potentially learn any function, given a sufficiently large number of examples. The question is how many examples, how much time and how large the model might be. The answer to -or cause of-this problem is known as bias. By embedding a particular bias for a learning algorithm, one can accelerate learning for some problems while making it harder for some other problems.
There are many ways of explicitly or implicitly introducing strong bias to a learning algorithm: specialised architectures, hyper-parameters, background knowledge, and the very algorithm itself. By using these particular biases, we can have AI systems that can solve particular pockets of problems: speech recognition, machine translation, robot navigation, medical diagnosis, face recognition, etc. Interestingly, by a shrewd use of more and more computing power, some of these algorithms are requiring less physical time (and occasionally fewer examples) to learn these tasks, approaching, at least in some areas, the flexibility of some animals.
Still, there is a view that artificial intelligence does not produce general systems [77,78]. Some prominent positions even deny the possibility of general intelligence at all, as Yann LeCun puts it: "There is no such thing as AGI. Intelligence is always specialized" [79]. While the same deep reinforcement learning can learn to play Go or Chess by just changing the rules [80], the same algorithm cannot learn to navigate a room and play poker [81,82]. Of course, there are algorithms that can learn to navigate a room and have similar principles (and even shared modules underneath) to those playing poker. In either case there is a great amount of hyperparameter tuning, input and output transformation, and other changes to the architectures and the optimisation operators to make them work for a different domain. However, recent progress in deep reinforcement learning and transformers is producing systems where some sort of generality is undeniable [83,84].
The area known as Artificial General Intelligence [85], where the same system should be able to solve a range of problems, is still seen as a counterpoint to a bevy of systems that are successful for more narrow domains, even if they are fuelled by machine learning, and built upon general principles looking for abstract representations.
Unfortunately, to the dismay of some members of the AGI community, the term AGI is now commonly used as synonym of ill-defined buzzwords such as human-level machine intelligence, human-level artificial intelligence or even superintelligence, without a proper analysis of what the 'G' in AGI actually means, and how it can be distinguished from mainstream AI [86].

D.1 Generality and all possible tasks
The reduction of AGI to anthropocentric views of intelligence has an intuitive appeal. We are interested in those tasks humans can solve. But which are these tasks? Or, more conspicuously, what are the tasks that humans -the hominids characterised by their general intelligence-cannot do? We can analyse this question and put the notion of generality to its limits by considering all possible tasks. One possible way of doing this is by defining the set of all computable tasks, with tasks being framed in a testing scenario, where agents can learn from experience. In other words, we can consider all possible learning tasks (see [87,88,89,10,90,91] for different ways of doing this). Apart from the particular formulation and setting, the most relevant issue comes when we realise that as we now have a set of infinite tasks, we need a distribution over them, giving more weight to some over others.
Let us start with Solomonoff, who defined all possible sequential prediction tasks and an associated distribution, the algorithmic probability [87,88]. The set of tasks is just defined by the problem of estimating the next bits of all the sequences that can be produced by a universal Turing machine UTM. While all sequences can be generated, their distribution (the algorithmic probability) depends on the reference UTM. In a way, this was an elegant way of representing the notion of bias in machine learning. Depending on the chosen UTM, some concepts will be easier to learn than others. Still, the great contribution by Solomonoff was that he showed that the same algorithm can be used for all UTMs (biases), and convergence can be obtained. A universal learning algorithm exists, it always works, but it will work more or less efficiently depending on the chosen bias, the reference UTM. In other words, each UTM assumes a prior about the world, and observations whose underlying pattern is simpler for the chosen UTM (smaller Kolmogorov complexity) are more likely than those observations with more complex patterns. Solomonoff integrates Occam's razor and Epicurus, as his theory considers the combination of all theories that are compatible with the evidence, weighted by their Kolmogorov complexity.
On the other extreme for the choice of a distribution we find the assumption that every possible problem's output is equally likely. In a sequential prediction problem this would be expecting all sequences to be equally likely or, in classification problems, to consider all combinations of inputs and outputs equally likely. This is referred to as "block uniformity" [92], a broader type of distribution than the uniform distribution. Block uniformity is one of the conditions for the famous no-free-lunch (NFL) theorems [93,89,94]. They show that, under these conditions, no learning method can be better than any other on average. A general-purpose learning system and hence the very notion of 'general intelligence' would be simply impossible [95]. Moreover, every agent would solve exactly the same number of tasks, so there would not be any variability in capability, effort and of course generality.
The NFL theorems are very relevant, because our observation that learning systems exist and work (in animals and computers) can only happen if the assumption is not true. This is the important corollary of the theorems. Apart from a pragmatic or ad absurdum rationale, there are more epistemological ones: choosing all perceptions as equally likely is difficult to reconcile with a world with physical laws and other agents around (plants, animals, conspecifics) that do not behave randomly. Actually, if we consider all these subsystems computable, Solomonoff's view is more natural, as the output of a UTM fed with random bits is not random. In other words, what we perceive, our world, is filtered through many machines -laws, mechanisms and agents-, making those patterns that are produced from systems with limited resources more likely.
From this view of all possible tasks, one can define a (universal) distribution according to the complexity of the generator of tasks, such as the complexity of the task description. However, one can also define the distribution by looking at the complexity of the solution for the task, which can be seen as its difficulty. This way of weighting solutions by their difficulty is common in psychometrics, but was first introduced in the context of all (sequential) tasks in [10], recently extended to bidimensional grids [96]. When one goes from sequential tasks to interactive tasks (such as reinforcement learning [97,98]), the difference between the smallest program that generates a task and the smallest program that solves the task becomes illuminating. Setting the distribution according to the former led to the notion of universal intelligence [99]. Setting the distribution according to the latter led to the notion of policy-general intelligence, assuming a uniform distribution over solutions for each task difficulty [100,101,91]. We can see some of these choices in Fig. S28.  Figure S28: Different ways of generating tasks (or deriving their distribution). Top: the probability of a task is given by its generator. If the number of tasks is infinite, but countable, a uniform distribution is not a viable option, and a universal distribution must be used instead, making this equal to universal intelligence [99]. Middle: we first define a distribution of difficulties and then we define tasks according to that difficulty. In cases where the difficulty of a task can be derived from the definition of the task, this is a good option, as in [9,10]. Bottom: again, we first define a distribution of difficulties and then we derive solutions matching that difficulty. Finally, tasks are generated according to the solution. This is actually an option when the definition of a task does not say much about the difficulty of the solution, such as interactive tasks, as used in [100,101,91]. Note that for the two bottom rows in the figure, if the difficulty distribution is uniform, the expected success on a random task drawn from the distribution is equal to the area under the ACC, which is capability, as for eq. 1. [Adapted from [91, Fig.9.7].]

D.2 The choice of diversity and difficulty
The important thing about a theoretical account of all possible tasks, and especially if we know how we generate them, is that we can control for two things that are crucial for generality: the diversity and the difficulty of the tasks. If we look at diversity first, the schema on the top of Fig. S28 makes it very hard to ensure that the set of tasks is going to be diverse, as we generate tasks according to a distribution on their definition, but not about their solutions. Besides, if the choice is a universal distribution as in [99], then the distribution is dominated by a few tasks, which cope most of its probability mass [102,103]. For the schema in the middle of Fig. S28, we have at least some range of difficulties but, still, that does not ensure that the solutions might not all end up being of the same kind. Finally, it is the choice at the bottom of Fig. S28 that ensures diversity by the most entropic choice of a distribution per each difficulty (assuming the number of solutions per difficulty is finite). This choice is the uniform distribution.
For instance, Fig. S29 shows an ACC where instances have been generated according to the bottom schema in Fig. S28. If we consider all difficulties as equally likely, and assume the curve is 1 for h ă 7 and 0 for h ą 14, then we have the ACC shown in the figure, with capability Ψ " 9.86 and generality Γ " 0.39.
A theoretical view also allows us to consider different options for difficulty. Having all tasks sliced by difficulty provides us with a way to understand the success of an individual in relation to the resources used. For instance, we can consider difficulty as the complexity of the simplest solution. In order to do this, we assume responses are binary or are binarised using a threshold, in successful or unsuccessful results, represented by A π µ , being 1 if π solves µ and 0 otherwise. There are few interesting consequences of this interpretation of difficulty. First, we have that for every agent, there is a difficulty from which its ACC is always zero, so the area is always finite 7 . Second, we can precisely determine how many solutions of a given difficulty there might be. For instance, we can calculate the resources according to different situations: • We can consider difficulty as the length of the solution with lowest Kolmogorov complexity, i.e., ℏpµq def " min π:A π µ "1 Lpπq where Lpπq is the length of the solution π. Then the number of solutions for a given difficulty h would be 2 h . In this situation, we can derive from the compactness property (#2) that the optimal curve is again one with Γ " 8. To achieve capability Ψ, a non-learning system having predefined solutions for a large number of tasks would require a minimum of ř Ψ h"0 h2 h " pΨ´1q2 Ψ`1`2 bits, plus the necessary code or neural wiring for making the switch among the 2 Ψ`1´1 solutions (assuming the solutions have nothing in common, because exhaustiveness here makes it difficult to compress this into a more hierarchical or reusable architecture). According to this situation, we can see that the "size" of the "brain" would grow more than exponentially. On the contrary, if instead of predefined solutions, we consider a learning system, the size would be reduced as much as we would like, but we would need to consider the availability of data and the learning effort instead.
• We can understand difficulty with Levin's Kt complexity, as advocated for in [9,10,104,105,91] because of its connection with Levin's optimal universal search [11,12]. In this case, we define LSpπ, µq def " Lpπq`log Spπ, µq where L is the length of the solution π and S the computational steps µ uses to solve the task π. Difficulty would be ℏpµq def " min π:A π µ "1 LSpπ, µq. With this, we could still consider that the number of solutions for a given difficulty would be less than (but still approximately or linear with) 2 h . The result for a non-learning system would be then similar, but now we would have to take into account the time to determine which problem we are facing, which must choose between 2 Ψ`1´1 solutions. On the contrary, the resources needed for a learning system using this schema would just simply be the expected difficulty H " Ψ{2. This is measured in the logarithm of computational steps 8 , so the expected computational steps using a universal search would be 2 Ψ{2 .
From the above, we see the difference between a system with a predefined repertoire of solutions and a system that learns those solutions 9 . Even if the above ignores the training examples or interaction needed to learn the concepts, we see that there might be a trade-off between pre-wired and learned solutions, depending on the size limitations and the speed of the system.
The cases above are important to clarify the distinction between nature-vs-nurture and general-vs-specialised. Whereas we have the tendency to associate inherited functions with specialised functions, this does not have to be the case a priori, according to the definition of generality we are considering here. This may be a consequence depending on what resources are most relevant. Note that in the two analyses above, we derive the minimum resources following the compactness property (#2). Assuming all difficulties equally likely, one should focus on those policies that require fewer resources. Of course, if some particular pockets of problems of high difficulty are more likely than many problems of low difficulty then there is a rationale to cover those pockets specifically, so having less generality.
In all these cases we are using a distribution of tasks that is not based on a particular species or environment -they are not the tasks a human or animal would find in their lifetime. Accordingly, these distributions can be criticised as arbitrary. However, it is not true that all humans (and much less all animals) face the same fixed set of tasks. Precisely because of this, many psychometric tests include very abstract tasks, in an effort to be independent of particular human groups, and some (like Raven's matrices) may even look very unrelated to the natural (ancient or modern) environments humans face. However, it is well known that IQ tests lack measurement invariance when applied to other groups (e.g., people with some disabilities, children, etc.), nonhuman animals and, most especially, computers. In the latter case, it is not that they are particularly unfair for computers, but that AI systems can specialise for these tasks [107,108,109,110]. In a way, we can get generality inside the test, but inability to extrapolate beyond the test. Restricting testing to a particular kind of tasks facilitates systems that specialise on them, and this is particularly exploited in AI.
Hence the relevance of diversity, and this idea of using all tasks with which we started. But, how much will the task distribution depend on the representational language or mechanisms used to derive the set of tasks? The invariance theorem, independently introduced by Solomonoff, Kolmogorov and Chaitin (see, e.g., [111]) says that any universal representational mechanism (language) can code any program as efficiently (in size) as any other up to a constant that is bounded by (but generally smaller than) the sizes of the definitions of both languages. This makes the concept of Kolmogorov complexity machine-independent, at least to an additive constant factor. However, the definition of "universal intelligence" [99] has been criticised by this dependence on the reference machine, which is actually leading to different definitions according to what UTM is used to generate the universal distribution [102,103,91]. The main reason is that the invariance theorem appears in the exponent of the distribution (2´K pxq ), and the additive constant becomes an exponential one. In contrast, the two versions on the bottom of Fig. S28 put back the invariance theorem as an additive constant on the scale of difficulty. This means that the scale upon which all other measures are derived is relatively more stable. For instance, given the spread for an individual using a notion of difficulty on a reference machine, then this spread will be at most increased by a constant that does not depend on the individual. Also, as capability grows, the invariance theorem starts having more relevance. This can also be seen in the opposite way: systems with very limited resources (or capabilities) will be more dependent on the reference machine.
Still, using two different reference machines might lead to very different difficulties for the x-axis in the ACC and hence different capability and generality scores, which is of course what underlies many discussions about whether tests are biased against or in favour of a group. But there are many "bias equalisers", especially in testing, that can be used to determine capability and generality more independently [112,113,114]: 1) introducing a testing apparatus that is novel for all subjects, 2) analysing differences after ensuring that individuals are raised in or adapted to the same core knowledge, 3) present problems that have to be solved by combining or using a set of constructs or elements that are abstract and new. These procedures are common in animal cognition and human intelligence testing, but not that much in AI research [115,116,117].
In practice, we do not need to consider all possible tasks to derive metrics of generality in AI. We can do this for any test battery or benchmark for which we are interested in deriving the generality of a particular AI algorithm or agent, be it in machine learning, planning or machine translation. In order to start we only need a metric of difficulty. It does not have to be a universal metric, as described above, but a customised one instead. It can be derived in many ways: • Anthropocentric difficulty: we can use human performance as a reference for the difficulty of a set of tasks.
This can be obtained as an indicator that is inversely related to the success of average humans in each task.
• Populational difficulty: this can be derived by using a population of AI techniques for the range of problems. For instance, [118] apply IRT to derive the difficulty of machine learning instances. This idea can be applied to datasets and other kinds of problems in AI (e.g., the ALE benchmarks, [119]).
• Intrinsic difficulty: any meaningful characterisation of difficulty can be used here, as we have done in several cases in this paper. For instance, the difficulty of a planning problem can be based on a series of features about the problem:, such as the number of components, its structure, the degree of noise, etc. Note that difficulty is different from computational complexity, but time complexity may be an important factor.
• Integral difficulty: some other notions of difficulty can integrate space resources, computational time, energy consumption, data required, etc., [120] especially when including very different tasks. For instance, [121] aligns difficulty with the number of trees used by a random forest classifier, providing a very clean mapping to resources and effort.
• Opponent difficulty: in those cases where other agents compete or cooperate, we can use the capability of the opponents (or a measure inversely related to the capability of cooperators). Note that this makes this option populational as well.
Note that we are working with a notion of difficulty for an expected response above acceptability (Errs j,i ě 1´ϵ), as defined in §E. So, for all of the above approaches to difficulty, one can vary the threshold ϵ, as a way of generating variants of the same instance or task, from those with difficulty 0 (ϵ " 1) to those with difficulty 8 (ϵ " 0, assuming no agent is 100% perfect). The estimation of difficulty will usually have some associated error. If we assume this error to be constant independently of the magnitude of the difficulty, then we can have a similar result as for proposition 3, and then the theoretical ACC will be sigmoidal in shape (complementary to the error function). In the case of a view of difficulty in terms of simplest description, the error can be linked to the constants of the invariance theorem.
In general, whenever an evaluation procedure is established in AI, there is a selection of tasks from a certain domain and for a particular range of difficulties. For instance, one rarely finds Hofstadter's "Gödel, Escher, Bach" [122] as an instance for a machine translation benchmark. It is too hard to be discriminative for AI. Usually, the benchmark tasks are selected to cover an application area (usually of scientific or industrial interest) and the difficulty of the items is chosen such that they are neither too easy nor too difficult for the state-of-theart algorithms. This is natural, but this is implicitly assuming a type of ACC nobody checks in the first place, and a very malleable notion of difficulty, adapted to the situation. This also makes the analysis of progress in AI hard to assess, as the tasks in the domain and their difficulty are changing, like a moving target.

D.3 Generality in competitions and benchmarks in AI
The options for difficulty seen above can be applied to an increasing range of AI competitions and benchmarks [123], especially those that are aiming at more general-purpose AI. Some of these are the general game playing AAAI Competition [124,125], the reinforcement learning competition [126,127] (which featured the 'polyathlon', with several domains), the genetic programming benchmarks [128,129], the general video game competition [130,21], and the arcade learning environment (ALE) [19,130], a collection of Atari 2600 video games, which "has incentivized the AI community to build more generally competent agents" [131]. It is important to note that the introduction of new platforms and benchmarks where hundreds of tasks can be potentially be implemented [132,133] is not usually accompanied with a verification of whether the agents that have highest performance are also more general (exceptions, [134,135,136]). Recognising that the diversity and difficulty of the tasks must be explicitly determined is one important outcome of our analysis so far, and one a metric of generality in these terms would help to flesh out.
The GVGAI competition [21,23,22], whose results we used in our experiments, aimed at general game playing. Even if the competition aims at general video game playing, hence the name, the focus is on finding non-transitivity, such that metalearning through hyper-heuristics and algorithm portfolios is effective [137,138], by choosing different agents for different problems. This is also a common thing in ensemble methods [54], where diversity is positive if results are to be combined. The notion of transitivity is vindicated or assumed in indicators such as Elo rating [139], which depend on this transitivity, although more complex game topologies exist [140,141,142,143].
We saw in Fig. 5 (top) in the chess scenario that difficulty is taken from the performance of the opponent. This is also especially interesting for systems that improve with self-play, like AlphaZero [80]. In these settings, it is important to check that the system does not get better and better against more competitive opponents but may end up losing (or drawing more frequently) against weak opponents. This leads us to the more general question of whether a system that develops over time becomes more or less general [144,145]. As the system evolves, we may experience less flexibility but a wider covering of tasks, and this can be studied using metrics of capability and generality. We can see this in the series of systems from AlphaGo to MuZero. AlphaGo [146] was the first AI system programmed to defeat professional Go players at the ancient game of Go. The system relied on human data, domain knowledge and game rules to master the game. AlphaGo is a clear specialist (low generality), and the actual capability only appears for one task. Then, its successor, AlphaGo Zero [147], initially had no capability at Go, but it learnt from self-play only providing the game rules. Its potential capability for Go is thus very high (and very high actual capability after training), but it has very low actual and potential generality. AlphaZero [80] was then introduced with no actual capability on any game initially. But after self-playing, it excelled at a diversity of board games: Go, chess and shogi. AlphaZero has thus potential capability and some potential generality if we consider board games. Finally, we find MuZero [83], in the pursuit of more general-purpose algorithms. This system does not even require the rules of each game. MuZero combines AlphaZero's lookahead-search approaches with new planning abilities. MuZero is able to match the performance of AlphaZero in the above classic board games, but it also demonstrates significant success in environments without known dynamics such as the Atari 2600 video games. MuZero has similar or higher potential capability than AlphaZero but higher potential generality, as it can cover more games.
Another good example of increasing generality in AI is represented by large language models, which have achieved remarkable performance on a wide range of tasks [84,148]. Their generality is actual rather than potential: they cover all these tasks without retraining, using zero-shot or few-shot inference. Of course, they still have many limitations and fail catastrophically at some very simple examples [149]. Nevertheless, there is great promise that with more parameters or jointly with other technologies in the future they could become general -yet not necessarily very capable-systems: consistently good at a very wide range of very simple tasks.
In the end, the progress of some techniques in AI can be made in such a way that generality is preserved, and the ACCs are just translated to the right as the technology improves. We now have tools to check whether this is the case, or, on the contrary, some new techniques solve more challenging problems at the cost of being worse at simpler problems. This is particularly relevant as progress in AI can be attained by combining several approaches, in areas such as ensemble learning or portfolios, where a big switch approach determines which technique is most appropriate for a particular instance. This modular approach to solving problems may well end up in specialised solutions and creating gaps, where some relatively simple problems are not solved, with lack of generality. But this modular approach, combining many specialised solutions, if the set of tasks remains constant, may increase generality (and capability), especially if the combination covers more of the easy ones than the difficult ones. Again, we see that generality measures how capability is distributed in terms of difficulty, but it does not impose constraints on how this is done. It may even include human computation, collective systems, cognitive services or hybrids, in the same way that humans can be enhanced by personal assistants or other devices -and increase or decrease their generality because of this.
Of course, if a modular solution requires hundreds of specialised subsolutions, the cost of keeping all them and designing an appropriate and efficient switch to determine which one to use may end up being less efficient than a more integrated solution, as we have discussed above. The relation between generality and resources is another way of looking at compression and generalisation, Occam's razor, the MML principle, etc., in machine learning, genetic programming and other areas in AI (e.g., [150,151,152]). Actually, the issues of generalisation and difficulty were usual (although from a different perspective) in the early days of genetic programming [153] (using the term 'generality' as 'generalisation power' or 'avoiding overfitting'). Commonly, the notion of generalisation is usually linked to whether a model extrapolates from the training data to the test data [154], and a proper validation will just equate this with performance [155]. But, generality, as introduced in this paper, just measures the distribution of success across difficulties, and can be applied to learning problems, planning problems, deductive problems, more in the original spirit of McCarthy [76].

E Formal setting and properties: Further detail
We will consider the evaluation of a set of M agents on a set of N tasks, with results or responses r j,i for each agent π j and task µ i , as represented in the example matrix (numbers in black) in Fig. 1 (top). For each agent we have its response meanr j def " Mean i rr j,i s, also referred to as agent average performance, and its response variance σ 2 j def " Var i rr j,i s, also referred to as agent variance 10 . From here, we could simply define one notion of regularity as the reciprocal (inverse) of the variance. This would give us 1{σ 2 a " 1{0.016 " 64.0 and 1{σ 2 b " 1{0.203 " 4.92 for agents a and b respectively in Fig. 1.
But is the variance produced by unreliability in the measurement, instability in the agent or is it because the agent really performs much better at some problems than others? We could try to exclude all sources of unreliability and work at the definitional level. In order to do this, instead of actual responses, we could work with expected (or ideal) responses. For each agent π j and an instance or task µ i , the expected response is given by Errs j,i . We assume 0 ď Errs j,i ď 1 with 0 meaning worst possible performance and 1 meaning best possible performance. We would then discretise expected responses as A j,i " 1 ('acceptable' or 'accomplished') if Errs j,i ě 1´ϵ and 0 otherwise ('unacceptable' or non-accomplished), where 1´ϵ would be just a threshold 11 . For instance, for dichotomous tasks (where agents can only be right or wrong), with an ϵ " 0.3, we could have that A j,i is 1 if the agent is expected to be correct on the instance at least 70% of the times.
This simple transformation would eliminate reliability issues in our analysis of generality. But still, could then we define generality as being good for all possible problems? First, for many sets of tasks N it is not possible to have acceptable results for all of them, as some may be very complex or may require more resources than the agent has. Second, by using binary acceptability, we would have a Bernoulli distribution, and the variance would just be derived from the agent's average performance (r j¨p 1´r j q) thus making it impossible to compare the generality of agents with equivalent performance. Part of this connection can be translated to spread. Consider that all the items are in some interval of difficulties, and hence capabilities 0..q. If we have N items and an agent is right on all of them, then the ACC would be completely packed on the left, and spread would be 0. If an agent is wrong on all of them, then the ACC would be zero from 0 to q, but assuming that the curve is full (saturated) on the left (translation property) then we would still have a spread of 0. Between these two extreme cases in capability (both with spread 0), as we sort responses by difficulties in the ACC, and we calculate spread in a different way from variance, we can have results that deviate significantly from the Bernoulli distribution. For more discussion about this and its relation to the metrics we are about to introduce, see Appendix F.5.
The way-out of these two problems is to look at responses in terms of their difficulty. Actually, agents might be better for easy problems than for hard ones. The quantification of difficulty appears at the core of generality.

E.1 Agent characteristic curves (ACCs) and capability: detail
Let us then consider a difficulty function, ℏ, mapping each task µ i to a real value ℏpµ i q ě 0. We define an agent characteristic plot for agent π j as a scatter plot showing accomplishment A j,i in terms of the difficulty ℏ i . In other words, we plot difficulty on the x-axis and accomplishment on the y-axis.
We can convert these scatter plots (as the dots are always 0s and 1s) into more interpretable curves. In order to do this, we define ψ j phq def " PpA j,i " 1|ℏpµ i q " hq, or equivalently, the mean of the accomplishment of agent j on all problems of difficulty h. We then define an agent characteristic curve (ACC) as a plot of ψ j phq as a function of h. Fig. S30 shows six scatter plots (grey circles, all either 0 or 1) and their corresponding ACC (blue line).
We can look at the leftmost part of the curve. We say that an agent characteristic curve is s-saturated if @h ď s : ψ j phq " 1. We see that the two first ACCs are not even saturated for s " 0. On the rightmost part, we want agent characteristic curves to ensure that the area under these curves is finite. We will assume difficulty functions that meet this property 12 .
We copy the definition of capability we gave in Eq. 1, i.e., the sum of all the mean responses per difficulty, which is the area under the ACC (see Fig. S30 for the calculated capabilities). Note that in a discrete way, capability is a weighted sum of all tasks according to a prior uniform distribution (or weight) of difficulties. The area will of course change even with a monotonic transformation on difficulty, such as a change to a logarithmic scale. Some scales make more sense than others and give a more meaningful notion of capability, especially if the x-axis can be associated with an additive unit, as we will discuss later on. These weights can be derived if we know the posterior, how many tasks we have for each difficulty. Definition of spread and generality: detail 3. Maximum generality (minimum spread): given a fixed capability Ψ j , the minimum expected difficulty H j and the maximum generality Γ j are obtained with a decreasing step agent characteristic function abruptly falling from 1 to 0 on h " Ψ j , where the capability is double the expected difficulty (i.e., Ψ j " 2H j ), spread S j " 0 and generality Γ j " 8.

Constant interval:
Given a constant function ψ j phq " c from 0 to q, we have Ψ j " cq, S j " q a cp1´cq and Γ j " 1{pq a cp1´cqq. In the particular case of c " 0.5 we have S j " q{2 and Γ j " 2{q. Any spread greater than the value for this constant curve is considered 'abstruse'. 5. Minimum generality (maximum spread): given a fixed capability Ψ j , and assuming ψ j phq " 0 beyond difficulty q, then the maximum expected difficulty H j and the minimum generality Γ j (the most abstruse result in a bounded interval) are obtained with a increasing step agent characteristic function going sharply from 0 to 1 on h " q´Ψ j (until q), leading to S j " a 2Ψ j pq´Ψ j q and Γ j " 1{ a 2Ψ j pq´Ψ j q. With the same capability, the square spread for this case is exactly twice the square spread for the constant case.
6. Task transitivity: if an agent π j is s-saturated then for every task µ b such that A j b " 1 in the saturated area then for all other tasks a of ℏpµ a q ď ℏpµ b q we have that A j a " 1. In other words, if this agent solves a task in the saturated area then it also solves any other easier task. Agents with maximum generality Γ j " 8 are s-saturated with s " Ψ j , so if a task of a given difficulty is solved we know all other simpler tasks are solved too.
7. Agent transitivity: if two agents π a and π b have maximum generality Γ a " Γ b " 8 and Ψ a ď Ψ b then for every task µ i such that A a i " 1 we have that A b i " 1. That means that π b dominates π a or, in other words, that an agent would solve all tasks a less capable agent solves, provided both have maximum generality. Note that if generality is not infinite, it is not sufficient to have a curve for π b that covers the curve for π a . We need to check that π b is s-saturated for at least the maximum value where π b gets non-zero accomplishment.
8. Same units: if we introduce a unit for difficulty, let us call it witts, then capability is also (additively) measured in witts, spread is also measured in witts and hence generality is measured in 1{witts.
Some of these properties (especially the transitivities) have been shown when assuming a Guttman (or deterministic) response model [35,36], as we discuss in §B. Looking again at Fig. S30 we see that the bottom left and bottom right are basically a translation of each other by k " 4. We see that the capability is increased by approximately 4, and the spread and generality are not significantly affected.

F Theoretical results
In this appendix we include the proofs of the properties, their associated lemmata and propositions, the connection between spread and variance for a normal distribution, the aggregation of step curves, the derivation of Guttman curves through thresholding, and the derivation of the isometrics.

F.1 Proofs of the properties
Despite being straightforward, for completeness we include all the proofs of the properties presented in §E.2.
Proposition 4. Given an agent with capability Ψ j , any positive translation by k implies that capability becomes Ψ j`k .
Proof. A translation creates a new function such that h 1 Ð h´k and ψ j ph 1 q " 1 for all h 1 ă k, so the new capability Ψ 1 j is now: Proposition 5. Given an agent with capability Ψ j where the |l|-leftmost part of the original curve was saturated, any negative translation by k ď l implies that capability becomes Ψ j´k .
Proof. As the left part of the curve is saturated, Ψ j can be decomposed into Now the translation removes part of the first term, so the new capability is: Proposition 6. With the same conditions as the above two propositions, generality is invariant to translation.
Proof. For a positive translation, we have that the new effort M 1 j equals: From proposition 4 we have that Ψ 1 j " k`Ψ j . Putting both things together into the definition of spread (Eq. 9), we have: As generality is the reciprocal of spread, and spread does not change, then it is invariant to positive translation. The proof for the negative translation is similar.
Proof. Clearly, Ψ 1 j " Ψ j , since the same mass q is included in the integral one way or the other. We have that the new effort M 1 j : Since h 2 ă h 3 , we have that ş h2 h1 hψ 1 j phq dh`ş h4 h3 hψ 1 j phq dh ă ş h2 h1 hψ j phq dh`ş h4 h3 hψ j phq dh, and hence M 1 j ă M j . Now, from the definition of spread (Eq. 9), we have: As spread is smaller, and generality is the reciprocal, this completes the proof.
Corollary 8. Maximum: given a fixed capability Ψ j , the minimum expected difficulty H j and the maximum generality Γ j are obtained with a step agent characteristic function on h " Ψ j .
Proof. By proposition 7, generality is increased as far as we move mass of the function from right to left, while keeping the area constant. This means that the maximum area with highest generality is obtained by a step function, whose location must be on h " Ψ j .
Proof. The area of a step function with location l is: As effort in this situation is: Expected difficulty is just: So Ψ j " 2H j and S j " b 2 l 2 2´l 2 " 0, so its reciprocal is 8.
Proposition 10. Constant curve in an interval: given a constant function ψ j phq " c from 0 to q, we have Ψ j " cq, S j " q a cp1´cq and Γ j " 1{pq a cp1´cqq.
Proposition 11. Minimum generality (maximum spread): given a fixed capability Ψ j , and assuming ψ j phq " 0 beyond difficulty q, then the maximum expected difficulty H j and the minimum generality Γ j (the most abstruse result in a bounded interval) are obtained with an increasing step agent characteristic function on h " q´Ψ j (until q), leading to S j " a 2Ψ j pq´Ψ j q and Γ j " 1{ a 2Ψ j pq´Ψ j q.
Proof. We have: Proposition 12. With the same capability, the square spread for the minimum case (most abstruse case in an interval) is exactly twice the square spread for the constant case (in the same interval).
Proof. For the constant case we had that Ψ j " cq, plugging this into the result of S j for the constant case, we have: Comparing with S j for the minimum case, and squaring them, we see that one doubles the other. So we have shown that the square spread for this minimum case is exactly twice the square spread for the constant case.
Proposition 13. Task transitivity: if an agent π j is s-saturated then for every task µ b such that A j b " 1 in the saturated area then for all other tasks a with ℏpµ a q ď ℏpµ b q we have that A j a " 1. Proof. If an agent s is s-saturated then A j b " 1 for all tasks such that ℏpµ b q ď s. If b is in the saturated area, any other task a of lower difficulty also is. Proposition 14. Agent transitivity: if two agents π a and π b have maximum generality Γ a " Γ b " 8 and Ψ a ď Ψ b then for every task µ i such that A a i " 1 then A b i " 1.
Proof. It is sufficient to see that both agents will have step functions.
Note that if the generality of π b is not infinite, it is not sufficient to have a curve for π b that covers the curve for π a . The reason is that there might be values of h for which 0 ă ψphq a ă ψphq b ă 1, and in these cases some of tasks that make the non-zero value in ψphq a might not be in the tasks that make the value of ψphq b .
Proposition 15. Same units: if we introduce a unit for difficulty, let us call it witts, then capability is also measured in witts, spread is also measured in witts and hence generality is measured in 1{witts.
Proof. As Ψ j is an integral over difficulty and the domain of the function is unitless (accomplishment, which is a proportion), then Ψ j has the same units as difficulty. As M j includes the factor h in the integral, i.e., the result is in witts 2 . Finally, from the definition of spread: 2M j´Ψ 2 j we get ? witts 2 , which means that spread is measured in witts, and the reciprocal for generality.

F.2 Proofs when using a normal distribution for capability
Here we include the proofs about the case where the ACC derives from assuming a normal distribution on capability. This can be interpreted as an aggregation of step curves where the capability of each is distributed normally or an agent such that the probability of a correct response depends on the CDF of a normal distribution using the difference between the capability and the instance difficulty.
Proposition 16. (proposition 1 in the paper) Assuming a normal distribution on capability, with standard deviation σ, the slope of the ACC will be´1 σ ? 2π .
Proof. (of proposition 1) We know that a normal distribution with mean µ and standard deviation σ will lead to the following agent characteristic curve: ith Φ being the CDF of the standard normal distribution and ϕ being the density function of the standard normal distribution. The maximum slope of this is the first derivative at µ, which is: slope "´ϕˆµ´µ σ˙"´1 ? 2πσ 2 e´p 0q 2 2σ 2 "´1 ? 2πσ 2 and hence the slope of the ACC will be´1 σ ? 2π .
Lemma 17. (lemma 2 in the paper) Assuming a normal distribution on the capability, with mean µ and standard deviation σ, such that the location is sufficiently beyond 0 to have negligible mass below 0 (i.e., µ σ " 0), we have that M j " σ 2`µ2 2 .
Proof. (of lemma 2) As in proposition 1, we know that a normal distribution with mean µ and standard deviation σ will lead to the following agent characteristic curve: ψ j phq " 1´Φˆh´µ σẆ e plug this into the definition of effort and operate a little bit on it in order to put the expression in terms of the cumulative distribution function Φ of the normal distribution: Fortunately, we can find the following integral of the moment of the CDF on page 402 (second last, entry 10,001) in [156]: ż xΦpa`bxq dx " 1 2b 2`p b 2 x 2´a2´1 qΦpa`bxq`pbx´aqϕpa`bxq˘`C And ϕ is the density function. In our case, a " µ σ and b " 1 σ , so we can put all things together into: M j "´" 1 2b 2`p b 2 x 2´a2´1 qΦpa`bxq`pbx´aqϕpa`bxq˘ȷ 0 8 "´" 1 2b 2`p´a 2´1 qΦpaq´aϕpaq˘ȷ´r0`0ss " 1 2`1 σ˘2ˆˆ´µ σ¯2`1˙Φ´µ σ¯`µ σ ϕ´µ σ¯Ṡ ince we are assuming that µ σ " 0, we have that Φp µ σ q « 1 and ϕp µ σ q « 0, so we get: Proposition 18. (proposition 3 in §B.1) With the same assumptions as lemma 2, we have that spread S j " σ and Γ " 1 σ . Proof. (of proposition 3) As the normal distribution is symmetric, we have that the location of the CDF is of course µ, so the capability Ψ j " µ, and plugging M j from lemma 2, we have: And by the definition of generality we have Γ j " 1 σ .

F.3 Aggregation
We show the following result: Lemma 19. Given N step ACCs with capabilities a 1 , a 2 , . . . , a N , if we average them into a single ACC, we have a descending staircase ACC with average capability Ψ j " 1 N ř N i"1 a i and spread S j being the standard deviation of the original capabilities.
Proof. We first consider the straightforward claim that Ψ j " 1 N ř N i"1 a i . As each component has capability a i , their average will make the capability of the composite, as areas are additive. Now, let us work with the definition of moment. And let us choose an index such that the capabilities a 1 , a 2 , . . . , a N of the N components are sorted by increasing capability, and for ease of notation, let us consider a value a 0 " 0. This means that from difficulties h from a 0 " 0 to a 1 we will have that all the N components are correct, so that the response ψ j phq is N N " 1, from a 1 to a 2 we will have that N´1 components will be correct, so that the response is N´1 N , and so on, until the segment beyond a N , whose response is zero. So,  which is the expression of population standard deviation for the values a 1 , a 2 , . . . , a N .

F.4 Scale transformations: from incommensurate values to thresholding
For many existing and future results in comparative cognition, psychometrics and AI we start with a matrix of results r j,i for task i and agent j. Each response can be a numeric score, or an aggregation of several items, as the values in black in Fig. 1. In very controlled scenarios, where the magnitudes of each task are binary 13 or correspond to a percentage, or at least in the same scale, we can extract some insight by looking at the variance of rows [157], -but remember the issues of the variance of a Bernoulli distribution as seen in §E, which we will discuss further in the following subsection. In general, however, looking at the variance for rows could be very misleading, as the original magnitudes can increase or decrease this variance in a spurious way. For instance, what if task µ 3 had values in a close interval between 0 and 1, and µ 7 ranges in a much larger interval? With this problematic proxy for generality task µ 7 would simply have more weight. If no further information about the tasks is available, it is hard to put the several columns of this matrix in a way that they are commensurate 14 .
The commensurability problem is related to the issue of deriving difficulties when no intrinsic difficulty is available. When the magnitudes are not commensurate and difficulty is not given, we need a point of reference. One possibility in these cases is to use a reference agent. For instance, in AI, it is quite common to use humans as a reference, and convert the score for a game or a task into a binary value that simply represents whether the agent is above human level. This is the usual approach in collections such as the Atari Games in ALE [19,158]. In fact, for this scenario, we have used the ARef transformation in Table 1, which simply converts the score to binary values representing whether the AI system reaches human performance or not. But then we calculate generality as we have introduced in this paper, using task difficulty, and not as the non-informative variance of the Bernoulli distribution that appears as result of the binarisation.
Another common option to make the row variance more meaningful is to normalise the columns (with same column mean and variance) or to use ranks, which in both cases implies a transformation that depends on the particular population. More precisely, the normalisation of r j,i consists of subtracting the mean and dividing by the standard deviation. The conversion of r j,i into ranks would consist of replacing each value in a column i of the matrix by the rank in that column, with values in 1, 2, . . . , M , with M being the highest rank and 1 is the lowest rank. In either case, we can now calculate a variance or standard deviation per row that can be more meaningful, as was attempted in the top matrix of Fig. 1.
If this variance is more meaningful now, can we compare it with generality? In order to do this, we can convert each new value into an actual (step) ACC using the value as a threshold. Namely, we convert each response in the response matrix into a step ACC with capability at the r j,i , or in other words, we assume that ψ j phq " 1 iff r j,i ě h, and 0 otherwise. With this trick, we use the population results as difficulty and we convert each test into a series of instances of different difficulty. With this configuration, we can now calculate the generality of each row by simply averaging the step ACCs, following the results of lemma 19, and we get a generality that is simply the reciprocal of the row standard deviation using the normalised or ranked values. This approach corresponds to the Rnk method in Table 1 when the r j,i are converted into rank values.
Indeed, this rank normalisation is done in [24] (as described in Table S2 and then used in Table S5 of their supplementary material). This can be seen as a way of circumventing the problem of not having a commensurate difficulty for all tasks. Under this rank transformation, we can use the rank as a metric of difficulty that is commensurate for all tasks (for this population of 53 orangutans). With this particular configuration, we could apply the thresholding approach above and connect row variance with generality.
Of course, this thresholding method assumes a step model, which is too simplistic. As an alternative, we could use IRT instead. For instance, [159] normalise the data and then use IRT models to work with a different metric of generality, based on the variance and estimated with a proxy: the slope of the curves. This is related to, but significantly different from, applying IRT and then using GA as suggested by the IRT transformation in Table 1.
The minimum sizes in Table 1 (approximate rules of thumb in some cases or statistical analysis in others) are derived from [160,161]. Finally, the individual generality score can be applied to situations where the difficulty of a task depends on other agents taking place in a competing or cooperating role, using the Opp transformation in Table 1, which is especially necessary for social and adversarial situations [162,163,164]

F.5 Bernoulli limits and Interval Relative Squared Spread (IRSS)
In §E, we saw that considering the row variance as some kind of (inverse) proxy for generality did not work: if we were using a binary acceptability, we would have a Bernoulli distribution, and the variance would be determined by an average of results (r j¨p 1´r j q). Variance would be a quadratic function of r j , with minima (0) for minimum and maximum average responses (0 and 1 respectively) and maxima (0.25) at 0.5, as we saw in Fig. S4.
From propositions 9, 11, 10, we identified the highest generality (lowest spread), lowest generality (highest spread) and the generality and spread for a constant curve of height c in this interval. In Fig. 4 (top left). we show an example of 496 results going from a minimum capability of 8.88 to a maximum capability of 19.41, as shown in the x-axis. The figure shows the curves for the maximum spread (in red), a constant curve spread (in blue) and the minimum spread (in green). We see how the points are all located under the red curve and most of them (except for 13.31%) below the blue curve.
Note that the interval does not start in 0, but the calculations can be adapted by subtracting the minimum value (in the example, 8.88) to the capabilities because of the translation property. Then, in Proposition 12, we showed that the squared spread for the minimum case (most abstruse case in an interval), S 2 max " 2Ψpq´Ψq, is exactly twice the squared spread for the constant case (in the same interval): S 2 cnst " Ψ j pq´Ψ j q, and at the same distance then to the minimum: S 2 min " 0.
of low difficulty, this would change average performance but not capability. It is the proportion of each difficulty what matters. Nevertheless, they do not introduce task difficulty in the analysis, and end up using several percentiles [135, Table 1], suggesting a column-wise correlational analysis instead, even if this requires some other assumptions and is performed at the level of the population.
That means that the expression in the Bernoulli constraint is now translated into squared spreads in a band that goes from 0 to twice the value Ψ j pq´Ψ j q. And a constant ACC (once for which correct or incorrect responses do not depend at all on difficulty) would be at the middle. This suggests that we could use a normalisation of spread, as was given in Eq. 6, copied here for convenience: This Interval Relative Squared Spread (IRSS) -or simply normalised spread-goes from´1 (minimum spread, maximum generality) to 1 (maximum spread, minimum generality). Because of this reverse relation of spread to generality, for convenience, we use the term normalised generality as the negative value of normalised spread, i.e., γ def "´s, also going from´1 (minimum generality) and`1 (maximum generality). Normalised generality does not have units, and we can use it to compare situations with different difficulty scales.
Note that normalised generality depends on the width if the interval (q). For the same agent, with a wider range of difficulties, the normalised generality would be more positive, while the unnormalised generality (and spread) would not change. While the range p´1, 1q might resemble a correlation between difficulty and accomplishment, the right interpretation of the normalised generality must be as follows: for the items inside a range of difficulties, a high number means that the system is good at those with low difficulty and bad at those with high difficulty. Again, values of γ closer to the maximum value (1) would mean that there's a step curve, also meaning that difficulty is very predictable about the behaviour of the system.

G Reproducibility: Code and Data
Code and data for reproducibility is available at https://github.com/jorallo/generality, under GNU General Public License (GPL). We use the R programming language [165], which allows us, and everyone else, to easily compare results with a number of packages for factor analysis, item response theory, etc. Unlike factor analysis and item response theory, the code for generality analysis does not use any random component, or any estimation algorithm, and does not rely on parameters (other than the kind of study or options for representation), so the metrics are completely deterministic and perfectly reproducible. Running the code again should lead to the same numerical results and representations.
The library includes many functions for analysis and representations. Some of them plot individual ACCs or perform the full generality analysis, with optional comparison with factor analysis, difficulty extracted from IRT, etc. All the transformations in Table 1 are also available. The functions are implemented in generality functions v.X.X.X.R with a wrapper in generality.R that is independent of the version. We suggest to start with demo v.X.X.R, which includes a simple example on how the functions work.
The library includes particular R files for the following scenarios: • ale: the Atari video games Arcade Learning Environment ( §A.7).
• ctest: Thurstone letter series using results using the C-test instances (main paper and §A.2).
• lambda: Lambda-One, a benchmark used to compare humans and simple reinforcement learning algorithms, as used in Fig. 2 in the main paper.
There's also an evolution.R file that runs the experiments and plots in the supplementary material dealing with evolutionary selective pressure (section C.3). The repository includes data folders for many of the above studies, and the data to run the generality analysis for each scenario.