Ensemble perception in the time domain: evidence in favor of logarithmic encoding of time intervals

Although time perception is based on the internal representation of time, whether the subjective timeline is scaled linearly or logarithmically remains an open issue. Evidence from previous research is mixed: while the classical internal-clock model assumes a linear scale with scalar variability, there is evidence that logarithmic timing provides a better fit to behavioral data. A major challenge for investigating the nature of the internal scale is that the retrieval process required for time judgments may involve a remapping of the subjective time back to the objective scale, complicating any direct interpretation of behavioral findings. Here, we used a novel approach, requiring rapid intuitive, ‘ensemble’ averaging of a whole set of time intervals, to probe the subjective timeline. Specifically, ​observers’ task was to average a series of successively presented, auditory or visual, intervals in the time range 300–1300 ms. Importantly, the intervals were taken from three sets of durations, which were distributed such that the arithmetic mean (from the linear scale) and the geometric mean (from the logarithmic scale) were clearly distinguishable. Consistently across the three sets and the two presentation modalities, our results revealed subjective averaging to be close to the geometric mean, indicative of a logarithmic timeline underlying time perception. ENSEMBLE PERCEPTION IN TIME DOMAIN 3


Introduction
What is the mental scale of time? This is one of the fundamental issues in timing research that has long been posed, yet remains only poorly understood. The well-known internal-clock model implicitly assumes linear coding of time: a central pacemaker generates ticks and an accumulator collects those ticks in a process of linear summation 1,2 . However, the neuronal plausibility of such a coding scheme has been called into doubt: large time intervals would require an accumulator with (near-) unlimited capacity 3 , rendering it very costly to implement neuronally 4,5 . Given this, alternative timing models have been proposed that use oscillatory patterns or neuronal trajectories to encode temporal information [6][7][8][9] . For example, the striatal beat-frequency model 6,9,10 assumes that time intervals are encoded in the oscillatory firing patterns of cortical neurons, with the length of an interval being discernible, for time judgements, by the similarity of an oscillatory pattern with patterns stored in memory. Neuronal trajectory models, on the other hand, use intrinsic neuronal patterns as markers for timing. However, owing to the 'arbitrary' nature of neuronal patterns, encoded intervals cannot easily be used for simple arithmetic computations, such as the summation or subtraction of two intervals. Accordingly, these models have been criticized for their lack of computational accessibility 11 . Recently, a neural integration model 12-14 adopted stochastic drift diffusion as the temporal integrator which, similar to the classic internal-clock model, starts the accumulation at the onset of an interval and increases until the integrator reaches a decision threshold. To avoid the 'unlimited-capacity' problem encountered by the internal-clock model (see above), the neural integration model assumes the ramping activities reach a fixed decision barrier, though with different drift rates -in particular, a lower drift rate for longer intervals. However, this proposal encounters a conceptual problem: the length of the interval would need to be known at the start of the accumulation. Thus, while a variety of timing models have been proposed, there is no agreement on how time intervals are actually encoded.
There have been many attempts, using a variety of psychophysical approaches, to directly reveal the subjective timeline that underlies our time judgments. However, it turned out that distinguishing between linear and logarithmic timing is largely constrained by the experimental paradigms adopted 5,[15][16][17] . In temporal bisection, for example, the bisection point was often observed at the geometric mean 18,19 , which led to the earliest speculation that the subjective timeline might be logarithmic in nature: If time were coded linearly, the midpoint on the subjective scale should be equidistant from both (the lower and upper) time references, yielding their arithmetic mean. By contrast, given logarithmic coding of time, the midpoint equidistant to both time references on the logarithmic scale would turn out to be their geometric mean, as observed. However, Gibbon and colleagues offered an alternative account for the geometric mean obtained with the bisection task, namely: the midpoint is determined by computing the ratio of the elapsed time T to the Short and Long reference durations (i.e., a comparison between Short/T and T/Long), which also yields the geometric mean 20,21 . Using a modified temporal-bisection procedure and signal-detection theory (SDT), Yi found that logarithmic timing with fixed variability (i.e., timing accuracy is constant in log-space) provided a better fit to the data than a linear representation of time with scalar variability (i.e., accuracy is relative to the interval being timed) 22 . However, Dehaene showed that "linear coding with scalar variability and logarithmic coding with fixed variability lead to the same metric of number similarity, and therefore to the same behavior" (p. 244) 23 . Thus, to date, it remains unsolved whether interval timing is based on a logarithmic or a linear internal representation.
One main factor responsible for the inconsistent findings is that the observer's response to a time interval may not directly reflect the internal representation. For example, an external interval might be encoded and stored (in memory) in a compressed, logarithmic format internally. When that interval is retrieved, it may first have to be decoded (i.e., transformed from logarithmic to linear space) in working memory before any further comparisons. The involvement of decoding processes would complicate drawing direct inferences from empirical data. However, it may be possible to escape such complications by examining basic 'intuitions' of interval timing, which may bypass complex decoding processes. One fundamental perceptual intuition we use all the time is 'ensemble perception' -which refers to the idea that our sensory systems can rapidly extract statistical (summary) properties from a set of similar items, such as their sum or mean magnitude. For example, Dehaene and colleagues 24 used an individual number-space mapping task to compare Mundurucu, an Amazonian indigenous culture with a reduced number lexicon, to US American educated participants. They found that the Mundurucu group, across all ages, mapped symbolic and nonsymbolic numbers onto a logarithmic scale, whereas educated western adults used linear mapping of numbers onto space -favoring the proposal that the initial intuition of number is logarithmic 24 . And they surmised that the educated Western participants had acquired sophisticated mapping knowledge (i.e., decoding the internal representation) to overwrite the basic intuitive mapping. Moreover, kindergarten and pre-school children also exhibit a non-linear representation of numbers close to logarithmic compression (e.g., they place the number 10 near the midpoint of the 1-100 scale), and this nonlinearity becomes 'linearized' after some 3 to 4 years of schooling [25][26][27] .
Our perceptual intuition works very fast. For example, we quickly form an idea about the average size of apples from just taking a glimpse at the apple tree. In a seminal study by Ariel 28 , participants, when asked to identify whether a presented object belonged to a group of similar items, tended to automatically respond with the mean size. Intuitive averaging has been demonstrated for various features in the visual domain 29 , from primary ensembles such as object size 30,31 , color and grayness 32,33 , to high-level ensembles such as facial expression and lifelikeness [34][35][36] . Rather than being confined to the (inherently 'parallel') visual domain, ensemble perception has also been demonstrated for sequentially presented items, such as auditory frequency, tone loudness, tone duration, and weight [37][38][39][40] . In a cross-modal temporal integration study, Chen and colleagues 41 showed that the average interval of a train of auditory intervals can quickly capture a subsequently presented visual interval, influencing visual motion perception.
In brief, our perceptual systems can automatically extract overall statistical properties using very basic intuitions to cope with sensory information overload and the limited capacity of working memory. Thus, given that ensemble perception operates at a fast and low-level stage of processing (possibly bypassing many high-level cognitive decoding processes), using ensemble perception as a tool to test time perception may provide us with new insights into the internal representation of time intervals.
On this background, we designed an interval-duration averaging task in which observers were asked to compare the average duration of a set of intervals to a standard interval. We hypothesized that if the underlying interval representation is linear, the intuitive average should reflect the arithmetic mean (AM) of the sample intervals. Conversely, if intervals are logarithmically encoded internally and intuitive averaging operates on that level (i.e., without remapping individual intervals from logarithmic to linear scale), we would expect the readout of the intuitive average at the intervals' geometric mean (GM) . Note, though, that the subjective Subjective durations are known to differ between visual and auditory signals 5,43,44 , as our auditory system has higher temporal precision than the visual system. Accordingly, the processing strategies may potentially differ between the two modalities. Thus, in order to establish whether the internal representation of time is modality-independent, we tested both modalities using the same interval sets in separate experiments.

Ethics statement
The methods and experimental protocols were approved by the Ethics Board of the Faculty of Pedagogics and Psychology at LMU Munich, Germany, and are in accordance with the Declaration of Helsinki 2008.

Participants
A total of 32 participants (16 females, mean age of 25.7) were recruited for the study, 16 randomly allocated to each experiment, from the LMU Psychology community. Prior to the experiments, participants gave written informed consent and were paid for their participation of 8 Euros per hour. All reported a normal (or corrected-to-normal) vision, normal hearing, and no somatosensory disorders.

Stimuli
The experiments were conducted in a sound-isolated cabin, with dim incandescent background lighting. Participants sat approximately 60 cm from a display screen, a 21-inch CRT monitor (refresh rate 100 Hz; screen resolution 800 x 600 pixels). In both experiments, visual stimuli -all gray disks (luminance 21.4 ) -were presented in the monitor center against a black screen d/m c 2 background (1.6 ). In Experiment 1, auditory stimuli (i.e., intervals) were delivered via d/m c 2 two loudspeakers positioned just below the monitor, with a left-to-right separation of 40 cm.
Brief auditory beeps (10 ms, 60 dB; frequency of 2500 or 3000 Hz, respectively) were presented to mark the beginning and end of the auditory intervals. In Experiment 2, the intervals were marked visually by brief (10-ms) flashes in the display center, of a gray disk (5°of visual angle in diameter) presented at the beginning and end (or between successive) interval(s).
As for the length of the (five) successively presented intervals on a given trial, there were three sets: Set 1: 300, 550, 800, 1050, 1300 ms; Set 2: 600, 700, 800, 900, 1000 ms; and Set 3: 500, 610, 730, 840, 950 ms. These sets were constructed such that Sets 1 and 2 had the same arithmetic mean (800 ms), which is larger than the arithmetic mean of Set 3 (727 ms). And Sets 1 and 3 have the same geometric mean (710 ms), which is shorter than the geometric mean of Set 2 (787 ms). Of note, the order of the five intervals (of the presented set) was randomized on each trial.

Procedure
Two separate experiments were conducted, testing auditory (Experiment 1) and visual stimuli (Experiment 2), respectively. Each trial consisted of two presentation phases: successive presentation of five intervals, followed by the presentation of a single comparison interval.
Participants' task was to indicate, via a keypress response, whether the comparison interval was shorter or longer than the average of the five successive intervals. The response could be given without stress on speed.
In Experiment 1 (auditory intervals), trials started with a fixation cross presented for 500 ms, followed by a succession of five intervals demarcated by six 10-ms auditory beeps. Along with the onset of the auditory stimuli, a '1' was presented on display monitor, telling participants that this was the first phase of the comparison task. The series of intervals was followed by a blank gap (randomly ranging between 800-1200 ms), with a fixation sign '+' on the screen (indicating the transition to comparison phase 2). After the gap, a single comparison duration demarcated by two brief beeps (10 ms) was presented, together with a '2', indicating phase two of the comparison. Following another random blank gap (of 800-1200 ms), a question mark ('?') appeared in the center of the screen, prompting participants to report whether the average interval of the first five (successive) intervals was longer or shorter than the second, comparison interval ( Figure 2A). Participants issued their response via the left or right arrow keys (on the keyboard in front of them) using their two index fingers, corresponding to either 'shorter' or 'longer' judgments. To make the two parts 1 and 2 of the interval presentation clearly distinguishable, two different frequencies (2500 and 3000 Hz) were randomly assigned to the first and, respectively, the second set of auditory interval markers. Experiment 2 (visual intervals) was essentially the same as Experiment 1, except that the intervals were delivered via the visual modality and were demarcated by brief (10-ms) flashes of gray disks in the screen center (see Figure 2B). Also, the visual cue signals used to indicate the two interval presentation phases ('1', '2') in the 'auditory' Experiment 1 were omitted, to ensure participants' undivided attention to the judgment-relevant intervals. In order to obtain, in an efficient manner, reliable estimates of both the point of subjective equality (PSE) and the just noticeable difference (JND) of the psychometric function of the interval comparison, we employed the updated maximum-likelihood adaptive procedure from the UML toolbox for Matlab 45 . The comparison interval was set to 500 ms initially, and the UML adaptive procedure then determined the next comparison interval based on the participant's response. Moreover, to mitigate habituation and expectation effects, the sequences of comparison intervals for three different sets were presented randomly intermixed across trials, though with concurrent tracking of the three separate adaptive procedures.
Prior to the testing session, participants were given verbal instructions and then familiarized with the task in a practice block of 30 trials (10 comparison trials for each set). Of note, upon receiving the instruction, most participants spontaneously voiced concern about the difficulty of the judgment they were told to make. However, after performing just a few trials of the training block, they all expressed confidence that the task was easily doable after all, and they all went on to complete the experiment successfully. In the formal testing session, each of the three sets was tested for 80 times, yielding a total of 240 trials per each experiment. The whole experiment took approximately 60 minutes to complete. .31, 01, 012 F 3 = 7 p < . η 2 g = . confirmed the Set effect was mainly due to the mean being highest with Set 2. In more detail, for the auditory experiment (Figure 3a), the mean of Set 2 was larger than those of Set 1 ( p =.01) and Set 3 ( p <.001), while there was no significant difference between the latter ( p =.42). The result pattern was the same for the visual experiment (Figure 3b), with Set 2 generating a larger mean than both Set 1 ( p =.007) and Set 3 ( p <.001), with no reliable difference between the latter ( p =.08). These results are consistent with one of our predictions, namely, that the main averaging process for providing perceptual summary statistics is based on the geometric mean, in both the visual and auditory modalities.

Results
In addition, for the three sets of auditory intervals (Experiment 1), the mean PSEs did not differ significantly from their correspondent geometric means (paired t -test pooled across the three sets), , but were significant smaller than the arithmetic means, (47) .71, 093 t = 1 p = .
The latter effect is consistent with previous reports that auditory intervals are often perceived as longer than physically equivalent visual intervals 43,46 .
Another key parameter providing an indicator of an observer's temporal sensitivity (resolution) is given by the just noticeable difference (JND), defined as the interval difference between the 50%-and 75%-thresholds estimated from the psychometric function. Figure 4 depicts the JNDs obtained in Experiments 1 and 2, separately for the three sets of intervals.
2 g = . however, revealed the JNDs to be significantly smaller for auditory than for visual interval averaging, ; that is, temporal resolution turned out higher for the (1,2) .93, p 01 F 9 = 8 < .
auditory than for the visual modality, consistent with the literature 47 .
Thus, taken together, evaluation of both the mean and sensitivity of the participants' interval estimates demonstrated not only that ensemble coding in the temporal domain is accurate and consistent, but also that the geometric mean is used as the predominant averaging scheme for performing the task. which the geometric and arithmetic means exhibit different patterns (see Figure 1 ). The pattern we found was essentially the same with both auditory (Experiment 1) and visual intervals (Experiment 2); and, critically, it did match the pattern expected for the geometric -rather than that for the arithmetic -mean (see comparison between the predictions in Figure 1 and the behavioral data in Figure 3 ). This supports our hypothesis that, regardless of the sensory modality, intuitive ensemble averaging of time intervals is based on logarithmically coded time.
In other words, at least in the short range of intervals (from 300 to 1300 ms), the subjective timeline is likely logarithmically scaled.
Unlike estimating the mean of visual ensembles (e.g., mean object size or mean facial expression), using an averaging task in the temporal domain raises a pragmatic concern: how do we actually average time across time -or, in Wearden and Jones's 48 words: 'can people do this at all?' (p. 1295). In their study, participants were asked to average three consecutively presented durations and compare their mean to that of three subsequently presented durations ; Wearden and Jones 48 found that not only could participants accurately extract the arithmetic mean, but the estimated means also remained indifferent to variations in the spacing of the sample durations. In the current study, by adopting the averaging task for multiple temporal intervals (>3), we resolved the problem encountered with the temporal bisection task, namely, that, as outlined in the Introduction, the finding of the geometric (bisection) mean may be the outcome of a ratio comparison 20,21 , rather than reflecting the internal subjective timeline. Specifically, we hypothesized that temporal ensemble perception may be indicative of a fast and intuitive process processing of time appears to have opted for the latter, ensuring computational efficiency. Thus, given the subjective scale is logarithmic, intuitive averaging would yield the geometric mean.
It could, of course, be argued that participants may adopt alternative weighting schemes to simple (equally weighted) arithmetic or geometric averaging. For example, the weight of an interval in the averaging process might be influenced by the length of that interval or/and the position of that interval in the sequence. Thus, for example, a long interval might engage more attention than a short interval, and weights are assigned to the intervals in accordance to their lengths. We simulated this alternative averaging strategy using an ideal observer model, which assumes that the observer encodes the presented intervals veridically (i.e., linearly), but assigns weights according to the proportion of a given interval duration to the total duration. As illustrated in Figure 5 (left panel), if this linear weighted averaging process (WA) were adopted as the dominant strategy for computing the mean interval, the estimated mean for Set 1 should be the largest one among the three sets. However, the observed data clearly are at variance with this model.
As has previously been demonstrated, more recent events in a temporal sequence tend to be more heavily weighted than earlier ones, which is known as recency effect 51 Figure 5 (right panel). In EMA, we constructed a similar monotonic function by adding an exponential rate ( ) in calculating the weights. That is, instead α of progressing being linearly, the weight assigned to each interval decreases exponentially as the time order increases, but never reaches zero. Figure 5 (middle panel) presents the EMA simulated results for the three interval sets (with as the initial weight). Comparison with .9 α = 0 Figure 3 shows that none of those weighting schemes is consistent with the observed behavioral pattern. In short, we examined three alternative averaging schemes (WA, EMA, and WMA) and found that none of them could explain the patterns we observed in Experiments 1 and 2. We are, thus, fairly confident that we can exclude those alternatives. Of course, there are many other, unusual weighting schemes one could think of. However, based on Occam's razor, our observed data patterns favor the simple geometric averaging account.
Logarithmic representation of stimulus intensity, such as of loudness or weight, has been proposed by Fechner over one and a half centuries ago 52 , based on the fact that the JND measured at a given intensity is proportionate to the stimulus intensity (Weber's law). It has been shown that, for the same amount of information (quantized levels), the logarithmic scale provides the minimal expected relative error that optimizes communication efficiency, under the constraint that neural storage of sensory or magnitude information in general is capacity-limited 53 . In this respect, logarithmic timing provides a good solution for coping with a limited STM capacity to represent longer periods of time. However, as argued by Gallistel 54 , logarithmic encoding makes valid computations problematic: "Unless recourse is had to look-up tables, there is no way to implement addition and subtraction, because the addition and subtraction of logarithmic magnitudes corresponds to the multiplication and division of the quantities they refer to" (p. 8). It is likely that the ensuing computational complexity pushed intuitive ensemble averaging onto the internal, subjective scale -rather than the external, objective scale, which would have required multiple nonlinear transformations. Thus, our results join the increasing number of studies suggesting that, like other magnitudes 23,55 , time is represented internally on a logarithmic scale, and intuitive averaging processes are likely bypassing higher-level cognitive computations. It should be noted that higher-level computations based on the external, objective scale can be acquired through training and education, and this is linked to mathematical ability 23,56,57 . Interestingly, a recent study also reported that, under dual-task conditions with an attention-demanding secondary task taxing visual working memory, the mapping of number onto space changed from linear to logarithmic 58 -providing convergent support for our hypothesis of an intuitive judgment strategy that can operate with a minimum of cognitive resources.
Another interesting finding of the present study concerns the overall underestimation of the (objective) mean interval durations, which was evident with all three sets of intervals and for both modalities (though it was particularly marked with visual intervals). The general underestimation is consistent with the subjective shortening effect -a source of bias reducing individual durations in memory 59,60 . The underestimation was more pronounced in the visual as compared to the auditory modality, which is consistent with the classic modality effect, of auditory events being judged as longer than visual events. This effect has been heavily investigated, the dominant idea being that temporal information is processed with higher resolution in the auditory than in the visual domain 43,61-63 . Thus, rather than examining whether the estimated mean was closer to the AM or GM for each ensemble set, we instead focused on the global estimation pattern across multiple sets and examined whether the patterns observed conformed with any of our hypotheses. With a consistent pattern found with all three sets ( Figure   3) and underpinned by strong statistical power, it is reasonable to conclude that the obtained data genuinely reflect an intuitive process of averaging multiple durations, where the average lies close to their geometric mean.
In summary, the present study provides behavioral evidence supporting a logarithmic representation of subjective time, and intuitive ensemble averaging is based on the geometric mean. Although the validity of behavioral studies has been increasingly acknowledged in recent years, lacking direct information about the neural basis makes it difficult to substantiate the inner timeline. We argue that achieving a full understanding of human timing requires a concerted research effort from both the psychophysical and neural perspectives.