## Introduction

As a cultural, historical, cognitive, and formal system, music is subject to a number of forces that have shaped its evolution over time. In the Western tradition, music from the 17th, 18th, and 19th centuries sounds characteristically different. One main aspect of that difference, apart from the development of performance and instrumentation, is characterized by the music-theoretical system of tonality. Based on historical reasoning, music theorists have proposed accounts of the process by which Western tonality emerged, stabilized, extended, and eventually destabilized over the course of its history (Aldwell and Schachter, 2003; Dahlhaus, 1968; Fetis, 1844; Kostka and Payne, 2013; Rameau, 1722; Riemann, 1893; Schoenberg, 1969). For instance, Fetis (1844) describes the history of tonality as a succession of stages that are determined by significant changes in the relation of notes to each other. For example, he locates the transition from the tonalité ancienne to the tonalité moderne–roughly the transition from the Renaissance to the Baroque–in a specific bar in Claudio Monteverdi’s motet Cruda amarilli. There, Monteverdi writes an unprepared dominant seventh chord, which would have been regarded as false by Renaissance composers. Since then, Fétis’s claim and the particular musical example have been the subject of musicological debate (Chafe, 1992; Dahlhaus, 1968; Long, 2020; McClary, 1976).

In recent years, the emerging field of musical corpus studies has turned to these questions from a new angle. By analyzing large datasets of digital encodings of musical pieces with computational and statistical methods, researchers aim at bridging the gap between qualitative-historical and quantitative-empirical paradigms (Neuwirth and Rohrmeier, 2016) and develop novel ways to study the old questions mentioned above (see Section “Related empirical research”). This paper participates in this endeavor by using computational modeling for a data-driven analysis of historical changes in tonality and focuses on a particular aspect: the historical development of modes in Western classical music.

### Related empirical research

Although definitions and models vary (Dahlhaus et al., 2001; Hyer, 2001; Lerdahl, 2001; Moss et al., 2019; Tymoczko, 2006), characterizations of tonality share (at least) some reference to the pitch level. For instance, a large number of previous empirical studies is based on a pitch-distribution model of tonality (e.g., Huron, 2006; Knopoff and Hutchinson, 1983; Krumhansl, 1990; Meyer, 1957; Snyder, 1990; Youngblood, 1958). Psychological and neuroscientific research often concentrates on the acquisition and mental representation of musical knowledge of pitch-class distributions (Bharucha and Krumhansl, 1983; Butler and Brown, 1994; Huron and Parncutt, 1993; Janata et al., 2002; Koelsch et al., 2013; Krumhansl, 1990; Krumhansl and Kessler, 1982; Large et al., 2016; Pearce and Rohrmeier, 2012; Tillmann et al., 2000, 2008; Zatorre and Salimpoor, 2013). Computational music-theoretical approaches aim at uncovering structures in the music itself by employing statistical methods to study distributions of pitch classes in musical corpora (Lieck et al., 2020; Moss, 2019; Parncutt, 2015; Savage et al., 2015; Temperley, 2009; White, 2013). The results of corpus-based research on tonality exhibit high similarities to the psychological findings (Huron, 2006; Krumhansl, 1990; Krumhansl and Cuddy, 2010): distributions of notes in musical corpora strongly resemble the patterns of mental representations and expectancy in behavioral experiments such as probe-tone studies (Krumhansl, 2004; Krumhansl and Kessler, 1982; Temperley and Marvin, 2008). Accordingly, it is assumed that implicit learning of statistical regularities in pitch distributions is a central factor for the cognition of tonality (Pearce, 2018; Rohrmeier and Cross, 2009; Rohrmeier and Rebuschat, 2012; Tillmann, 2005; Tillmann et al., 2000).

There are a number of previous empirical studies that investigate diachronic developments of tonality in Western classical music based on the statistics of pitch-class distributions. Rodrigues Zivic et al. (2013) study distributions of note bigrams (melodic intervals) in a large corpus of short melodies (Viro, 2011) and are able to distinguish four periods of music history based on the usage of these intervals in musical pieces. Several recent studies, as well as the present article use the music collection from the ClassicalArchives internet database (see Section “Data”). White and Quinn (2016) extracted a dataset of time slices of simultaneously sounding notes of approximately 13,800 pieces and report a diachronically increasing ratio of dominant-seventh chords to dominant triads. This finding is related to Dahlhaus’s (1968) assessment of the importance of dominant-seventh chords for the beginning of harmonic tonality, although the precise nature of this relation requires further investigation. A sub-sample of 33 composers from the time-slice dataset is used by Yust (2019) to study historical changes in pitch-class distributions using the discrete Fourier transform. His findings indicate that during the 18th and 19th centuries composers begin to explore modulations to more distantly-related keys, leading to a decrease in what he calls “diatonicity”. A similar finding is reported by Moss (2019) who shows that the tonal material used by Western classical composers expands historically, enabling them to write more chromatic and enharmonic compositions in the 18th and 19th centuries. Using approximately 24,000 pieces from ClassicalArchives, Huang et al. (2017) predict the composer of a musical piece based on the interval statistics of the pieces. Moreover, their clustering analysis shows that three groups of composers emerge which roughly correspond to three historical musical epochs. Weiß et al. (2019) use a dataset of 2000 audio recordings from which they computationally extract several features related to musical pitch in order to study a range of music historical questions. Their findings indicate a historical shift in the directionality of chord transitions (see also Moss et al., 2019; Rohrmeier and Cross, 2008), an increase in dissonant intervals, as well as an overall increase in tonal complexity.

Few studies focus exclusively on modes in Western classical music. A characterization of eight modes in medieval music is given by Huron and Veltman (2006), albeit only on the basis of a relatively small sample of 98 Gregorian chants. This work was recently enlarged by Cornelissen et al. (2020) who introduce and compare several mode classification methods, and apply them to this repertoire on the basis of more than 20,000 chants. The questions addressed by Albrecht et al. (2014) are particularly similar to the ones studied in this paper. Based on a dataset of 455 pieces sampled from various sources, they study the historical development of modes in Western classical music between 1400 and 1750, that is, in the transition between the Renaissance and Baroque periods. They acknowledge a particular difficulty: in order to classify musical pieces into different modes by calculating the similarity of their pitch-class distributions to pre-defined mode templates, one needs to know how many templates there are and what they look like. They propose to solve this problem by taking for granted that the major and minor modes were well-established in the Common-Practice period, and that they can be represented by the pitch-class distributions reported by Albrecht and Shanahan (2013). Their methodology can be summarized as follows: For the latest time period 1700–1750 they calculate the Euclidean distance of the given major and minor templates to all twelve transpositions of a piece and keep that transposed piece with minimal distance to either mode. These transposed pieces are then clustered together via a hierarchical clustering algorithm to determine the optimal number of modes for this time period. Subsequently, for each found cluster, the average pitch-class distribution is calculated and taken as new mode template for the earlier 50-year bin. By progressively moving backwards in time they find that the time period 1600–1650 suggest three rather than two modes. In the earliest period in their sample from 1400–1450 four modes emerge from this procedure. While their overall methodology bears resemblance to our proposed geometrical model (see Section “Geometric Model”), there is a crucial difference: our approach does not assume the existence and particular shapes of the major and minor modes a priori (see Section “Assumptions and contributions”) but demonstrates that both aspects can be inferred from data.

### Assumptions and contributions

The aforementioned approaches rely on several specific music-theoretical assumptions, in particular that Western classical music is governed by exactly two modes, the major mode and the minor mode. With the notable exception of Albrecht et al. (2014), this underlying assumption is rarely discussed. The present approach explicitly and systematically addresses this fundamental assumption with data-driven methods. We do neither assume a fixed number of modes nor specific characteristics of these modes, but rather infer them in an unsupervised manner for different time periods.

Empirical music research commonly presupposes three further assumptions that are adopted in this study. The first assumption is octave equivalence, according to which notes group into pitch classes. For example, all Cs (e.g., C1, C2, C3, …) are grouped into the pitch class C. Second, musical keys are commonly assumed to be invariant under transposition. This means that the relative importances of a key’s scale degrees are independent of the root. The third assumption is that there are only twelve distinct pitch-classes that are commonly represented as integers {0, …, 11}. This, together with the assumption of two modes, implies that there are exactly 2 × 12 = 24 musical keys. While the assumption of octave equivalence is supported by empirical evidence (Wright et al., 2000), transpositional invariance is debatable, in particular with respect to the historical appropriateness of this concept (Gárdonyi and Nordhoff, 2002; Lieck et al., 2020; Moss, 2019). For instance, the findings of Quinn and White (2017) indicate that the pitch-class distributions of different keys might not generally be related by mere transposition. For a detailed account of the music theoretical concepts see the Appendix.

The contributions of this paper are summarized as follows: (1) We propose two novel models: a geometric model of modes as clusters of musical pieces in a non-Euclidean space, and a probabilistic Bayesian model of modes as Dirichlet distributions; (2) using the geometric model, we determine the relation between modes for five historical epochs and infer their cardinality via unsupervised learning, and (3) we apply the probabilistic model to infer the characteristics of major and minor from data and compare them to previous findings. Using recent computational methods, our approach provides a data-driven analysis of the historical changes of tonality.

## Materials and models

### Data

This study uses a dataset of files in Musical Instrument Digital Interface (MIDI) format from the internet platform ClassicalArchives, consisting of approximately 21,000 pieces of Western classical music.Footnote 1 The files were generated by users either transcribing musical scores or recording themselves playing the pieces. We use this dataset because it contains entire musical works as opposed to only segments or short melodic fragments. Moreover, it covers a broad historical range and links this study to previous research mentioned above. While the data does not conform to the high standards of scholarly editions, its quantity statistically outweighs potential errors caused by encoding, transcribing, or recording the pieces. In particular, minor errors in the encodings of musical pieces only marginally affect their pitch-class distributions and associated statistics that are the basis for this study. We inspected the dataset at random and did not find major errors such as incomplete or mislabeled pieces. In order to show chronological changes in tonality, we use the subset of 12,625 pieces (over 55 million notes) for which the year of composition is given.

Since the ClassicalArchives dataset is crowd-sourced, it is prone to certain biases. For instance, the prevalence of composers and pieces is likely to reflect contemporary rather than historical popularity. Moreover, piano compositions might be over-represented because many files have been entered by users via a MIDI keyboard. This also affects the historical balance of the data. In particular, we observe a lack of data from pre-Baroque periods in which music was predominantly vocal. Therefore, we additionally included 777 Renaissance pieces from several scholarly resources, namely from the Lost Voices project (Freedman, 2014),Footnote 2 the Citations: The Renaissance Imitation Mass project (CRIM; Freedman et al., 2017),Footnote 3 and the Electronic Locator of Vertical Interval Successions project (ELVIS; McKay and Fujinaga, 2015),Footnote 4 resulting in 12,625 + 777 = 13,402 pieces in total.

The data was subsequently converted into lists of notes from which pitch-class counts (e.g., Youngblood, 1958) were calculated for each piece, weighted by the total duration of each pitch-class. For example, if a piece contains 200 times the note F with a duration of 1/8 each, and 50 times the note C with a duration of 1/2, both would have a total duration of 25 in this weighted representation. Figure 1 shows the distribution of the pieces over time. Based on the minima of a Gaussian kernel density estimate (Bishop, 2006), shown as a smooth curve, boundaries between five time periods are determined, resulting in splits at the years 1649, 1758, 1817, and 1856. The earliest pieces have been grouped together into a single period before 1649 to reduce data sparsity. This division approximately reflects the succession of the Renaissance, Baroque, Classical, early Romantic, and late Romantic musical periods (Burkholder et al., 2014).

A subset of 6655 pieces (49.7%) is annotated with a key label (e.g., C major or D minor) by the ClassicalArchive users. Albrecht and Shanahan (2013) estimate the accuracy of the labels to be 87.5%. We use these labels only for evaluating our model (see Section “Results and Discussion”) which operates on all 13,402 pieces (with or without key labels). In the following, we refer to key labels and years of composition as “metadata”, and to the musical pieces as “data”.

### Geometric model

Similar to the bag-of-words model used in Natural Language Processing (Manning and Schütze, 2003), we model a musical piece as the vector of its pitch-class counts weighted by duration. Figure 2 shows the weighted pitch-class counts for an example piece in C major.

This vector model is the basis of mathematical spaces in which pieces that have similar pitch-class counts are close to one another; different notions of similarity define different spaces. Mathematically, this representation allows for the application of geometrical approaches in $${{\mathbb{R}}}^{12}$$ and its quotient spaces. The proposed geometric model formalizes tonality in two stages using the concepts key and mode. First, we define the key space in which pieces are close if they have similar keys and denote it by $${\mathbb{K}}$$. Second, we define the mode space in which pieces are close if they have similar modes and denote it by $${\mathbb{M}}$$. This approach enables a conceptualization of key and mode as related but distinct relations between pieces without assuming a specific number of keys or modes and their respective characteristics.

Since the key of a piece does not depend on the absolute frequencies of pitch classes (the length and the key of a piece are independent), the pitch-class vectors of all pieces are normalized such that each vector sums to one. This is expressed mathematically by dividing each pitch-class vector p by its sum ∑ipi (i.e., the L1 norm). We define the key-space distance of two pieces p and q as

$${d}_{{\mathbb{K}}}(p,q)\left\Vert\frac{p}{{\sum }_{i}{p}_{i}}-\frac{q}{{\sum }_{i}{q}_{i}}\right\Vert_{2},$$
(1)

where 2 de notes the Euclidean norm, $$| | p| {| }_{2}=\sqrt{{\sum }_{i}{p}_{i}^{2}}$$. Using this notion of distance, different keys emerge as clusters of pieces. Moreover, it is to be expected that keys which are close on the circle of fifths are also close in the key space because they share similar pitch-class statistics.

Under the assumption that the internal structure of a key is invariant under transposition (see Section “Assumptions and contributions”), modes correspond to invariance classes of keys. In the mode space $${\mathbb{M}}$$, the distance of two pieces p and q is computed by choosing the minimum of the key distances from p to each of the twelve transpositions of q. If two pieces p and q are in the same mode, they are close with respect to the metric

$${d}_{{\mathbb{M}}}(p,q)=\mathop{\min }\limits_{i\,\in\, {{\mathbb{Z}}}_{12}}{d}_{{\mathbb{K}}}({\sigma }_{i}(p),q)=\mathop{\min }\limits_{i\,\in\, {{\mathbb{Z}}}_{12}}{d}_{{\mathbb{K}}}(p,{\sigma }_{i}(q)),$$
(2)

where σi(p) denotes the transposition (circular shift) of p by i semitones, $${({\sigma }_{i}(p))}_{k}={p}_{(k-i\;\text{mod}\;12)}$$ for $$k\in {{\mathbb{Z}}}_{12}$$. Note that the mode-space distance $${d}_{{\mathbb{M}}}$$ is symmetric. Analogously to the key space $${\mathbb{K}}$$, modes emerge as clusters in the mode space $${\mathbb{M}}$$. Note that two pieces that are far apart in the key space, such as pieces in C and F major, are close in the mode space. A mathematical proof that $${\mathbb{M}}$$ is a pseudometric space can be found in the Appendix.

### Bayesian model

Since the size and structure of the clusters in $${\mathbb{M}}$$ characterize the particular modes, we propose a generative Bayesian model for unsupervised mode learning and classification. Generally, Bayesian models are used in computational cognitive science to describe how the human mind acquires abstract symbolic knowledge inductively from the observation of examples (Chater et al., 2006; Griffiths et al., 2008; Tenenbaum et al., 2011). Bayesian statistics is expressive and powerful, because it provides a generic framework for the specification of scientific models and uses formalized plausible reasoning to implement them (Jaynes, 2003). In particular, this separation of modeling and inference facilitates the development of models tailored to the task at hand, instead of limiting scientific models to specific pre-existing algorithms (Bishop, 2013).

We propose and implement a Bayesian model that learns an abstract representation of modes from the observation of concrete musical pieces. Figure 3 shows the Bayesian network of the model that is briefly introduced in the following paragraphs; more detailed explanations can be found in the Appendix.

The Bayesian model analyzes the weighted pitch-class distributions of single pieces by re-generating them together with their roots and modes as latent variables. An inference algorithm (Gibbs sampling; described in the Appendix) is then used to infer the roots and modes of the pieces in the present dataset. In correspondence to the proposed geometric model (Section “Geometric Model”), all pieces are normalized to a common number of notes a priori. In contrast to the geometric model that represents key and mode as clusters of pieces, the Bayesian model operationalizes modes as probability distributions over pitch-class vectors. Keys are represented as pairs of a mode together with a root note, as it is also standard terminology in music theory, e.g., “A major”, “B minor”. The number of different modes m, however, is not fixed but a parameter of the Bayesian model that is to be inferred.

For each time period T, the proportions of how often modes and roots are used in this period are generated first; they are denoted by θ(M) and θ(R), respectively. These proportions are later used to randomly generate modes and roots of pieces. For each mode M, a pitch-class distribution $${\theta }_{M}^{(P)}$$ is generated. This distribution characterizes the internal structure of the mode (e.g., how much more or less likely a major third is than a minor third). For each piece P in the period, its mode M is sampled from θ(M) and its root R is sampled from θ(R). The pitch classes of the piece are generated by first drawing a pitch-class vector from $${\theta }_{M}^{(P)}$$ and then transposing it according to the piece’s root R.

Given the present dataset D and a time period T, the posterior distribution p(R, MP, T, D) describes how plausible it is that a piece P has root R and mode M.Footnote 5 Since the Bayesian model represents keys as pairs of roots and modes, this distribution is called the posterior of the piece’s key. It is obtained using a Gibbs sampling method adopted from Johnson et al. (2007) and Sato (2011); see the Appendix for further explanation. The key (R*, M*) that maximizes the piece’s posterior is given by

$$({R}^{* },{M}^{* })= \mathop{\mathrm{arg}\, {\mathrm{max} }}\limits_{(R,M)\in {{\mathbb{Z}}}_{12}\times \{1,\ldots ,m\}}p(R,M| T,P,D).$$
(3)

It is called the maximum-a-posteriori estimate in the terminology of Bayesian statistics. A piece P from a time period T is then classified into the mode M* of this key.

## Results and discussion

### Classification results

A dimensionality reduction of the mode space $${\mathbb{M}}$$ to the two-dimensional plane using t-distributed stochastic neighbor embedding (t-SNE, Van Der Maaten and Hinton, 2008) reveals a clear clustering of the pieces into several groups in the first three periods, namely four clusters in the Renaissance, two clusters in the Baroque and Classical periods, and a single large cluster in the later two periods (early and late Romantic). The columns of Fig. 4 show the different time periods, and the rows show different colorings that correspond to the classification into the modes according to different mode classification methods (metadata, Gaussian mixture model, Bayesian model) applied to the data. Each point corresponds to one piece in the dataset.Footnote 6

#### Clustering according to metadata labels

The first row displays the key labels from the metadata for the five periods. Note that the earliest epoch, the Renaissance, contains almost no pieces with key labels in the metadata; most pieces are colored in gray. For the following two periods, Baroque and Classical, the coloring by the labels in the metadata and the clustering obtained by t-SNE are aligned: one cluster is largely orange and the other one is largely blue. This correspondence induced by the metadata indicates that the clusters in the Baroque and the Classical period represent the major and the minor mode (shown in orange and blue, respectively). This bipartition into two modes reflects a pillar of Common-Practice tonality. In contrast, no clear distinction into different modes exist in the Romantic periods; pieces with major and minor metadata labels are mixed together, along with a relatively large number of unlabeled pieces. Hence, the metadata labels suggest that the binary classification of pieces into “major” and “minor” is less adequate for the 19th century.

#### Gaussian mixture model

The second row of Fig. 4 displays clusterings by mixtures of two-dimensional Gaussian distributions, a so-called Gaussian Mixture Model (GMM, see e.g., MacKay, 2003), for each period. Such models explain data by optimally covering the data points using multidimensional bell curves. In our application, each mode corresponds to one bell curve, and a piece is classified into the mode whose bell curve it is covered with.Footnote 7 In order to obtain the optimal number of bell curves and thus the optimal number of modes for a given historical period, silhouette scores (Rousseeuw, 1987) were calculated for all possible numbers of modes between 2 and 6. Silhouette scores are a standard measure to quantify the visual impression of the appropriateness of different numbers of clusters mode generally.Footnote 8 Table 1 lists the silhouette scores for all five periods. These numbers largely confirm the visual impression of Fig. 4. For the Renaissance, a clustering of the data into four modes is most appropriate. The division into two clusters achieves the highest silhouette scores in the Baroque, Classical and early Romantic periods, while the optimal number of clusters is three in the last period. Furthermore, the separation between clusters is clearest in the Classical era, as expressed by the largest difference between the highest two silhouette scores for this time period (0.460 − 0.402 = 0.058), in which the corresponding silhouette score reaches the maximum over all periods and number of modes. The indication for two modes is thus strongest in the Classical and the Baroque periods, again confirming that the Common-Practice period is largely determined by the usage of the major and the minor modes. Three modes achieve the highest silhouette score in the late Romantic period, questioning this clear separation. These results suggest that the two modes shape the tonal structure of pieces only for the Baroque and Classical eras, at least from the global perspective of pitch-class distributions that abstracts from local keys and modulations.

The results of the geometric model exhibit a strong similarity to the classification based on the metadata in the first row which suggests the same interpretation: in the Baroque and Classical periods, the space is well separable into two modes which does not hold for the Renaissance and the Romantic periods. The mixture of Gaussian distributions moreover classifies one of the larger clusters in the Renaissance as minor (blue) and the second large, as well as the two smaller clusters as major (orange). Since there is no corresponding metadata for this period, we can not evaluate the mode classification on that basis but will inspect the modes in this period in more detail below (Section “Characteristics of modes”). While music-theoretical conceptions of Common-Practice tonality are largely corroborated by these mode classifications according to the metadata and the GMM, it is important to recall that the former is highly uncertain because the labeling in the metadata might be erratic. The latter classification by the GMM is uncertain because it has been performed in a dimensionality-reduced space, and this transformation might have affected the underlying similarities between the pitch-class distributions of musical pieces. In order to overcome this limitation and to derive a more appropriate mode classification for all periods, we introduce a Bayesian classifier that operates in the original, unreduced mode space $${\mathbb{M}}$$.

#### Bayesian model

Based on the overall best silhouette scores for two modes (see Table 1), the mode predictions of the Bayesian model are shown by the coloring in the third row of Fig. 4; the appropriateness of major-minor classifications in the Renaissance will be considered below (Section “Characteristics of modes”). This classification into two modes closely resembles the partition by the GMM in all periods except in the Classical, where the Bayesian model is more exact. In other words, both the two-dimensional reduction via t-SNE (cluster shapes) and the unsupervised Bayesian mode classifier (coloring in the third row of Fig. 4) independently assign the musical pieces to one of the two modes in a similar manner. It is important to note that this agreement between the GMM and the Bayesian model is not trivial but a result of the close correspondence between the notion of distance in the mode space and the interaction of the Bayesian model’s distributions. The binary separation into major and minor is particularly strong in the Baroque and Classical periods for both methods, whereas the Renaissance and Romantic periods are somewhat more mixed for the Bayesian classifier.

### Mode clarity

To quantify the visual impression that the binary separation into major and minor is particularly strong in the Baroque and Classical periods, we define the mode clarity of a time period as the proportion of correctly predicted modes according to the metadata. Figure 5 visualizes the estimated mode clarity for each period. The mode clarity of the Renaissance period is not well quantifiable due to the small amount of key labels in the metadata. Nonetheless, it is lowest among all periods. The mode clarity is highest in the Classical period and significantly higher for the Baroque and the Classical period than for the Romantic periods. This finding underpins again the evidence for two clusters of major and minor pieces in the Baroque and the Classical period. Furthermore, it validates that the binary distinction between major and minor pitch distributions are less adequate to characterize the tonality of Romantic music. In other words, the concepts “major” and “minor” as distinct global modes of musical pieces is much less appropriate in the Romantic eras.

### Characteristics of modes

#### The Renaissance

As stated above, the mode clarity for the Renaissance is difficult to interpret because of lacking metadata. Moreover, the silhouette scores (see Table 1) clearly show that the pieces in this period group together in four and not only two clusters. We thus applied the Bayesian model to the pieces in this period separately, assuming the existence of four clusters. The result is shown in the coloring of Fig. 6, showing a large coherent cluster (red), a large mixed cluster (green and blue), as well as two small clusters (both violet). Recall that the clustering is a result of the dimensionality reduction and that the coloring expresses the similarity of pieces in the original mode space, in which the pieces of the two violet clusters are actually close to each other and the pieces in the green-blue cluster are actually separated. This provides further evidence that clustering the pieces in the reduced space with a GMM is not sufficient and again justifies the application of the Bayesian model for mode classification.

Since we cannot compare this clustering to the labels in the metadata, we inspect the pitch-class distributions of these four clusters separately and display them in so-called violin plots (see Fig. 7). Each of the violins depicts a kernel density estimation for the relative frequencies of the respective pitch classes in one of the four clusters. We manually added a threshold of 5% (dashed horizontal line) to distinguish between pitch classes belonging to our mode interpretations of the clusters and out-of-scale tones. The distributions of pitch classes within the respective clusters resemble music-theoretical descriptions of Renaissance modes (see Glareanus, 1547; Huron and Veltman, 2006; Judd, 2002). The first cluster (top row, violet) resembles the Mixolydian mode, the second cluster (second row, red) resembles the Ionian mode, the third cluster (third row, blue) resembles the Dorian mode, and the fourth cluster (bottom row, green) could be interpreted as a mode between the Aeolian and Dorian modes, lacking a clear expression of either pitch class 8 or 9 to distinguish them. Note, however, that our Bayesian classifier transposes all pieces to a common root, based on the assumption of transpositional invariance (see Section “Assumptions and contributions”). The results should therefore be interpreted with care since Renaissance composers did not transpose modes to arbitrary roots; rather the different modes were associated with specific absolute pitches as roots (Judd, 2002; Wiering, 2001).

#### The common-practice period

Both the silhouette scores (Table 1) and the mode clarities (Fig. 5) are clearest for the Common-Practice period, covering the Baroque and Classical periods. Figure 8 shows the distributions of each of the twelve pitch classes inferred via the Bayesian model for both modes (major and minor). The fact that the maxima of the distributions fall either above or below a threshold around five percent reflects the music-theoretical distinction between in-scale and out-of-scale notes (Lerdahl, 2001). The value of this threshold might be specific to the dataset we used. It is, however, interesting that there is a threshold separating in-scale from out-of-scale pitch classes so clearly. The variance of the distributions moreover differs across pitch classes and modes. Generally, it is smaller for of out-of-scale notes than for in-scale notes, expressing that out-of-scale notes are typically rarely used, while in-scale notes have a greater degree of freedom in their use in musical pieces.

The characteristics of a mode can be summarized in its average pitch-class distribution (henceforth mode template). For example, drawing on traditional definitions of major and minor (e.g., Aldwell and Schachter, 2003) one would expect that both the major and the minor mode have similar characteristics for their roots and fifths (pitch classes 0 and 7), but contrasting characteristics for their thirds (pitch classes 3 and 4) and sixths (pitch classes 8 and 9). Commonly, such mode templates are reported as line or bar plots and show absolute or relative estimated or counted frequencies. Unfortunately, they rarely show an appropriate measure of the underlying variance of the distributions as in Fig. 8. In order to directly compare our results with previously published mode templates, we reduce the representation of the characteristics of the major and minor mode from the full distributions to only their maxima. Figure 9 compares the mode templates of the unsupervised Bayesian classification (Bayesian Model) to templates presented in previous research, in particular to templates based on data from psychological experiments (Krumhansl and Kessler, 1982 and Temperley, 2001), as well as to templates derived from corpus statistics (Metadata and Albrecht and Shanahan, 2013). In contrast to the earlier studies, we propose to use so-called radar plots that capture the circular structure of the pitch space (Harasim et al., 2016; Lerdahl, 2001) as opposed to the more common linear arrangement of mode templates. Moreover, pitch classes are not ordered chromatically ascending (see Fig. 8) but along the circle of fifths. This makes it possible to better reflect that in-scale and out-of-scale pitch classes form consecutive parts of the circular space, respectively (Carey and Clampitt, 1989).

The corpus-driven templates assign less probability mass to out-of-scale notes than the experimental templates, supporting the music-theoretical concept of mode. Moreover, all corpus-driven templates further clearly exhibit the symmetry between major and minor mode. In all templates, the most distinctive pitch classes for the two modes are 3 and 4, the minor and major thirds, underpinning the music-theoretical definition of the two modes.

The close resemblance of the mode templates inferred by our Bayesian model to the other corpus-derived templates confirms that the two modes indeed correspond to the major and minor modes but establishes that neither their existence nor their particular shapes need to be assumed a priori; it is possible to infer them from a corpus of musical pieces in a data-driven, unsupervised fashion.

## Conclusion

We investigated the evolution of tonality over time by inspecting a large dataset of MIDI-encoded pieces, covering a historical range of approximately 500 years. Unlike previous music-theoretical studies, which draw on complex analytical methods and a host of prior assumptions, we adopt simple statistical models of pitch-class distributions and propose a novel Bayesian generative model that only takes into account a “bag-of-notes” representation of the pitch classes in the pieces. We find that basic pitch statistics suffice to reveal, in an unsupervised way, the number of modes and their characteristic shapes for a given period of time. We did not assume the number and shape of the modes in advance in order to remove potential biases towards the major-minor tonality as described by music theory (Aldwell and Schachter, 2003; Kostka and Payne, 2013). The results therefore provide strong support for cognitive theories on mode inference via statistical learning (Huron, 2006; Rohrmeier and Rebuschat, 2012; Saffran, 2003; Tillmann, 2012), based on pitch-class frequencies in musical pieces.

Both the geometric and the probabilistic model reveal a similar pattern: modes can be predicted best for the Classical period, second best for the Baroque, and worst for the 19th century and the Renaissance. The two modes that emerge and stabilize over time in the Common-Practice period indeed correspond to the well-known major and minor modes. Applying the same methodology to the Renaissance separately shows that a larger number of independent modes is more appropriate. However, this needs to be studied in more detail in future research, since our fundamental assumption of transpositional invariance seems least appropriate for this era. The poor prediction for the 19th century can be interpreted to reflect the substantial changes in tonality that took place over the course of the 19th century (Meyer, 1989), such as the use of new types of scales, as well as of pitch and harmonic relations (Aldwell and Schachter, 2003; Cohn, 2012; Haas, 2004). It moreover suggests that the concept of a single global mode for 19th-century pieces is much less suitable for this time period, likely due to more frequent local modulations to more distantly-related keys.

Our approach provides an avenue for data-driven corpus research to become a foundation for the empirical study of the historical evolution of tonality. The outcomes of this study open up several directions for future research. Especially the application of hierarchical models and mixture models seems to be very promising in order to draw more fine-grained conclusions about the relative importance of modes and the development of tonality not only on the global but also on more local and intermediate levels of a composition (Lieck and Rohrmeier, 2020). While the dataset used in this study only allows the distinction between twelve pitch classes, Lieck et al. (2020) have recently shown that the incorporation of algebraic models of tonal spaces can be used for more structured representations of of pitch-class distributions of musical pieces. They provide initial evidence that this is also relevant for historical changes of tonality which suggests to further explore the implications of this approach. The expansion of the historical extent of the database used for this study—potentially increasing the range from medieval to contemporary popular music—will broaden the horizon of the empirical historiography of tonality. The models proposed in this article can generally be applied to any musical style for which the representation by pitch classes is adequate and are thus not strictly limited to the study of Western classical music. They might thus be adopted to compare pitch-class distributions not only within but also between musical cultures. This might, in turn, potentially contribute to ongoing debates between biological (e.g., Honing et al., 2015; Merker et al., 2014; Wiggins et al., 2015) and cultural (e.g., Mehr et al., 2019; Nettl, 2001; Savage, 2019) approaches to the evolution of music.