Applying modern psychometric techniques to melodic discrimination testing: Item response theory, computerised adaptive testing, and automatic item generation

Harrison, Peter M. C.; Collins, Tom; Müllensiefen, Daniel

doi:10.1038/s41598-017-03586-z

Download PDF

Article
Open access
Published: 15 June 2017

Applying modern psychometric techniques to melodic discrimination testing: Item response theory, computerised adaptive testing, and automatic item generation

Scientific Reports volume 7, Article number: 3618 (2017) Cite this article

6708 Accesses
38 Citations
10 Altmetric
Metrics details

Subjects

Abstract

Modern psychometric theory provides many useful tools for ability testing, such as item response theory, computerised adaptive testing, and automatic item generation. However, these techniques have yet to be integrated into mainstream psychological practice. This is unfortunate, because modern psychometric techniques can bring many benefits, including sophisticated reliability measures, improved construct validity, avoidance of exposure effects, and improved efficiency. In the present research we therefore use these techniques to develop a new test of a well-studied psychological capacity: melodic discrimination, the ability to detect differences between melodies. We calibrate and validate this test in a series of studies. Studies 1 and 2 respectively calibrate and validate an initial test version, while Studies 3 and 4 calibrate and validate an updated test version incorporating additional easy items. The results support the new test’s viability, with evidence for strong reliability and construct validity. We discuss how these modern psychometric techniques may also be profitably applied to other areas of music psychology and psychological science in general.

Audiovisual integration in the McGurk effect is impervious to music training

Article Open access 08 February 2024

The Confidence Database

Article 03 February 2020

Calibrating the experimental measurement of psychological attributes

Article 16 November 2020

Introduction

Modern psychometric theory provides a remarkable array of tools for ability testing. For example, item response theory (IRT) allows scores to be compared between participants who take different tests. Computerised adaptive testing (CAT) produces efficient tests that dynamically tailor their difficulty levels to participants of different abilities. Automatic item generation (AIG) produces tests with effectively unlimited item banks. Techniques such as these can theoretically produce very powerful and flexible tests of abilities and other latent traits.

Unfortunately, modern psychometric theory has so far only had limited impact on mainstream psychological practice. A number of reasons have been suggested for this, including a lack of mathematically precise thinking in psychology, insufficient mathematical training in psychology education, and an absence of psychological theories sufficiently strong to form the basis of psychometric models¹. As a result, much psychological research still relies on outdated modes of test construction and validation. At best, this leads to suboptimal testing efficiency; at worst, it leads to incorrect psychological conclusions.

Music psychology is a case in point. There exists a long tradition of musical ability tests, spanning from Seashore’s Measures of Musical Talents ² to Kirchberger and Russo’s Adaptive Music Perception Test ³. However, the vast majority of these tests (e.g. refs 2, 4,5,6,7,8,9) are built in classical test theory ¹⁰, which psychometricians abandoned long ago in favour of more advanced frameworks such as item response theory.

We propose that psychology, and music psychology in particular, stands to benefit from incorporating more of these psychometric tools. The present paper addresses this possibility. We took a well-established testing paradigm from the music-psychological literature – melodic discrimination – and we constructed a new test under this paradigm, making use of a number of modern psychometric techniques including IRT, CAT, and AIG. Together, these psychometric techniques carry substantial potential benefits including flexible test lengths, sophisticated reliability measures, improved testing efficiency, improved construct validity, avoidance of exposure effects, and improved test-construction efficiency. The resulting melodic discrimination test is, to our knowledge, the first musical ability test to incorporate all of these psychometric techniques. It should therefore provide a useful test case for examining how psychological research can benefit from these modern psychometric tools.

Background

Melodic discrimination testing

Melodic discrimination is the ability to detect differences between melodies. Many tests have been developed over the years to assess this ability, under names such as tonal memory tests, melodic memory tests, and melodic discrimination tests. However, they all share a common paradigm: participants are played several versions of the same melody, and have to distinguish differences between these versions. The precise task can vary, but most tests use a ‘same-different’ task, where the participant has to decide whether two melody versions are the same or different^{4, 6, 11}.

The earliest melodic discrimination tests formed part of musical aptitude test batteries, in the tradition of Seashore’s Measures of Musical Talents ² (see ref. 12 for a review). The purpose of a musical aptitude test is to assess an individual’s innate capacity for musical success. Musical aptitude tests are still often used as part of entrance exams for school music scholarships, in order to distinguish musical potential from learned ability. Research suggests that musical aptitude scores are indeed good predictors of musical achievement at school¹³ and that it is difficult to change musical aptitude scores through preparation¹⁴, but the degree to which musical aptitude dictates long-term musical success is still contested¹⁵.

Melodic discrimination tests are also often used in psychological research. Some of this research uses musical aptitude test batteries^16,17,18, whereas other research uses test batteries made expressly for research^{19, 20}. Recent examples of such test batteries include the Goldsmiths Musical Sophistication Index (Gold-MSI)¹¹, the Montreal Battery of Evaluation of Amusia (MBEA)²¹, the Musical Ear Test (MET)⁴, the Profile of Music Perception Skills (PROMS)⁶, and the Swedish Musical Discrimination Test⁵. By definition, melodic discrimination tests assess the ability to discriminate between melodies, but this ability is often interpreted as reflecting more general cognitive traits, such as perceptual sensitivity to melodies⁶, memory for melodies¹¹, and general musical competence⁴.

Melodic discrimination tests therefore have important roles to play both in educational assessment and in psychological research. However, these tests have historically possessed two important limitations: poor construct validity and poor efficiency.

Construct validity concerns how test scores relate to the underlying construct of interest²². It is of paramount importance in psychology, as it allows researchers to generalise their results from artificial measures (such as questionnaires or ability tests) to “real” human capacities (such as personality or ability). However, construct validity has received surprisingly little attention in melodic discrimination testing, despite the paradigm’s wide usage. Different studies propose different underlying abilities for these tests, including ‘audiation’²³, melodic memory¹¹, and tonal memory²⁴; however, the definitions of these abilities are usually cursory and unsubstantiated. This seriously compromises the construct validity of melodic discrimination testing.

Poor efficiency in melodic discrimination testing has a number of causes. First, in order to achieve sufficient reliability over a wide range of ability levels, the test must contain many items distributed over a wide range of difficulty levels. In traditional fixed-item tests, this means that any one participant will therefore take many items that are either much too easy or too difficult for their ability level, lowering psychometric efficiency²⁵. Second, most melodic discrimination tests use multiple-choice questions with only two options, meaning that even minimally able participants can score at least 50%, likewise resulting in lowered psychometric efficiency²⁶. Third, melodic discrimination items are slow to administer because of their inherently temporal nature²⁶. As a result of this low efficiency, melodic discrimination tests tend to be time-consuming and tiring, diminishing their practical utility.

Fortunately, modern psychometric techniques can make substantial contributions towards both construct validity and test efficiency, the two main limitations of historic melodic discrimination tests. This, coupled with the important role that melodic discrimination testing plays in educational assessment and psychological research, suggests that it may be worthwhile to construct a new melodic discrimination test using these modern psychometric techniques.

Psychometric theory

Item response theory (IRT)

Item response theory (IRT) represents the state of the art in modern test theory (see ref. 25 for a review). Each item is characterised by a finite set of item parameters, and each test-taker is characterised by a finite set of person parameters. Most common IRT models can be formulated as special cases of the four-parameter logistic (4PL) model, defined as

$$P({X}_{j}=1|\theta ,{a}_{j},{b}_{j},{c}_{j},{d}_{j})={c}_{j}+({d}_{j}-{c}_{j})\frac{\exp \,[{a}_{j}(\theta -{b}_{j})]}{1+\exp \,[{a}_{j}(\theta -{b}_{j})]}$$

where X _j denotes the scored response of the test-taker to item j (1 = correct, 0 = incorrect), θ is the ability parameter for the test-taker, and a _j, b _j, c _j, and d _j are the item parameters for item j, respectively termed discrimination, difficulty, guessing, and inattention parameters ²⁷.

The four item parameters each capture different ways in which items might vary. Items with higher discrimination parameters are better at discriminating between test-takers of different abilities. Items with higher difficulty parameters are harder to answer correctly. The guessing parameter corresponds to the lower performance asymptote (chance success rate), and the inattention parameter corresponds to the upper performance asymptote. Person ability is measured on the same scale as item difficulty, with this scale typically being defined so that the distribution of abilities in the test-taker population has mean 0 and standard deviation 1.

Scoring test-takers’ responses using IRT requires knowledge of the test’s item parameters. This typically requires the IRT model to be calibrated on the basis of response data from a group of test-takers. Once item parameter estimates are obtained, the IRT model can be used to estimate test-taker abilities on a test-independent metric.

In practice, the full flexibility of the 4PL model is not always desirable. Simpler models are typically more efficient to calibrate, requiring less financial investment in the test construction phase. Common simplifications include the three-parameter logistic (3PL) model, where the inattention parameter is constrained to 1, and the Rasch model, where discrimination and inattention parameters are constrained to 1 and guessing parameters are constrained to 0. Whether or not a guessing parameter can plausibly be constrained to 0 depends on the testing paradigm; in particular, multiple-choice paradigms with small numbers of options are very likely to have non-zero chance success rates. In the present work, we constrain the guessing parameter to the reciprocal of the number of response options (n), the inattention parameter to 1, and the discrimination parameter to be equal for all items. The resulting model can be expressed as

$$P({X}_{j}=1|\theta ,a,{b}_{j},n)=\frac{1}{n}+(1-\frac{1}{n})\frac{\exp \,[a(\theta -{b}_{j})]}{1+\exp \,[a(\theta -{b}_{j})]}$$

(1)

where a is the shared discrimination parameter. Such models are sometimes termed constrained or modified 3PL models because they preserve the 3PL model’s non-zero guessing parameter while introducing other constraints on the model parameters (e.g. ref. 28).

IRT brings a couple of important advantages. Unlike classical test theory, IRT allows for the direct comparison of participant scores even when these participants answer different items. This makes it easy for researchers to shorten or lengthen tests without compromising score comparability. It is also a crucial prerequisite for adaptive testing, where participants are administered different items based on their performance levels.

A second advantage of IRT is its sophisticated treatment of reliability. Reliability refers to the consistency of a measurement: a reliable instrument is one that delivers consistent results under similar conditions. In classical test theory, reliability is a property of a test with respect to a given test-taker population, and is assessed using measures such as test-retest reliability and Cronbach’s alpha. These reliability measures have limited generalisability: they are sample dependent, meaning that they cannot be generalised from one test-taker population to another, and they are test dependent, meaning that they cannot be generalised from one test configuration to another. IRT addresses this problem by treating reliability as a function of the items administered and the ability level of the test-taker, with the resulting measure being termed information. Information is easy to compute as long as estimates are available for the item parameters. IRT therefore makes it easy to estimate test reliability for new test-taker populations and new test configurations.

IRT-based ability tests are rare in music psychology, but notable exceptions include two melodic discrimination tests^{24, 29}, a test of Wagner expertise³⁰, a test of notational audiation³¹, and a test of music student competency³².

Computerised adaptive testing (CAT)

Computerised adaptive testing (CAT) is an approach to ability testing where item selection is algorithmically determined on the basis of the test-taker’s prior responses (see ref. 33 for a review). Item selection typically aims to maximise the information that each item delivers about the test-taker’s true ability.

Several frameworks exist for CAT. The simplest of these, such as the staircase method³⁴ and Green’s³⁵ adaptive maximum-likelihood procedure, require little pre-calibration but only work for simple tasks, such as psychophysical tests. In contrast, the IRT approach to adaptive testing is flexible enough to be applied to a much wider range of tasks, including melodic discrimination.

Under the IRT framework, adaptive tests typically comprise the following general stages:

1.
Make an initial estimate of the test-taker’s ability;
2.
Repeat the following steps:
1. a.
  Select and administer an item that should deliver maximal information about the participant’s true ability, possibly subject to practical constraints (e.g. not administering the same item twice);
2. b.
  Calculate a new estimate of the test-taker’s ability on the basis of their responses;
3. c.
  Check whether the stopping criterion is satisfied (e.g. required test length reached); if so, terminate the test.

The primary advantage of CAT is improved testing efficiency. Traditional non-adaptive tests must contain items at a wide range of difficulty levels so as to cater to a wide range of test-taker abilities. This means that any given test-taker will receive some items that deliver low information at their ability level, on account of them being too easy or too hard. In contrast, adaptive tests aim to deliver maximally informative items at each point in the test, resulting in great reliability improvements. As a result, adaptive tests can typically be shortened by 50–80% and still match the reliability of equivalent non-adaptive tests^{25, 36}. This effect is particularly pronounced when the test is targeted at a wide range of ability levels, as is common in music psychology.

Several recent musical ability tests incorporate CAT. These include the Adaptive Music Perception Test (AMP)³, the Harvard Beat Assessment Test (H-BAT)³⁷, and the Battery for the Assessment of Auditory Sensorimotor and Timing Abilities (BAASTA)³⁸. However, almost all of these tests use non-IRT procedures which do not generalise well to higher-level cognitive abilities. IRT is the ideal tool for such cases, but we are only aware of one IRT-based CAT in the music-psychological literature: Vispoel’s adaptive tonal memory test²⁴. Unfortunately, this test seems to be no longer available.

Automatic item generation (AIG)

Automatic item generation (AIG) is an approach to test construction where test items are generated algorithmically together with estimates of their psychometric parameters (see ref. 39 for an overview). It contrasts with traditional approaches to IRT, where items are constructed by hand and psychometric parameters are estimated separately for each item on the basis of empirical response data²⁵. To our knowledge, AIG and IRT have yet to be combined in musical ability testing.

One important benefit of AIG is improved efficiency of test construction. Generating items algorithmically avoids the time-consuming process of manual item design. Moreover, predicting psychometric parameters a priori means that items do not have to be individually calibrated on human test-takers before use. Both of these characteristics are particularly important for adaptive tests, whose large item banks can otherwise be very expensive to construct and calibrate.

A second important benefit of AIG is improved construct validity. Effective AIG typically relies on identifying the cognitive mechanisms that underlie task performance: this is called construct representation, and is an important part of construct validity⁴⁰. This construct representation is used to generate hypotheses about the relationships between structural item features and psychometric item parameters. These hypotheses are then tested on empirical response data. If the hypothesised relationships are supported by the data, this supports the test’s construct representation, and only then can these relationships be used to predict the psychometric characteristics of newly generated items. Construct validity therefore goes hand-in-hand with AIG techniques.

A third benefit of AIG concerns item exposure. Traditional tests have a limited number of items, meaning that participants may become familiar with those items if they take the test multiple times. This can be a problem in psychology research, since the same participant can easily take part in several studies that use the same ability test. However, tests that use AIG can benefit from an effectively unlimited pool of items, making it very unlikely that participants receive the same item again in subsequent test sessions.

This paper takes a top-down, weak theory approach to AIG. ‘Top-down’ means that item development is driven by an a priori theoretical model connecting item features to cognitive processes⁴¹. ‘Weak theory’ means that item development centres around constructing families of isomorphic items: items with differing surface characteristics but similar psychometric characteristics⁴².

Several item response models exist for weak-theory AIG, including the Identical Siblings Model, the Related Siblings Model, and the Random-Explanatory Model ⁴³. These models vary in the way that they treat within-family variation in item parameters: the Identical Siblings Model assumes no within-family variation, the Related Siblings Model treats within-family variation as a random effect in a mixed-effects model, and the Random-Explanatory Model treats within-family variation as a combination of fixed and random effects in a mixed-effects model. This paper adopts the Identical Siblings Model for three reasons: (a) its performance closely matches that of its competitors⁴³, (b) it is conceptually very simple, and (c) it can be estimated with standard IRT software packages. When combined with the constrained IRT model described in equation (1), the Identical Siblings Model trivially produces the following item response function:

$$P({X}_{ij}=1|\theta ,a,{b}_{j},n)=\frac{1}{n}+(1-\frac{1}{n})\frac{\exp \,[a(\theta -{b}_{j})]}{1+\exp \,[a(\theta -{b}_{j})]}$$

(2)

where X _ij denotes the test-taker’s scored response to item i from item family j, θ is the test-taker’s ability parameter, n is the number of response options, a is the shared discrimination parameter across all item families, and b _j is the difficulty parameter for item family j. Note that the assumption of zero within-family variation in item parameters means that the expression for the item response function is independent of i.

We approach our AIG task as follows. First, we describe the generic form of the items used in our melodic discrimination test. We then outline the cognitive model of melodic discrimination that forms the basis of our AIG system. We use this cognitive model to identify which item features should significantly affect item difficulty (radicals) and which should have minimal effect on difficulty (incidentals). Radicals and incidentals are then manipulated to define 20 (later 32) item families, constructed so as to cover a wide range of difficulty levels, with item difficulty being hypothesised to be constant within item families but to differ across families. We then develop a protocol for automatically generating melodic discrimination items within these families. Lastly, we calibrate the psychometric parameters of these item families in two empirical studies (Studies 1 and 3), and validate the performance of the resulting adaptive melodic discrimination test in two further empirical studies (Studies 2 and 4).

Test design

Generic item form

Most melodic discrimination tests use a ‘same-different’ paradigm, where the participant is played two versions of the same melody and is asked whether they are the same or different^{4, 6, 29}. This paradigm is appealing for its simplicity, but is problematic for IRT in that task performance depends both on task ability and on the participant’s individual decision threshold²⁹. Other tests use a paradigm where participants are played two melodies that differ by one note, and their task is to identify which note differed^{5, 24}. This eliminates the decision threshold problem, but may introduce an unwanted task dependency with numerical fluency.

In the present research we therefore introduce a three-alternative forced-choice (3-AFC) melodic discrimination paradigm, which does not require such a decision threshold. In each trial, the participant is presented with three versions of the same melody. Two of these versions possess the same interval structure and are called lures, but one version has exactly one note altered, and is called the odd-one-out. These three versions can occur in any order, and the participant’s task is to identify which version was the odd-one-out. An example trial is illustrated in Fig. 1.

Cognitive model

Many previous studies in experimental psychology have used the melodic discrimination paradigm to explore melody perception and cognition^{29, 44,45,46,47,48}. These studies can provide a useful cognitive basis for melodic discrimination testing. Here we adopt the cognitive model of melodic discrimination proposed by Harrison and colleagues²⁹, which identifies four primary cognitive processes that underlie melodic discrimination: perceptual encoding, memory retention, similarity comparison, and decision-making.

Perceptual encoding occurs first, with the listener forming a cognitive representation of the melody as it is played. This involves extracting a number of different features from the melody, such as pitch content, interval content, melodic contour, and harmonic structure. Next, for all melodies aside from the last melody in the trial, the participant must retain the melodies’ cognitive representations in working memory. Similarity comparisons are then performed between these cognitive representations, making use of the different feature representations that were formed in perceptual encoding and stored in memory retention. In the 3-AFC paradigm, we suggest that each melody version is compared with every other version, producing three similarity comparisons in total. Lastly, the participant performs a decision-making process to determine which melody was the odd-one-out, on the basis of these similarity judgements. Two of these melody pairs must be different, and one must be the same; the participant’s task is therefore to identify the most similar pair, and then the odd-one-out must be the melody not contained within this pair. Other decision strategies are possible, but we suggest that these alternative strategies have similar psychometric implications.

Implications for item difficulty

Item features that impair any of the four primary cognitive processes in the melodic discrimination task should be expected to increase item difficulty. Particularly important item features are melodic complexity, melodic similarity, conformity to cultural schemata, and pitch transposition²⁹. More complex melodies place higher demands on the limited capacity of working memory, and hence result in more difficult items⁴⁷. Increased contour similarity and tonal similarity between melody versions makes the similarity comparison task more demanding, hence increasing item difficulty⁴⁹. Conformity to cultural schemata aids perceptual encoding and memory retention, thereby decreasing item difficulty^{47, 48}. Transposition impairs perceptual encoding and similarity comparison, hence increasing item difficulty⁴⁸.

Melodic complexity, similarity, conformity to cultural schemata, and transposition can therefore all be described as radicals: they are features that should contribute to item difficulty. The aim of varying the radicals is to produce a suitable range of item difficulties for the adaptive test. In this research we manipulate the first two radicals (complexity and similarity) while keeping the second two (conformity to cultural schemata and transposition) constant.

Complexity is operationalised as the number of notes in the melody (termed length). Longer melodies are more complex, and should therefore result in higher item difficulties. Similarity is operationalised in terms of two dichotomous variables: whether the altered melody differs in pitch contour from the original (contour violation) and whether the altered note leaves the home key of the original melody (tonality violation). Contour and tonality violations should decrease similarity, hence decreasing item difficulty.

Incidentals are features that are expected not to contribute substantially to item difficulty. Manipulating incidentals introduces variation into test items, hence reducing exposure effects and improving generalisability. The primary incidental manipulated in the present research is the base melody used for each item. This ensures that the participant is always discriminating between unfamiliar melodies. We also treat the position of the odd-one-out as an incidental, and randomise it across all trials to prevent it from becoming a response cue.

Prior research usually treats melodic discrimination as a unidimensional ability, occasionally split into (typically correlated) rhythm and pitch subcomponents^{4, 23}. To maintain a viable scope for the present paper, we focus on the detection of pitch differences, and leave rhythmic discrimination to future work. Rhythmic discrimination aside, the cognitive model described above still describes four primary cognitive processes behind melodic discrimination, and individual differences in each of these cognitive processes could lead to a multidimensional melodic discrimination ability. For example, an individual might be good at discriminating very similar melodies, but bad at retaining complex melodies. Later in this paper we examine this hypothesis empirically.

Item families

Version 1 of the adaptive melodic discrimination test comprises 20 item families. Each item family corresponds to a unique combination of the three radicals: two dichotomous radicals (contour violation, tonality violation), and five levels of length (6, 7, 9, 12, and 16 notes). Transposition is kept constant, with the same starting key being used for all items (D major), and each successive melody in the 3-AFC trial being transposed one semitone higher than its predecessor. Conformity to cultural schemata is also kept constant, with the chosen musical idiom being Irish folk melodies.

Version 2 of the adaptive melodic discrimination test expands Version 1 by introducing three more length levels (3, 4, and 5 notes). These new length levels are factorially combined with the remaining radicals, producing 12 new item families and bringing the total number of item families to 32.

Item generation protocol

The purpose of the automatic item generation protocol is to provide an (effectively) unlimited supply of items for each item family. This protocol comprises three main steps: generating the base melodies, generating altered melodies, and synthesising the corresponding audio.

Base melodies are generated algorithmically by the computational model Racchman-Jun2015 (Random Constrained CHain of MArkovian Nodes)^{50, 51}, which takes as input a corpus of source music in a particular musical style, calculates a matrix of transition probabilities between musical events, and uses this transition matrix to generate new musical extracts in the style of the source corpus. The source corpus used here is the collection of Irish folk melodies from the Essen collection⁵² in simple triple time. Two constraints are placed on base melody generation. One constraint is that melodies at a particular length level (e.g. 6 notes) should all occupy the same number of musical beats, hence keeping note density constant. Another constraint is that no more than two consecutive note events should come from the same melody, reducing the probability that the algorithm will replicate a segment of a source melody note for note. A pilot study with 20 participants (mostly university students with limited familiarity with Irish folk music) and 80 trials per participant found no significant difference in perceived stylistic success between melodies generated by the computational model and melody extracts from the source corpus (Welch t-test, t(74.2) = 0.71, p = 0.48). This suggests that the generated melodies should be sufficiently realistic for use in the melodic discrimination test.

Four altered melodies are produced for each base melody, one satisfying each combination of contour violation and tonality violation. Alterations are produced by modifying the relative pitch of exactly one note in the base melody, with the following constraints:

1.
For melodies with lengths of 5 notes or fewer, neither the first nor the last note are available for alteration.
2.
For melodies with lengths of 6 notes or longer, neither the first two nor the last two notes are available for alteration.
3.
The altered note must not differ from the original note by more than 6 semitones.
4.
The altered note must be between an eighth note and a dotted half note in length, inclusive.

A search algorithm attempts to find alterations that satisfy these constraints while minimising the displacement distance between the altered note and the original note. If four altered melodies cannot be found for the base melody, the base melody is discarded and the process starts again with a new base melody.

All stimuli are synthesised from MIDI with identical piano timbre and a tempo of 120 beats per minute. The three melodies within each trial are always separated by 1 s of silence, and the first melody within a trial always takes the key of D major. Three audio stimuli can be generated from the same combination of altered melody and base melody, since the odd-one-out can come either first, second, or third in the 3-AFC trial. Example stimuli can be found in the supplementary materials.

Studies

Study 1: First calibration

The primary aim of this first study was to estimate psychometric parameters for Version 1 of the adaptive melodic discrimination test. This involved administering automatically generated items to a large number of participants and fitting an IRT model to the resulting response data. This IRT model could then be used to estimate psychometric parameters for new items.

The automatically generated items spanned a wide range of difficulties, and so effective estimation of item parameters required the participants to span a wide range of ability levels. We addressed this by recruiting both adults and schoolchildren. The schoolchildren were expected to be less cognitively developed than the adults, and therefore to have lower melodic discrimination abilities. We assumed that, otherwise, melodic discrimination should be similar in schoolchildren and in adults.

This study also provided an opportunity to investigate the viability of the proposed adaptive melodic discrimination test. The test’s underlying psychometric assumptions were assessed using various tools from the IRT literature, such as unidimensionality tests and model fit statistics. The reliability of item difficulty predictions was assessed through the inspection of item discrimination parameters. Construct validity was assessed in terms of construct representation and nomothetic span: construct representation was investigated by regressing item difficulties on structural item features, while nomothetic span was investigated by correlating participant abilities with other participant attributes⁴⁰. Lastly, item difficulties were compared to participant abilities to determine the test’s suitability for different ability levels.

Method

Participants

Two participant groups were used: one group of adults (N = 158) and one group of schoolchildren (N = 266). The adults were recruited via social media and word-of-mouth, and were rewarded by a prize draw for a £100 (≈$125) gift voucher as well as the chance to see feedback on their melodic discrimination skills. They were approximately evenly split by gender (65 males, 87 females, six anonymous) and ranged in age from 18 to 77 years (M = 34.4, SD = 14.6). The schoolchildren, meanwhile, participated as part of a wider study investigating a broader range of academic and musical skills⁵³. All schoolchildren were female, and ranged in age from 6 to 18 (M = 14.5, SD = 1.79).

Materials

Melodic discrimination test: This study used a non-adaptive instance of Version 1 of the melodic discrimination test. Twenty base melodies were generated for each of the five length levels, which were then crossed with contour violation (two levels), tonality violation (two levels), and position of the odd-one-out (three levels) to produce 1,200 items in total.

Musical training questionnaire: The musical training questionnaire was sourced from the Goldsmiths Musical Sophistication Index (Gold-MSI)¹¹ self-report measure. It comprised seven items addressing the participant’s formal musical background as well as their performance ability. Scores on these seven items were aggregated to produce a numeric musical training score for each participant.

Procedure

Data were collected using the Concerto platform⁵⁴. Adults participated online, agreeing to wear headphones and to take the test in a quiet room free from interruptions. Schoolchildren participated in quiet classrooms wearing standardised headphones (Behringer, HPM1000).

Adult testing sessions lasted approximately 12 minutes each. Sessions began with the melodic discrimination test and concluded with two short questionnaires. The first was the Gold-MSI musical training questionnaire, described above; the second comprised some basic demographic questions (age, gender, occupational status). Upon completion of the questionnaires adult participants were presented with their total melodic discrimination scores.

Schoolchildren testing sessions lasted approximately an hour each, and included the melodic discrimination test alongside a number of other listening tests and questionnaires. The questionnaires included the Gold-MSI musical training questionnaire and basic demographic questions. Results from other listening tests and questionnaires are not reported here. The schoolchildren received no feedback for their scores.

For all participants, the melodic discrimination test began with a training phase, which included instructions, audio examples, and two practice trials. Participants were free to repeat the training phase if they felt unsure about the task procedure. After completing the training phase, all participants answered 20 randomly selected items, with the constraints that each item family was represented exactly once and that no base melody was heard more than once.

Ethics

All experimental protocols in this and subsequent studies were approved by the Ethics Committee of Goldsmiths, University of London, and all experiments were performed in accordance with the relevant guidelines and regulations. Informed consent was obtained from all participants prior to participation.

Results

Response data were modelled using the item response model described in equation (2) with n = 3. The model was fit with approximate marginal maximum likelihood, using the ltm package⁵⁵ in the statistical software environment R ⁵⁶.

Model quality was assessed in several ways. First, model fit for the different item families was assessed using Yen’s⁵⁷ Q ₁ statistic with 10 ability groups, taking 500 Monte Carlo samples to estimate the distribution of the statistic under the null hypothesis, and calculating significance levels using Bonferroni correction. No item families exhibited statistically significant levels of poor fit, and this result proved robust to variation of the number of ability groups. Model fit and conditional independence were then assessed by computing the model fit on the two-way and three-way margins, after Bartholomew⁵⁸. Out of the 760 pairs of items and response patterns examined for the two-way margins, only one pair (0.13% of the total) was flagged for poor fit by Bartholomew’s⁵⁸ criterion (test statistic greater than 4.0). For the three-way margins, meanwhile, only 67 out of 9,120 triples (0.73%) were flagged for poor fit. Overall, these results suggested that the model fit well and satisfied the assumption of conditional independence.

Unidimensionality was then tested using modified parallel analysis ⁵⁹. This involved calculating the second eigenvalue of the tetrachoric correlation matrix for the response data, and comparing this eigenvalue to a Monte Carlo simulation of its distribution under the null hypothesis. The results showed no evidence for multidimensionality (500 Monte Carlo samples, p = 0.49).

Effective AIG using weak theory requires that items in the same family possess similar item difficulties. In the Identical Siblings Model, similar item difficulties lead to higher discrimination parameters⁴³. The observed global discrimination parameter for the melodic discrimination test was relatively high (1.31), suggesting similar difficulties within item families and hence suitability for AIG.

A linear regression model was then constructed to investigate the effects of the radicals on item difficulty (Fig. 2). The model predictors comprised the three radicals (melody length, contour violation, and tonality violation), as well as all pairwise interactions between these radicals. Melody length was treated as a continuous variable, and linearly scaled to take a mean of 0 and a standard deviation of 1. The resulting regression model was statistically significant, F(6, 13) = 21.99, p < 0.001, with an adjusted R ² of 0.869 (Table 1). As hypothesised, longer melodies were significantly harder than shorter melodies, while contour and tonality violations significantly reduced item difficulty (tonality violations more so than contour violations). However, the interaction effects show that the impact of contour and tonality violations depended on melody length. As melody length increased, contour violations reduced difficulty less, but tonality violations reduced difficulty more.

Table 1 Linear regression model predicting item difficulty from the radicals (test Version 1).

Full size table

Melodic discrimination ability scores (expected a posteriori) were calculated for all participants on the basis of the IRT model. A linear regression was then conducted to investigate how sample group and musical training contributed to task performance. Nine participants were excluded on account of missing data, and musical training scores were linearly scaled to a mean of 0 and a standard deviation of 1. The regression was statistically significant, F(3, 411) = 118, p < 0.001, and had an adjusted R ² of 0.459 (Table 2). This model indicated that musical training and membership of the adult group were both positively associated with task performance, but that musical training was a stronger predictor of performance for the adult group than for the child group.

Table 2 Linear regression model predicting melodic discrimination ability (Study 1).

Full size table

Ability scores for the two sample groups were then compared with the distribution of item difficulties for test Version 1 (Fig. 3). The distribution of item difficulties matched the adult ability distribution relatively well, but matched the children less well, who performed significantly worse than the adults (mean child ability = −0.364, mean adult ability = 0.614, Welch t-test, t(273.6) = −13.796, p < 0.001).

Discussion

The purpose of this first study was to estimate psychometric parameters for Version 1 of the adaptive melodic discrimination test. The results suggest that this process was successful. IRT assumptions of model fit, item conditional independence, and unidimensionality were satisfied, and the high item discrimination indicated that item difficulty could be predicted well by item-family membership.

Construct representation is the aspect of construct validity concerning the cognitive mechanisms that underlie task performance⁴⁰. We investigated construct representation by regressing item difficulties on structural item features. The regression model showed that longer melodies produced harder items, and that contour violations and tonality violations produced easier items, as hypothesised. The results support the cognitive model of the melodic discrimination task and hence its construct representation.

The regression model also found that the effect of contour violations decreased for longer melodies, whereas the effect of tonality violations increased for longer melodies. These results were not predicted in test construction, but are consistent with prior research. An interaction between contour violation and melody length has been described by Edworthy⁶⁰, who found that listeners are better at detecting contour violations than interval violations for short melodies, but better at detecting interval violations than contour violations for long melodies. Edworthy⁶⁰ suggested that this is because contour can be encoded independently of tonal context, unlike interval information, whose encoding benefits from the greater tonal context available in longer melodies. Meanwhile, the interaction between tonality violation and melody length is consistent with previous work showing that tonality violations are easier to detect with greater tonal context⁶¹. As melody length increases, tonal context increases, and hence tonality violations are more salient, as observed. Both of these effects are consistent with our cognitive model.

Nomothetic span is a complementary aspect of construct validity concerning how test-taker scores relate to other variables⁴⁰. We investigated nomothetic span by regressing test scores on other participant attributes. Musical training was positively associated with melodic discrimination ability, consistent with prior research^{4, 6, 11}. Membership of the adult group positively predicted melodic discrimination ability, perhaps because adults tend to possess better developed cognitive abilities, but also perhaps because the adults were better motivated (adults opted in for testing, whereas children could only opt out) and less tired (the total length of the testing session was shorter for adults than for children). These results are consistent with our conception of melodic discrimination ability, and hence support the test’s nomothetic span. However, the limited number of comparison variables limits the conclusions that can be drawn; nomothetic span was therefore explored further in Study 2.

It is interesting that musical training was more strongly associated with melodic discrimination performance for adults than for children. One explanation is that the test’s difficulty was better suited to the adults than to the children, resulting in higher discrimination power, higher reliability, and hence higher correlations between test scores and musical training for the adult group.

Our cognitive model describes four primary cognitive processes behind melodic discrimination. Individual differences in each of these cognitive processes could lead to a multidimensional melodic discrimination ability. However, the results showed no evidence for multidimensionality. There are several possible interpretations for this: (a) multiple abilities exist, but they are highly correlated and hence difficult to distinguish psychometrically; (b) multiple abilities exist, but some are more prone to individual differences than others; (c) multiple abilities exist, but not all are tested fully by the melodic discrimination task. For practical purposes, however, it seems that melodic discrimination can be treated as a unidimensional ability.

Comparing the distributions of ability scores and item parameters indicated that the item bank suited the adult sample group well, but did not contain enough easy items for the child group. This limitation was subsequently addressed in Study 3. First, however, a study was conducted to validate this first version of the adaptive melodic discrimination test.

Study 2: First validation

This purpose of this study was to gather three types of validation information about the adaptive melodic description test. First, we aimed to gather population norms for the test, so that future test-takers could be evaluated with respect to the general population. The second aim was to investigate the test’s nomothetic span, the matter of how test-taker scores relate to other variables. The third aim was to investigate the test’s reliability.

Different participant populations are relevant for different scenarios in psychology research and in educational testing. One important population is that of the country as a whole. Here we estimated norms for this population by testing a nationally representative group sourced by a market research company. However, many music psychology studies do not randomly sample from the entire population, but instead use self-selected sample groups. Self-selected participants are more likely to be actively interested in music, and may correspondingly have better musical listening abilities. We therefore also collected norms for a sample group of self-selected participants recruited by word of mouth and by social media.

Nomothetic span had received preliminary analysis in the previous study, but we explored it further here. Three aspects of nomothetic span were assessed: concurrent validity, convergent validity, and divergent validity. Concurrent validity is demonstrated when test scores correlate well with test scores from a pre-established test of the same ability. Here we investigated concurrent validity with a shortened version of the Musical Ear Test⁴, which has been shown to discriminate reliably between professional musicians, amateur musicians, and non-musicians, as well as predicting various aspects of musical expertise. Convergent validity means that test scores correlate appropriately with measures of other related abilities. We assessed convergent validity by testing how melodic discrimination scores related to musical training; previous research indicated that musical training should be positively associated with melodic discrimination scores^{4, 6, 11}. Lastly, divergent validity is shown when test scores show an appropriate lack of correlation with measures of theoretically unrelated abilities. Here we assessed divergent validity using a low-level psychoacoustic task where the participant had to determine the order of short successive tones.

A reliable test is one that delivers similar results when administered under similar situations. In this study we measured test reliability in two ways. First, we used the IRT model to estimate the statistical uncertainty of its ability estimates. Second, we administered the test twice to a subset of participants, and investigated how well test scores correlated between the two administrations (test-retest reliability). We placed a special focus on investigating reliability at different possible test lengths, aiming to see whether the test could still perform well when shortened.