Introduction

Although the use of spectral karyotyping (Macville et al, 1997; Schrock et al, 1997; Veldman et al, 1997) is redefining the role of G-banding in chromosome analysis, analysis of chromosome banding patterns remains a cornerstone of karyotypic analysis both for routine diagnosis and for application in such techniques as comparative genomic hybridization (Piper et al, 1995). Chromosome classification and analysis is aided by the use of automated karyotyping systems that yield a preliminary classification for each chromosome, which may be corrected by hand as necessary. Automated karyotyping relies upon acquisition of a digital image, followed by extraction of chromosome features. Two general approaches to feature extraction are employed: gray level encoding of each chromosome and more complex extraction of distinctive features. These features may then be used in an algorithm that assigns the chromosome to one of 24 classes (autosomes 1–22, X, and Y). A variety of such algorithms has been proposed, based upon approaches such as Bayesian analysis (Lundsteen et al, 1986), Markov networks (Granum and Thomason, 1990; Guthrie et al, 1993), neural networks (NN) (Beksac et al, 1996; Errington and Graham, 1993; Graham et al, 1992; Jennings and Graham, 1993; Korning, 1995; Leon et al, 1996; Malet et al, 1992; Sweeney et al, 1994; Sweeney et al, 1997), and simple feature matching (Piper and Granum, 1989). The reported classification accuracy varies surprisingly little by approach. Most methods achieve approximately 90% correct classification of the Copenhagen chromosome data set; commercial implementations typically achieve approximately 80% correct classification in routine use.

Automated chromosome classification entails several steps. First, an image segmentation step is used to create distinct images of each chromosome in a metaphase. Then, salient features of the chromosome image are extracted. Typically, gray level encoding is employed to represent the chromosome by a vector of gray level values, which are obtained by sampling at evenly spaced intervals along the chromosome's medial axis. (See, for example, Errington and Graham, 1993.) Different vectors may contain a different number of samples, so vectors are typically stretched or compressed to a fixed number of entries via constant interpolation or downsampling. Because variations in lighting can cause the gray scale measurements to vary, all stretched vectors are normalized to Euclidean magnitude 1. Figure 1 illustrates this stretching, and Figure 2 illustrates the variations in measured values for chromosomes having the same identity. Chromosome 1 is usually the easiest to identify; it is physically the longest chromosome, and the banding pattern is particularly distinctive. The Y chromosome is among the hardest; it is physically relatively short, and the banding pattern is often rather indistinct.

Figure 1
figure 1

Stretching a chromosome. The ordinate (y-axis) shows the gray level (staining intensity) as a function of the position along the chromosome, shown on the x-axis (from the p-terminus on the left to the q-terminus on the right). The chromosome has been stretched from 64 pixels to 93 pixels in length.

Figure 2
figure 2

Samples of chromosome 1 and the Y chromosome from the Edinburgh data set, with two samples of each highlighted for clarity. The ordinate (y-axis) shows the gray level (staining intensity) as a function of the position along the chromosome, shown on the x-axis axis (from the p-terminus on the left to the q-terminus on the right).

Feature extraction provides an alternative to gray level encoding. Piper and Granum (1989), for example, have proposed the use of 30 classification parameters derived from automated measurements. These features include the following:

  • physical length of the chromosome

  • location of the centromere (a narrowed region of the chromosome)

  • the area of the chromosome

  • the perimeter of the convex hull of the chromosome

  • the number of bands

  • inner products of the gray level values with various basis vectors resembling a set of wavelet “hat” functions.

In summary, the problem is to assign an identity (1–22, X, or Y) to a chromosome, given a vector containing its gray level measurements or other measured features and some training vectors with known identities. Helpful additional output would include degree of certainty in identification, identification of abnormal chromosomes, and automatic characterization of abnormalities.

In this paper we propose some new approaches for solving the problem of automated chromosome identification:

  • singular value decomposition

  • principal component analysis

  • Fisher discriminant analysis

  • hidden Markov models.

A brief description of each approach follows; more details may be found in the Appendix.

Singular Value Decomposition (SVD)

One way to pose our problem is to seek among all vectors in the training set the one that most closely matches the vector of unknown identity. We then assign the unknown vector the identity of this most closely matching chromosome. Viewed in this way, the problem resembles the retrieval of a document whose keywords most closely match those of a query. We represent each document and query by a vector indicating the relative importance of each keyword. Literal matching (eg, taking inner products of document vectors with the query and then choosing the maximum) is not usually the best strategy because latent relationships and document clusters are not revealed.

Instead, in the latent semantic indexing (LSI) method (Berry et al, 1995; Deerwester et al, 1990), the vectors characterizing the documents form the columns of the document matrix. We approximate this matrix by a low-rank matrix, and then scoring is done by inner products of the query with the low-rank approximation.

Principal Component Analysis (PCA)

The SVD algorithm implicitly assumes that the measurements are independent and have similar standard deviations. If we wish to take covariances into account, then we need to use the SVD in a somewhat different way. Rather than finding the identity with the largest score, we find the one with the minimal Mahalanobis distance to the mean of the training chromosomes of that identity.

Fisher Discriminant Analysis (FDA)

Fisher discriminant analysis (Mardia et al, 1979) is similar to principal component analysis in that it uses a multidimensional normal distribution to model the 24 clusters. Also, like PCA, it projects the data into a lower dimensional space. This projection is not done via the SVD but rather by solving a generalized eigenvalue problem. The projection is computed using the training data so as to maximize the ratio of the between cluster distances to the within cluster distances.

Hidden Markov Models (HMM)

One characteristic of speech problems as well as chromosome karyotyping is that the vectors can be of variable length. For instance, the duration of sound for a given phrase varies from speaker to speaker and even from trial to trial. Similarly, the number of gray levels sampled from a chromosome is variable. The SVD, PCA, and NN models all must normalize the input vector to a fixed number of entries, but hidden Markov models (HMM) (Baum and Eagon, 1967; Baum et al, 1970) have no such restrictions. (See Rabiner, 1989, and Rabiner and Juang, 1986, for an introduction to these methods.)

The models that we build work with a sequence of gray level “triples” (Fig. 3). From the vector of gray scale observations, we form a vector of first differences and a vector of second differences. Our 24 models output triples of observations approximating those of typical chromosomes of each identity. An observer, then, would see a sequence of gray level triples, each representing a single entry from each of the three vectors. Hidden from the observer is a Markov chain that is generating the output. The current state of the chain produces a single output triple, and then a new state is chosen according to probabilities specified in a transition probability matrix. For each sequence of gray level triples, we compute a probability that the sequence was generated by each of the 24 models. This 24-element vector of scores is used to classify each unknown chromosome. The details of the HMM classifier are given in the Appendix of this article.

Figure 3
figure 3

Graphic illustration of a hidden Markov model with 10 states. If the model is currently in state 3, then it outputs a triple of values (gray level, first difference, and second difference) chosen according to the output function B. Then, with probability anext the model transitions to state 4, with probability askip it transitions to state 5, and otherwise it stays in state 3 for another cycle.

Results

Experiment 1

First we compare our results with those of the neural net model of Errington and Graham (1993). For each of the three data sets (Philadelphia, Edinburgh, and Copenhagen), we display in Table 1 the percentage of chromosomes classified correctly by each model. We note that HMM frequently gives the best performance. SVD performs well and is generally better than PCA. The second differences have little effect on the performance of the neural nets although the first differences improve performance slightly.

Table 1 Results of Experiment 1

Experiment 2

In this experiment, we explore the robustness of the methods when there are “mild” chromosomal abnormalities present. We degrade the data from each scoring chromosome (but not the training chromosomes) by taking the sequence of gray level values in the middle 10% and reversing their order. This simulates an internal inversion of chromosomal material.

The behavior of the best methods from Experiment 1 are shown in Table 2. The HMM performs the best, degrading by at most 8 percentage points. The other three methods do not behave as well, but each achieves at least 56% accuracy.

Table 2 Results of Experiment 2

Experiment 3

Next we degrade the data by truncating each of the scoring chromosomes (but not the training chromosomes) by deleting either the first or last 10% of the gray level values in each sequence. This simulates an artifact commonly encountered during the “editing phase” of semiautomated karyotype analysis, in which overlapping chromosomes are “cut apart,” in addition to those deletions of the terminal chromosome arms that occur “naturally.” In Table 3 we see that the HMM is quite robust on this data, degrading by, at most, 4 percentage points. The SVD is moderately successful, but PCA and NN methods classify most chromosomes incorrectly.

Table 3 Results of Experiment 3

Experiment 4

We also tested several algorithms on the feature data in an experiment analogous to that of Errington and Graham (1993). From the data in Table 4 we conclude that the best methods were PCA and FDA, both of which performed slightly better than the NN of Errington and Graham (1993).

Table 4 Results of Experiment 4

Experiment 5

Under the assumption that each metaphase consists of chromosomes from a single cell, classification errors can be further reduced by adding the constraint that the slide produced for a single cell contains, at most, two copies of the autosomes, and either 2 X's or one X and one Y. This assumption is valid in some special cases, such as chromosome spreads produced for comparative genomic hybridization. To illustrate how this information can be used, we consider the results of PCA for the feature data. Given the FDA scores we can form two likelihood matrices, F and M, where F corresponds to the assumption that the patient is female and M that the patient is male. The likelihood matrix F is formed based on the FDA log likelihoods as specified by the following equation:

The matrix M is defined analogously with columns 45 and 46 corresponding to the likelihoods of chromosome X and Y respectively. The total likelihood of assigning labels to the chromosomes is maximized by solving two linear programs of a special type, a linear assignment problem. The linear assignment problem finds a matching of the rows to the columns with maximum sum. There are a number of very efficient polynomial methods for solving this problem (Papadimitriou and Steiglitz, 1982). The method used here was the Hungarian algorithm, the work of which is proportional to the cube of the number of rows. Both the Copenhagen and Edinburgh data sets have the property that each metaphase consists of chromosomes from a single cell. (A number of the metaphases from the Philadelphia data set had 47 chromosomes identified on them.) Tables 5 and 6 give the results of this linear assignment given gray scale values or feature vectors. For the Edinburgh and Copenhagen data sets, linear assignment on the results of the HMM for gray levels with 2 differences improved the accuracy by 4 to 6 percentage points. It improved the accuracy by 6 to 8 percentage points for the truncated sequences and by 9 percentage points on the chromosomes in which the centers were inverted. For feature data, the results improved by 2 or 3 percentage points and were slightly better than those achieved by Sweeney et al (1994).

Table 5 Results of Experiment 5: Gray Level Data
Table 6 Results of Experiment 5: Feature Data

Discussion

Interpretation of G-banded chromosomes remains the cornerstone of both routine karyotyping and chromosome identification for such molecular biologic methods as comparative genomic hybridization. Although algorithms intended to speed up karyotypic analysis are widely available, their use has been limited by their modest accuracy in classifying even “normal” chromosomes. Methods that are sufficiently robust to accurately classify chromosomes that are abnormal, as a result of disease, constitutional anomaly, or artifacts introduced during acquisition of chromosome images, have not been previously published.

In this paper, we demonstrate that although a number of algorithms achieve 90% accurate classification of “normal” chromosomes, most of these algorithms perform poorly when as little as 10% at the end of a chromosome arm is truncated. This amount of truncation is commonly encountered in practice. Although it usually results from artifacts associated with the acquisition and processing of digital data, truncation may also be characteristic of disease. Although the performance of these algorithms is better for chromosomes in which a small internal inversion has been simulated, the rate of correct classification is reduced by 6% to 22% even for these chromosomes. Automated classification methods are thus significantly less useful in routine practice than the putative 90% classification accuracy would suggest. Our results show that one type of classification algorithm, HMM, is significantly better at correctly identifying chromosomes bearing these abnormalities than are either the other novel algorithms we explored or the neural network algorithms in common use. Introducing a commonly encountered artifact — truncation of the terminal 10% of either the p- or q-arm, resulted in a 5% to 11% degradation in classification accuracy. In contrast, the next best method (Rank 24 SVD with 2 differences) had a 22% reduction in accuracy on truncated chromosomes from the Copenhagen data set. Our results further demonstrate the utility of constrained classification algorithms that rely on the observation that normal cells carry, at most, 2 of each autosome, and either 2 X chromosomes, or an X and a Y. These algorithms achieve classification accuracies for normal chromosomes of 95% to 97% (Copenhagen data set), even for truncated sequences. This represents a reduction of approximately 50% in the classification error rate. Although this constraint is inappropriate in cases where cytogenetic anomaly is being sought (such as prenatal diagnosis or cytogenetic characterization of tumor specimens), it is useful and appropriate in applications such as comparative genomic hybridization, in which metaphase spreads are prepared from cell cultures of “normal” individuals.

HMMs have previously proven useful in several areas of biological science, including speech recognition (Jelinek, 1995), EKG analysis (Koski, 1996), gene identification (Lukashin and Borodovsky, 1998), and protein structure prediction (Sonnhammer et al, 1998). These problems are similar in that all involve classification of data sequences (vectors) that can demonstrate substantial within-class variations in length and pattern. HMMs appear to be especially well-suited to solving such problems, because the classification resulting from the model does not depend upon the precise location of values within the data vector, but rather upon the relationships between adjacent or nearly adjacent data values. This feature of HMMs is very useful in chromosome classification, because the same chromosome (chromosome 5, for example) can vary substantially in length among various metaphase spreads. This feature alone gives HMMs a robustness that is not found in most other classification approaches. One result is that chromosome classification using HMMs created using normal chromosomes is expected to remain reliable for substantially larger truncations/terminal deletions and internal inversions than were explored in this paper.

HMMs are also expected to be useful in the characterization of abnormal karyotypes for which training data is not available. For example, one can synthetically create HMMs characteristic of reciprocal t(14; 18) translocations with varying break points, based upon data obtained from normal chromosomes 14 and 18. (This can also be done with NN, SVD, PCA, and FDA.) By competitively scoring these models, we may expect to obtain a fairly precise localization of the break point if a chromosome bearing t(14; 18) is encountered in a test set. By creating such “synthetic” HMMs for chromosomes bearing truncations and deletions, we may expect to further improve the classification of chromosomes bearing these anomalies as well.

In summary, we have applied four mathematical approaches for automated chromosome identification: singular value decomposition (SVD), principal components analysis (PCA), Fisher discriminant analysis (FDA), and hidden Markov models (HMM). We have demonstrated that although all these approaches yield similar results for “perfect” normal chromosomes, the HMM approach is superior for the identification of imperfect and/or abnormal chromosomes. Finally, we expect that the HMM approach can be implemented in a way that allows highly accurate classifications to be made even when few data are available upon which to train new models.

Materials and Methods

Data Preparation

The Copenhagen, Edinburgh, and Philadelphia data sets were used in creating and validating the mathematical models. These data sets consist of vectors of the gray values obtained from linear axial traces of 5100 to 8100 chromosomes each, together with the chromosome assignment. Each of these data sets was divided into “training” and “scoring” parts as done by Errington and Graham (1993).

Classification Experiments

A more precise description of each of the mathematical models underlying our classification experiments is given in the Appendix. The data for our SVD, PCA, and FDA experiments were the gray level vectors, augmented by their first and second differences. We computed differences between vector elements, and then stretched them to equal lengths (the length of the longest training vector). The first and second differences were weighted by a factor of 5. These vectors were then used without further preprocessing.

For experiments, we set the rank of the SVD approximations to K = 24. A rank of 36 improves the results slightly but does not seem to be worth the extra computational effort.

Back-propagation networks were created and run using Netmaker Professional for Windows and Brainmaker Professional for Windows (California Scientific Software, Nevada City, California). Networks consisted of an input layer of 15, 30, or 45 nodes, where only gray levels, gray levels plus first differences, or gray levels plus first and second differences were used as network input. A single hidden layer of 200 nodes was used. Network output was a 24-element vector, in which each element represented one chromosome type. In the training phase, each chromosome was represented by both the 15-, 30-, or 45-element input vector and a 24-element classification vector in which all elements were set to 0 except for that element corresponding to the encoding chromosome. Training was accomplished using a constant “learning rate” of 0.05, with a training tolerance of 0.4. When 75% correct classification of the training set was achieved, the training tolerance was reduced by a factor of 0.9. Training was discontinued after 1000 iterations in which each chromosome in the training set was presented to the network. During the testing phase, a chromosome was considered to be correctly identified by the neural network if the largest element in the neural network output vector corresponded to the correct chromosome number.

For the HMM, the training data was further subdivided by chromosome types. A HMM was then found for each chromosome type, as discussed in the Appendix. The number of states was set to be the median length of chromosomes in the training data.