Spike sorting based on shape, phase, and distribution features, and K-TOPS clustering with validity and error indices

Spike sorting is one of the most important data analysis problems in neurophysiology. The precision in all steps of the spike-sorting procedure critically affects the accuracy of all subsequent analyses. After data preprocessing and spike detection have been carried out properly, both feature extraction and spike clustering are the most critical subsequent steps of the spike-sorting procedure. The proposed spike sorting approach comprised a new feature extraction method based on shape, phase, and distribution features of each spike (hereinafter SS-SPDF method), which reveal significant information of the neural events under study. In addition, we applied an efficient clustering algorithm based on K-means and template optimization in phase space (hereinafter K-TOPS) that included two integrative clustering measures (validity and error indices) to verify the cohesion-dispersion among spike events during classification and the misclassification of clustering, respectively. The proposed method/algorithm was tested on both simulated data and real neural recordings. The results obtained for these datasets suggest that our spike sorting approach provides an efficient way for sorting both single-unit spikes and overlapping waveforms. By analyzing raw extracellular recordings collected from the rostral-medial prefrontal cortex (rmPFC) of behaving rabbits during classical eyeblink conditioning, we have demonstrated that the present method/algorithm performs better at classifying spikes and neurons and at assessing their modulating properties than other methods currently used in neurophysiology.

However, the goal of the spike classification was to find a single optimal value for the number of clusters and the best clustering.
For this purpose, a proper validity measure (CD-index) was implemented in this work to obtain the optimal number of clusters among all the distance-metric combinations. The main goal was to make the three internal validation indices interact among themselves to produce the maximum cohesion-dispersion of the clustering across all the distance-metric combinations (see Methods in the main text). In combination with the validity index above (CD-index) for determining the optimal number of clusters, a customized clustering error index (CE-index) was applied for determining the optimal clustering (see Methods in the main text and the Supplementary Appendix S5 for details).

APPENDIX S2: Silhouette Index
The Silhouette (S) index value for each point i is a measure of the similarity of a point to points in its own cluster compared with points in other clusters 5 . For a given grouping ∈ 1, … , ; = number of classes, the method assigns each object a quality measure 1, … , ; number of objects of . The Silhouette index value is defined as:

max ,
In the above expression, a(i) is the average distance between the i-th object and all the objects included in , while b(i) is the minimum distance between the i-th object and all objects classified in ∈ 1, … , , .
The index value will be in the range 1 1, and any object being poorly grouped will have an index value 1 0. The index for a group is defined as: 1 Here m is the number of objects within and c is the total number of classes. Therefore, the average value of the Silhouette index indicating whether the classification made is optimal, adequate, and well separated so as to provide the optimal number of classes is as follows: 1

APPENDIX S3: Davies-Bouldin Index
This index measures the average similarity of groups of objects (classes) within a classification. A low value of the Davies-Bouldin (DB) index indicates that a class is grouped by similar objects, is compact, and has the centers of each class far from each other 6 . The average value of the index for different classes determines the final value of DB. This index is defined with the following formulation: In the above expression, and represent the class j and i, respectively; represents the class i; Λ , the distance between classes and ; the term Λ represents the distances within the class ; and finally c is the number of similar classes of objects within the classification ∁ ⋃ ∈ 1, … , .

APPENDIX S4: Dunn Index
This index maximizes the distances within each class of similar objects while minimizing the distances between the remaining classes of the classification -i.e., it identifies compact, well separated classes 7 . Therefore, high values of Dunn's index indicate the presence of very compact classes and separate objects within a classification. The number of classes that maximizes the Dindex is taken as number of optimal classes. For any classification ∁ ⋃ ∈ 1, … , , the Dindex is defined as: Here, Λ , denotes the distance between classes and ; Λ represents the distances within the class ; and c is the number of similar classes of objects within the classification ∁ ⋃ ∈ 1, … , .

APPENDIX S5: CE-index
The CE-index quantifies the similitude between the observed classification matrix and the expected classification matrix. This index is an adaptation of the error index proposed by Letelier and Weber (2000) 8 . In the present work, the observed classification matrix, based on the mean correlation coefficients (template vs. spikes), was defined as: In the case of the existence of clusters resulting from the classification, represents the spikes belonging to the i-template, while the clusters represent the groups distinguished by the automatic clustering algorithm. Here, are the mean correlation coefficients for the relationships between the template corresponding to the i-cluster and all its spike events (i.e., the template vs. spikes mutual-correlations), are the mean correlation coefficients for the relationships between the template corresponding to the i-cluster and all the spike events of another j-cluster (i.e., the template vs. spikes mixed-correlations, when i ≠ j). In this way, comparisons of all spike events with all available templates were carry out.
A perfect performance of the sorting procedure, at Δ 1 (i.e., the strongest mutualcorrelations possible) with 1, … , , should produce the following expected classification matrix: In order to compare our approach with other methods 8,9 of spike sorting, we also estimated the observed and expected classification matrices, as well as, the CE-index and the customized and unclassified [ ∑ ∑ )] spike events, where and are the diagonal and nondiagonal elements, respectively, of the observed classification matrix. Here, is the total number of spikes that we know belong to the i-cluster.
Finally, the customized Error Index (corresponding to the simulated data with three spike templates (T1, T2 and T3) and an important noise component) reported in Table S1 (for K-means, first step alone), in Table 4 (for K-TOPS clustering algorithm) and in Table 5 (for the comparison among the different methods) of the main text, was calculated according to the equation above.

Cityblock
Sum of absolute differences.

Cosine
One minus the cosine of the included angle between points (treated as vectors).

Correlation
One minus the sample correlation between points (treated as sequences of values).

Hamming
Percentage of coordinates that differ.

Jaccard
Percentage of nonzero coordinates that differ.    Table 6 for description).
The values of the execution times are reported by the Mean ± SEM (standard error of the mean).
Here, the values of the statistics H (One-Way ANOVA on Ranks) and F (One-Way ANOVA Ftest) are reported. The significance level (P-value) is indicated in each case.

Session FV
Total execution time (in minutes) for each session (60 trials) of recordings   Table S1) and seven clustering metrics (sqEuclidean, Euclidean, Cityblock, Cosine, Correlation, Hamming and Jaccard, in Table S2) for all possible values of K. (B) In this diagram K-means method (for a selected distance-metric combination) uses the internal validation indices: Silhouette (S), Davies-Bouldin (DB) and Dunn (D). Also, a general index for measuring the cohesion and dispersion of the clustering (called CDindex) for each classification is calculated. The higher value of the CD-index allows determining the optimal number of clusters. Note that, the maximum number of clusters n should not exceed the value √ where s is the number of detected spikes in the neural recording.  neurons from the rostro-medial prefrontal cortex during classical eyeblink conditioning using a delay conditioning paradigm with 250 ms of inter-stimulus interval. All the illustrated raster displays were collected during the 8th conditioning session (C8, 60 trials). Representative examples of three different types of firing rates as a result of the application of our spike sorting method/algorithm. The three different firing rates were characterized by having their maximum peaks close to the beginning of the CS (blue raster and firing rate, time to peak 65.3 ms, maximum rate 47.99 spikes/s); in the center of the CS-US (red raster and firing rate, time to peak 231. ms, maximum rate 38.15 spikes/s); or next to the end of the US (green raster and firing rate, time to peak 348.99 ms, maximum rate 49.59 spikes/s).