Clustering algorithm for formations in football games

In competitive team sports, players maintain a certain formation during a game to achieve effective attacks and defenses. For the quantitative game analysis and assessment of team styles, we need a general framework that can characterize such formation structures dynamically. This paper develops a clustering algorithm for formations of multiple football (soccer) games based on the Delaunay method, which defines the formation of a team as an adjacency matrix of Delaunay triangulation. We first show that heat maps of entire football games can be clustered into several average formations: “442”, “4141”, “433”, “541”, and “343”. Then, using hierarchical clustering, each average formation is further divided into more specific patterns (clusters) in which the configurations of players are different. Our method enables the visualization, quantitative comparison, and time-series analysis for formations in different time scales by focusing on transitions between clusters at each hierarchy. In particular, we can extract team styles from multiple games regarding the positional exchange of players within the formations. Applying our algorithm to the datasets comprising football games, we extract typical transition patterns of the formation for a particular team.

is as follows.In order to quantify a formation using an adjacency matrix A(t), uniform numbers → = … U a b j [ , , , ]  of players (player identities) need to be assigned to the indexes → = … I [1, 2, , 10] of A(t).If we cluster Delaunay networks of a single game in which no player substitutions occur, then an arbitrary correspondence between → U and → I can be employed.However, clustering over multiple games requires the assignment of multiple uniform numbers → → … U U , , 1  2   for different games to one set of indexes → I , and such an assignment is not uniquely determined.
For the application of the Delaunay method to the real game analysis, we have to deal with this problem.In fact, the difference of formations among multiple games is essential information for the assessment of teams' styles or strategies.In this paper, we propose an extended algorithm that can cluster formations over multiple games, and demonstrate the formation analysis by applying our algorithm to the datasets comprising football games of Japan professional football league (J League).Our method first clusters heat maps in multiple football games into several average formations: "442", "4141", "433", "541", and "343".Then, we employ the role representation and hierarchical clustering, and each average formation is further divided into more specific patterns in which the configurations of players are slightly different.Based on the transition network between clusters, we extract typical transition patterns of the formation for a particular team.

Methods
Dataset and analysis.We employ datasets comprising 45 football games by 18 teams of the top league of J League (J1 League) second stage 2016, provided by DataStadium Inc., Japan.The DataStaduim has been authorized to collect and sell data under a contract with the J League.This contract also ensures that the use of relevant datasets does not infringe any rights of players and clubs belonging to J League.The datasets are not open.We have received permission to use them for this research from the DataStadium.The list of names of 18 teams is as follows: "Fukuoka", "Hiroshima", "Iwata", "Kashima", "Kashiwa", "Kawasaki", "Kobe", "Kofu", "Nagoya", "Niigata", "Omiya", "Osaka", "Sendai", "Shonan", "Tokyo", "Tosu", "Urawa", "Yokohama".
There are five games per team, and each of five games was taken place on Sept. 25, Oct. 1, Oct. 22, Oct. 29, and Nov. 3 in 2016.In this paper, we refer to each game in a form such as Sept. 25 game of "Sendai".Each dataset contains all players' absolute positions every 0.04 seconds (i.e., the frame rate is 25 fps), which are tracked automatically by multiple cameras fixed in each stadium; the spatial resolution of the data is centimeter scale.For simplicity, we focus on the 10 players (N = 10) other than the goalkeeper for each team and analyze the data of the first halves of games.It is noted that we exclude several games with player substitutions in the first half from the analysis.
Our analysis is performed using python packages; for the calculation of Voronoi region and Delaunay triangulation, Voronoi and Delaunay classes in the scipy.spatialmodule was used; for the hierarchical clustering, linkage class in the scipy.cluster.hierarchymodule was used.All calculations were executed on a MacBook Pro with a 2 GHz Intel Core i5 processor and 16 GB of memory.
The absolute coordinates of the j-th player of a team at time t is denoted as → R t ( ) j .The centroid position and standard deviation of a team respectively defined as follows: and σ(t), the normalized coordinates → r t ( ) for the j-th player are calculated as Delaunay method and clustering algorithm for a single game.Here, we summarize the Delaunay method and the clustering algorithm for a single team 15 .In our method, we regard a football formation as adjacency relationships of players, which is independent of the deviation σ(t) of the team.Specifically, as shown in Fig. 1(a), a formation of a team at time t is quantified using the adjacency matrix A(t) of the Delaunay network, whose components A ij (t) are given by = . { 1 if the Voronoi regions of players and are adjacent with each other at , 0 otherwise ij Although there are other options for the definition of neighbors in 2D space, we choose the Delaunay triangulation because it is reasonable for the visualization and clustering of formations as shown below.
Owing to this quantification, a dissimilarity measure between two formations at different times can be introduced as Here, we define ′ D tt as the Euclidean squared distance, considering the hierarchical clustering using Wards' method.The dissimilarity ′ D tt becomes large when a number of edges are rewired due to the positional exchange of players within the formations.
Based on this dissimilarity measure, we introduced a clustering algorithm for formations appearing in a single game through the following four steps (i)-(iv).(i) The Delaunay networks every Δf frames in a single game are computed, where the frame rate is 25 fps.(ii) Hierarchical clustering is performed using Ward's method 16 , where the input to the clustering is the dissimilarity matrix D whose components are ′ D tt defined by Eq. ( 4).In the Wards' method, distance between two clusters C 1 and C 2 is given by where n represents the size of C. From Eq. ( 5), h(C 1 , C 2 ) equals to Eq. ( 4) at the state where each cluster contains one Delaunay network.In addition, Ward's method yields comparable size of clusters at each hierarchy compared with other methods.For each Delaunay network in a cluster, the positional coordinates of each player are converted into normalized coordinates using Eq. ( 3).This transformation enables to compare each Delaunay network independently of → r c and σ(t).Next, the time averaged position of each player is visualized by an ellipse in the normalized coordinates.We note that the direction and magnitude of each ellipse are determined by the eigenvector and eigenvalue of covariance matrix of each player's normalized position, → R t ( ) j .The use of the hierarchical clustering enables to control the number of clusters N c according to the resolution of formations; in fact, if we want to characterize the formation changes in a short time interval, large N c is selected and vice versa.As an example, we demonstrate the above clustering process based on Sept. 25 game of "Sendai".Figure 1(b) is the dendrogram obtained from the step (ii) where Δf = 25.The optimal number of clusters N c can roughly be determined as the point where height increases rapidly with decreasing of the number of clusters (Fig. 1(c)); in this case, we chose N c = 3.In Fig. 1(d), we show the clusters (coarse-grained formations) for N c = 3 in the normalized coordinates where the direction of offense is upward.Each cluster is distinguished by a cluster number from C1 to C3, and the difference between them is that several pairs of players exchange their positions: players 2 and 3, and players 5 and 6, for example.

Results
clustering algorithm for multiple games.Let us consider the problem of clustering Delaunay networks over multiple games.Players of a team for j-th game are identified with uniform numbers, → U j .As we have shown above, the assignment of multiple uniform numbers for different games to one set of indexes → I of A(t) is not uniquely determined.Here, we adopt the framework of "role representation" introduced by Bialkowski et al. 13,14 .We assume that the players play the same roles if they occupy similar positions in a formation.Then, we label each player by a role number and identify them with the indexes → I of A(t).In the following, we propose an extended clustering algorithm based on this idea, consisting of three parts I, II, and III.
Part I: clustering into average formation.In part I, we assign the same index i of A(t) to players whose positions in a formation are approximately the same.To estimate the relative position of each player in a game, we compute the heat map of each game for each team in the normalized coordinates given by Eq. ( 3).We present the heat maps obtained for all teams and games in Supplementary Fig. S1.In this figure, the time-averaged position of each player is expressed by the region within each ellipse where the direction of offense is upward.The direction and magnitude of an ellipse are determined by the eigenvector and eigenvalue of the covariance matrix for the corresponding player's normalized position.These heat maps appear to be classified into several patterns.In fact, we find from our data that they belong to one of the following five patterns: "442", "4141", "433", "541", and "343" (these are referred to as "average formations" hereafter).A schematic representation of the five average formations is shown in Fig. 2(a).The frequency of such formations for each team in five games is shown in Fig. 2(b).It should be noted that we manually classified the heat maps into the average formations.Almost all teams, except the teams with player substitutions in the first half, can be classified into one of the average formations.Hence, the change in average formations during a game did not occur in our data.We also note that the names of the average formations are not an official one and other notations can also be considered.
For a certain average formation, the ellipses (average positions of players in a game) are distinguished by serial numbers from 1 to 10, as shown in Fig. 2(a).It is considered that players belonging to the same average formation with the same serial number play the same role in the team (e.g., player 1 in "4141" is interpreted as a "center forward").Therefore, we identify these serial numbers with the indexes → I of A(t), and a one-to-one correspondence between and → I is obtained for each average formation.
Part II: hierarchical clustering of average formations.As shown in Supplementary Fig. S1, the ellipses of some players in a heat map overlap, indicating that these players exchange their positions or move close to each other in the game.Besides, the configurations of players are slightly different even within the same average formation.In order to distinguish such patterns, in part II, we cluster all the Delaunay networks belonging to the same average formation using the clustering algorithm introduced in Methods. Figure 3 presents typical examples of clustering results for the five games of "Sendai" where Δf = 25, with N c = 5 or 15.Because "Sendai" adopted "442" in all five games (see Fig. 2(b)), the coarse-grained formation obtained using this method is expressed as "442-C1", where the former number denotes the average formation and the latter is the cluster number.Furthermore, each ellipse in a cluster in Fig. 3 consists of all the positions of players with the same index in the five games.We find that each cluster exhibits a more specific pattern compared with the corresponding average formations.The major difference between clusters is that players 2 and 3, or players 5 and 6 exchange their positions.We note that C3 in N c = 5 or C7 in N c = 15 include irregular patterns, which could be associated with transitional situations such as competition in front of goal or counter attacks.
The value of N c depends on the cutting height h c of the dendrogram, where the height represents the distance between two merged clusters in the clustering process.As noted in Methods, we can control the degree of coarse-graining of formations by varying N c : finer (coarser) patterns are obtained by increasing (decreasing) N c .For example, C2 in N c = 5 is divided into (C2, C3, C4, C5) in N c = 15; C5 in N c = 5 is divided into (C11, C12, C13, C14, C15) in N c = 15.In addition, when N c = 15, the positions of players 7 and 10 are slightly different between clusters compared with the case of N c = 5.In particular, there are two patterns that players 7 and 10 are in a middle line and a back line; such two patterns appear to correspond to the offense and defense scenes, respectively.We note that the special case, N c = 1, is the most coarse pattern, corresponding to the superposition of all average formations of the five games.
Part III: transition network between clusters.When a certain number N c of clusters is given, a continuous time series of formation changes can be regarded as discrete transitions between the clusters.In Fig. 4(a), we present transition networks, whose nodes and edges represent clusters and number of transitions between them, for the five games of "Sendai".Here, each node in the networks corresponds to the coarse-grained formation shown in Fig. 3(d), and a transition from one cluster to another represents a change of the configuration of players in the formation; e.g., C1 → C2 indicates that the players 2 and 3 exchange their positions.The nodes are placed using Fruchterman-Reingold force-directed algorithm 17 , which achieves an optimal layout depending on the number of transitions between clusters (weight of edges): two nodes with a large number of transitions are placed in nearby locations.In addition, we also visualize adjacency matrices of corresponding transition networks in Fig. 4(a).
We find from Fig. 4 that each of the five games exhibits similar transition patterns as follows.First, there are two communities consisting of clusters (C1, C2, C3, C4, C5), and (C9, C10, C11, C12, C13, C14, C15); the former (latter) community corresponds to the pattern that the player 5 is on the right (left) and the player 6 is on the left (right).Second, cluster C6 is the coarse-grained formation connecting such two communities; in fact, players 5 and 6 are lined up vertically in the formation.Third, clusters C7 and C8 are somewhat irregular formations, e.g., positions of players 1 and 4 in C8 are different from other clusters.It is noted that each community includes a cluster corresponding to the position-exchanged pattern between players 2 and 3, i.e., C1 and (C9, C10).We further show the time series of the clusters for Sept. 25 game of "Sendai" in Fig. 4(b).We find that the transition between two communities occurs only a few times in the first half; namely, if players 5 and 6 exchange their positions once, the formation continues for a while.On the other hand, the duration time of the clusters C1 and (C9, C10) is not such a long, and players 2 and 3 exchange positions more frequently.Because we confirmed that such features are in common for the five games, this appears to reflect the strategy of "Sendai".

Discussion
We have proposed an extended clustering algorithm based on role representation (part I) and hierarchical clustering (part II).Here, we compare our clustering algorithm with the method introduced by Bialkowski et al. 13,14 .In that method, a 2D probability distribution → H R ( ) (heat map) for a team is divided into 10 heat maps, , and the set r is computed to achieve a minimal overlap with others, under the condition that each player belongs to a different r at each frame.Because each player is labeled by a role number r instead of a uniform number u at each frame, this method is called "role representation".In the role representation approach, → H R ( ) r consists of various players at different frames, and patterns in which two players exchange their positions are regarded as the same.
In contrast, our algorithm describes an entire heat map → H R ( ) as the sum of players' heat maps, , where u denotes the uniform number.The set u is called a "average formation".This decomposition does not achieve the minimal overlap, namely, players with different u can exchange their positions during a game.Instead, our method distinguishes such position-exchanged patterns as different formations, based on the Delaunay method and hierarchical clustering; in particular, the quantification of a formation as the Delaunay triangulation is essential because it can incorporate the information of adjacency relationships of players.In this sense, our method realizes a more detailed characterization of formations compared with that by Bialkowski et al. 13,14 .Although we have only shown the results for the particular datasets, our method does not depend on the details of data.
While our decomposition of the entire heat map → H R ( ) does not achieve the minimal overlap, the average positions of players, expressed by ellipses, are still clearly separated (see Supplementary Fig. S1).That is, each player carries out an individual role in a football game.This feature of football games allows us to label players not only by uniform numbers → U but also by role numbers (role representation).Furthermore, it provides a criterion for the correspondence between multiple uniform numbers and the indexes → I of A(t), and allows hierarchical clustering to be realized over multiple games.We note that our method can be applied to specific sports in which players' average positions are almost fixed because it relies on the one-to-one correspondence between → U and → I .The variation in average formations and switches among them are a reflection of teams' strategies 1,2 .It has been reported that football teams adopt a so-called "win-stay lose-shift strategy" for formation changes between games 2 : they tend to adopt the same (a different) formation after a win (loss).Our method has the potential to provide a more detailed characterization of strategies or game flow by focusing on formation changes within a game.As an example, we have introduced the transition networks between clusters in Fig. 4.While we have mentioned some common features in Results, a closer look at the adjacency matrices in Fig. 4 shows that each network exhibits slightly different transition patterns.In order to extract more specific patterns from them, larger N c is needed.We expect that temporal network analysis for the cluster transitions for different N c values provide insights into the characterization of team styles.
Regarding this type of analysis, a further extension of the Delaunay network could also be considered.In fact, the present Delaunay network lacks information on opposing teams.This means that the edges do not always represent pass routes, because opposing players may exist on these edges.We can address this problem by introducing a Delaunay triangulation method including an opposing team.In this extended Delaunay network, edges connecting players in the same team represent secure pass routes.Further dynamical analyses of formation structures incorporating ball passes or interactions with opposing players by employing extended Delaunay network will be a topic of future research.
Finally, the Delaunay method and the clustering algorithm using hierarchical clustering are a general framework to coarse grain a many-particle system with incorporating its adjacency relationships.It realizes more detailed characterization and visualization rather than macroscopic quantities such as the centroid and the standard deviation for collective motions of various systems, including team sports 11 , animals 18 , and robots 19 .We expect that our method will provide a common tool for formation analysis of team sports and new insights to the research fields of general collective motions.

Figure 1 .
Figure 1.A typical example of a clustering process for Sept. 25 game of "Sendai".(a) A Delaunay network at a certain frame and its adjacency matrix.The unit of the horizontal and vertical axes is centimeter.(b) The Dendrogram and (c) the relation between the number of clusters and height, which are obtained from the hierarchical clustering.The height of the vertical axes corresponds to the distance between two merged clusters.(d) Coarse-grained formations for N c = 3 in the normalized coordinates where the direction of offense is upward.The major difference between clusters is that players 2 and 3, and players 5 and 6 exchange their positions.
(iii) The clustering process in step (ii) is displayed by the dendrogram whose vertical axis (height) corresponds to h(C 1 , C 2 ) between two merged clusters C 1 and C 2 .Particular number N c of clusters are extracted by cutting the dendrogram at a certain height h c .(iv) Coarse-grained formations are visualized from each cluster as follows.

Figure 2 .
Figure 2. (a) Schematic representation of each average formation.The heat map of each team shown in Supplementary Fig. S1 belongs to one of these five patterns.(b) Average formations of each team throughout five games.The label "others" means that player substitutions occurred in the first half the game, or the average formation not be identified.

Figure 3 .
Figure 3. Results of hierarchical clustering for the five games of "Sendai".(a) The Dendrogram, (b) the relation between the number of clusters and height, and the visualization of coarse-grained formations where (c) N c = 5, and (d) N c = 15.Each cluster is visualized in the normalized coordinates where the direction of offense is upward and distinguished by a cluster number.

Figure 4 .
Figure 4. (a) Transition networks between clusters (upper panels) and those of adjacency matrices (below panels) for all games of "Sendai" where N c = 15.Each node corresponding to the coarse-grained formation in Fig. 3(d) is arranged using Fruchterman-Reingold force-directed algorithm 17 .(b) Time series of the clusters for Sept. 25 game of "Sendai".