SC2EGSet: StarCraft II Esport Replay and Game-state Dataset

As a relatively new form of sport, esports offers unparalleled data availability. Our work aims to open esports to a broader scientific community by supplying raw and pre-processed files from StarCraft II esports tournaments. These files can be used in statistical and machine learning modeling tasks and compared to laboratory-based measurements. Additionally, we open-sourced and published all the custom tools that were developed in the process of creating our dataset. These tools include PyTorch and PyTorch Lightning API abstractions to load and model the data. Our dataset contains replays from major and premiere StarCraft II tournaments since 2016. We processed 55 “replaypacks” that contained 17930 files with game-state information. Our dataset is one of the few large publicly available sources of StarCraft II data upon its publication. Analysis of the extracted data holds promise for further Artificial Intelligence (AI), Machine Learning (ML), psychological, Human-Computer Interaction (HCI), and sports-related studies in a variety of supervised and self-supervised tasks.


Introduction
Electronic sports (esports) are a relatively new and exciting multidisciplinary field of study [1,2].There are multiple groups of stakeholders involved in the business of esports [3].The application of analytics to sports aims to optimize training and competition performance.New training methods are derived from an ever increasing pool of data and Dataset API Homepage: https://github.com/Kaszanas/SC2_Datasetsresearch aimed at generating actionable insights [4][5][6][7][8][9].Rule changes in sports come at varying time intervals and frequently with unpredictable effects on their dynamics.It is especially relevant to share esports data to assess changes in game design and their impact on professional players, as such changes can occur more rapidly due to the (yet) relatively unstrctured nature of esports competition [10,11].
Advancements in Artificial Intelligence (AI) and Machine Learning (ML) have shown that Reinforcement Learning (RL) agents are capable of outmatching human players in many different types of games [12][13][14][15].Psychological research on neuroplasticity has also shown the great potential of video games to induce structural brain adaptation as a result of experience [16].Further, previous studies have shown that playing video games can enhance cognitive functioining in a wide range of domains, including perceptual, attentional and spatial ability [17,18].Data obtained from esports titlesincluding those gathered from high-level tournament performance -may provide a path to improving the quality and reproducibility of research in this field, especially in contrast to the more variable data that is collected in laboratories and in less competitive settings.A lower technical overhead and more data available for modeling could assist further research in these areas [19][20][21].
The sparsity and methodological diversity of research on this topic remain as roadblocks in the study of how video games can affect mental functioning.Some scholars recommended further research on esports as a potential path forward [1].Despite the digital nature of esports -which are their greatest asset with respect to data gathering -there seems to be a lack of high-quality pre-processed data published for scientific and practical use.The goal of our work is to mitigate this issue by publishing datasets containing StarCraft II replays and pre-processed data from esports events, classified as "Premiere" and "Major" by Liquipedia in the timeframe from 2016 until 2022 [22].
A brief summary of the contributions stemming from this work is as follows: (1) The development of a set of four tools to work with StarCraft II data; (2) The collected esports data from various public sources; (3) The publication of a collection of raw replays after brief pre-processing [23]; (4) The processing of raw data with our toolset and publication as a dataset [24]; (5) and the preparation of an official API to interact with our data using PyTorch and PyTorch Lightning for ease of experimentation in further research [25].

Related Work
While reviewing StarCraft II related sources, we were able to find some publicly available datasets made in 2013 "SkillCraft1" [26] and 2017 "MSC" [27].These datasets are related to video games and in that regard could be classified as "gaming" datasets.However, it is not clear what percentage of games included within these datasets contain actively competing esports players and at what levels of skill.Using the SkillCraft1 dataset, the authors distinguished the level of players based on the data.They proposed a new feature in the form of the Perception-Action Cycle (PAC), which was calculated from the game data.This research can be viewed as the first step toward developing new training methods and analytical depth in electronic sports.It provided vital information describing different levels of gameplay and optimization in competitive settings [28].In Table 1 we present a comparison of these two StarCraft II datasets to our own.
There are existing datasets in other games.Due to the major differences in game implementations, these could not be directly compared to our work.Despite that, such publications build upon a similar idea of sharing gaming or esports data for wider scientific audience and should be mentioned.Out of all related work, STARDATA dataset is notable in that it comes from prior generation of StarCraft game.This dataset seems to be the largest StarCraft: Brood War dataset available [29].Moreover, in League of Legends, a multimodal dataset including physiological data is available [30].[26] 3395 ND + MSC [27] League [51], and Counter-Strike: Global Offensive [52][53][54][55][56], among others [57].In some cases a comparison between professional and recreational players was conducted.
Most studies did not provide data as a part of their publication.In other publications, the authors used replays that were provided by the game publishers or were publicly available online, which are unsuitable for immediate data modeling tasks without prior pre-processing.The researchers used raw files in MPQ (SC2Replay) format with their custom code when dealing with StarCraft II [58,59].Other studies solved technical problems that are apparent when working with esports data and different sensing technologies, including visualization, but with no publication of data [60][61][62][63][64].Some researchers have attempted to measure tiredness in an undisclosed game via electroencephalography (EEG) [65], and player burnout using a multimodal dataset that consisted of EEG, Electromyography (EMG), galvanic skin response (GSR), heart rate (HR), eyetracking, and other physiological measures in esports [66].

Game Description
Many of the related works introduce and communicate the properties of the games that they analyze.In case of StarCraft II, we recommend the following description: "StarCraft II: Legacy of The Void (SC2) contains various game modes: 1v1, 2v2, 3v3, 4v4, Archon, and Arcade.The most competitive and esports related mode (1v1) can be classified as a two-person combat, real-time strategy (RTS) game.The goal of the game for each of the competitors is either to destroy all of the structures, or to make the opponent resign."Moreover, StarCraft II contains multiple matchmaking options: "Ranked game -Players use a built-in system that selects their opponent based on Matchmaking Rating (MMR) points.Unranked game -Players use a built-in system that selects their opponent based on a hidden MMR -such games do not affect the position in the official ranking.Custom game -Players join the lobby (game room), where all game settings are set and the readiness to play is verified by both players -this mode is used in tournament games.Immediately after the start of the game, players have access to one main structure, which allows for further development and production of units."[67].

Dataset Sources and Properties
The files used in the presented information extraction process were publicly available due to a StarCraft II community effort.Tournament organizers for events classified as "Premiere" and "Major" made the replays available immediately after the tournament to share the information with the broader StarCraft II community for research, manual analysis, and in-game improvement.Sources include Liquipedia, Spawning Tool, Reddit, Twitter, and tournament organizer websites.All replay packs required to construct the dataset were searched and downloaded manually from the public domain.The critical properties of the presented dataset are as follows: • To secure the availability of the raw replays for further research and extraction, the SC2ReSet: StarCraft II Esport Replaypack Set was created [23].
• The replays were processed under the licenses provided by the game publisher: End User License Agreement (EULA), and "Blizzard StarCraft II AI and Machine Learning License" which is available in subsection A.1 of the supplementary material.
• Our dataset was created by using open-source tools that were published with separate digital object identifiers (doi) minted for each of the repositories.These tools are indexed on Zenodo [68][69][70].
• We have made available a PyTorch [71] and PyTorch Lightning [72] API for accessing our dataset and performing various analyses.Our API is accessible in the form of a GitHub repository, which is available on Zenodo with a separate doi.All of the instructions for accessing the data and specific field documentation are published there [25].
• The presented dataset is currently the largest that is publicly available, and contains information from prestigeous StarCraft II tournaments.
• The dataset can be processed under CC BY NC 4.0 to comply with Blizzard EULA and the aforementioned Blizzard StarCraft II AI and Machine Learning License.

Dataset Pre-Processing
Dataset pre-processing required the use of a custom toolset.Initially, the Python programming language was used to process the directory structure which held additional tournament stage information.We include this information in the dataset in a separate file for each tournament, effectively mapping the initial directory structure onto the resulting unique hashed filenames.Moreover, a custom tool for downloading the maps was used; only the maps that were used within the replays were downloaded [68].Finally, to ensure proper translation to English map names in the final data structures, a custom C++ tool implementation was used.Information extraction was performed on map files that contained all necessary localization data [69].The entirety of our processing pipeline is visualized in Figure 1, and additional visualizations are available in the Appendix, subsection A.3.

Data Processing
Custom software was implemented in the Go programming language (Golang) and built upon authorized and public repositories endorsed by the game publisher [73].The tool was used to perform information extraction from files in MPQ format with the SC2Replay extension.Information extraction was performed for each pre-processed directory that corresponded to a single tournament.Depending on the use case, different processing approaches are possible by providing command line arguments [70].

Data Parsing and Integrity
The parsing capabilities of the tooling were defined with a Golang high-level parser API available on GitHub [73].
After initial data-structures were obtained, the next step checked the integrity of the data.This was accomplished by comparing information available across different duplicate data structures that corresponded to: the number of players, map name, length of the player list, game version, and Blizzard map boolean (signifying whether a map was published by Blizzard).If a replay parser or custom integrity check failed, the replay was omitted.

Data Filtering and Restructuring
Filtering for different game modes was omitted as collected replay files were a part of esports tournament matches.Most often, StarCraft II tournament matches are played in the form of one versus one player combat.Therefore, it was assumed that filtering for the number of players was not required at this step.Custom data structures were created and populated at this stage.This allowed for more control over the processing, summary generation, and final output.
Merging data structures containing duplicate information was performed where applicable.

Summarization and JSON Output to zip archive
Replay summarization was required in order to provide information that can be accessed without unpacking the dataset.Finally, the data was converted from Golang data structures into JavaScript Object Notation (JSON) format, and compressed into a zip archive.

Dataset Loading
Interacting with the dataset is possible via PyTorch [71] and PyTorch Lightning [72]  Listing 1: Example use of the SC2EGSetDataset with PyTorch with a synthetic replaypack prepared for testing.

Dataset Description
The collected dataset consisted of 55 tournaments.Within the available tournaments, 18309 matches were processed.
The final processing yielded 17895 files.While inspecting the processed data, we observed three major game versions.Each tournament in the dataset was saved with an accompanying JSON file that contains descriptive statistics such as: (1) Game version histogram, (2) Dates at which the observed matches were played, (3) Server information, (4) Picked race information, (5) Match length, (6) Detected spawned units, (7) Race picked versus game time histogram.Figure 2 depicts the frequency with which each of the races played against the other and the distribution of races observed within the tournaments.Figure 4 depicts the distribution of match times that were observed.5 Experiments and Future Research

Winner prediction and Player Performance Evaluation
Within section 2 we have referenced multiple articles that dealt with player performance evaluation.These works performed data mining tasks on game engine generated replays and other sources of player related information.
Experiments regarding winner prediction can uncover interesting information about the optimal strategy of play.
Prior analyses in this task with a small sample of esports players have shown the importance of some key indicators.The proposed dataset can help with the reproduction and facilitation of various claims, some of which are based on anecdotal evidence [67].The sample analysis below describes a basic attempt at predicting match outcome using only data related to player economy to demonstrate the potential for gleaning insights from replay data.
Data Preparation Matches were initially filtered to only include those exceeding or equaling a length of 9 minutes, which is approximately the 25th percentile of match length values.Next, a set of features was generated from the available economy-related indicators.Additional features were generated by combining mineral and vespene indicators into summed resource indicators.Match data were then aggregated by averaging across match time for each player, resulting in 22,230 samples of averaged match data (from 11,115 unique matches).Standard deviations were computed in addition to averaged values where applicable.Further features were then generated by computing ratios of resources killed/resources lost for army, economy and technology contexts, along with a ratio of food made to food used.As a final step, prior to feature standardization, each feature was filtered for outliers (replacing with median) that exceeded an upper limit of 3 standard deviations from the feature mean.

Feature Selection
The feature count was reduced by first computing point biserial correlations between features and match outcome, selecting for features with a statistically significant (α = .001)coefficient value exceeding that of ± .050.Next, a matrix of correlations was computed for the remaining features and redundant features were removed.17 features remained after this process, of which 8 were basic features (mineralsLostArmy, mineralsKilledArmy, mineralsLostEconomy, mineralsKilledEconomy, and the SD for each).
Modelling Modelling was conducted on features (economic indicators) that represented the global average gamestate, in which all time points were aggregated into a single state, and also as a time series in which the gamestate was represented at a sampling rate of approx.7 seconds.Three algorithms were chosen for comparative purposes: Logistic Regression (sklearn.linear_model.LogisticRegression), Support Vector Machine (sklearn.svm.SVC) [74,75], and Extreme Gradient Boosting (xgboost.XGBClassifier) [76].Each algorithm was initiated with settings aimed at binary classification and with typical starting hyperparameters.A 5-fold cross validation procedure was implemented across the models.
Two sets of models were trained for the average gamestate and one for the gamestate as a time series.In the first averaged set of models the input features represented the economic gamestate of a single player without reference to their opponent, with the model output representing outcome prediction accuracy for that player -a binary classification problem on scalar win/loss classes.The second averaged set of models differed in that it used the averaged economic gamestate of both players as input features, and attempted to predict the outcome of "Player 1" for each match.Finally, the time series models consisted of a feature input vector containing the economic gamestate at 7 second intervalsthe task here was also to predict the outcome of a match based on only a single player's economic features, as in the single-player averaged set of models.
Label counts were equalized to the minimal label count prior to generating the data folds, resulting in 10,744 samples of "Win" and "Loss" labels each for the single-player averaged models and the time series models.For the two-player set of averaged models (containing the features of both players in a given match), the total number of matches used was 10,440.Accuracy was chosen as the model performance evaluation metric in all three cases.Computation was performed on a standard desktop-class PC without additional resources.

Results
As the results indicate (see Table 2), good outcome prediction can be achieved from economic indicators only, even without exhaustive optimization of each model's hyperparameters.For the one-player averaged set of models, SVM and XGBoost displayed similar performance, with the logistic classifier lagging slightly behind.For the two-player averaged set of models, all three algorithms performed essentially equally well.Feature importances were taken from a single-player XGBoost model (with identical hyperparameters) that was applied to the entire dataset for illustrative purposes.Figure 5 below depicts the top five features by importance.It is interesting to note that importance was more heavily centered around mineral-related features than those tied to vespene, which is likely tied to how mineral and vespene needs are distributed across unit/building/technology costs.Further feature investigation is required to verify this tendency.
Table 2: Classification models and their performance metrics for two separate win prediction models.The "One Player Prediction" models attempt to correctly output if one of the players won or lost.The "Two Player Prediction" models have access to the data for both of the players and attempts to output if "Player   Each timepoint contains the average accuracy for 5-fold cross validation, with a minimum match length of 9 minutes and a maximum match length of approx.15 minutes.All three algorithms provided similar performance over time, although this may be an effect of the minimal hyperparameter optimization that was performed.Further, it is also interesting to note and that all three algorithms meet a classification performance asymptote at approx. the same match time (~550 seconds), which may indicate that this is where economic indicators begin to lose their predictive power and (presumably) other factors such as army size, composition, and their application become the primary determinants.The code for our experiments is available at a dedicated GitHub repository [77].

Game Style Analysis
Game style analysis can be treated as a task to be solved via supervised or self-supervised methods.Using algorithms such as Uniform Manifold Approximation and Projection (UMAP) [78] or t-Distributed Stochastic Neighbor Embedding (t-SNE) [79] for the data that we provided could uncover interesting insights depending on the direction of the analysis.Such game style analysis could be investigated using sequence analysis methods or use per game statistics.

Combat Encounter Analysis
Combat analysis as a task can be researched using AI, ML, and classic algorithms in various esports [80].There were some related works that analyzed unit encounters in StarCraft II [47].Although our pre-processed dataset cannot be used to directly reproduce combat encounter analyses, we provide raw replays published as SC2ReSet [23].

Limitations
We acknowledge that our work is not without limitations.The design and implementation of our dataset do not consider the ability to obtain StarCraft II data through game-engine simulation at a much higher resolution.Because of this, the extracted dataset cannot reflect exact unit positioning.Replays in their original MPQ (SC2Replay) format contain all necessary information to recreate a game using game-engine API.Therefore, we plan to continue our research and provide more datasets that will expand the scientific possibilities within gaming and esports.Further, it should be noted that the experiments described here are more illustrative than investigative in nature, and could be greatly expanded upon in future work.We recommend further research to use SC2ReSet [23] to compute game-engine simulated information.We also do not provide simulation observation data that allows more detailed spatiotemporal information to be extracted at a higher computational cost.Moreover, it is worth noting that the dataset completeness was dependent on which tournament organizers and tournament administrators decided to publish replay packs.

Discussion
Future authors may want to filter out replays that ended prematurely due to unknown reasons.Our dataset may contain replays that are undesirable for esports research.We have decided against the deletion of replays to preserve the initial distributions of data.Additionally, as filtering was omitted (besides that performed for the purposes of the described experiments), there is a risk that the dataset contains matches that were a part of the tournament itself but did not count towards the tournament standings.Due to the timeframe of the tournaments and game version changes, despite our best efforts, some information might be missing or corrupted and is subject to further processing and research.
Our dataset is the largest publicly available pre-processed esports dataset.Moreover, in preparing the data, we defined and published the software used for the data extraction process and other tasks.Future research on StarCraft II may be built upon these tools and our dataset [68][69][70].
The dataset may also serve to increase knowledge regarding the in-game behavior of players, i.e. the relationship between the variables and overall strategies used by the players at high levels of advancement.Such information can be used in comparisons to non-gamers or intermediate players in the process of studying the relationship between game proficiency, cognitive functioning, and brain structure [81].
Moreover, a report done in the area of clinical medicine highlighted the lack of compliance of many authors with their data availability statement (DAS).It is clear that publishing the data and tools required for modeling is a key component of ensuring reproducible scientific work [82].
Other noteworthy applications of the dataset include comparing gameplay styles, action sequence classification, and their relation to victory.To that end, we encourage using different statistical methods and Machine Learning (ML) algorithms, including supervised and self-supervised approaches.
SC2DatasetPreparator [68] to collect the final .ziparchives.The dataset post-processing and experiments are briefly visualized on Figure 9 and highlight the use of PyTorch [71], and Lightning [72].

SC2ReSet
Dataset Pre-processing Figure 7: Pre-processing steps of our pipeline that result in SC2ReSet [23].We are using a custom data processing toolset including SC2DatasetPreparator [68].

Experiment loop Experiment loop Experiment Workflow
Official API SC2_Datasets raw data Figure 9: Using the SC2EGSetDataset [24] with the officially provided API [25] to conduct experiments.

A.4 Dataset Usage Examples
There are various ways to use our dataset; one way includes using the custom PyTorch dataset class which was briefly introduced in subsection 3.7, Listing 1. Due to the page limit we were unable to visualize all of the potential uses of our infrastructure in the main text; Listing 2 showcases the most basic use of the Lightning custom DataModule class that we implemented for our dataset.For further information please refer to the official documentation.It is important to note that our classes by default return a custom SC2ReplayData class which is a serialization of the initial pre-processed JSON files.To construct a custom tensor required for further modeling, users should use the exposed keyword argument "transform", which should be a function that transforms the default SC2ReplayData into some custom tensor required for further modeling.

A.5 Dataset Structure Examples
We include human-readable examples of various fields showcase on Listings 3-24; these are a part of the SC2EGSet dataset JSON files.Users should refer to the respective parts of the official documentation for more information and a list of all of the available fields.Access to these can be used to define parsers in other programming languages.... { " evtTypeName " : " PlayerSetup " , " id " : 9 , " loop " : 0 , " playerId " : 1 , " slotId " : 0 , " type " : 1 , " userId " : 1 } , ... ]

Figure 2 :
Figure 2: Distribution of player races and race matchup information.

Figure 5 :
Figure 5: Percentages of feature importances based on XGBoost fit to all data.

Figure 6 Figure 6 :
Figure 6 depicts the time series application of these models as an illustration of outcome prediction accuracy over time.It should be noted that these time series results are not based on any form of data aggregation, and as such only basic economic features could be used for classification (18 features in total).

Listing 3 :
Example header field containing a JSON object.

Table 1 :
Customvalidators that filter or verify the integrity of the dataset; (3) The ability of our abstractions to load and use any other dataset that was pre-processed using our toolset.The required disk space to succesfully download and extract our dataset is approximately 170 gigabytes.We showcase the example use of our API in Listing 1. Please note that the API is subject to change and any users should refer to the official documentation for the latest release features and usage information.Additional listing showcasing the use of SC2EGSetDataset is available in the Appendix, subsection A.4.Additionally, we include human readable examples of JSON data structures in the Appendix, subsection A.5.