Deciphering Bitcoin Blockchain Data by Cohort Analysis

Bitcoin is a peer-to-peer electronic payment system that has rapidly grown in popularity in recent years. Usually, the complete history of Bitcoin blockchain data must be queried to acquire variables with economic meaning. This task has recently become increasingly difficult, as there are over 1.6 billion historical transactions on the Bitcoin blockchain. It is thus important to query Bitcoin transaction data in a way that is more efficient and provides economic insights. We apply cohort analysis that interprets Bitcoin blockchain data using methods developed for population data in the social sciences. Specifically, we query and process the Bitcoin transaction input and output data within each daily cohort. This enables us to create datasets and visualizations for some key Bitcoin transaction indicators, including the daily lifespan distributions of spent transaction output (STXO) and the daily age distributions of the cumulative unspent transaction output (UTXO). We provide a computationally feasible approach for characterizing Bitcoin transactions that paves the way for future economic studies of Bitcoin.


Background & Summary
Bitcoin is a peer-to-peer electronic payment system that has rapidly grown in popularity in recent years [1][2][3][4] .As a distributed ledger technology (DLT), Bitcoin records newly generated transactions in a decentralized way, eliminating the need for intermediaries like banks and reducing transaction costs [5][6][7] .
Bitcoin relies on recording the unspent transaction outputs (UTXO) to efficiently verify newly generated transactions [8][9][10][11][12] .An illustrative example of UTXO is shown in Figure 1.A UTXO can be generated either as block rewards or outputs of transactions.Block rewards are newly minted bitcoins (BTC) distributed to miners for their work to maintain the network, such as routing transactions and validating blocks.In fact, all UTXOs can be dated back to block rewards.The timestamp is recorded when a UTXO is generated.A UTXO is spent and converted into a spent transaction output (STXO) when it is used as the input of a transaction.A timestamp is again recorded when the UTXO is spent, and each UTXO can be spent only once.Such a unique feature allows us to calculate the age of each UTXO and the lifespan of each STXO as we do in population data.Take Figure 1 as an example.As of July 1, 2020, UTXOs 1-3 are 8.5-years, 1-year, and 1-day old, respectively.Immediately after Alice's payment to Bob on January 1, 2021, UTXOs 1-3 are converted to STXOs with ages of 9 years, 1.5 years, and 0.5 years and 1-day old, respectively.
Noticing the unique structure of the Bitcoin blockchain data, we apply cohort analysis [13][14][15][16][17] , originally developed for population data, to analyze it.To continue the analogy with the population data, we say a UTXO is born when it is generated as block rewards or the output of a transaction, and we say a UTXO is dead when it is spent as the input of another transaction.In this way, all UTXOs generated on the same day form a daily birth cohort, and all UTXOs spent on the same day form a daily death cohort.We define the age of a UTXO as the difference between "now" (the date on which we are working) and the time when it was born.We define the lifespan of an STXO as the difference between the time when the STXO was dead and the time when it was born.Thus, all UTXOs within an age range form an age cohort, and all STXOs within a lifespan range form a lifespan cohort.With this framework, we naturally replicate in Bitcoin blockchain data a trinity of birth, death, and age cohorts using population cohort analysis.
Usually, we need to query the complete history of Bitcoin blockchain data to acquire variables with economic meaning.With over 1.6 billion historical transactions on the Bitcoin blockchain, it has become increasingly difficult and computationally arXiv:2103.00173v3[econ.GN] 15 Jan 2022 intensive now to download the complete Bitcoin blockchain records.It is thus important to query Bitcoin transaction data in a way that is more efficient and provides economic insights 18 .Cohort analysis provides a new perspective from which we can analyze data within each cohort separately before integrating them into a time series.
Our workflow is displayed in Figure 2. We query and process Bitcoin transaction input and output data within each daily cohort.By doing so, we successfully create datasets and visualizations for some key Bitcoin transactions indicators, including the daily lifespan distributions of STXOs as percentages (Figure 3) and the cumulative daily age distributions of UTXOs (Figure 4).These visualizations can be used to study the functions of bitcoin (BTC) as a currency.The three functions of a currency include acting as a store of value, unit of account, and medium of exchange.For example, Figure 4 shows the number of BTCs in UTXOs (i.e., BTCs that have not been spent) by age distribution.By the end of 2020, approximately 2 million BTCs had not been transacted for more than 10 years.In the past 5-10 years, 2-5 years, and 1-2 years, approximately 2 million, 4.5 million, and 3 million BTCs, respectively, remained inactive.This equals approximately 11.5 million BTCs not having been transacted for more than 1 year.These BTCs serve as a time deposit and act as a store of value.Moreover, approximately 5 million BTCs are alive for 1 month to 1 year.These BTCs are similar to a demand deposit.Frequently transacted BTCs are those with ages between 1 day and 1 month (2 million) and less than 1 day (0.2 million).These BTCs act as a medium of exchange.
Our final datasets include one dataset that characterizes STXOs and one that characterizes UTXOs, which are both smaller than 1 MB.Moreover, cohort analysis keeps data querying and processing to a minimum for future updates and enables automated updates.We thus provide a computationally feasible approach for characterizing BTC transactions, which paves the way for future economic studies of Bitcoin.Our methods can be generally applied to other cryptocurrencies that adopt UTXO protocols, including Litecoin, Dash, Zcash, Dogecoin, and Bitcoin Cash.

Methods
While the Bitcoin transaction output data are publicly available on its blockchain, we find the size of the raw data (approximately 1.3 TB) overwhelming to process, even with cloud computing platforms.To improve the efficiency of computation, we first retrieve the data relevant to the study to create a more manageable data table of only 45 GB.By partitioning this data table into daily birth and death cohorts, we can analyze the STXOs and UTXOs in each cohort separately to summarize the daily characteristics of transaction outputs and create visualizations based on the cohort summary.Our method can be adapted to the creation of future blocks -we only need to process the transaction output data from the latest cohort and append the summary to the current version.

Creating Partitioned Tables
Our primary workplace is Google Colaboratory (Colab), a Jupyter Notebook hosted environment from Google, and BigQuery, a data warehouse from Google Cloud Platform.We first query the columns of interest from the public dataset crypto-bitcoin on BigQuery 19 , which includes the input and output data of Bitcoin.We then join the data queried from input and output data to create a data table that includes the value of UTXO (value), the timestamp when the UTXO was created (block_timestamp), and the timestamp when the UTXO was spent as an input of another transaction (spent_block_timestamp) (this column is left null if the transaction output is unspent).As the UTXO in a transaction is counted in satoshi (1 satoshi = 10 −8 BTCs), the actual number of BTCs in a UTXO can be computed by # UT XO = value * 10 −8 , where the value represents the number of BTCs in satoshi.We rely on this derived data table (1.6 billion rows, 45 GB) to conduct further analysis.
To save the cost of the query, we create two partitioned tables based on the derived data table, one by the date in block_timestamp and one by the date in spent_block_timestamp.This means that the data entries are partitioned either by the date when the UTXOs were created or by the date when the UTXOs were spent.In this way, the program queries only the entries with timestamps in a specific range, which saves a notable amount of computational power.This step can significantly improve query performance and reduce query cost 20 .

Querying and Processing Cohort Data
The data structure of partitioned tables coincides with our need to process cohort data.The table partitioned by date in block_timestamp naturally divides the derived data into birth cohorts that include the segment of transaction outputs created on the same date, and the table partitioned by date in spent_block_timestamp divides the derived data into death cohorts that include the segment of transaction outputs spent on the same date.
We query and process each birth cohort and each death cohort with a loop program following the procedure described in Figure 2.For each specific date after 2009-01-03, when the first block of Bitcoin was created, the birth cohort data and the death cohort data of that date are queried and imported to Colab from BigQuery.As in Task 1, we compute the total number of BTCs in UTXOs created and spent on that date by summing the number of BTCs in UTXOs in the birth cohort data and the death cohort data respectively.Task 2 focuses on the weighted average lifespan (WAL) on the date, defined as the average lifespan (the difference between the time when the output was spent and the time when the output was created) weighted by the number of BTCs contained in the transaction outputs.WAL can be computed from the death cohort data by the formula: where Li f espan = spent_block_timestamp − block_timestamp.
As in Task 3, we compute the distribution of lifespan with death cohort data on that date by first categorizing UTXOs based on lifespan and then summing the number of BTCs in UTXOs in each category.In Task 4, we apply a more complicated partitioning method to compute the age distribution for each specific date.The age of a UTXO is defined as age = working_date − block_timestamp, where working date means the date of interest for the data cohort being studied.Each UTXO that remains alive on a specific date must satisfy both conditions: a) its block_timestamp must be smaller than the end of the working date, which means that the UTXO was created sometime before or on the date, and b) its spent_block_timestamp must either be null, which means the UTXO was not spent before 2021-02-10, or be larger than the end of the working date, which means that the UTXO was spent sometime after the working date but before 2021-02-10.Thus, we cannot simply interpret this information as either birth or death cohort data.Instead, we must first query the data needed to compute the age distribution for a twelve-month or six-month period depending on the size of the data in each year and then split the queried data into daily cohorts in the Python program.We compute the age distribution of each daily cohort by categorizing the age of each UTXO and summing up the number of BTCs in UTXOs in each category.

Visualizing the Time Series
The result of our analysis is condensed into time-series data that include the number of BTCs in UTXOs created and spent, the weighted average lifespan, the lifespan distribution, and the age distribution on each date from 2009-01-03 to 2021-02-10.Many visualizations can potentially be generated from this informative time series.For example, BTC token velocity, which we define below as the number of BTCs spent in the last 30 days divided by the circulating supply of BTCs, can be computed by Our method can be adapted to the creation of future blocks.The time-series data for the past dates are not subject to changes as new blocks are created.As time goes on, we need only query and process the latest data cohorts to extend the time series.We will update the visualizations according to the latest development of Bitcoin, and researchers may easily repeat our work in part or in whole based on their needs.

Data Records
The final data records are stored and published on the Harvard Dataverse 21 .The records consist of the UTXO and the STXO datasets in csv format.The metadata information of the two datasets is presented in Appendix Tables 2 and 3. Data ranges from 2009-01-03 to 2021-02-10, and the data frequency is daily (n = 4421).The timezone used in the data is UTC+0.In addition to examining Bitcoin, we apply the same cohort analysis to five other cryptocurrencies and generate twelve datasets in total.Detailed information on these data files is presented in Appendix Table 4.

Technical Validation
To further verify the validity of our methods, we use our data to calculate other variables, including block reward and circulating supply of BTCs, and check whether the results are consistent with descriptions in the Bitcoin white paper 1 and external data sources.We compute the circulating supply of BTC by computing the cumulative net new UTXOs with the formula Figure 5 visualizes the block rewards and the circulating supply.Block rewards are the BTC awarded to the miner who wins the right to record a block of transactions by proof-of-work.Supply of the BTCs originates from the block rewards, so the cumulative sum of block rewards is the total number of BTCs in UTXOs, i.e., the circulating supply of BTC.The Bitcoin block reward was initially set at 50 BTCs per block in 2009, which means approximately 7,200 newly minted BTCs every 24 hours.The block reward halves every 210,000 blocks, roughly every four years, until the total BTC supply reaches 21 million 1 .As of the time of writing, the daily block reward amounts to approximately 900, and the circulating BTC supply is 18.9 million.
In addition, we calculate the circulating supply of BTCs by summing all UTXOs in different age cohorts because existing BTC are essentially just UTXOs of different ages.We then compare the circulating supply we compute with the circulating supply data obtained from CoinMetrics, a widely used blockchain database 22 .As shown in Figure 5, the two measures of circulating supply match exactly with each other.Hence, the validity of our data is verified.

Usage Notes Applicability
Our data can inspire research in finance, computer science, and macroeconomics.Our data can produce new technical indicators for financial studies to predict cryptocurrency bubbles 23,24 , measure cryptocurrency volatility and systematic risk 25,26 , design investment strategies [27][28][29] and implement portfolio managements 30,31 .For instance, Liu and Zhang 18 used our data to design automated trading strategies for BTC investment that outperform conventional approaches.In computer science, we can apply the UTXO and STXO data to evaluate blockchain security and scalability [32][33][34] .Wang et al. 35 cite our data to demonstrate the scalability issues of the BTC blockchain and propose an efficient storage scheme.Our data can also contribute to event studies that evaluate the effect of macro policies on BTC transactions 36 .

Limitations and future research
In this section, we identify the limitations of our current results and directions for future research.First, although the frequency of our data is on a daily level, our cohort analysis can produce data with higher frequencies.Table 1 shows several other cryptocurrencies to which our methodology can be easily applied.The granularity of the data can reach different levels (75 seconds to 10 minutes) depending on the block time of each cryptocurrency.
Second, the age distribution of UTXOs is a limited measure for BTC as a store of value.UTXOs might accumulate ages for at least two reasons other than being a store of value: First, the owner of the UTXOs has lost the private key, or second, the amount of UTXOs in the owner's account is less than the transaction fee.Owners do not transact these dust UTXOs for cost-benefit reasons.In neither case is age accumulation a sign that BTC acts as a store of value.However, scientific methods to identify the two types of UTXOs have yet to be found.
Third, the cohort analysis we designed and implemented was for UTXO-based blockchains.However, account-based blockchains, such as Ethereum, Polkadot, and Dfinity, adopt a different accounting method.In the UTXO model, crypto tokens are akin to banknotes issued by central banks; in the account model, crypto tokens are akin to balances in commercial bank accounts.Future research could extend the cohort analysis for account-based blockchains.Moreover, Ethereum, a Turing-complete blockchain, has two types of accounts: externally owned accounts (EOA) and contract accounts, which can be analogized to private and corporate accounts in commercial banks.A comparative study of the two accounts by cohort analysis could be an exciting direction for future research.2 and Table 3 present the meta data of the data we present on Bitcoin STXO and UTXO.

Figures & Tables
Table 4 lists all the data files published in our GitHub depository, including the STXO and UTXO data for six cryptocurrencies including Bitcoin.The figure shows that the WAL of BTCs in STXOs attains a peak value when the BTC price is volatile.For example, the 2014 peak of WAL value closely followed the rocketing of BTC price from $100 to $1000 and its subsequent price collapse.This implies that older BTC become more active during market turmoil.

Figure
Figure 1.An example of UTXO birth and death.UTXOs 1, 2, and 3 were spent in a transaction taking place between Alice and Bob and were transformed to UTXOs 4 and 5. UTXOs 1, 2, and 3 became STXOs after the transaction.

1 .
Figure 1.An example of UTXO birth and death.UTXOs 1, 2, and 3 were spent in a transaction taking place between Alice and Bob and were transformed to UTXOs 4 and 5. UTXOs 1, 2, and 3 became STXOs after the transaction.

Figure 2 .
Figure 2. Workflow of cohort analysis on BTC UTXO data

Figure 3 .
Figure 3. Lifespan distribution of BTC STXOs.The figure shows the log percentage of spent transaction outputs with different lifespans in each day until Feb. 2021.For example, by Feb. 2021, the STXOs with lifespans of less than one day accounted for 80% of all STXOs, while those with lifespans between 1 day and 1 month accounted for another 15%.

Figure 4 .Figure 5 .
Figure 4. Number of BTC UTXOs by age.The figure shows the total cumulative unspent transaction outputs by age.For example, by Feb. 2021, there were approximately 200k UTXOs less than 1 day old used as the medium of exchange and approximately 2 million UTXOs more than 10 years old lost or used as store of value.

Figure 6 .
Figure 6.Daily weighted average lifespan of Bitcoin STXOs and BTC price.The figure shows that the WAL of BTCs in STXOs attains a peak value when the BTC price is volatile.For example, the 2014 peak of WAL value closely followed the rocketing of BTC price from $100 to $1000 and its subsequent price collapse.This implies that older BTC become more active during market turmoil.

Table 1 .
Figures 7, 8, 9, 10, and 11are several additional visualizations of the data we present here.Block time summary of UTXO-based cryptocurrencies.Data sources from white papers and websites of these cryptocurrencies.

Table 2 .
cohort data were queried, in the format "%Y/%m/%d" newborn Number of UTXOs in BTC created on the given date dead Number of UTXOs in BTC spent as inputs on the given date WAL Weighted average lifespan of the UTXOs spent on the given date, defined as the average lifespan (the difference between the time when the output was spent and the time when the output were created) weighted by the number of UTXOs in BTC contained in the transaction outputs.Number of UTXOs in BTC spent on the given date that were created more than six months but less than one year (6m ∼ 1y) before 1 Number of UTXOs in BTC spent on the given date that were created more than one year but less than two years (1y ∼ 2y) before 3 Number of UTXOs in BTC spent on the given date that were created more than two years but less than three years (2y ∼ 3y) before 5 Number of UTXOs in BTC spent on the given date that were created more than three years but less than four years (3y ∼ 4y) before 7 Number of UTXOs in BTC spent on the given date that were created more than four years but less than five years (4y ∼ 5y) before 9 Number of UTXOs in BTC spent on the given date that were created more than five years but less than ten years (5y ∼ 10y) before 11 Number of UTXOs in BTC spent on the given date that were created more than ten years (> 10y) before Meta Data for STXO Dataset

Table 3 .
Number of UTXOs in BTC still alive by the end of the given date that were created more than six months but less than one year (6m ∼ 1y) before 1 Number of UTXOs in BTC still alive by the end of the given date that were created more than one year but less than two years (1y ∼ 2y) before 3 Number of UTXOs in BTC still alive by the end of the given date that were created more than two years but less than three years (2y ∼ 3y) before 5 Number of UTXOs in BTC still alive by the end of the given date that were created more than three years but less than four years (3y ∼ 4y) before 7 Number of UTXOs in BTC still alive by the end of the given date that were created more than four years but less than five years (4y ∼ 5y) before 9 Number of UTXOs in BTC still alive by the end of the given date that were created more than five years but less than ten years (5y ∼ 10y) before 11 Number of UTXOs in BTC still alive by the end of the given date that were created more than ten years (> 10y) before Meta Data for UTXO Dataset

Table 4 .
File names of Altcoins UTXO and STXO data on GitHub