Chinese provincial multi-regional input-output database for 2012, 2015, and 2017

Global production fragmentation generates indirect socioeconomic and environmental impacts throughout its expanded supply chains. The multi-regional input-output model (MRIO) is a tool commonly used to trace the supply chain and understand spillover effects across regions, but often cannot be applied due to data unavailability, especially at the sub-national level. Here, we present MRIO tables for 2012, 2015, and 2017 for 31 provinces of mainland China in 42 economic sectors. We employ hybrid methods to construct the MRIO tables according to the available data for each year. The dataset is the consistent China MRIO table collection to reveal the evolution of regional supply chains in China’s recent economic transition. The dataset illustrates the consistent evolution of China’s regional supply chain and its economic structure before the 2018 US-Sino trade war. The dataset can be further applied as a benchmark in a wide range of in-depth studies of production and consumption structures across industries and regions.


Methods
China's provincial MRIO tables for 2012, 2015, and 2017 were compiled using a partial survey approach, which combines the official survey data and modelled outcomes 6,27 . The partial survey approach allows MRIO table construction to be regarded as linking provincial SRIO tables with the trade matrix for each sector. Provincial SRIO tables are often available from surveyed data, while the trade matrix for sectors are unavailable and rely on modelling. However, there are two challenges to be faced before compilation. On the one hand, due to the high costs for the ad hoc survey for input-output table construction, China's official provincial SRIO tables are only released every five years, with the year ending with 2 or 7 (e.g. 2012 and 2017 in this case). The SRIO table for years ending with 5 (e.g. 2015) is not compulsory. Hence, SRIO tables of 2012 and 2017 were available for all provinces whereas SRIO tables for 2015 were not. Thus, before SRIO tables could be linked to the trade matrix, provincial SRIO tables for 2015 had to be built. In addition, provincial SRIO tables released by the National Statistics Bureau cannot be directly used due to the inconsistent trade flows between provinces, which is the case for 2012 and 2017. For a given product, the total domestic exports should be equal to the total domestic imports in an economy but officially released SRIO tables often fail to meet this condition. In this study, a cross-entropy model is thus employed to address these problems. The model follows the minimal cross-entropy principle (or Kullback-Leibler divergence) to minimise the entropy distance between the target and prior distribution 28,29 . The outcome of the cross-entropy model ensures maximum similarity between the target and the known distribution. Figure 1 illustrates the 5 steps involved in constructing provincial MRIO tables: (1) Estimation of domestic demand and supply; (2) Disaggregating demand and supply; (3) Adjustment of the provincial SRIO table; (4). Estimation of the interregional trade matrix; (5) Linking adjusted provincial SRIO tables with the trade matrix. Table 2 lists the raw data required in the MRIO table construction. Due to differences in data availability, we introduce two cases according to data treatment processes. Case 1 is based on comprehensive provincial SRIO tables for all 31   www.nature.com/scientificdata www.nature.com/scientificdata/ a few SRIO tables for 2015. In compiling the model, output and value-added data by sectors can be derived from provincial SRIO tables (in 2012 or 2017 case) or provincial statistical yearbooks (in 2015 case), but provincial statistical yearbooks might not provide the output for tertiary sectors and value-added data for industrial sectors. In this case, we can estimate the missing data based on the assumption of the same share structure of value-added and output. For example, we can estimate value-added data for industrial sectors by multiplying the distribution of their output with the total provincial value-added for industrial sectors. To be consistent with the national SRIO table, aggregated provincial output and value-added by sector for all 31 provinces are scaled by the national value from the national SRIO table. In short, output for tertiary sectors is not available in the yearbooks, but value-added for tertiary sectors is. So, we use the structure of value-added for all provinces to disaggregate the national output of the tertiary sectors (derived from national IOT). Similarly, value-added for industrial sectors is not available in the yearbooks, but the output is. Similarly, we use the output structure for all provinces to disaggregate the national value-added by industrial sectors (derived from national IOT).
Provincial trade flows (domestic imports and domestic exports) are derived from the China customs database for 2015 or from the official provincial SRIO tables for 2012 and 2017. To estimate the trade matrix, the observed transport data and electricity transmission data were also obtained from national railway statistics and China's electricity yearbook respectively.  www.nature.com/scientificdata www.nature.com/scientificdata/ estimation of domestic demand and supply. The compilation starts with the estimation of supplies and demands (Fig. 2). From the supply perspective, the supply from the given province can be defined by destination and further divided into self-supply, supply to other provinces and supply to other countries (or export). We can estimate domestic supply s r i (sector i in province r supplied to all provinces including itself) by using its total output (x r i ) minus exports (ex r i ), as shown in Eq. 1: Where x r i refers to the output of commodity i in province r; ex r i refers to the export of commodity i in province r. s r i represents the domestic supply of commodity i for province r. The demand of a specific province can be defined by the source and further divided into self-demand, demand from other provinces, and demand from other countries (or import). Similarly, we can estimate the domestic demand d r i within a province by using its total demand minus imports, where the total demand is the function of intermediate demand (z r i ) and final demand ( f r i ). In case 1, the total demand and imports are available from provincial SRIO tables, and thus shown as Eq. 2. In case 2, the total demands are not available due to the lack of provincial SRIO tables for 2015. We estimated the total demand based on the assumptions:    www.nature.com/scientificdata www.nature.com/scientificdata/ im r i is the import for sector i of province r. It is worth noting that the technical coefficients for 2012 were used because the sector classification in the 2012 tables is the same as used in the 2015 tables, while a different classification is used in the 2017 tables (discussed in Table S1). However, choosing different technical coefficients can generate different estimated total demands which lead to different MRIO tables. Therefore, more investigation is needed to address how total demand for each province can be estimated.
Disaggregating demand and supply. Once domestic supply and demand are established through the above step, we disaggregate the domestic supply and demands by the cross-entropy model (CE), shown in Fig. 3. The cross-entropy model (CE), as mentioned above, is used to obtain the distribution which is closest to the prior information as well as taking into account the given constraints. For a given product or sector, several numeric equations reflect the supply-demand balance, which are constraints in the CE model: (1) the self-supply should be equal to the self-demand for the same provinces; (2) the row sum of Sd and SO should be identical with the domestic supply S. Correspondingly, the row sum of DD and DO should conform with the domestic demand D; (3) the column sum of SO for all provinces should be equal to the column sum of DO for all provinces, as all products giving out are equal to all products received within a certain boundary. Mathematically, this can be shown as: Subject to: (the distribution of all supply is equal to 1) (the column sum of SO is equal to the sum of DO) (the row sum of domestic supply is equal to the domestic supply by province) (the column sum of domestic demand is equal to the domestic demand by province) Where p ir is the distribution of supply and demand for sector i in province r; q ir is the prior distribution of supply and demand for sector i in province r. s i and d i are aggregated domestic supply and demand for sector i. s ir and d ir indicates the domestic supply and demand for sector i in province r.
adjusting provincial single regional input-output table. The above steps re-adjust the domestic supply and demand to make sure that total domestic exports ∑ so ( ) r r are equal to domestic imports ∑ sd ( ) r r for any product. Thus, we updated the intermediate demand (Z) and final demand (F) from previous provincial SRIO tables, calibrated with adjusted domestic export and import. We employed the generalised RAS (GRAS) model, which is a variant that allows for non-positive elements in the iterative matrix balancing 30 . For a given SRIO table www.nature.com/scientificdata www.nature.com/scientificdata/ for province r, two conditions need to be met in terms of the SRIO table balancing. By row, the row sum of intermediate and final demand should be equal to total output minus net export. By column, the column sum of intermediate demand should be equal to total output minus value-added. Mathematically: Where q r ij is the prior distribution containing the matrix of intermediate demand z r ij and final demand f r i , which can be directly derived from the provincial SRIO table or proxy if the SRIO table is not available, as in 2015. We assume the identical technical coefficients between 2012 and 2015 and then multiply the 2015 input to get a preliminary intermediate demand. For the final demand, we first calculate the aggregated final demand by GDP minus NE, and then multiply the final demand distribution of 2012 as the proxy estimate. ne r i represents the net export of product i for province r, which is equal to foreign export + domestic export-foreign import-domestic import by product. Foreign export and import are intermediately available from the provincial SRIO tables (for 2012 and 2017) or customs dataset (for 2015). For 2012 and 2017, we used the trade data directly from provincial SRIOTs, while the customs dataset is to estimate provincial export and import by sectors for 2015, as there are no provincial SRIO tables. p r ij represents the unknown distribution dividing known prior distribution, which is the result of the GRAS; e is the Natural logarithm. X r j represents the total input of product j for province r, while X r i represents the total output of product i for province r.
Intraregional matrix estimate. Equation Similarly, we can apply the import purchase coefficient (IPC), analogous to the purchase coefficient, to derive the demands supplied from other provinces. Mathematically: Where x r i refers to the output of product i in province r; ex r i indicates the foreign export of product i in province r; so r i refers to product i supply from province r to other provinces; im r i represents product i imported from other countries to province r; do r i represents product i required in province r. Interregional trade matrix estimate. To obtain the trade matrix, we apply the gravity model with the observable trade data between provinces, which improves the accuracy and reliability of the interregional www.nature.com/scientificdata www.nature.com/scientificdata/ estimates 27,32,33 . The gravity model has been widely adopted in previous Chinese MRIO table building 17,21 . It is worth noting that the standard gravity model requires trade sample data to estimate the parameters. When the sample data are unavailable, the doubly constrained gravity model can be chosen as a reliable alternative 34 . The doubly constrained gravity model has also been used by IMPLAN to build a sub-national trade matrix for the US 35,36 . The model assumes that the trade between two regions is the function of supply and demand and the impedance in costs. Therefore, the standard gravity model is as follows: rs i i i rs ro os 1 2 Where t rs i is the trade flow for commodity i between province r and province s; e i ro and m i os are the supply (or domestic export) from province r and the demand (or domestic import) of province s, respectively. d rs is the distance between two provinces, which is the proxy for transportation costs. β 1 and β 2 represent the weights of the original and destination province. γ refers to the friction parameter. With sample trade data, the unknown coefficients for each sector (β 1 , β 2 , γ) can be estimated using regression. In this case, we use the railway's interregional commodity from National Railway Statistical Data as sample data for the shippable commodity. We use the sample data as the trade flow (t rs i ) to estimate the unknown coefficients (β 1 , β 2 , γ) . With sample trade data, the unknown coefficients for each sector (β 1 , β 2 ) can be estimated using regression. We use the sample data as the trade flow (t rs i ) to estimate unknown coefficients (β 1 , β 2 , γ) . As we have trade data for 11 commodities from the railway statistics, some sectors in the gravity model have to share the same coefficients. As we have trade data for 11 commodities from railway statistics, some sectors in the gravity model have to share the same coefficients (See Table S2). We show the mapping relationship in the appendix. For non-shippable commodities (e.g. service and construction), we do not set transport costs, and simply assume that they are evenly distributed based on supply and demand, as data are unavailable. For electricity transmissions, we obtained an interregional electricity transmission matrix from China Electricity Power Yearbook as electricity sample data 37 . With estimated coefficients, we can derive the initial trade matrix directly by Eq. 12. But the initial trade matrix is not in line with the constraints of row and column which are domestic export and import from the updated provincial SRIO table. We then apply the RAS model to balance the trade matrix to make it consistent with the provincial SRIO table.
Based on the balanced trade matrix, we calculate the proportion of total domestically imported products supplied from each province, defined as purchase proportion (RP), shown as: Where rp rs i represents the ratio of domestic imports from province r to province s for product i; t jr i refers to the trade from province j to province r for sector i. Therefore, the non-diagonal matrix in the MRIO table can be presented as:

Data Records
Provincial MRIO tables illustrate the regional economic structure and interregional supply chains for 31 provinces with 42 sectors and cover China's economic transition period for 2012, 2015 and 2017. The layout follows the standard MRIO table (Fig. 4) Table 1) is not included in this comparison, due to this table comprising only 30 sectors for 30 provinces. The format of MRIO-DRC and MRIO-CAS (42 sectors for 31 provinces) is compatible with our MRIO table (MRIO-CEADs).
Following previous work in MRIO table comparison 40 , three indicators are employed in the comparison. Specifically, we calculate the mean absolute deviation (MAD), the Isard-Romanoff similarity index (DSIM) and the absolute entropy distance (AED). These indicators measure the similarity between matrixes. MAD measures the absolute distance between each element in the two matrices; DSIM uses the relative distance instead of the absolute distance in MAD; AED is based on information theory and refers to the entropy loss between two matrices. It calculates the absolute entropy differences between two matrices. More similar to two matrices, AED is closer to zero. Here, we compare the intermediate demand matrix, representing how the sector's production requires the other sector's production. Mathematically:  www.nature.com/scientificdata www.nature.com/scientificdata/ with CAS. Overall, three indicators might explain why our MRIO table is in the middle between two counterparts MRIO tables, while all three indicators might be similar for some provinces. For example, Chongqing in MRIO-CEADs is similar to MRIO-CAS, while Hubei in MRIO-CEADs is similar to MRIO-DRC.
We then compare the province-wise proportion of domestic intermediate input to total input, the proportion of the domestic final demand to total output, and the value-added embodied in the final demand (Fig. 5). The results show that MRIO-CEADs is generally similar to the other two matrices although for some provinces differences may be more significant. In the proportion of domestic intermediate demands to total input, the biggest gap is found in Shanghai, where demand to total input is 6% less compared to the MRIO-DRC but 4% higher compared to the MRIO-CAS. The comparison in standard deviation (SD) between our MRIO table and the other two tables shows that our MRIO table is more similar to MRIO-DRC in the intermediate demand, with a tiny margin. But for domestic final demands in the total output, MRIO-CEADs is more similar to the MRIO-DRC as a general trend. The biggest gap is found in Tibet where the figure is 13% higher than in the MRIO-CAS, but only 4% higher than in the MRIO-DRC. In terms of SD, our MRIO s deviates more than in the MRIO-CAS. As for the value-added embodied in the final demands, all MRIO tables produce similar outcomes which might indicate that the main deviation occurs in the sectors with smaller value-added in the final demands, such as agriculture and mining. The results show that most sectors are ±5% in all three years, except for a few sectors. In 2012, Processing of petroleum, coking, processing of nuclear fuel (S11), Comprehensive use of waste resources(S23), and production and distribution of gas (S26) are outliers, being 30% higher than in the SRIO table, the most significant deviation is found in production and distribution of gas (S26 in 2015 and S25 in 2017) in 2015 and 2017. It is worth noting that these sectors in China are highly related to imports. The reason behind the uncertainty is the assumption that the import ratio is identical when transforming the competitive table into the non-competitive table. The accuracy of the ratio is, therefore, more sensitive to the sectors with higher imports. The ratio can be adjusted if more data are available to improve the model. In 2012, other datasets can also be compared with domestic input from SRIO, but the deviation is far higher. For example, Mining and washing of coal (S2) shows a deviation of 57% for MRIO-DRC and 62% for MRIO-CAS higher than in the SRIO table. The main reason for the deviation is that MRIO-DRC and MRIO-CAS are compiled based on the provincial SRIO table without being calibrated to the national one. The aggregation of provincial data are not entirely equal to the national one, as provincial data are compiled by provincial statistics agencies while national data are compiled by the national statistics bureau 41 .

Code availability
The programs used in the data generation is based on MATLAB and GAMS. The associated code can be found in Zenodo repository 38 .