Geolocated dataset of Chinese overseas development finance

China is now the world’s largest source of bilateral development finance and will likely continue to play a prominent role in sovereign lending through its multi-billion-dollar Belt and Road Initiative. This paper introduces major methodological enhancements in tracking this finance: the use of an original application programming interface (API) to gathers news in multiple languages; double-verification of every record to ensure every finance commitment has been formalized; and visual geo-location to trace the precise footprint of every project. The resulting dataset enables economic, environmental, and social analyses with high-precision spatial accuracy, as well as spatiotemporal monitoring by project stakeholders and enhanced planning by project managers. It covers the years 2008–2019 to enable analysis before and after the announcement of the Belt and Road Initiative. It includes 862 finance commitments, 669 of which have geographic location, to 94 countries across the world.

of economic and policy trends. Under this rigorous and innovate approach to verification, users can rely on the existence of each project, its attributes, and its precise geolocation.
This new dataset aims to empower stakeholders to monitor ongoing projects and to weigh potential risks and benefits of proposals. As the world faces rapid biodiversity losses and an impending climate crisis, this type of monitoring is more crucial than ever 22 . The United Nations and other global bodies have called for international DFIs to make radical shifts in the types of projects supported and the locations chosen for them. This dataset will enable tracking of progress toward those goals by including major DFIs that, until now, have not been fully traceable 23 , and facilitating research to evaluate potential environmental and social impacts of global Chinese development finance using high-precision spatial analyses 24 .

Methods
Our aim has been to create a global, validated dataset of China's overseas development finance from 2008 to 2019. We include all sovereign lending commitments by China's two policy banks that are most active in overseas lending, CDB and ExImBank because as policy banks, they are differentiated from other sources of finance by their aim of supporting Chinese policy goals rather than commercial aims. The resulting trends can be interpreted as both economic and policy actions, akin to other policy-driven actors like multilateral development banks, national development banks, and export credit agencies 25 . As the interest rates associated with individual loans are far from universally public, our approach allows users to have a high degree of precision that the financing tracked here is extended in service of development policy aims.
To date, no official global aggregate or record of CDB and ExImBank overseas sovereign lending exist. Bank annual reports include lending for overseas projects, but include in those figures lending to Chinese companies for their work overseas and lending to private firms abroad, and in some cases include Chinese territories in overseas finance. As explained in more detail below, existing third-party aggregations largely fall into two categories: those that are not limited to DFIs and/or do not disaggregate by lender, and those that do not employ rigorous data validation to eliminate over-counting. For this reason, our methodology is bottom-up in nature, starting from individual loans and building to global aggregation.
We applied a uniform validation standard of double verification (described in detail below) to every record incorporated here. This process had three steps: 1) we compiled the limited number of existing datasets of Chinese development finance that already meet this double-verification standard; 2) we then applied this standard to 'clean' other existing datasets, and 3) we created our own algorithm to unearth projects in countries and years that were not discovered in steps 1 and 2, and then validated these newly-discovered records through the double-verification method. These steps are illustrated together in Fig. 1.
Step 1) Compiling existing doubly verified datasets. Three Figure 1 shows the total number of records considered from each of these sources, as well as the final dataset size. While thousands of loans are considered from the various input datasets, and an additional 146 were discovered through our in-house news collection algorithm, the double-verification method narrows that universe to the 862 validated projects included in the final dataset.
Step 2) Incorporating records from datasets without double-verification. We expand on these previous databases by applying the double-verification standard to Chinese development finance records from other existing databases. These datasets include: The double-verification method is explained in greater detail in the technical validation section, below.
Step 3) Identification of gaps and dataset completion through algorithmic data collection. Combining the coverage of existing datasets yields a patchwork of coverage, as Table 1 shows. Even if all of these observations could be validated, significant gaps would remain. Because of the significant gaps left among these datasets, and to ensure inclusion of projects that were not captured by existing efforts, we complemented these sources with our independent data gathering.
As of 2019, there is no singular source of truth for historical news access online. There are many public news aggregators, such as Google News and Apple News, which provide free or low-cost news searches. These aggregators also provide "real-time" feeds of news for users in their native language, with a preference for local results. For example, a search term of "baseball scores" performed in Boston will yield different results than the same search performed in San Diego; These would yield scores for Red Sox and the Padres, respectively. Many such news aggregators also provide historical searches, allowing users to select date ranges for a search term. Since www.nature.com/scientificdata www.nature.com/scientificdata/ these news aggregation services rely upon a combination of scraping web-based news sources and news licensing agreements, the quality and availability of these historical searchers are dependent on the publication quality and availability of source news [30][31][32][33] . In addition, there are private companies that provide similar access to historical news, often sourcing news through distribution agreements. These, too, vary widely in quality and breadth of coverage. Some of these include webhose.io, the GDELT project 34 , Lexis Nexis Uni, Factiva, among many others.
In political science, economics and other social sciences, it is now commonplace to utilize digital media and online news databases for data analysis [35][36][37][38] . Indeed, prior work around Chinese development finance mentioned above utilizes online news sources 39 . However, there remain a number of open questions around how to utilize online news databases as well as selecting for validity and reliability. For one, relying on only one source may introduce any number of data issues. In prior work, Blatchford (2020) explores the potential methodological weaknesses in utilizing a single online news source database for analysis 40 . Other issues may include discrepancies between news database sources 41,42 , the possible gatekeeping nature of news aggregators, as well as inconsistent or incomplete coverage 43 . We mitigate these issues by utilizing triangulation among multiple database sources, as well as first-level human validation, and the subsequent double verification. This blending of manual and algorithmic methods typically yields superior results 44,45 .
In order to algorithmically collect data, news aggregators and historical news services typically provide an Application Programming Interface (API) to facilitate programmatic access. This allows many thousands of individual searches (e.g. "China Development Bank loan" in "English" for "August-1-2015" yielding 50 results) to be performed by an algorithm. An algorithm collecting this data may take several hours to days to collect the entire corpus of search results for complex projects but will surpass a manual approach, which would take significantly longer and may introduce human errors 46 .
To select the news database sources we would use, we selected sources that provided an API and then tested them with the following methods. Each was tested using search terms such as '"China Development Bank" +loan' to try to "organically" discover the news articles. This method is essential to verify that the approach will scale appropriately to other terms and still yield relevancy. To elaborate, we do not want to simply search for test terms such as '"China Development Bank" +"Thar Energy" +2018 +330 MW' as these return desired results and imply 100% coverage. Rather, our goal is to identify terms that will maximize accuracy and coverage while also reducing 6 Existing datasets with double verification Brautigam  www.nature.com/scientificdata www.nature.com/scientificdata/ the cost (most importantly, the required human-hours to verify and vet the resulting data from the algorithmic data collection). For example, using broad search terms, we encountered over 22,000 results from one year for one country. This would require over 360 hours of manual assessment at one minute per article. Our goal was to adjust the data collection parameters and post-process the collected data to reduce false positives as well as reduce duplicates so that the resulting effort by human-time will be minimized without a drop in accuracy or coverage. We utilized an established dataset of 2018 Energy Financing Projects to benchmark news databases against to measure coverage.
For this project, two data sources were selected and utilized in order to maximize data coverage. The first, Azure Cognitive Webservice 47 is provided by Microsoft and has excellent coverage and is accessed by an API. One weakness of this API is the coverage, which is strongest around English articles. To supplement this, we utilize the GDELT project 34 which has much greater coverage, especially for non-English languages, but yields significantly higher false positive matches. These false positives must be vetted manually, so a primary reliance on this data source would be untenable. A manual sampling of 500 shows that more than 85% of these are not relevant to the project and are indirect references. Contrast this to the Azure News service, which automatically sorts results by relevance; of the first 250 results, 95% are relevant to the project. These two databases combined provided over 95% coverage of known projects.
Our scraping algorithm collected over one million records. Additional algorithms were created to further reduce the data by filtering to remove duplicates, poor news sources, and to remove articles which did not mention the key terms above in close proximity with one another. This process yielded 98,978 records. These were first manually scanned for relevancy and further reduced. Then, researchers manually checked and read each of the remaining records against existing records already included in the datasets listed above, resolving any conflicts in project attributes. Where conflicts arise between verification sources, we give government sources top priority, followed by academic sources, civil society sources, and private press sources.
Steps 4-6) Technical validation. After we compiled these records, we subjected them to a multi-layered process of technical validation, described in more detail in the technical validation section, below. These validation stages apply harmonized definitions across the entire dataset, with particular focus on validation of finance commitment attributes (the data records described in the following section) and project location.
As many other authors have noted, this subject area is characterized with very low transparency 9,11,15 . Thus, while our method of double verification precludes the possibility of over-counting, it is still possible that some projects may be omitted due to insufficient public information. For this reason, all of the above steps are repeated annually, and in each update all years are included through the year prior to the update, in order to find projects that may not be reflected in public records for several years after financing is signed. In doing so, we join previous scholars including Brautigam et al. (2020) and Horn, Reinhart, and Trebesh (2019) in recognizing the importance of regular updates for maximum transparency in this intrinsically opaque field 8,15 . www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
The following information is gathered for inclusion in the final dataset.
Project index. This unique code differentiates projects from each other and corresponds to observations mapped in the accompanying shapefile.
Project name (english). Projects are named in English, giving priority to include reference to local place names were possible.
Country name and ISo. In the vast majority of observations, commitments are signed with individual countries, which are reflected here by name and by ISO 3166-1 alpha-3 codes. Thirteen additional finance commitments went to regional multilateral bodies, such as the Development Bank of Central Africa and the African Export-Import Bank.
As Fig. 2 shows, Chinese development finance is distributed widely across the world. Figure 2a shows the geographic footprints of the dataset, with special detail in three regions of particularly heavy representation: northern South America and the Caribbean; Sub-Saharan Africa, and Southeast Asia. Figure 2b shows national totals for finance commitments, and demonstrates that despite the broad coverage, a few countries comprise the bulk of the records. In fact, the top 10 recipients, labeled in Fig. 2b, comprise $277 billion in finance commitments, or 60 percent of the total.
Borrower. All of the commitments here are to governments or entities wholly or partially owned by governments, including state-owned enterprises (SOEs), multilateral organizations, sub-national governments, and public-private partnerships. Where entities other than national governments are known, these are listed individually. Otherwise, all national government bodies (including ministries of finance, central governments, and other specifications) are listed as "national government, " for the sake of comparability among projects.

Borrower category.
Borrowers are shown in one of five categories, or combinations thereof: • National government • Public-private partnership • Regional body • State-owned entity • Sub-national government Year signed. This column corresponds to the year that loan agreements were signed. It is important to note that this year does not necessarily correspond to the year when project construction was begun or completed. In many cases, financing is secured relatively late in the project cycle, long after letters of intent (LOIs) or memoranda of understanding (MOUs) are signed. In other cases, the reverse is true, and financing is secured well before a contractor is selected and final plans drawn up. Because the central actors in this data are the CDB and ExImBank, we reflect the year when they committed to financially support a project.
Lender. The finance institution is shown as CDB or ExImBank. In five cases, co-financiers are shown, including the Bank of China, Citic, and the Industrial and Commercial Bank of China. In these cases, it was impossible to disaggregate the finance commitment among these lenders.

amount.
In almost all cases, this amount shows the total committed by CDB and ExImBank, in millions of USD. However, in five cases of joint projects with other lenders, it was impossible to differentiate between the finance provided by these two and their co-financing partners. Those projects are noted with an asterisk in the next variable, Total Includes Co-Financing.
Flag for Co-Financing. The five projects for which the listed finance commitments include co-financing from other banks (as described above) are indicated here.

Sector.
Projects are classified into one of ten major sectors, as follows.
• Agriculture/food: This category includes agriculture, fishing, and agricultural processing.
• Extraction/pipelines: This category includes mining, drilling, and pipeline transportation of extracted products. They are combined into one category because of the frequency of finance commitments that include both sub-sectors. • Government: This category includes central bank operations (including foreign reserves), education, emergency response, environmental projects, government office construction, healthcare, national geological surveys, public housing, postal services, security, and national or regional development bank support. • Manufacturing: This category includes refineries, smelters, and factories.
• Power: This category includes energy generation and distribution projects, which are further classified by the sub-sector or fuel (coal, gas/LNG, hydropower, nuclear, oil, solar, wind, distribution, other). • Telecommunications: This category includes television, radio, wired and wireless communications networks, fiber optics and broadband data networks, satellite communications, and digitization and electronic dissemination of government services and records.
• Water/wastewater: This category includes irrigation, potable water, wastewater, and sewage projects.
• Other construction: This category includes industrial parks, business districts, shopping centers, athletic centers, conference centers, and general infrastructure commitments. • Multi-sector/discretionary: This category includes finance commitments without specified purposes (for general budgetary support) or for projects that cross the sectors listed above.
As Table 2 shows, three sectors account for the bulk of China's overseas development finance: transportation, extraction, and energy. Commitments in these three sectors came to $336 billion, or 72% of the total. A fourth important category of finance commitments includes those that cross multiple sectors or are discretionary (and able to be used as the borrower sees fit). Each of these sectors also includes trade finance for government purchases of related equipment, machinery, or vehicles. www.nature.com/scientificdata www.nature.com/scientificdata/ Precision level. Following AidData 11,12 , we indicate the level of precision of our location data. As described above, our use of these codes differs from previous datasets, as follows: 1. Exact project footprint 2. Within 25 km, based on sources labeling a project as "near" a mapped place. This category also includes projects known at the second-order administrative division where that division has a radius of less than 25 km. 3. Second-order administrative division, such as municipality or county 4. First-order administrative division, such as state or province 5. Spanning multiple first-order administrative divisions 6. Country 7. Unknown As Table 3 shows, the overwhelming majority of projects are located at the exact or "near" level: over 80% of commitments accounting for roughly 90% of committed finance.
The resulting records are available in two datasets -project attributes and project locations -at the Open Science Framework repository (https://doi.org/10.17605/OSF.IO/GFWHJ). The former file, project attributes, includes all variables except for each project's location, and is freely available in CSV and XLSX formats. The latter dataset, project locations, is freely available upon completion of the data use agreement, which is also available in the same repository. The project locations dataset includes separate files for point-based, linear, and polygonal projects, in Esri shapefile (SHP) formats 48 .
In addition, two sets of code are available at the same repository. News aggregation code is available in Python and geolocation code (querying Google Maps and Open Street Maps APIs) is available in R, upon completion of the same data use agreement mentioned above 48 .

technical Validation
This dataset relies on two types of technical validation: ensuring the accuracy of (1) project attributes and, where applicable, (2) their geographic locations.  www.nature.com/scientificdata www.nature.com/scientificdata/ Project attribute validation: the double-verification method. Existing sources for Chinese overseas development finance rely on a variety of verification standards. The present dataset extends the most stringent approach of the existing "double verification" methods pioneered by the China Africa Research Initiative at the Johns Hopkins University School of Advanced International Studies (SAIS-CARI) to create a harmonized, global standard.
The double verification method is based on academic literature showing a tendency to overstate, rather than understate, finance commitments. For example, Ebeke and Ölçer 49 show that major infrastructure projects are often timed for announcements to coincide with political campaigns. Regional case studies 9,50 show patterns of planners avoiding the publication of projects' environmental and social risks, but simultaneously maximizing the visibility of the projects and their financial commitments, often before they are finalized. For this reason, earlier datasets have struggled to correctly identify and exclude projects that have been publicized but never materialized, resulting in sometimes significant over-estimations 51 .
The possibility remains of under-counting. As Horn, Reinhart, and Trebesch (2019) 15 point out, in reference to "hidden" Chinese finance, many overseas Chinese loans are never fully disclosed. For this reason, we cast the widest possible net for financing commitments and then narrowing those findings by applying the standard of double-verification. It is for this reason also that we perform annual updates, and in each update include previous years' data, in order to include any additional projects that may not have been disclosed until a much later date.
Our aim is to provide the most evidence-based supported data in order to have a more empirical based understanding of Chinese overseas development finance. Erring on the side of caution then, double verification is admittedly a more conservative set of estimates but grants all scholars and stakeholders the confidence that every record in the dataset does indeed exist.
Without public reporting by CDB and ExImBank of their lending operations, we are limited to reporting by government (and government-affiliated) sources, academic, civil society, and press reports. The system of double verification ensures accuracy in this context, requiring agreement on the core characteristics of each loan agreement between at least one Chinese source and at least one international source.
For China-side verification, we rely on official and quasi-official sources associated with the Chinese government or Chinese Communist Party. We include the following sources: For international verification, we rely similarly on government reports, supplemented with academic, civil society, and private press reports. As mentioned above, when differences emerge among sources, we resolve these conflicts by giving government sources top priority, followed by academic sources, civil society sources, and private press sources. Government press sources, such as the Chinese sources listed above, are given the weight of government sources. This method coincides with that of other datasets with double verification 7,8,21 .
Because of the stringency of the double-verification standard used here, we exclude the smallest finance agreements (those below $25 million USD). Excluding these low-level loans necessarily involves a small degree of under-counting. For example, Brautigam et al. (2020) 8 show that loans of less than $25 million each comprise just $389 million in total commitments, out of a total of $148 billion in financing commitments by CDB and ExImBank between 2008 and 2018 in Africa: approximately 0.2% of the total. However, including these loans would introduce significant geographic bias toward countries with particularly transparent governments and open media environments. As the purpose of the present effort is to enable more reliable geospatial analysis, the inclusion of this additional activity was not deemed worthy of the cost to the reliability of analysis using it.
It is worth comparing these results to those of other datasets for context. Among other independent datasets of Chinese lending, only AidData 11,12 and Horn, Reinhart, and Trebesch 15 have global coverage, and of those two, only AidData differentiates by lender, allowing a strict comparison. As Fig. 1 shows, AidData includes $463 billion in policy bank loans between 2008 and 2014 that would meet the standard for inclusion in the present dataset if they could be validated. However, in that same time period, our methodology found that only $271 billion of loans could pass the validation standards introduced here.
This process of double-verification results in a dataset that excludes some countries that appear in other datasets. For example, in the case of four countries, this process resulted in the present dataset having no loans listed, even though CDB and/or ExImBank loans appear in AidData, the largest global dataset, with loans that would qualify for inclusion here if they could be validated. Those four are: Central African Republic (for which we were unable to find doubly verified validation for the Boali No. 3 hydropower plant project), Dominica (for which we were unable to double verify the source of the loan for rehabilitation of State College), Turkey (whose Turk Telecom was privatized before the loan listed in AidData), and Yemen (for which we were unable to find Chinese validation for the Bajal cement factory project). In addition to these four countries, three others are included in AidData but with no loans of $25 million or more: Burundi, Colombia, and Sierra Leone. www.nature.com/scientificdata www.nature.com/scientificdata/ As with other researchers in this space 7,8,21 we understand that individual projects within such funds can be hidden from public view until the line of credit or framework agreement is renewed or laid down unused. Thus, we include such financing agreements when they are initially drawn up, but then withdraw them from subsequent updates if it comes to light that they were unused. If they are renewed, as lines of credit frequently are, such renewals do not represent new financing but simply a relaxation of the time period for use of the original commitment. For this reason, renewals are not considered separately.
Finally, not all projects in this dataset have been completed as of this writing. We have removed all projects that have been publicly cancelled, but ongoing projects with active financing commitments remain, even if construction has not yet begun or has been suspended. For this reason, we refer to each observation as a commitment or agreement, rather than a loan. Funds may or may not have been disbursed as of this writing, but commitments have been made and remain valid. In all, this double-verification process resulted in a final dataset of 857 finance commitments in 93 countries from 2008 through 2019.

Location validation.
Of the 857 finance commitments in the final dataset, 664 have a geographic footprint of some type. These projects -encompassing agriculture, extraction, manufacturing, utilities, infrastructure, and other installations -were located according to the following procedure.
Several of the existing datasets listed above include the location of financed projects: AidData, CSIS, Dayant and Pryke, and the World Bank 11,13,14,26 . Among these datasets, CSIS' Reconnecting Asia merits special mention, as it displays project locations through embedded Google Maps. For projects originating in this dataset, we queried CSIS for the coordinates in these maps (using code available in R as CSIS_to_coord_str.R on the project repository). For these observations, we used these reported locations as initial estimates, to be visually validated thereafter. For energy projects not listed in these project datasets, we used the following sources for initial estimates of project locations: For other observations, we developed an API to query Google Maps for the locations of each (available in R as GoogleMaps_OSM_API_query.R on the OSF project repository).
For all observations -those included in previous geolocated datasets, those located through querying Google Maps and Open Street Maps, and those with no query response -we validated the locations visually through the use of Google Maps, Open Street Maps, and Open Route Services, as shown in Fig. 3 below.
This process represents a significant elevation of requirement needing to be met for projects to be reported as having a precise location, in comparison to previous geocoded datasets. For example, AidData allows projects to be reported at the most precise location category based on the precise boundaries of an area of uncertainty around a project-including populated places or the political seats of geographic areas-rather than the precise point or boundaries of the true project site(s). The resulting high-precision category includes 579 sovereign finance commitments by CDB and ExImBank identified by AidData during our period of study, of which only 105 geotags are associated with specific sites of projects. The remaining projects' location are defined by the administrative division or the political seats thereof. This is in contrast to the more stringent precision classification scheme in our dataset. Projects marked with a precision code of "1" in the present dataset have all been visually located as site-specific project footprints. The introduction of this new level of precision allows for linear and polygonal projects to be represented with their complete footprints, rather than representative points, which enables a more thorough analysis of environmental risks and impacts, including for example, the impacts of the entire length of a highway or the entire area of a mine. Analysts using this dataset will be able to avoid the under-estimation of environmental impacts necessarily introduced by relying on representative points. Our first such analysis uses these precise footprints to compare location-based social and ecological risks of Chinese overseas development finance www.nature.com/scientificdata www.nature.com/scientificdata/ to World Bank projects, based on their proximity to the boundaries of national protected areas, possible critical habitats, and indigenous territories 48 . The dataset also supports holistic environmental analysis of interconnected networks of projects, based on their collective footprints. Yang et al (2021) use these collective footprints to examine the environmental and social sensitivity of Chinese overseas development finance locations, and find that the total footprint is significantly concentrated in more sensitive territory than World Bank projects during the same time period 55 .
To accurately reflect the variety of types of footprints across various types of finance projects, we classified each geolocated observation as a point (or collection of points), line (or collection of discontinuous lines), or polygon (or collection of discontinuous polygons). Points are used for individual buildings or installations. Lines are used for linear infrastructure including roads, rails, power distribution, wired communications networks, and pipelines. Polygons show projects with footprints that are larger than single buildings or installations, with well-defined boundaries, including dam reservoirs, oil and gas fields, and clusters of buildings such as housing or stadium complexes. The distribution of projects among footprint types is listed in Table 4.
A few examples merit further explanation regarding their classification of footprint type. First, wind farms are comprised of turbines along access roads; to accurately show the total geographic footprints, we show them as linear infrastructure comprised of their access roads. In addition, projects with lower levels of geographic precision (at the national level or first/second-level administrative division level) are shown as polygons that encompass these areas, showing the municipal, provincial, or national boundaries 48 .

Code availability
Two sets of code are available in conjunction with the resulting data, at https://doi.org/10.17605/OSF.IO/GFWHJ. News aggregation code is available in Python. Geolocation code (querying Google Maps and Open Street Maps APIs) is freely available, in R, upon completion of the data use agreement, which is also available in the same repository 48 .  Table 4. Footprint types.