Assembling real networks from synthetic and unstructured subsets: the corporate reporting case

The analysis of interfirm business transaction networks provides invaluable insight into the trading dynamics and economic structure of countries. However, there is a general scarcity of data available recording real, accurate and extensive information for these types of networks. As a result, and in common with other types of network studies - such as protein interactions for instance - research tends to rely on partial and incomplete datasets, i.e. subsets, with less certain conclusions. Here, we make use of unstructured financial and corporate reporting data in Japan as the base source to construct a financial reporting network, which is then compared and contrasted to the wider real business transaction network. The comparative analysis between these two rich datasets - the proxy, partially derived network and the real, complete network at macro as well as local structural levels - provides an enhanced understanding of the non trivial relationships between partial sampled subsets and fully formed networks. Furthermore, we present an elemental agent based pruning algorithm that reconciles and preserves key structural differences between these two networks, which may serve as an embryonic generic framework of potentially wider use to network research, enabling enhanced extrapolation of conclusions from partial data or subsets.

www.nature.com/scientificreports www.nature.com/scientificreports/ Firstly, FRn is build by automated text scanning of more than 3 million files within the first database, and applying a filtering method to eliminate false relationships. This is followed by a comparing and contrasting data analysis of the FRn to the actual, real IBTn 6,7 , where structural similarities and differences between these networks are highlighted from both macro and local levels. We then move on to develop an elemental agent based pruning algorithm that reduces the much larger IBTn to a subset that is similar to the FRn both in terms of size and certain structural features.
By following the above steps, we address a specific as well as a generic motivation to this research. From a narrow perspective, we intend to draw conclusions whether the use of unstructured financial reporting data can be used by researchers as well as market agents as the basis to generate a synthetic IBTn where data for the latter is not available.
Distinctly from Japan, other countries do not benefit from years of data collection work from private firms that enable direct access to the real 'Interfirm Business Transaction Network' or IBTn. Therefore, a method that enables to generation of a structurally similar IBTn can provide significant benefits to researchers and analysts providing significant support on the study of the dynamics of the trading networks within a selected country, or an economic block, for example.
From a wider perspective, we propose an embryonic generic framework that provides further insight into the effects and potential implications of incomplete and noisy network data, an important topic to networks in general, such as biological and social networks 8,9 . Data collection for these networks tend to be carried out partially, through non-homogenous processes, with contributions added over time from a number of different sources, distinct researches and varied methods. Essentially, these processes in aggregate may be seen as part of incrementally evolving dynamics which are conditioned by each earlier collection exercise. From such perspective, it is possible to regard financial reporting data generation as a basic evolutionary process that may be not too distinct from those of other networks.
Previous works on the real trade Japanese network 6,10-12 are centered on the system dynamics evolutionary aspects of the network. In contrast, this research is more preoccupied with the generation and formation of networks from large scale unstructured data 13 , and the interaction between complete versus partial -proxy -network datasets and subsets 14,15 . Furthermore, the work is also interested in comparing the results of high dimensional space measurement methods 16,17 to those of other more traditional network techniques when applied to large networks 18 .

Results
Our results are presented is four distinct sections in tandem with the structure described within the Introduction section above, namely: (a) the construction of the 'Financial Report Network' (FRn), (b) the comparative network data analysis, (c) the agent based pruning algorithm and (d) network comparison of high dimensional space. Whereas specific inferences are presented within this section, broader conclusions are included within next section, 'Conclusions and Discussion' .
Formation of the financial report network. As described, FRn is solely built upon automated electronic scanning of all company names within the financial reporting and corporate credit reports database. Every instance where another company name is found within a given report, a relationships between citing (i.e. reporting) and cited companies is recorded as an edge within the network.
Such data mining process, however, may generate incorrect, or false, edges, due to generic names being wrongly interpreted as a company names (i.e. nodes) as described in details within the Methods section. Therefore, such edges are largely concentrated within very few nodes within essentially predictable names. Unsurprisingly, the effect is substantially pronounced at tail end of a node degree distribution. This effect can be clearly observed by the plots within Fig. 1. The plot (a) shows the orange dots -representing the unfiltered, raw FRn network -result in very large numbers, and produce a slope which is not consistent with that of the intersection with the real trade network, IBTn, as observed within the plot (b). Once the filtering procedure is applied, however, FRn produces a slope consistent with IBTn, as observed in Fig. 2. Most importantly, the overall effectiveness of the filtering procedure is substantiated by the fact that the cumulative degree distributions resulting from the intersection of edges between FRn and IBTn is essentially and substantially preserved before and after filtering, as seen within the right side plot. Effectively, this means that very few known existing nodes are incorrectly removed during the process. Whereas only the 2018 results are presented here, similar results were obtained for the years 2015, 2016 and 2017.
Comparative network data analysis. The cumulative degree distributions of firms as a function of their respective number of trading links, are set out for both networks, FRn and IBTn, in Fig. 2. Each large frame shows the link distributions among all companies within Japan for a given calendar year. By contrast, each inset relates exclusively to link distributions amidst companies within a selected prefecture.
It is noticeable that the slope coefficient of circa −1.5 7 is essentially similar for both networks across all studied years. However, whereas IBTn maintains its slope at prefecture level, and therefore indicating a self-similar structure, FRn descends into a steeper grade for smaller prefectures. Such difference in behaviour can be attributed to the small size effect, since the number and size of firms within these prefectures are much smaller.
Whereas on a macroscopic, aggregated degree distribution level, both networks are very similar, upon other measurements, a more nuanced picture exists, and some differences start to emerge at structural level.
Firstly, as observed in the turquoise and blue circles within plot (a) in Fig. 3, whereas a giant connected component ('GCC') 19 is observed in both networks, IBTn is almost fully connected with 99.7 per cent of the nodes within such component, all all other clusters containing no more than five nodes. In contrast, FRn is much more fragmented, containing around 19 thousand isolated clusters, with the GCC representing circa 91 per cent of the total nodes. Both networks, however, present similar power law scaling factor of around −3.2 for their respective cluster distributions.  Agent based pruning algorithm. Through the pruning algorithm described in the Methods section, we have carried out a total of 100 individual simulations where IBTn is trimmed down into a subset with the number of edges equivalent to that of FRn.
From a macroscopic degree distribution perspective, the resulting trimmed network is substantially similar to FRn, both in total and at fractal prefecture level. Figure 4 shows the average results of all 100 simulations for 2018, where: (a) relates to trade between all companies within Japan, (b) trades between companies located in Tokyo prefecture only, and (c) trades within Hiroshima only (i.e. combination of a large and a small prefecture). Analysis for other prefectures, as well as for the years 2015, 2016 and 2017, yield similar results. Here, we highlight the existence of two dots within the tail of the Tokyo distribution -plot (b) -that do not fit the overall simulation. Under granular inspection, we were able to verify that these relate to two specific large companies in Japan that have unusual concentration of reporting to companies within the Tokyo prefecture, and therefore, they can be regarded as outliers that cannot be specifically captured under a generic algorithm.
In addition, plot (d) -and metrics within the inset -shows that the resulting pruning subset also favourably compares to FRn when these networks are analysed from a cluster size distribution and giant component (i.e. the largest cluster) perspective. Whereas IBTn tends to effectively equate to a single large component, its subset (the Prunned IBTn) and FRn are very fragmented. This type of fragmentation seems to be similar to those observed within a number of research related to protein networks, and are typical of subsets of larger networks 14,15 .
At a more micro, local node interaction neighbourhood connectivity level, the resulting Prunned IBTn also provides results in line with structural features of FRn. This can be clearly observed in plot (e) of Fig. 4 where the node degrees k are plotted against the average degrees of their linked neighbours k nm (k). Agreement is particularly high at larger node degrees, whereas very small degrees (i.e. less than 4 edges) yielded slightly higher deviation.
The pruning algorithm also provides better performance when compared to two other more traditional statistical methods, namely: the random node removal process and the static preferential attachment (see Methods section). Within the former, whereas the fitting at macro, degree distribution level provides reasonably good results -as shown by the degree correlation plot on Fig. 4(e) -the procedure fails to capture the essential dynamics of local interaction levels, with way too many connections between smaller nodes to larger ones being preserved. In contrast, the static preferential attachment method -i.e. one solely based on information from a single node but without taking into account the overall changes to the composition of all other nodes -provides unfavourable fittings, albeit preserving the basic power law slope.
It is also worthy emphasising that if we were to select a source node in accordance with the cumulative advantage mechanism described in the Methods section (Network pruning algorithm) but randomly select a target node, the whole Prunned IBTn curve (in red) would have shifted significantly up, and therefore no longer in agreement with FRn. The degree distribution at macro level shown in plot (a) would, however, remain substantially similar.
In summary, the importance of a more generalised, evolutionary, preferential attachment method is highlighted when the results of the pruning algorithm are evaluated in tandem with those of other statistically based network removal methods, as shown in all frames within Fig. 4. Indeed, the intrinsic information from a single node, together with the composition of all other nodes, play a combined role in the selection process. This reflects a similar feature found in Jensen's tangle nature model 20 , where the fitness of a node, or individual, is not solely based on its own features but also on the distribution of other individuals within a given system. www.nature.com/scientificreports www.nature.com/scientificreports/ network comparison on high dimensional space. The previous section makes use of typical network theory measurements to compare the results of different methods 18 , both from a macro as well as local level interaction perspective. Here, we further extend our analysis to assess and evaluate the similarities between the actual FR N and synthetic networks inspired by graph theory 16,17 principles, based on the general, high dimensional, structures of graphs' -or networks' -continuous eigensepectrum distributions 21,22 .
In Fig. 5 it is possible to observe broadly similar shapes with respect to the derived Lorentzian distribution of the Laplacian spectra of the eigenvalues related to the actual and simulated networks of four middle sized prefectures (Hiroshima, Kyoto, Nagano and Fukuoka). However, it is clearly noticeable that the pruning algorithm is able to replicate the tail ends as well as the peaks in a much enhanced manner when compared to the other methods -i.e statistical methods solely based on random removal or the preferential attachment method without evolutionary dynamics. The better fitting is also clearly reflected by the calculation of the Ipsen-Mihkalov distance 23 between each of the synthetically derived network and the actual FR N as shown by the insets within each graph.
We note here that we have limited our high dimensional analysis to prefecture level data -which are large datasets in isolation -due to the computational challenges associated with the calculation of all eigenvalues of large networks 24 . Indeed, research to date on network measurements and comparisons of high dimensional space methods -such as the Ipsen-Mihkalov distance 23 -tend to be restricted to much smaller networks. Within this context, the consistency our results when applied to large networks -and compared to other more traditional network methods -provide further insight into to the potential application of high dimensional space methods as network comparison tools.

conclusions and Discussion
The result of this research indicate that it is feasibly possible to derive a generically and structurally representative interfirm business transaction network solely based on from unstructured data from financial and corporate credit analysis documents.
Although the resulting synthetic proxy network is only a subset of the wider (real) network, it contains some of its key features, and therefore provides a valuable insight into the business dynamics of a country via a subset when data is not availble. However, accurate and informative extrapolation of larger synthetic networks in general, or related reversing process (i.e. sampling), is only likely to be successfully achieved if essential features underpinning the dynamics and generation of the networks in study are well understood and modelled accordingly.
This study provides further support to the fact that straight forward statistical selections and random methods are unlikely to yield fully representative results for a broader range of network features, both at macroscopic as well as at localised, neighbourhood based interaction levels 8,25 . As a result, we believe our findings may be helpful to inspire new methods by researchers working on social and biological networks that need to rely on samples, incomplete and/or noisy network datasets 26 .
Moreover, our results indicate that the broader use on large networks of the Lorentzian distribution of the Laplacian spectra and related Ipsen-Mikhailov distances provide useful complement to traditional network measurements. Future work on pairing and adjusting such approach with computational methods that estimate eingenvalues of very large networks 24 may enable the development of mathematically richer comparison measurements and methods.
More specifically to the existing networks in study, the current method can be potentially enhanced by making further use of scaling relationships and correlations with regards to geographical distance, company size, and economic sector. This can be combined with community detection and indirect relationships methods 13 which would significantly reduce the level of uncertainty of edges generation (in case of a growth model) or removal (in case of sampling). However, one needs to be cautioned to the fact that the formation and construction of the Financial Report Network is highly dependent on obtaining reports and documents of smaller and middle sized entities, since there is no reciprocity on the reporting from large entities. Such information is not always to the www.nature.com/scientificreports www.nature.com/scientificreports/ availability of the general public, given that financial reporting and coverage of credit analysis tends to be limited to larger entities and corporations.

Methods construction of the networks. This research analyses two rich datasets provided by Teikoku Databank
Ltd., namely: (a) the financial reporting and corporate credit report; and (b) the business trade network database. The information within each database are sourced independently from each other, containing records -electronically stored since 1974 -for over 800,000 companies in Japan. Here, we make use of a four year period between 2015 to 2018.
The construction of the Interfirm Business Trade Network, IBTn, is a straight forward process, since the database contains an ID number for each company (which can be mapped to a similar government company registration number) together with the ID number of its customers and suppliers 10,11 . Therefore, each company maps to a node within the network, with edges representing the customer/supplier relationship. Edges can be directed based on the flow of monies or -in reverse -by the delivery of goods and services. However, the direction of the edges are not relevant to the scope of this reserach.
In contrast, the Financial Reporting Network, FRn, is generated by a data mining process of financial and corporate credit reports. Within this process each company represented by a node with a specific unique ID, with the source node being the reporting company (i.e. the one being analysed or reported on), whereas the target node relates to the companies cited within such report, and mapped to their respective unique IDs. Therefore, edges essentially capture the relationships between cited and citing companies. Given that financial reporting standards and requirements state that key activities and relationships between customers and suppliers require specific disclosure 27 , the citation links will encompass a fraction of -and be substantially equivalent to -the relationships recorded within IBTn from an undirected perspective, and therefore enabling direct comparison from an undirected perspective.
At very basic level, the formation of FRn is akin to that of citation networks 28,29 . However, fundamental differences exists such since data is fully unstructured, and there is no time ordering of vertices and direction. filtering process. The database underpinning IBTn is comprehensive and consistent. It has also been extensively used in a number of research publications, and therefore it requires no cleansing. In contrast, FRn was generated by automated scanning processes of unstructured data -i.e. by finding names of companies within financial reports. Such computing processes inevitably lead to some degree of unreliable data. As a result, filtering and cleansing procedures are required before any rigorous analysis is carried out.
Our filtering method is based upon the existing scaling relationships within business transaction networks. Previous studies 5,6,10 show that the number of trading links within the IBTn is highly correlated to other quantities such as number of employees, total assets and income. Therefore, it is only natural to make use of such relationships to remove incorrect data within FRn.
The proposed filtering process is not to be applied universally to all types of networks. However, it can be generically applied if the networks in study are subject to scalling relationships -such as merger networks 30 , food webs 31 , metabolism 32 , etc. Scalling is a common feature of a number of biological and social networks 33 . This is a simple approach to ensure that the false relationships are substantially eliminated within FRn through automated means. We emphasise, however, that the filtering process is not central to the research, and other more advanced methods -as well as manual methods -could also be used. In any case, these are likely to have minimal impact on the structure of the network. Figure 6 illustrates the logic. Whereas the scaling relationship is maintained at lower number of links on plot (a), it starts breaking down for higher values. This is in clear contrast to the data shown within plot (b), i.e. the intersection set, where the scalling relationship is continuous. It follows that links that significantly deviate from scalling, beyond two standard deviations, are deleted. The results on plot (c), read in conjunction with Fig. 1(b) show the elimination of outliers, but the preservation of the known data points.
On a granular inspection, we were able to verify that significant part of the breakdown of the relationships were due to very small companies that had generic names such as 'China' , 'Tokyu' , 'Imperial' that are also referred in financial reports with a total different meaning ('China' , for example being a reference to the country rather than the company).
The granular inspection that validates the eliminations, together with the preservation of the known data points as shown in Fig. 1(b) provide appropriate comfort and assurance that the filtering process is an adequate method to eliminate such incorrect data without introducing any unintended bias. network pruning algorithm. The pruning algorithm is essentially inspired by Yule's classical workings 34 on evolution of species and distributions, Price's cumulative advantage theory 35 , and Barabasi's application of the preferential attachment concept to networks 34,36 .
However, distinctively from the classical preferential attachment equation where weightings are solely dependent upon the intrinsic characteristic of each node 18,36 , we introduce additional evolutionary dynamics where weightings are impacted -and adjusted by -the composition of other participating nodes. Such approach is inspired by Jensen's tangled nature model framework 20 , where the behaviour or each node is not solely dependent on its individual fitness but shaped in accordance with the other nodes within the environment. Furthermore, inspiration and measuring methods are also drawn from Ipsen-Mikhailov's evolutionary framework for reconstruction of networks 23 .
In order to demonstrate and highlight the importance of both dynamics acting together, namely, the preferential attachment mechanism as well as the evolutionary mechanism, we generate and compare the results of three different pruning methods. Firstly, we adopt the statistical random node removal method as described by www.nature.com/scientificreports www.nature.com/scientificreports/ Stumpf 9,37 . Secondly, we modify the random selection by implementing the classic, static, preferential attachment process 36 . Essentially the method adds weights to the probabilities of selection within the random process. Lastly, we adopt our pruning algorithm where these weightings are further modified by the composition of all the nodes within the system at the time of removal.
When constructing the algorithm, we address two specific issues. The first matter is applicable to all networks in general: a random sampling of a scale-free network does not lead to similar scale-free distributions for the subset 37 . In contrast, the second matter relates to specific structural differences between the networks (i.e. the full IBTn set and the FRn subset). Within IBTn, the in and out degree distributions are balanced, resulting in equal power law probability distributions. In contrast, the FRn is unbalance since the in degree distributions follow similar power law behaviour, but not the out degrees. This means that whereas smaller and middle size companies tend to cite most relationships, large companies provide very limited citation information. Essentially this leads to an observed rich-club phenomena within IBTn which is not present in FRn.
In order to address these two issues, the pruning algorithm is built in two stages, corresponding to sequencing the selection of two (connected) nodes, source and target, instead of directly selecting an edge to be removed.
The algorithm flow. The algorithm follows a straight flow where an edge is removed by selecting the source, and subsequently the target, nodes by the normalising the probabilities of all nodes available for selection: where Pn(K) is the normalised probability of P(k) P(k) is the probability for each node calculated in accordance to equation 2 below i n 1 with the denominator consisting of all nodes when selecting the source, in contrast to all nodes connected to the source when selecting the target. The edge linking source and target is then removed, and the degree of the nodes updated. This process is followed until the total number of remaining edges equals to the total number of edges within k FRN .
The probability of selection of nodes and edges. Each node is given a probability of selection: ,where S(k) is the probability of selection of node with degree k M is the middle of the zipf rank of degrees K available for selection The above probability of node selection equates to an approximation of the cumulative mass function of a geometric distribution, where S(k) = P(K < k). Therefore, it can be written as: is the cumulative number of successes for the set K {0, 1, 2, , k} before first failure, (i e K k) 1 represents generic probability of addition of an edge to node of degree k k 1


Here, we elaborate on the choice of the functional form of Eq. 2 and the equivalent Eq. 3. Firstly, the pruning mechanism is not time dependent, since both networks, set and subset, are frozen at time 't' . In this case the geometric distribution is appropriate as we are only preoccupied with the number of 'success' events (i.e. addition of an edge), and the order of their removal. Therefore, 'failures' can be ignored. The pruning algorithm removes an edge at every single step, and therefore no corresponding 'failure' events are to be selected.
Secondly, the cumulative mass function is used since every single element of the set K < k needs to be removed in order for all 'success' events to be reversed. www.nature.com/scientificreports www.nature.com/scientificreports/ Thirdly, k and M can be regarded as the endogenous and exogenous factors affecting a probability of a node to be selected. Here, k represents the fact that a node with a large number of edges -in isolation -is intrinsically more likely to add further edges (i.e. the preferential attachment, or cumulative advantage principle). In contrast, M represents the effect of the overall environment on each node, where the more diverse the population (in terms of numbers of edges), the higher tends to be the general probability of origination of new edges. These two elements bear a conceptual resemblance to the reproduction dynamics of Jensen's Tangled Nature model 20 where the ability of an individual to reproduce is dependent upon (a) its strength of interactions and (b) the carrying capacity of the environment.
Network comparison and the ipsen-mikhailov distance. Extensive research exists in relation to network measures and comparison methods. Essentially, the selection of methods are fundamentally dependent on the size and structure of the network, the features that are important for the selected research, as well as the researcher preferences.
Our method makes use of three traditional distribution measures derived from network theory 6,18 that are applicable to very large networks, namely: degree distributions and cluster and giant component distributions to evaluate global features as well as the average degree correlations to assess local level interactions.
Furthermore, inspired by graph theory and statistical physics concepts, we also make use of the Lorentzian distribution of the Laplacian spectra 16,17 and related Ipsen-Mikhailov 23 distances and apply to large subsections, or prefecture levels, of our network.
The Ipsen-Mikhailov metric is originally derived from modelling the frequencies of vibrating strings connected to a network system of N molecules. The vibrational strings within such system can be derived from the eigenvalues of the Laplacian matrix of such network, where the spectral density of the is the sum of Lorentz distributions, defined as:∑ Once the spectral densities are generated for each distinct network, relative structural comparison is possible by overlaying the plotting of spectral densities as exemplified in Fig. 5. Differences between densities of two networks can be further synthesised into a single number, the Ipsen-Mikhailov distances (insets of Fig. 5) as follows: