Online division of labour: emergent structures in Open Source Software

The development Open Source Software fundamentally depends on the participation and commitment of volunteer developers to progress on a particular task. Several works have presented strategies to increase the on-boarding and engagement of new contributors, but little is known on how these diverse groups of developers self-organise to work together. To understand this, one must consider that, on one hand, platforms like GitHub provide a virtually unlimited development framework: any number of actors can potentially join to contribute in a decentralised, distributed, remote, and asynchronous manner. On the other, however, it seems reasonable that some sort of hierarchy and division of labour must be in place to meet human biological and cognitive limits, and also to achieve some level of efficiency. These latter features (hierarchy and division of labour) should translate into detectable structural arrangements when projects are represented as developer-file bipartite networks. Thus, in this paper we analyse a set of popular open source projects from GitHub, placing the accent on three key properties: nestedness, modularity and in-block nestedness –which typify the emergence of heterogeneities among contributors, the emergence of subgroups of developers working on specific subgroups of files, and a mixture of the two previous, respectively. These analyses show that indeed projects evolve into internally organised blocks. Furthermore, the distribution of sizes of such blocks is bounded, connecting our results to the celebrated Dunbar number both in off- and on-line environments. Our conclusions create a link between bio-cognitive constraints, group formation and online working environments, opening up a rich scenario for future research on (online) work team assembly (e.g. size, composition, and formation). From a complex network perspective, our results pave the way for the study of time-resolved datasets, and the design of suitable models that can mimic the growth and evolution of OSS projects.


Introduction
Open Source Software (OSS) is a key actor in the current software market, and a major factor in the consistent growth of the software economy.The promise of OSS is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in, according to the Open Source initiative 1 .These goals are achieved thanks to the active participation of the community 2 : indeed, OSS projects depend on contributors to progress 3,4 .
The emergence of GitHub and other platforms as prominent public repositories, together with the availability of APIs to access comprehensive datasets on most projects' history, has opened up the opportunities for more systematic and inclusive analyses of how OSS communities operate.In the last years, research on OSS has left behind a rich trace of facts.For example, we now know that the majority of code contributions are highly skewed towards a small subset of projects 5,6 , with many projects quickly losing community interest and being abandoned at very early stages 7 .Moreover, most projects have a low truck factor, meaning that a small group of developers is responsible for a large set of code contributions [8][9][10] .This pushes projects to depend more and more on their ability to attract and retain occasional contributors (also known as "drive-by" commits 11 ) that can complement the few core developers and help them to move the project forward.Along these lines, several works have focused on strategies to increase the on-boarding and engagement of such contributors (e.g., by using simple contribution processes 12 , extensive documentation 13 , gamification techniques 14 or ad hoc on-boarding portals 15 , among others 16 ).Other social, economic, and geographical factors affecting the development of OSS have been scrutinised as well, see Cosentino et al. 17 for a thorough review.
Parallel to these macroscopic observations and statistical analyses, social scientists and complex network researchers have focused, in relatively much fewer papers, on analysing how a diverse group of (distributed) contributors work together, i.e. the structural features of projects.Most often, these works pivot on the interactions between developers, building explicit or implicit collaborative networks, e.g.email exchanges 18,19 and unipartite projections from the contributor-file bipartite network 20 , respectively.These developer social networks have been analysed to better understand the hierarchies that emerge among contributors, as well as to identify topical clusters, i.e. cohesive subgroups that manifest strongly in technical discussions.However, the behaviour of OSS communities cannot be fully understood only accounting for the relations between project contributors, since their interactions are mostly mediated through the edition of project files (no direct communication is present between group members).To overcome this limitation, we focus on studying the structural organisation of OSS projects as contributor-file bipartite graphs.Far beyond technical and methodological adaptations, the consideration of these two elements composing the OSS system allows retaining valuable information (as opposed to collapsing it on a unipartite network) and, above all, recognising both classes as co-evolutionary units that place mutual constraints on each other.
Our interest on the structural features of OSS projects departs from some obvious, but worth highlighting, observations.First, public collaborative repositories place no limits, in principle, to the number of developers (and files) that a project should host.In this sense, platforms like GitHub resemble online social networks (e.g.Twitter or Facebook), in which the number of allowed connections is virtually unbounded.However, we know that other factors -biological, cognitive-set well-defined limits to the amount of active social connections an individual can have 21 , also online 22 .But, do these limits apply in collaborative networks, where contributors work remotely and asynchronously?Does a division of labour arise, even when interaction among developers is mostly indirect (that is, via the files that they edit in common)?And, even if specialised subgroups emerge (as some evidence already suggests, at least in developer social networks 20 ), do these exhibit some sort of internal organisation?
To answer these questions, we will look at three structural arrangements which have been identified as signatures of self-organisation in both natural and artificial systems: nestedness 23,24 (i.e.do projects evolve in a way such that the emergence of generalists and specialists is favoured?);modularity [25][26][27] (i.e.do OSS projects split in identifiable compartments, thus avoiding Brook's law 28 despite the addition of contributors?Are these compartments bounded?); and in-block nestedness [29][30][31] (i.e. if bio-cognitive limits and division of labour are in place, do the resulting specialised modules self-organise internally?).

Results
The projects that we analyse in the following were selected according to their popularity (quantified as the number of stars these projects had received on GitHub, at the time of collection in 2016).This criterium mainly responds to two arguments: maturity and success.That is, here we purposefully pay attention to projects which have reached a reasonable degree of evolution, regardless of the absence (or presence) of any given structural organisation at the initial stages.
After pre-processing, formatting and discarding some of the top 100 public OSS projects hosted on GitHub, we ended up retaining 65 of them, see Materials and Methods for details.As can be seen in Table 1, we have a sufficiently broad distribution of project sizes and age.Note also that popularity (number of stars) is not necessarily related to their size or age.Each of these projects have been represented as a bipartite unweighted graph, where inter-class links (between contributors c and files f ) are allowed, but intra-class links are forbidden.This bipartite network is thus encoded as an N × M rectangular binary matrix A, where entries a c f = 1 if contributor c edited the file f at least once.The size of a project is S = N + M; the smallest project considered here is resume.github.com,with S = 82, and the largest one is foundation-sites, with S = 13, 382.

Preliminary observations
Before we focus on the structural arrangements of interest (nestedness, modularity, in-block nestedness), we explore whether a potentially unbounded interaction capability is mirrored in actual OSS projects across 4 orders of magnitude in size.To do so, we work on the projected contributor-contributor network, to measure the developer's implicit average degree k , i.e. the average amount of contributors with whom an individual shares at least one file.Figure 1 shows a scatter plot of k against s (note the semi-log scaling).Besides the initial fluctuating pattern, it is clear that k presents an almost flat trajectory which indicates that, on average, a contributor indirectly interacts with ∼ 70 peers, regardless of the size of the project.The flat pattern for the developers implicit average degree in Figure 1 is interesting in two aspects.First, it points to an inherent limitation to the number of connections (even indirect ones) that a contributor to a project can sustain.Notably, such limitation is below (but not far) from the celebrated Dunbar number (somewhere between 100 and 300), which is echoed as well in digital environments 22 .Second, the result is consistent with the existence of some sort of mesoscale organisation in the projects.In Bird et al. 19 , the authors find that developers in the same community have more files in common than pairs of developers from different communities.Reversing the argument, one may say that relatively small contributor neighbourhoods are indicative, though not a guarantee, of the presence of well-defined subgroups in OSS projects.

Mesoscale patterns
From the previous encouraging result, we move on to the analysis of the projects at a larger scale.The specificities of the methods to calculate nestedness N , and to optimise modularity Q and in-block nestedness I are detailed in the Materials and Methods section.For the sake of illustration, Figure 2 (top row) shows idealized examples of nestedness (left), modular (middle) and in-block nested (right) arrangements.The bottom row of the figure presents actual adjacency matrices of three projects with high values of each one of the structural measures.In them, rows and columns have been rearranged to highlight the different properties.
We start out with a general overview of the results for the three measures of interest.Figure 3 plots the obtained values for N , Q, and I over all the projects considered in this work.To ease visualisation, and considering that nestedness and modularity are antagonistic organisations 32 , projects are sorted to maximise the difference between N and Q.In general, nestedness is the lowest of the three values at stake, and in-block nestedness is, more often than not, the highest.It can be safely said, thus, that a tendency to self-organise as a block structure is present: 90% of the projects exhibit either Q or I above 0.4, and values beyond 0.5 are not rare.This evidence is compatible with previous results regarding the division of labour: indeed, be them modular or in-block nested, most projects can be split into communities of developers and files, forming subgroups around product-related   Just like there is virtually no technical limit to the overall size of a project, there is not either an explicit bound to the size that a sub-group should have.And yet, previous theory and evidence suggests that larger communities come at an efficiency cost: the dynamics of a group change fundamentally when they exceed the Dunbar number, which is estimated around 150.While most often the number refers to personal acquaintances, it has been (and still is) applied in the industrial sphere 33 .Applied to the OSS environment, exceedingly small working sub-groups might hamper a project's advance; while too many contributors may not allow the group to converge towards a solution 34,35 .We explore whether, indeed, size limitations arise in developers sub-groups, as they emerge from either Q or I optimisation procedures.Although partitions are hybrid, i.e. a community has both developers and files, in the following, we will report the community sizes in terms of developers.
Figure 4 provides a global overview of the 65 projects studied here, with the distribution of their largest subgroup sizes as they are identified via Q (panel a) or I (panel b).In both cases the average (dashed orange vertical line) is below 200, and the histogram is evenly distributed around 100: most communities belong in the range from 80 to 200.For the sake of comparison, the solid red line represents a log-normal fit (notice the logarithmic scale in the x-axis), and the insets in both panels show the Q-Q plots, to compare both theoretical and empirical distributions revealing that, indeed, the fit is accurate.Although Figure 4 evidences, on average, a well-defined maximum community size, we must ensure that the size of the largest communities detected for each project is independent of the size of the project, in order to validate such organisational limit.To do so, we go down to the project level.Figure 5 reports average (panels a and c) and maximum (panels b and d) subgroup sizes for both community identification strategies, as a function of the project size s.In general, results point at the existence of bounds to group size, which resemble the limits described by Dunbar's number: even the largest projects reflect that the maximum size of a community in them is between 100 and 200 (in the case of I-communities, panel d).This result is robust and stable beyond s > 2000.On the other hand, largest community size is slightly above 200 in the case of Q-communities (panel b).These results are ever more striking, since such trend towards the compartmentalisation of the workload is not only decentralised, in the sense that it does not emerge from a predefined plan, but also implicit, because the interaction between developers is most often indirect.

Co-existing architectures and project maturity
As it has been suggested 32 , empirical evidence indicates that more than one structural pattern may concur within a network, each evincing different properties of the system.We take the same stance here: a network is not regarded, for example, as completely modular or completely nested; rather, it may combine structural features that reflect the evolutionary history of the system, or the fact that the system evolves under different dynamical pressures that favour competing arrangements.
A convenient way to grasp this mixture is a ternary plot (or simplex), see Figure 6.In the ternary plot, each project is located with three coordinates f N , f Q and f I , which are simply calculated from the original scores, e.g.f N = N /(N + Q + I).The simplex can be partitioned according to "dominance regions", bounded by the three angle bisectors.These regions intuitively tell us which of the three patterns is more prominent for any given project.Note that certain areas of the simplex (in grey in Figure 6) are necessarily empty, see Materials and Methods, and Palazzi et al. 32 for further details. Figure 6 reveals that most projects lie in the nested regions, while the predominantly modular region is relatively empty.
Together with their dominant architecture, points in Figure 6 are colour-coded according to the total number of commits that each project has received.We take this number as a proxy to the level of development or maturity of the project (note that a project's age may be misleading due to periods of inactivity).The distribution of colour on the simplex suggests that more mature projects tend to exhibit nested or in-block nested structures, whereas predominantly modular projects appear to be relatively immature (with exceptions, admittedly).Such result is resonant to the fact that topical conversations in online social networks ("information ecosystems") evolve through different stages -modular when the discussion is still brewing in a scattered way; nested when the discussion becomes mainstream to the group of interest 36 .More relevant to OSS development, Figure 6 reconciles the idea of workload compartmentalisation (subcommunities forming around product-related activities) 19 , and the emergence of hierarchies 8 or a rich club 18 of developers, at least in well-developed projects.This partial picture is The evolution of the average community size as a function of s presents differences for Qand I-optimised partitions (panels a and c, respectively).Regarding the size, average Q-communities are in general larger than I-communities.Furthermore, the scaling behaviour is also different: an average community size for Q-optimised partitions moderately grows with s, while it remains fairly constant for I beyond s > 2000.Turning from average to maximum community size, Qand I-optimised partitions (panels b and d, respectively) present very similar bounds, from 30 to 300 contributors.Again, the largest Q-community slightly tends to grow with s, while this size stabilises around 100 for the case of I.Note semi-log scaling.
however complemented by the fact that hierarchies emerge as well on the code class: the presence of generalists and specialists applies to both developers and files in a nested or in-block nested scenario.

Discussion
In summary, our analyses have unveiled that OSS projects evolve into a relatively narrow set of structural arrangements.At the mesoscale, we observe that projects tend to form blocks, a fact that can be related to the need of contributors to distribute coding efforts, allowing a project to develop steadily and in a balanced way.Focusing on the file class, the emergence of blocks is interesting as well, since a modular architecture (understood now as a software design principle) is a desired feature in any complex software project.Furthermore, those blocks or subgroups have a relatively stable size no matter how large a project is.Remarkably, such size fluctuates around the Dunbar number.Previous research reported that OSS projects are largely heterogeneous, in the sense that developers self-organise into hierarchical structures.Such statement may seem to clash with a modular arrangement, to the extent that modularity Q does not make any assumption regarding the internal organisation of the subgroups.Our findings, however, point at the fact that more mature projects tend to present a nested organisation inside modules.Thus, the presence of workload compartmentalisation is compatible with the emergence of hierarchies, with generalists and specialists throughout a project.Paradoxically, a more evolved and structured architecture does not imply better overall performance here: the nested arrangement inside blocks can hamper a project's progress, since the occasional and least committed contributors -those acting upon a small part of the codetend to edit precisely the most generalist files, neglecting the least developed ones.
These findings open up a rich scenario, with many questions lying ahead.On the OSS environment side, our results contribute to an understanding of how successful projects self-organise towards a modular architecture: large and complex tasks, involving hundreds (and even thousands) of files appear to be broken down, presumably for the sake of efficiency and task specialisation (division of labour).Within this compartmentalisation, mature projects exhibit even further organisation, arranging the internal structure of subgroups in a nested way -something that is not grasped by modularity optimisation only.More broadly, our results demand further investigation, to understand their connection with the general topic of work team assembly (size, composition, and formation), and to the (urgent) issue of software sustainability 37 .OSS is a prominent example of the "tragedy of the commons": companies and services benefit from the software, but there is a grossly disproportionate imbalance between those consuming the software and those building and maintaining it.Indeed, by being more aware of the internal self-organisation of their projects, owners and administrators may design strategies to optimise the collaborative efforts of the limited number (and availability) of project contributors.For instance, they can place efforts to drive the actual project's block decomposition towards a pre-defined software architectural pattern; or ensure that, despite the nested organisation within blocks, all files in a block receive some minimal attention.More research on the derivation of effective project management leadership strategies from the current division of labour in a project is clearly needed and impactful.Closer to the complex networks and data analysis tradition, our results leave room to widen the scope of this research.To start with, future research should tackle a larger and more heterogeneous set of projects, and even across different platforms such as Bitbucket.Admittedly, this work has focused on successful projects, inasmuch we only consider a few dozens among the most popular.Beyond the richness of the analysed dataset, the relationship between maturity and structural arrangement (specially in regard to the internal organisation of subgroups) clearly demands further work.Two obvious -and intimately related-lines of research are related to time-resolved datasets, and the design of a suitable model that can mimic the growth and evolution of OSS projects.Such model should lay down the necessary dynamical rules for both contributors and files which, presumably, differ largely.

Material and Methods
Data.Our open source projects dataset was collected from GitHub 38 , a social coding platform which provides source code management and collaboration features such as bug tracking, feature requests, tasks management and wiki for every project.Given that GitHub users can star a project (to show interest in its development and follow its advances), we chose to measure the popularity of a GitHub project in terms of its number of stars (i.e. the more stars the more popular the project is considered) and selected the 100 most popular projects.The construction of the dataset involved three phases, namely: (1) cloning, (2) import, and (3) enrichment.
Cloning and import.After collecting the list of 100 most popular projects in GitHub (at the moment of collecting the data) via its API 39 , we cloned them to collect 100 Git repositories.We analysed the cloned repositories and discarded those ones not involving the development of a software artifact (e.g.collection of links or questions), rejecting 15 projects out of the initial 100.We then imported the remaining Git repositories into a relational database using the Gitana 40 tool to facilitate the query and exploration of the projects for further analysis.In the Gitana database, Git repositories are represented in terms of users (i.e.contributors with a name and an email); files; commits (i.e.changes performed to the files); references (i.e.branches and tags); and file modifications.For two projects, the import process failed to complete due missing or corrupted information in the source GitHub repository.
Enrichment.Our analysis needs a clear identification of the author of each commit so that we can properly link contributors and files they have modified.Unfortunately, Git does not control the name and email contributors indicate when pushing commits resulting on clashing and duplicate problems in the data.Clashing appears when two or more contributors have set the same name value (in Git the contributor name is manually configured), resulting in commits actually coming from different contributors appearing with the same commit name (e.g., often when using common names such as "mike").In addition, duplicity appears when a contributor has several emails, thus there are commits that come from the same person, but are linked to different emails suggesting different contributors.We found that, on average, around 60% of the commits in each project were modified by contributors that involved a clashing/duplicity problem (and affecting a similar number of files).
To address this problem, we relied on data provided by GitHub for each project (in particular, GitHub usernames, which are unique).By linking commits to unique usernames, we could disambiguate the contributors behind the commits.Thus, we enriched our repository data by querying GitHub API to discover the actual username for each commit in our repository, and relied on those instead on the information provided as part of the Git commit metadata.This method only failed for commits without a GitHub username associated (e.g. when the user that made that commit was no longer existing in GitHub).In those cases we stick to the email in Git commit as contributor identifier.We reduced considerably the clashing/duplicity problem in our dataset.The percentage of commits modified by contributors that may involve a clashing/duplicity problem was reduced to 0.004% on average (σ = 0.011), and the percentage of files affected was reduced to 0.020% (σ = 0.042).
At the end of this process, we had successfully collected a total number of 83 projects, 48,015 contributors, 668,283 files and 912,766 commits.The other 18 projects (to the total of 65 reported in this work) were rejected due to other limitations.On one hand, we discarded some projects that presented very strong divergence between the number of nodes of the two sets, e.g.projects with very large number of files but very few contributors.In these cases, although N , Q and I can be quantified, the outcome is hardly interpretable.An example of this is the project material-designs-icons, with 15 contributors involved in the development of 12,651 files.On the other hand, we considered only projects with a bipartite network size within the range 10 1 ≤ S ≤ 10 4 , as the computational costs to optimise in-block nestedness and modularity for larger sizes were too severe.
Matrix generation.We build a bipartite unweighted network as a rectangular N × M matrix, where rows and columns refer to contributors and source files of an OSS project, respectively.Cells therefore represent links in the bipartite network, i.e. if the cell a i j has a value of 1, it represents that the contributor i has modified the file j at least once, otherwise a i j is set to 0.
Nestedness.In structural terms, a nested pattern is observed when specialists (nodes with low connectivity) interact with proper nested subsets of those species interacting with generalists (nodes with high connectivity), see Figure 2 (left).Several works have shown that a nested configuration is signature feature of cooperative environments -those in which interacting species obtain some benefit [41][42][43] .Following this example in natural systems, scholars have sought (and found) this pattern in other kinds of systems 36,44,45 .Here, we quantify the amount of nestedness in our OSS networks by employing the global nestedness fitness N introduced by Solé-Ribalta et al. 31 : where O i, j (or O l,m ) measures the degree of links overlap between rows (or columns) node pairs; k i ,k j corresponds to the degree of the nodes i, j; Θ(•) is a Heaviside step function that guarantees that we only compute the overlap between pair of nodes when k i ≥ k j .Finally, O i, j represents the expected number of links between row nodes i and j in the null model, and is equal to O i, j = k i k j M .Modularity.A modular network structure (Figure 2, center) implies the existence of well-connected subgroups, which can be identified given the right heuristics to do so.Modularity has been reported in almost any kind of systems: from food-webs 46 to lexical networks 47 , to the Internet 27 and social networks.
A correct quantification is needed to settle the extent to which a given network does or does not have community structure.First, we look for the optimal modular partition of the nodes through a community detection analysis 26,27 .To this end, we 8/11 apply the extremal optimisation algorithm 48 (along with a Kernighan-Lin 49 refinement procedure) to maximise Barber's 26 modularity Q, where L is the number of interactions (links) in the network, ãi j denotes the existence of a link between nodes i and j, pi j = k i k j /L is the probability that a link exists by chance, and δ (α i , α j ) is the Kronecker delta function, which takes the value 1 if nodes i and j are in the same community, and 0 otherwise.
In-block nestedness.The third architectural pattern that we considered in our work, consists of a mesoscale hybrid pattern in which the network presents a modular structure, but the interactions within each module are nested, i.e., an in-block nested structure Figure 2 (right).This type of hybrid or "compound" architectures, were first described by Lewinsohn et al. 29 .Although, the literature covering this types of patterns is still scarce, the existence of such type of hybrid structure in empirical networks has been recently explored 30,31,50 and the results from these works seem to indicate that combined structures are in fact, a common feature in many systems from different contexts.
In order to compute the amount of in-block nested present in networks, in this work, we have adopted a new objective function 31 , that is capable to detect these hybrid architectures, and employed the same optimisation algorithms used to maximise modularity.The in-block nestedness objective function can be written as, Note that, by definition, I reduces to N when the number of blocks is 1.This explains why the right half of the ternary plot (Figure 6) is necessarily empty: I ≥ N , and therefore f I ≥ f N .On the other hand, an in-block nested structure exhibits necessarily some level of modularity, but not the other way around.This explains why the lower-left area of the simplex in Figure 6 is empty as well (see Palazzi et al. 32 for details).

Figure 1 .
Figure 1.Scatter plot of the developers implicit average degree k (blue line) against the size S = N + M of a project (where N is the number of contributors, and M is the number of files.The shadowed grey area represents one standard deviation above and below the average, while circles represent each individual project.The plot is presented in semi-log.

Figure 2 .
Figure 2. Top row: left: Nestedness N , middle: Modularity Q, bottom: In-block nestedness I. Bottom row: Interaction matrices for three projects with high values for each one the structural patterns of interest.

Figure 3 .
Figure 3. N , Q, and I obtained values, for each project of our dataset.The projects were sorted to maximise the difference between N and Q.

Figure 4 .
Figure 4. Distribution of the largest community size for each project obtained after optimization of modularity (panel a) and in-block nestedness (panel b).In both panels, the solid red line corresponds to the log-normal fit performed to each distribution, which are centred around 100.The dashed orange line indicates the average values of our dataset, and inset panels show the Q-Q plots of the empirical versus theoretical quantiles from the log-normal distribution fit.

Figure 5 .
Figure5.The evolution of the average community size as a function of s presents differences for Qand I-optimised partitions (panels a and c, respectively).Regarding the size, average Q-communities are in general larger than I-communities.Furthermore, the scaling behaviour is also different: an average community size for Q-optimised partitions moderately grows with s, while it remains fairly constant for I beyond s > 2000.Turning from average to maximum community size, Qand I-optimised partitions (panels b and d, respectively) present very similar bounds, from 30 to 300 contributors.Again, the largest Q-community slightly tends to grow with s, while this size stabilises around 100 for the case of I.Note semi-log scaling.

Figure 6 .
Figure 6.Distribution of the three architectural patterns for each the projects across a ternary plot.The colorbar indicates the number of commits received by each project, normalised by the size of it.

Table 1 .
Statistics of our dataset.