Quantifying the relationship between specialisation and reputation in an online platform

Online platforms implement digital reputation systems in order to steer individual user behaviour towards outcomes that are deemed desirable on a collective level. At the same time, most online platforms are highly decentralised environments, leaving their users plenty of room to pursue different strategies and diversify behaviour. We provide a statistical characterisation of the user behaviour emerging from the interplay of such competing forces in Stack Overflow, a long-standing knowledge sharing platform. Over the 11 years covered by our analysis, we represent the interactions between users and topics as bipartite networks. We find such networks to display nested structures akin to those observed in ecological systems, demonstrating that the platform’s user base consistently self-organises into specialists and generalists, i.e., users who focus on narrow and broad sets of topics, respectively. We relate the emergence of these behaviours to the platform’s reputation system with a series of data-driven models, and find specialisation to be statistically associated with a higher ability to post the best answers to a question. We contrast our findings with observations made in top-down environments—such as firms and corporations—where generalist skills are consistently found to be more successful.

a variety of incentive systems to foster trust between their users.In some cases (e.g., Twitter), these come as simple identity verification protocols.In other cases (e.g., sharing economy platforms such as Uber and Airbnb), trust is fostered with a reputation score that users develop through digital peer-review mechanisms (e.g., star ratings) 3,4 .
In recent years, a number of studies have analysed the relationship between user behaviour and reputation.Experimental approaches have measured user response to different elements appearing on profiles in order to identify which ones are most conducive to trust 5 .Other studies have instead looked at strategic behaviour as a driver of user reputation, focusing, e.g., on the the cooperative and retaliatory mechanisms underlying the exchange of ratings 6,7 and on the incentives to commit review fraud 8 .An understudied aspect in this stream of research relates to other types of strategic user behaviours, namely those related to specialisation and/or generalism.
In ecology, the term specialist (generalist) refers to species that prosper in a limited (wide) range of environmental conditions.Specialisation emerges as a natural response to competitive pressure, with the aim of securing an edge in specific circumstances.Conversely, generalism emerges as resilience against varying conditions.Such concepts have found plenty of applications in non-natural domains, and have been particularly helpful to conceptualise different strategic behaviours in large socio-economic systems.
The management literature has consistently found that individuals with broader sets of skills (i.e., generalists) enjoy greater success in top-down organisations.Generalist CEOs receive higher pay than their specialist counterparts, with the highest pay increases occurring when firms switch from a specialist to a generalist CEO 9 .Similar results are found in 10 , which the authors interpreted as a reflection of a higher demand for generalist skills required to manage increasingly complex firms, and generalist CEOs are more likely to engage in acquisitions outside a firm's main industry 11 .Similarly, empirically tested theories of leadership support the idea that leaders in industry tend to be generalists rather than specialists 12 .
Traces of such behaviours have been also observed in the bottom-up context of online platforms, with a wide range of strategies -ranging from extreme specialisation to extreme generalism -being found, e.g., on Reddit and GitHub 13,14 .Notably, such strategies are associated to different user archetypes, with specialists being more likely to stick to the online communities they contribute to and generalists being more likely to remain active on platforms as a whole.In the context of online gaming, generalists have been found to be more resilient to change (e.g., after the release of game patches) although specialists ultimately tend to outperform other players on average 15 .
In this paper, we aim to quantify the relationship between specialisation/generalism and reputation in online platforms.To the best of our knowledge, this is an understudied relationship, which has only been looked at in contexts where reputation is developed through interactions that are external to platforms (e.g., the online ratings received by medical professionals on physician-rating websites 16 ).Our focus here, instead, is to look at such a link in contexts where user reputation is developed endogenously through interactions and peer-review taking place on the platform itself.
We do so by analysing data from Stack Overflow (SO), the flagship knowledge-sharing platform of the Stack Exchange network, which features questions and answers on a wide variety of topics in the area of computer programming (see Methods section).SO implements an elaborate reputation system, which is well known for its effectiveness in incentivising users to produce high quality posts 17 .

Results
Platform growth We begin our analysis by looking at the evolution of the Stack Overflow platform over time from an aggregate perspective.For each year in our dataset (2009-2019), we look at the monthly number of active users (i.e., users who posted at least once), the monthly number of tags (i.e., tags that appear in at least one post), and the monthly number of posts.These quantities are reported in Fig. 1, broken down by post type.The number of users posting questions rapidly overtakes the number of those posting answers (left panel), and both numbers settle around several thousands around 2013-2014.The number of tags featured in answers and questions roughly equal each other throughout the platform's lifetime (central panel), whereas the number of answers posted remains systematically higher than the number of questions (right panel), with both numbers settling on the order of tens of thousands of posts per month.
We then proceed to characterise the platform's growth by categorising its user base with respect to post types.We discard casual users by restricting our analysis to those who contribute with at least 10 posts (answers and questions combined) in a given year.We characterise a user's activity based on the relative proportion of questions and answers.We indicate as A i (y) the number of answers (questions) posted by user i during year y, and we characterise the user's profile with respect to post types in that year with the following score:  Overall, the proportions of users belonging to both groups grow over time.However, the fraction of A-users remains relatively stable between 15% and 20% (even showing some decline in 2019), whereas the proportion of Q-users increases from less than 10% to almost 25%.After an initial phase where A-users are more numerous, Q-users become the relative majority in 2011, signalling the platform's transition from a 'supply-driven' to a 'demand-driven' knowledge marketplace.
The above transition is not driven by the addition of newcomers to a stable core of users, but rather by turnover.The top right and bottom left panels in Figure 2 show -respectivelythe year-to-year survival and dropout rates for Aand Q-users.With the former, we indicate the empirically estimated probability that a user belonging to either group in a given year will again belong to the same group the following year, while with the latter we indicate the probability that a user either leaves the platform or falls below the minimum activity threshold to be included in our analysis (10 posts).Only a minority of Aand Q-users remain in such groups in consecutive years, and the dropout rates for both groups display a sharp increase over time.We can therefore conclude that the sub-populations of Aand Q-users grow over time through the replacement of users who drop out with larger numbers of new users.
Let us also mention that there is very little spillover between the two groups throughout the years, as testified by the fact that the transition rates between them (i.e., the empirically estimated probability that a Q-user will become an A-user the following year and vice versa) are both below 0.3%, as shown in the bottom right panel in Figure 2 .

Specialist and generalist users
We then proceed to characterize user behaviour in terms of topics.
We do so by forming monthly bipartite user-tag networks restricted to 'pure' Aand Qusers (i.e., users whose activity score in Eq. ( 1) is D = 1 and D = −1, respectively, in the year of interest).
Namely, if a Q-user i has posted w Q iτ questions featuring the tag τ , we place a link from i to τ with weight w Q iτ .We construct a similar network for A-users, considering as weights the number of answers posted in response to questions featuring a certain tag.Following well established approaches to detect the coexistence of specialisation and generalism in ecosystems, we measure nestedness in such networks (see Fig. 5), and compare its values against those obtained under a null network model in order to establish its significance, following a procedure based on spectral radii 18 (see Methods section).We find nestedness to be statistically significant throughout the platform's history (see Fig. 5), which in turn suggests that the platform indeed self-organises into specialist and generalist users, both in its supply and demand sides.Based on this observation, we then quantify the level of specialisation attained by users in their activity when posting answers/questions with the Herfindahl index, a measure of concentration which (in the case of questions) reads where w Q iτ (as defined above) is the number of questions posted by the user on tag τ , whereas is the total number of questions posted by the user.With the above definition, the Herfindahl index will approach one for users who are only active on a limited set of tags (with the limiting case H Q i = 1 for users active on just one tag), and will instead approach zero for users whose activity is uniformly spread over a large number of tags.We define an equivalent index H A i in the case of answers and characterise users whose activity features both types of posts with both Herfindahl indices.Fig. 6 shows the annual distributions of the Herfindahl scores for both answers and questions.
Both distributions are remarkably stable throughout the years, signalling that -despite the increase in the number of tags (see the middle panel in Figure 1) -the users' collective behaviour in terms of specialization remains largely unchanged.
Reputation We proceed next to investigate the users' reputation in the platform.For each user with at least 10 posts in a year, we build a profile based on the following features describing their activity: the number of posts (n), the number of tags associated to their posts (t), their Herfindahl indices (H A and H Q , see Eq. ( 2)), and their activity score (D, see Eq. ( 1)).We use these features to build a number of linear models to characterise user reputation on the platform.
We begin by looking at the main sources of user reputation, i.e., the ability to post accepted answers.These correspond to answers selected as the best one in response to a given question by the author of the very same question.Notably, posting an accepted answer is worth 15 reputation points (whereas an up-vote, for instance, is worth 10), and it is the result of a combination of skills (i.e., both competence and rapidity).In order to identify the factors that are conducive to a user's ability to post answers that may get accepted, we consider a logistic regression model for the logodds log(π a /(1 − π a )), where π a denotes the probability of a user having at least one accepted answer in a given year (see Methods section).We choose to do so -instead of modelling the acceptance rate of a user's answers -because we find the user population to be approximately split between those who have at least one accepted answer and those who have none.The full results of the calibration of the above model are shown in Table 1, with the corresponding ROC curves shown in Fig. 7. Throughout the years, the model delivers excellent accuracy (with an AUC ranging between 76% and 84%).The regression coefficients obtained for each covariate in each year of our analysis are illustrated in Fig. 4. For the first ten out of eleven years, specialisation (H A ) is found to be the leading contributor to a user's ability to post high-quality answers, reaching its maximum relative importance in the early years of the platform, with some mild decline in more recent years.
User activity (n) is the second main contributor, with an increasing trend suggesting that it may be overtaking specialisation (albeit the two coefficients are statistically indistinguishable both in 2018 and 2019).Notably, the number of tags t on which a user posts answers is the only covariate whose impact changes over time: before 2013-2014 it contributes positively to a user's ability to post accepted answers, while it contributes negatively to it after then.This somewhat further strengthens the importance of specialisation, as it suggests that the more successful users are those who specialise on narrower sets of tags.The activity score D remains instead negatively correlated with the ability to post accepted answers throughout the entire time window of our analysis.We then build a multinomial logistic regression model to classify users (in each year) into three mutually exclusive categories: users whose posts have received zero votes, users whose posts have received only up-votes, and users whose posts have received both up-and down-votes.We neglect the case of users whose posts only receive down-votes, since in all years considered in our analysis they are less than 0.1%.We calibrate multinomial logistic regression models for the log-odds associated with the probability of belonging to the three above categories, using the aforementioned features as covariates.The full results are reported in Tables 2 and 3, and show that no specific feature is systematically associated with a higher probability of attracting votes.
We then proceed to restrict our analysis to those users whose posts received at least one vote in a given year.To this end, we calibrate four stepwise regression models using as dependent variables the (logarithm of the) average number of up-or down-votes per post received by a user.
Starting from a constant model, we use both forward and backward selection to select the best model (in terms of sum of squared residuals) based on the aforementioned covariates.With only few exceptions, the stepwise selection procedure results in a very simple model where the users' activity -as quantified by their number of posts n -is the only statistically significant covariate.However, it is noteworthy that activity has a similar impact across board, both in terms of sign and magnitude.Namely, we find activity to have a negative impact on the number of votes received per post, both in the case of up-and down-votes and regardless of the type of post.Remarkably, in the case of questions such minimalistic models explain 60% or more of the variance.The full results of the calibration are reported in Tables 4, 5, 6, and 7.

Discussion
In this paper we presented a number of analyses aimed at understanding the relationship between specialisation and reputation in the domain of online decentralised platforms.Thanks to the lack of monetary incentives and its already long history, Stack Overflow represents an ideal environment to observe the development of such a relationship 'in the wild' over an extended period of time.
The 11 years of history covered in our study reveal how the Stack Overflow platform's user base grew into a structured community, with different individuals taking on different roles.First, we documented how most of the platform's user base quickly evolved into well defined supply and demand sides, represented by two large sub-communities of users characterized by their willingness to answer or pose questions, respectively.Second, we provided ample evidence on the emergence of specialisation at the level of topic selection in the users' posts.
Should the above findings be attributed to self-organisation or should they instead be interpreted as a direct response to the platform's design and incentives?Plausibly, the very nature of Stack Overflow -a knowledge-sharing platform structured around questions and answers -is responsible for the emergence of sub-communities dedicated to posting answers and questions.Like other two-sided platforms, Stack Overflow naturally attracts users with markedly different needs (e.g., similarly to hosts and guests in accommodation platforms).
Specialisation with respect to topic selection is a more complicated phenomenon to unpack.
We do not find it to be correlated (in a statistical significant manner) with the likelihood of attracting up-or down-votes to generic posts, suggesting that the quality of a user's posts may be largely idiosyncratic.Conversely, we do find a statistically significant correlation between a user's specialisation and the likelihood of their answers being accepted as the best one in response to a question.This is a notable asymmetry, as an equivalent selection mechanism is lacking in the case of questions, and no other user-generated feedback awards more reputation points on Stack Overflow than an accepted answer.We interpret these findings as a clear consequence of the incentives set in place by the platform's reputation system.
It is interesting to relate our results to findings about the users' decision-making when choosing which answers to accept.Such decision-making has been found to be largely driven by heuristics, with selections being determined by factors such as the order in which answers appear or the amount of screen space they occupy 19 .It is therefore tempting to speculate that the selection process that takes place on posted answers may contribute to optimise user behaviour with respect to such heuristics.
Our findings illustrated in Fig. 4 shed light on the above point by identifying the salient traits of successful users.These are -on average -highly active and specialised users, whose specialisation progressively focuses on a narrower set of topics (as testified by the change in sign of the coefficient associated to the number of tags).Notably, these are not users who specialise in posting answers only, as their activity score D (see Eq. ( 1)) is negatively correlated with the likelihood of having answers accepted, suggesting that developing some expertise on both sides of a two-sided platform may unlock positive reputational spillovers.
Overall, our findings are in rather stark contrast with observations made in top-down environments (such as firms and corporations), where generalists are usually found to enjoy greater success than specialists.However, we ought to acknowledge that the extent to which our findings may generalise to other decentralised online environments can only be the subject of speculation at this stage.Stack Overflow's reputation system and the sustained success it has brought to the platform -with relatively minimal policy changes throughout the years -are rather unique.Other successful knowledge-sharing platforms have taken radically different approaches to foster trust within their user base.For instance, Wikipedia holds elections to promote reliable users to administrators.Similarly, comparisons with different reputation/feedback systems (e.g, textual reviews) are not straightforward.We therefore believe our work represents a first step towards the datadriven modelling of the relationship between specialisation and online reputation, and a blueprint that following studies may adapt to different environments and data sources.

Methods
Data We analyze data from the Stack Overflow platform, the flagship site of the Stack Exchange Network, which features questions and answers on a wide variety of topics in the area of computer programming.The portion of the data used in our study spans 11 years going from January 2009 (shortly after the platform was started in 2008) to December 2019.Posts represent the main unit of activity in the platform.Posts are divided into three main categories: questions, answers, and accepted answers.An accepted answer is a post that has been identified as the best one in response to a question by the author of the same question.Users can classify the questions they post with up to five tags (e.g., C++, Python, etc), which help other users identify the posts they might be able to reply to.Each individual post (i.e., both questions and answers) can generate a sub-thread in the form of comments.Any post or comment can be either up-voted or down-voted by other users.
Users develop a reputation score based on their activity.The main source of points are accepted answers (+15 points) and up-votes (+5 for questions, +10 for answers).A down-vote penalizes the user receiving it by −2 points.Down-voting posts is costly (−1 point) in order to suppress trolling.Upon reaching certain milestones users can also earn reputational badges.
Nestedness In ecological systems, nestedness refers to a property typically observed in the networks describing species-species interactions.Let us assume that such interactions in a given system are represented by a weighted bipartite adjacency matrix W , whose entry w ij quantifies the strength of interaction between species i and j.In a perfectly nested matrix, an arrangement of rows and columns can be found such that the set of links in each row i (column j) contains the set of links in row i+1 (column j +1), and such that matrix entries satisfy W ij ≤ min(W i−1,j , W i,j−1 ).
It can be shown that among all possible connected bipartite networks with a fixed number of nodes and links, the one yielding the highest spectral radius ρ(W ) corresponds to a perfectly nested matrix 20 , where the spectral radius is defined as the largest singular value.Therefore, an ideal measure of nestedness in an empirical bipartite weighted matrix would be the ratio between its spectral ratio and that of the corresponding perfectly nested matrix with the same number of nodes and links.This, however, is unfeasible in practice due to the prohibitively high computational cost of identifying the perfectly nested matrix in the set via hard counting.Therefore, in our work we follow Staniczenko et al. 18 , and quantify the nestedness of a matrix with the z-score z(ρ) = (ρ(W ) − ρ(W ))/σ(ρ(W )), where ρ(W ) and σ(ρ(W )) represent, respectively, the mean and standard deviation of the spectral radii computed over a sampled population of bipartite matrices with the same nodes and edges as W , but with randomly reshuffled link weights.

Logistic regression model for user specialization
For each year in our analysis we calibrate the following logistic regression model: where π A denotes the probability that at least one of the answers posted by a user (with at least 10 posts in the year under consideration) gets accepted, i.e., marked as the best one in response to a question.In the above expression n denotes the number of answers posted by a user, t A the number of tags associated with the corresponding questions, H A the specialization of the user as quantified by the Herfindahl index (see Eq. ( 2)), and D the user's activity score (see Eq. ( 1)).

A Additional tables and figures
Table 1: Logistic regression model for the probability that a user's answer gets accepted.We , where π A denotes the probability that a user has at least one accepted answer in a given year, n A indicates the number of answers posted by a user, t A the number of tags featured in the corresponding questions, H A the user's Herfindahl index (Eq.( 2)), and D the user's activity score (Eq.( 1)).The three bottom rows report -respectively -the number N of users included in the model (i.e., users with at least 10 posts in the year of interest, of which at least one is an answer), the resulting regression model's D, where π (z,A) , π (u,A) and π (v,A) indicate -respectively -the probability that a user's posted answers receive zero votes, only up-votes, and both up-and down-votes in a given year.n A indicates the number of answers posted by a user, t A the number of tags featured in the corresponding questions, H A the user's Herfindahl index (Eq.( 2)), and D the user's activity score (Eq.( 1)).The four bottom rows report the total number of users (N ), and the fractions of users whose posted answers received only up-votes (N u ), both up-and down-votes (N v ) and zero votes (N z ).Numbers in brackets indicate the standard errors of the estimated coefficients.
D, where π (z,Q) , π (u,Q) and π (v,Q) indicate -respectively -the probability that a user's posted questions receive zero votes, only up-votes, and both up-and down-votes in a given year.n Q indicates the number of questions posted by a user, t Q the number of tags featured in the such questions, H Q the user's Herfindahl index (Eq.( 2)), and D the user's activity score (Eq.( 1)).
The four bottom rows report the total number of users (N ), and the fractions of users whose posted , where v ↑A denotes the number of up-votes received by a user's posted answers, n A indicates the number of answers posted by a user, t A the number of tags featured in the corresponding questions, H A the user's Herfindahl index (Eq.( 2)), and D the user's activity score (Eq.( 1)).The three bottom rows report -respectively -the number N of users included in the model (i.e., users with at least 10 posts in the year of interest and with at least one up-vote to their posted answers), the resulting regression model's R 2 coefficient and the model's F statistic.Numbers in brackets indicate the standard errors of the estimated coefficients.
, where v ↓A denotes the number of down-votes received by a user's posted answers, n A indicates the number of answers posted by a user, t A the number of tags featured in the corresponding questions, H A the user's Herfindahl index (Eq.( 2)), and D the user's activity score (Eq.( 1)).The three bottom rows report -respectively -the number N of users included in the model (i.e., users with at least 10 posts in the year of interest and with at least one down-vote to their posted answers), the resulting regression model's R 2 coefficient and the model's F statistic.Numbers in brackets indicate the standard errors of the estimated coefficients.
, where v ↑Q denotes the number of up-votes received by a user's posted answers, n Q indicates the number of answers posted by a user, t Q the number of tags featured in the corresponding questions, H Q the user's Herfindahl index (Eq.( 2)), and D the user's activity score (Eq.( 1)).The three bottom rows report -respectively -the number N of users included in the model (i.e., users with at least 10 posts in the year of interest and with at least one up-vote to their posted questions), the resulting regression model's R 2 coefficient and the model's F statistic.Numbers in brackets indicate the standard errors of the estimated coefficients.2)), and D the user's activity score (Eq.( 1)).
The three bottom rows report -respectively -the number N of users included in the model (i.e., users with at least 10 posts in the year of interest and with at least one down-vote to their posted questions), the resulting regression model's R 2 coefficient and the model's F statistic.Numbers in brackets indicate the standard errors of the estimated coefficients.In both panels, z-scores are calculated with the procedure put forward in 18 .the models in Table 1.

Fig. 1 .
Fig. 1.Growth of Stack Overflow from 2009 to 2019.(a) Monthly number of active users.(b) Monthly number of tags featured in posts.(c) Monthly number of posts.In all panels the blue (red) symbols refer to answers (questions) on Stack Overflow.

Figure 2 (
Figure 2 (top left) shows the annual proportions of users who only post answers (D i = +1) or questions (D i = −1).Let us label such two groups as A-users and Q-users, respectively.

Fig. 2 .
Fig. 2. Characterisation of Stack Overflow's user base.(a) Annual percentage of Aand Q-users (D = 1 and D = −1, respectively, see Eq. (1)).(b) Annual survival probabilities for A-(blue) and Q-users (magenta), defined as the empirically estimated probabilities for users belonging to either group to belong to the same group in the following year.(c) Annual dropout rates for A-(blue) and Q-users (magenta), defined as the empirically estimated probabilities for users belonging to either group to either leave the platform or fall below the minimum threshold of 10 posts per year to be considered in our analysis.(d) Annual transition rates from Ato Q-users (blue) and vice versa (magenta), defined as the empirically estimated probabilities for users belonging to one group to transition to the other one the following year.

Fig. 3 .
Fig. 3. Evidence of nestedness in Stack Overflow's user-tag bipartite networks.(a) User-tag bipartite network for answers posted in January 2009 (size: 6960 × 4982).(b) User-tag bipartite network for questions posted in January 2009 (size: 6011 × 4906).In both panels blue dots represent non-zero entries, and the matrix rows and columns have been sorted from top to bottom for users and from left to right for tags.

Fig. 4 .
Fig. 4. Logistic regression results for the probability that a user has at least one accepted answer in a given year.Dots represent the values of the regression coefficients estimated for the four covariates included in the model, shown in the legend.Error bars show the standard errors on the coefficients times three.

) 3 . 1 Fig. 5 .
Fig. 5. Evidence of nestedness in Stack Overflow's user-tag bipartite networks.a z-scores of the spectral radius ρ calculated in the monthly user-tag networks of A-users (i.e., users with activity score D = 1, see Eq. (1)).b Same quantity calculated in the monthly user networks of Q-users (D = −1).In both panels, z-scores are calculated with the procedure put forward in18 .

Fig. 6 .
Fig. 6.Empirical density of the Herfindahl index.a Annual distribution of user specialisation with respect to tags in the case of answers.b Annual distribution of user specialisation with respect to tags in the case of questions.

Fig. 7 .
Fig. 7. Model performance of the logistic regressions for accepted answers.ROC curves for

Table 2 :
Multinomial logistic regression model for the probability that a user's answers attract votes.We calibrate the model log

Table 3 :
Multinomial logistic regression model for the probability that a user's questions attract votes.We calibrate the model log received only up-votes (N u ), both up-and down-votes (N v ) and zero votes (N z ).Numbers in brackets indicate the standard errors of the estimated coefficients.

Table 4 :
Stepwise linear regression model for the number of up-votes received by a user's answers.We calibrate the model log(

Table 5 :
Stepwise linear regression model for the number of down-votes received by a user's answers.We calibrate the model log(

Table 6 :
Stepwise linear regression model for the number of up-votes received by a user's questions.We calibrate the model log(