Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Quantifying social organization and political polarization in online platforms


Mass selection into groups of like-minded individuals may be fragmenting and polarizing online society, particularly with respect to partisan differences1,2,3,4. However, our ability to measure the social makeup of online communities and in turn, to understand the social organization of online platforms, is limited by the pseudonymous, unstructured and large-scale nature of digital discussion. Here we develop a neural-embedding methodology to quantify the positioning of online communities along social dimensions by leveraging large-scale patterns of aggregate behaviour. Applying our methodology to 5.1 billion comments made in 10,000 communities over 14 years on Reddit, we measure how the macroscale community structure is organized with respect to age, gender and US political partisanship. Examining political content, we find that Reddit underwent a significant polarization event around the 2016 US presidential election. Contrary to conventional wisdom, however, individual-level polarization is rare; the system-level shift in 2016 was disproportionately driven by the arrival of new users. Political polarization on Reddit is unrelated to previous activity on the platform and is instead temporally aligned with external events. We also observe a stark ideological asymmetry, with the sharp increase in polarization in 2016 being entirely attributable to changes in right-wing activity. This methodology is broadly applicable to the study of online interaction, and our findings have implications for the design of online platforms, understanding the social contexts of online behaviour, and quantifying the dynamics and mechanisms of online polarization.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Quantifying social dimensions on Reddit.
Fig. 2: Macroscale social organization of Reddit communities.
Fig. 3: Distribution of political activity on Reddit.
Fig. 4: Political polarization of new and existing users.
Fig. 5: Ideological asymmetry in online polarization.

Data availability

All data are available from the Reddit archive28 at data are provided with this paper. Reddit community embedding, social dimension vectors and community scores are available at

Code availability

All code is available at Analyses were performed with Python v3.7, pandas v1.3.3 and Spark v3.0.


  1. Sunstein, C. #Republic: Divided Democracy in the Age of Social Media (Princeton Univ. Press, 2018).

  2. Iyengar, S. & Hahn, K. S. Red media, blue media: evidence of ideological selectivity in media use. J. Commun. 59, 19–39 (2009).

    Article  Google Scholar 

  3. van Alstyne, M. & Brynjolfsson, E. Electronic communities: global villages or cyberbalkanization? In Proc. International Conference on Information Systems 5 (1996).

  4. van Dijck, J. The Culture of Connectivity: A Critical History of Social Media (Oxford Univ. Press, 2013).

  5. McLuhan, M. The Gutenberg Galaxy: The Making of Typographic Man (Univ. of Toronto Press, 1962).

  6. Farrell, H. The consequences of the internet for politics. Ann. Rev. Pol. Sci. 15, 35–52 (2012).

    Article  Google Scholar 

  7. Conover, M. D. et al. Political polarization on Twitter. Proc. Intl AAAI Conf. Web Soc. Media 133, 89–96 (2011).

    Google Scholar 

  8. Bail, C. A. et al. Exposure to opposing views on social media can increase political polarization. Proc. Natl Acad. Sci. USA 115, 9216–9221 (2018).

    CAS  Article  Google Scholar 

  9. Martin, T. community2vec: vector representations of online communities encode semantic relationships. In Proc. 2nd Workshop on NLP and Computational Social Science 27–31 (2017).

  10. Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proc. Natl Acad. Sci. USA 115, E3635–E3644 (2018).

    CAS  Article  Google Scholar 

  11. Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V. & Kalai, A. T. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Adv. Neural Inf. Process. Syst. 29, 4349–4357 (2016).

  12. Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).

    CAS  Article  ADS  Google Scholar 

  13. Kozlowski, A. C., Taddy, M. & Evans, J. A. The geometry of culture: analyzing the meanings of class through word embeddings. Am. Soc. Rev. 84, 905–949 (2019).

    Article  Google Scholar 

  14. Shi, F., Shi, Y., Dokshin, F. A., Evans, J. A. & Macy, M. W. Millions of online book co-purchases reveal partisan differences in the consumption of science. Nat. Hum. Behav. 1, 0079 (2017).

    Article  Google Scholar 

  15. Del Vicario, M. et al. Echo chambers: emotional contagion and group polarization on Facebook. Sci. Rep. 6, 37825 (2016).

    Article  ADS  Google Scholar 

  16. Pariser, E. The Filter Bubble: What the Internet is Hiding from You (Penguin, 2011).

  17. Flaxman, S., Goel, S. & Rao, J. M. Filter bubbles, echo chambers, and online news consumption. Public Opin. Q. 80, 298–320 (2016).

    Article  Google Scholar 

  18. Bakshy, E., Messing, S. & Adamic, L. A. Exposure to ideologically diverse news and opinion on Facebook. Science 348, 1130–1132 (2015).

    MathSciNet  CAS  Article  ADS  Google Scholar 

  19. DiMaggio, P., Evans, J. & Bryson, B. Have American’s social attitudes become more polarized? Am. J. Sociol. 102, 690–755 (1996).

    Article  Google Scholar 

  20. Barberá, P., Jost, J. T., Nagler, J., Tucker, J. A. & Bonneau, R. Tweeting from left to right: is online political communication more than an echo chamber? Psychol. Sci. 26, 1531–1542 (2015).

    Article  Google Scholar 

  21. Adamic, L. A. & Glance, N. The political blogosphere and the 2004 US election: divided they blog. In Proc. 3rd International Workshop on Link Discovery 36–43 (2005).

  22. An Examination of the 2016 Electorate, Based on Validated Voters (Pew Research Center, 2018).

  23. Hawley, G. Making Sense of the Alt-Right (Columbia Univ. Press, 2017).

  24. Simmel, G. Conflict and the Web of Group Affiliations (Free Press, 1955).

  25. Breiger, R. L. The duality of persons and groups. Social Forces 53, 181–190 (1974).

    Article  Google Scholar 

  26. Bourdieu, P. Distinction: A Social Critique of the Judgement of Taste (Routledge, 1984).

  27. Crenshaw, K. W. On Intersectionality: Essential Writings (The New Press, 2017).

  28. Baumgartner, J., Zannettou, S., Keegan, B., Squire, M. & Blackburn, J. The Pushshift Reddit dataset. In Proc. International AAAI Conference on Web and Social Media 14, 830–839 (2020).

  29. Reddit privacy policy Reddit (2021).

  30. Kumar, S., Hamilton, W. L., Leskovec, J. & Jurafsky, D. Community interaction and conflict on the web. In Proc. 2018 World Wide Web Conference 933–943 (2018).

  31. Waller, I. & Anderson, A. Generalists and specialists: using community embeddings to quantify activity diversity in online platforms. In Proc. 2019 World Wide Web Conference 1954–1964 (2019).

  32. Levy, O. & Goldberg, Y. Dependency-based word embeddings. In Proc. 52nd Annual Meeting of the Association for Computational Linguistics 2, 302–308 (2014).

  33. Levy, O. & Goldberg, Y. Neural word embedding as implicit matrix factorization. Adv. Neural Inf. Process. Syst. 27, 2177–2185 (2014).

    Google Scholar 

  34. Schlechtweg, D., Oguz, C. & im Walde, S. S., Second-order co-occurrence sensitivity of skip-gram with negative sampling. Preprint at (2019).

Download references


This research was supported by the National Sciences and Engineering Research Council of Canada (NSERC), the Canada Foundation for Innovation (CFI) and the Ontario Research Fund (ORF).

Author information

Authors and Affiliations



I.W. performed the computational analysis. A.A. and I.W. designed the research, analysed the results and wrote the paper.

Corresponding author

Correspondence to Ashton Anderson.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature thanks Kenneth Benoit, Kate Starbird and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Distribution of community scores.

Left: distributions of communities on the age, gender, partisan, and affluence dimensions. Right: the most extreme communities and words on those dimensions. Word scores are calculated by averaging community scores weighted by the number of occurrences of the word in the community in 2017. Community descriptions can be found in the glossary (Supplementary Table 1).

Extended Data Fig. 2 External validations of social dimensions.

Scatter plots of the external validations of the gender, partisan, and affluence axes. The gender scores for occupational communities are plotted against the percentage of women in that occupation from the 2018 American Community Survey. The partisan scores for city communities are plotted against the Republican vote differential for that metropolitan area in the 2016 presidential election. The affluence scores of city communities are plotted against the median household income for that metropolitan area from the 2016 US Census. The blue line is the best-fit linear regression for the data; the shaded area represents a 95% confidence interval for the regression estimated using a bootstrap. \(p\)-values for correlation coefficients computed using two-sided test of Pearson correlation assuming joint normality.

Extended Data Fig. 3 Further validations of social dimensions.

Clockwise from left: The gap between university and city communities on the age dimension. The distribution of university and city communities on the age dimension; age is strongly related to label (\(r=0.91\), two-sided \(p < {10}^{-58}\), \(n=150\), Cohen’s \(d=4.37\)). The distribution of left and right wing labelled communities on the partisan dimension; partisan is strongly related to label (\(r=0.92\), two-sided \(p < {10}^{-21}\), \(n=50\), Cohen’s \(d=4.89\)). The distribution of explicitly labelled left- and right-wing communities on the partisan-ness axis as compared to the general distribution; there is a large difference in their means (Cohen’s \(d=3.27\)). For violin plots, white dot represents median; box represents 25th to 75th percentile; whiskers represent 1.5 times the inter-quartile range; and density estimate (‘violin’) extends to the minima and maxima of the data. \(p\)-values for correlation coefficients computed using two-sided test of Pearson correlation assuming joint normality.

Extended Data Fig. 4 Distributions of age, gender and partisan scores by cluster.

Distributions of raw age, gender and partisan scores, separated by cluster. Outlier communities that lie more than two standard deviations from the mean are annotated. Dashed lines represent the global mean on each dimension. Community descriptions can be found in the glossary (Supplementary Table 1).

Extended Data Fig. 5 Distributions of affluence, time, sociality and edgy scores by cluster.

Outlier communities that lie more than two standard deviations from the mean are annotated. Dashed lines represent the global mean on each dimension. Community descriptions can be found in the glossary (Supplementary Table 1).

Extended Data Fig. 6 Relationships between online social dimensions.

The relationships between the partisan dimension and (a) gender, (b) age, (c) partisan-ness. Every bar represents a bin of communities with partisan scores a given number of standard deviations from the mean, and the distribution illustrates the scores on the secondary dimension (e.g. gender in (a)). From left to right, the bars represent highly left-wing, leaning left-wing, center, leaning right-wing, highly right-wing communities. The leftmost and rightmost bars are annotated with the number of communities, and examples of the largest communities, in each group. The hex-plot in (c) illustrates the joint distribution of partisan and partisan-ness scores. Labels correspond to the categorizations used in the polarization analysis.

Extended Data Fig. 7 Polarization robustness checks.

(a) The partisan distribution of deleted and non-deleted comments in political communities. (b) The proportion of activity that took place in very left-wing (\(z < -3\)) and very right-wing (\(z > 3\)) communities over time. (c) Alternate version of Fig. 3a generated using a dataset in which the authorship of all comments was randomly shuffled. Each individual bin distribution is extremely similar to the overall activity distribution, showing that the overall activity distribution is a useful reference point for what bin distributions would look like if there were no tendency for users to comment in ideologically homogeneous communities. (d) Average distributions of political activity for authors of comments in the 25 largest political communities on Reddit (by number of comments). (e) Correlation of users’ average partisan scores over time. Each \(\left(x,y\right)\) cell represents the correlation between scores of a user in month \({t}_{x}\) and that same user in month \({t}_{y}\), for all users active in both time periods. A user is only considered active if they make at least \(10\) comments in a month. (f) The relationship between the proportion of users who polarize and the polarization threshold. The polarization threshold is the number of standard deviations a user must increase in polarization to be considered polarized. Three lines are plotted corresponding to three pairs of months; the pairs of months with the minimum (blue), maximum (orange), and median (green) proportion of users polarized when using a threshold of \(1\). A threshold of \(1\) is used in all other calculations. (g) The relationship between the proportion of users who polarize and the comment threshold. The comment threshold is the value used to filter inactive users from the calculation. Users must have at least \(x\) comments in each of the two months to be included in the calculation of the proportion of users who polarize. The same three month pairs are plotted as in part (e). There are minimal differences between different thresholds. A threshold of \(10\) is used in all other calculations.

Extended Data Fig. 8 Distribution of political activity by user group.

The distribution of political activity on Reddit over time by partisan score. Each bar represents one month of comment activity in political communities on Reddit, and is coloured according to the distribution of partisan scores of comments posted during the month (the partisan score of a comment is simply the partisan score of the community in which it was posted.) The top plot includes all activity as in Fig. 3b, while the four following plots decompose this into the subsets of activity authored by particular groups of users. Users are classified based on the average partisan score of their activity in the month 12 months prior–into left-wing (having a score at least one standard deviation to the left), right-wing (one standard deviation to the right), or center. Users with no political activity in the month 12 months prior use the label of the most recent month more than 12 months prior in which they had political activity; if they have never had political activity before, they fall into the new / newly political category (bottom).

Extended Data Fig. 9 Additional measures of ideological asymmetry.

(a) Average polarization (absolute \(z\)-score) of activity in different ideological categories over time. (b) Volume of activity (number of comments) in different ideological categories over time. (c, d) Annual change in polarization in the two partisan activity categories, decomposed into the change attributable to new (\(\varDelta n\)) and existing (\(\varDelta e\)) users as done in Fig. 4.

Extended Data Fig. 10 Implicit polarization.

The relationship between explicitly partisan and implicitly partisan activity (left: left-wing activity; right: right-wing activity.) Of users who were first active in an explicitly partisan community at time \({m}_{E}\), the proportion of them who were first active in an implicitly partisan community at time \({m}_{I}\) is denoted by the colour in cell \(\left({m}_{E},{m}_{I}\right)\). The line graphs at the top show the total proportion of users who were active in implicitly partisan communities before they were active in an explicitly partisan community (i.e. the sum of each column below the diagonal back to 2005, or the total proportion of users for whom \({m}_{I} < {m}_{E}\)).

Extended Data Table 1 Social dimension seeds

Supplementary information

Supplementary Information

This file contains Supplementary Tables 1 and 2.

Reporting Summary

Peer Review File

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Waller, I., Anderson, A. Quantifying social organization and political polarization in online platforms. Nature 600, 264–268 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

Further reading


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing