How character limit affects language usage in tweets

Boot, Arnout B.; Tjong Kim Sang, Erik; Dijkstra, Katinka; Zwaan, Rolf A.

doi:10.1057/s41599-019-0280-3

Download PDF

Article
Open access
Published: 09 July 2019

How character limit affects language usage in tweets

Arnout B. Boot¹,
Erik Tjong Kim Sang²,
Katinka Dijkstra¹ &
…
Rolf A. Zwaan¹

Palgrave Communications volume 5, Article number: 76 (2019) Cite this article

26k Accesses
33 Citations
45 Altmetric
Metrics details

Subjects

Abstract

In November 2017 Twitter doubled the available character space from 140 to 280 characters. This provided an opportunity for researchers to investigate the linguistic effects of length constraints in online communication. We asked whether the character limit change (CLC) affected language usage in Dutch tweets and hypothesized that there would be a reduction in the need for character-conserving writing styles. Pre-CLC tweets were compared with post-CLC tweets. Three separate analyses were performed: (I) general analysis: the number of characters, words, and sentences per tweet, as well as the average word and sentence length. (II) Token analysis: the relative frequency of tokens and bigrams; (III) part-of-speech analysis: the grammatical structure of the sentences in tweets (i.e., adjectives, adverbs, articles, conjunctives, interjections, nouns, prepositions, pronouns, and verbs); pre-CLC tweets showed relatively more textisms, which are used to abbreviate and conserve character space. Consequently, they represent more informal language usage (e.g., internet slang); in turn, post-CLC tweets contained relatively more articles, conjunctions, and prepositions. The results show that online language producers adapt their texts to overcome limit constraints.

Worldwide divergence of values

Article Open access 09 April 2024

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Anger is eliminated with the disposal of a paper written because of provocation

Article Open access 09 April 2024

Introduction

Spontaneous linguistic communication is typically unrestrained in terms of the length of utterances but in some situations there are constraints on utterance length. For example, there are word count limitations to newspaper headlines, advertisements, journalistic articles, student papers, and scholarly manuscripts. These limitations are sometimes so restrictive that they impact sentence structure and content and word forms. For instance, the advent of the telegraph, in which words were literally at a premium, necessitated an elliptic style that has become known as telegram style of telegraphese, which is viewed as a normal expressive form of language (Barton, 1998; Isserlin, 1985; Tesak and Dittmann, 2009). A more contemporary example of an elliptic style is textese, which is often used in modern text messages (Drouin and Driver, 2014).

Textese and telegraphese are both characterized by an imposed limit constraint (Barton, 1998; Drouin and Driver, 2014; Isserlin, 1985; Tesak and Dittmann, 2009). However, a crucial difference is the nature of the length restriction: In telegrams, the costs are related to the number of words and not the number of characters. In other words, a cost-effective telegram contains as few words as possible. In text messages, on the other hand, one is obliged to conserve character space, which results in a different practice of economy (Frehner, 2008). Character reduction as performed in textese, can be achieved not only by minimizing the number of words but also by abbreviating words and using shorter synonyms and symbols. Textese has been called ‘squeeze text’, which well reflects its grammatical features (Carrington, 2004).

The character-reducing strategies inherent to textese are referred to as textisms (Carrington, 2004; Lyddy et al. 2014). They evolved not only to save character space but also to reduce typing efforts. Textisms reduce character use without compromising the conveyed meaning and even add meaning in some cases. This includes acronyms (e.g., LOL for ‘laugh out loud’), emoticons (e.g., ☺ instead of ‘I am happy’), accent stylizations (e.g., slang terms such as gonna), nonconventional spellings (e.g., gudnite), homophones (e.g., gr8 and c u), shortenings (e.g., pic as in ‘picture’), contractions (e.g., thx for ‘thanks’), and omission of punctuation (Carrington, 2004; De Jonge and Kemp, 2012; Ling and Baron, 2007; Plester et al., 2009; Tagliamonte and Denis, 2008; Thurlow and Brown, 2003; Varnhagen et al., 2010).

Another strategy to reduce character usage is the omission of certain part-of-speech (POS) categories. The basic elements of a sentence are subject, verb, and object (SVO or SOV; Koster, 1975). The SVO structure, comprises (pro)nouns and a verb. For example, ‘Tom ate lunch’. The main components of the SVO structure are unlikely to be omitted. In contrast, the POS categories that modify the basic structure and introduce additional information are more likely to be excluded. In textese and telegraphese, articles and conjunctions are often excluded (Carrington, 2004; Oosterhof and Rawoens, 2017). Consistent with this intuition, eyetracking studies of reading have shown that function words such as articles and prepositions are often skipped in normal reading because these words are both short and highly predictable from context (Rayner et al., 2011). A reader can even fill in omitted articles and conjunctions. For example, ‘car broke down stopped in middle of road’. Although the overall readability is compromised, the message is still clear. Therefore, if words have to be omitted to reduce character usage, they are likely to be function words. However, other words can also be omitted, leaving out information. For example, ‘the car broke down’ instead of ‘the car broke down and stopped in the middle of the road’. In this case, additional information is being withheld. Generally, this means limit constraints might also affect sentence structure.

An example of a contemporary platform that might necessitate elliptic writing strategies is Twitter, an online microblogging platform which imposes a message-length limit to its users. On November 8th 2017, Twitter doubled the character limit from 140 characters to 280 characters^{Footnote 1}; we will refer to this as the character limit change (CLC). After a trial period in September, Twitter observed that 9% of English tweets hit the previous limit of 140 characters, whereas only 1% of tweets reached the new 280-character limit (Rosen, 2017). Doubling the character limit was thought to prevent a group of users from ‘cramming their thoughts’ (Rosen and Ihara, 2017). Furthermore, only 2% of trial tweets surpassed 190 characters, indicating that many users used merely a few more characters than had previously been possible. When Twitter announced the upcoming CLC the community responded ambivalently. Some users appreciated the increased tweet length, having more space to express their thoughts, whereas others claimed it would harm the tweets’ brevity and to-the-point characteristics (Watson, 2017).

The doubling of the maximum tweet length provides for an interesting opportunity to investigate the effects of a relaxation of length constraints on linguistic messaging. What happened to the average length of tweets? And more interestingly, how did CLC impact the structure and word usage in tweets?

The need for an economy of expression decreased post-CLC. Therefore, our first hypothesis states that post-CLC tweets contain relatively less textisms, such as abbreviations, contractions, symbols, or other ‘space-savers’. In addition, we hypothesize that the CLC affected the POS structure of the tweets, containing relatively more adjectives, adverbs, articles, conjunctions, and prepositions. These POS categories carry additional information about the situation being described, the referential situation; such as features of entities, the temporal order of events, locations of events or objects, and causal connections between events (Zwaan and Radvansky, 1998). This structural change also entails that sentences will be longer, with more words per sentence.

Gligorić et al. (2018) compared pre and post-CLC tweets with a length of approximately 140 characters. They found that pre-CLC tweets in this character range comprise relatively more abbreviations and contractions, and fewer definite articles. In the current study, we used a different approach that adds complementary value to the previous findings: we performed a content analysis on a dataset of approximately 1.5 million Dutch tweets including all ranges (i.e., 1–140 and 1–280), instead of selecting tweets within a specific character range. The dataset comprises Dutch tweets that were created between 25 October 2017 and 21 November 2017, in other words two weeks prior to and two weeks after the CLC.

We performed a general analysis to investigate changes in the number of characters, words, sentences, emojis, punctuation marks, digits, and URLs. To test the first hypothesis, we performed token and bigram analyses to detect all changes in the relative frequencies of tokens (i.e., individual words, punctuation marks, numbers, special characters, and symbols) and bigrams (i.e., two-word sequences). These changes in relative frequencies could then be utilized to extract the tokens that were especially affected by the CLC. In addition, a POS analysis was performed to test the second hypothesis; that is, whether the CLC affected the POS structure of the sentences. An example of each investigated POS category is presented in Table 1.

Table 1 Part-of-speech (POS) categories of interest

Full size table

Method

Apparatus

The data collection, pre-processing, quantitative analysis, figures, token analysis, bigram analysis, and POS analysis were performed using Rstudio (RStudio Team, 2016). The R packages that were used are: ‘BSDA’, ‘dplyr’, ‘ggplot’, ‘grid’, ‘kableExtra’, ‘knitr’, ‘lubridate’, ‘NLP’, ‘openNLP’, ‘quanteda’, ‘R-basic’, ‘rtweet’, ‘stringr’, ‘tidytext’, ‘tm’ (Arnholt and Evans, 2017; Benoit, 2018; Feinerer and Hornik, 2017; Grolemund and Wickham, 2011; Hornik, 2016; Hornik, 2017; Kearney, 2017; R Core Team, 2018; Silge and Robinson, 2016; Wickham, 2016; Wickham, 2017; Xie, 2018; Zhu, 2018).

Period of interest

The CLC occurred on 8 November 2017 at 00:00 a.m. (UTC). The dataset comprises Dutch tweets that were created within two weeks pre-CLC and two weeks post-CLC (i.e., from 10-25-2017 to 11-21-2017). This period is subdivided into week 1, week 2, week 3, and week 4 (see Fig. 1). To analyze the effect of the CLC we compared the language usage in ‘week 1 and week 2’ with the language usage in ‘week 3 and week 4’. To distinguish the CLC effect from natural-event effects, a control comparison was devised: the difference in language usage between week 1 and week 2, referred to as Baseline-split I. Furthermore, the CLC could have initiated a trend in the language usage that evolved as more users became familiar with the new limit. This trend could be shown by comparing week 3 with week 4, referred to as Baseline-split II.

Data collection

The website^{Footnote 2} twiqs.nl was used as a means to collect tweet-ids^{Footnote 3}, this website provides researchers with metadata from a (third-party-collected) corpus of Dutch tweets (Tjong Kim Sang and Van den Bosch, 2013). The tweet-ids allow for the collection of tweets from the Twitter API that are older than 9 days (i.e., the historical limit when requesting tweets based on a search query). The R-package ‘rtweet’ and complementary ‘lookup_status’ function were used to collect tweets in JSON format. The JSON file comprises a table with the tweets’ information, such as the creation date, the tweet text, and the source (i.e., type of Twitter client).

Data cleaning and preprocessing

The JSON^{Footnote 4} files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as users who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, N_users = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.

Table 2 Dataset exclusions and inclusions

Full size table

The tweet texts were converted to ASCII encoding. URLs, line breaks, tweet headers, screen names, and references to screen names were removed. URLs add to the character count when located within the tweet. However, URLs do not add to the character count when they are located at the end of a tweet. To prevent a misrepresentation of the actual character limit that users had to deal with, tweets with URLs (but not media URLs such as added pictures or videos) were excluded.

Token and bigram analysis

The R package^{Footnote 5} ‘quanteda’ was used to tokenize the tweet texts into tokens (i.e., isolated words, punctuation marks, and numbers) and bigrams. In addition, token-frequency-matrices were computed with: the frequency pre-CLC [f(token pre)], the relative frequency pre-CLC[P (token pre)], the frequency post-CLC [f(token post)], the relative frequency post-CLC and T-scores. The T-test is similar to a standard T-statistic and computes the statistical difference between means (i.e., the relative word frequencies). Negative T-scores indicate a relatively higher occurrence of a token pre-CLC, whereas positive T-scores indicate a relatively higher occurrence of a token post-CLC. The T-score equation used in the analysis is presented as Eq. (1) and (2). N is the total number of tokens per dataset (i.e., pre and post-CLC). This equation is based on the method for linguistic computations by Church et al. (1991; Tjong Kim Sang, 2011).

$$T{\mathrm{ = }}\frac{{P\left( {{\mathrm{token}}\,{\mathrm{post}}} \right) - P\left( {{\mathrm{token}}\,{\mathrm{pre}}} \right)}}{{\sqrt {{\mathrm{\sigma }}^2\left( {P\left( {{\mathrm{token}}\,{\mathrm{post}}} \right)} \right) + {\mathrm{\sigma }}^2\left( {P\left( {{\mathrm{token}}\,{\mathrm{pre}}} \right)} \right)} }}$$

(1)

$$\approx \frac{{\frac{{f\left( {{\mathrm{token}}\,{\mathrm{post}}} \right)}}{{N_{{\mathrm{pre}}}}} - \frac{{f\left( {{\mathrm{token}}\,{\mathrm{pre}}} \right)}}{{N_{{\mathrm{post}}}}}}}{{\sqrt {\frac{{f\left( {{\mathrm{token}}\,{\mathrm{post}}} \right)}}{{N_{{\mathrm{post}}}^2}} + \frac{{f\left( {{\mathrm{token}}\,{\mathrm{pre}}} \right)}}{{N_{{\mathrm{pre}}}^2}}} }}$$

(2)

Part-of-speech (POS) analysis

The R package^{Footnote 6} ‘openNLP’ was used to classify and count POS categories in the tweets (i.e., adjectives, adverbs, articles, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and miscellaneous). The POS tagger operates using a maximum entropy (maxent) probability model in order to predict the POS category based on contextual features (Ratnaparkhi, 1996). The Dutch maxent model used for the POS classification was trained on CoNLL-X Alpino Dutch Treebank data (Buchholz and Marsi, 2006; Van der Beek et al., 2002). The openNLP POS model has been reported with an accuracy rating of 87.3% when used for English social media data (Horsmann et al., 2015). An ostensible limitation of the current study is the reliability of the POS tagger. However, similar analyses were performed for both pre-CLC and post-CLC datasets, meaning the accuracy of the POS tagger should be consistent over both datasets. Therefore, we assume there are no systematic confounds.

Statistical interpretation

The large sample size (N = 1,516,425) is an approximation of the population size; this means that the standard errors are low and the confidence intervals (CI) are narrow. 99% CIs were implemented, as opposed to the commonly used 95% CI, to reduce the chance of type I errors.

Results

The results comprise three components: (1) General statistics–the CLC induced differences across multiple tweet features, (2) token (i.e., unigram) and bigram analyses to test the first hypothesis, and (3) POS analysis to test the second hypothesis.

General statistics

After the CLC, the average tweet length increased. Table 3 contains descriptive information about different tweet features such as character and word count. This table also provides the absolute and relative differences between pre and post-CLC tweets. All tweet features increased in frequency. Furthermore, the standard deviations of all length features increased, indicating an increase in variability. This suggests some users took advantage of the additional character space, whereas others continued to use fewer than 140 characters.

Table 3 Tweet features pre and post-CLC

Full size table

Figure 1 shows that the average character usage increased immediately after the CLC. In addition, the character usage also increased from week 3 to week 4, suggesting that some users became familiar with the 280-limit in the week after the CLC. Figure 2 provides an overview of all observations and shows an increase in character usage from pre to post-CLC time frames. This figure also shows the day/night cycle in Twitter activity, a small proportion of users who were still limited to 140 characters after the CLC (due to outdated Twitter client versions), an initial increase in the amount of tweets near the 280-limit, and a decrease in the amount of tweets near the 280-limit as compared to the 140-limit. Figure 3 displays the character (3a), word (3b), and sentence (3c) usage over time, which show a similar increase in tweet length. Figure 4a displays the number of characters per word (i.e., word length) over time. The average word length remained unaffected by the CLC, except for a temporary increase the first day after the CLC. Figure 4b, c present an increase in sentence length after the CLC, this suggests a syntactic change in sentence structure.

Figure 5 shows a large amount of pre-CLC tweets (15.48%) within the upper range of 121–140 characters. In comparison, a much smaller proportion of post-CLC tweets (1.73%) are within the upper range of 261–280 characters. Alternatively, the percentage of pre-CLC tweets near the pre-CLC limit (i.e., 138–140 characters) is 4.73%, whereas the post-CLC limit (i.e., 278–280 characters) comprises just 0.48% of post-CLC tweets. In other words, doubling the character limit appears to have decreased the hindrance by a factor of ten.

Figure 6 shows the distribution of word usage in tweets pre and post-CLC. Again, it is shown that with the 140-characters limit, a group of users were constrained. This group was forced to use about 15 to 25 words, indicated by the relative increase of pre-CLC tweets around 20 words. Interestingly, the distribution of the number of words in post-CLC tweets is more right skewed and displays a gradually decreasing distribution. In contrast, the post-CLC character usage in Fig. 5 shows small increase at the 280-characters limit.

Token and bigram analyses

To test our first hypothesis, which states that the CLC reduced the use of textisms or other character-saving strategies in tweets, we performed token and bigram analyses. Firstly, the tweet texts were separated into tokens (i.e., words, symbols, numbers and punctuation marks). For each token the relative frequency pre-CLC was compared to the relative frequency post-CLC, thus revealing any effects of the CLC on the use of any token. This comparison of pre and post-CLC percentage was revealed in the form of a T-score, see Eqs. (1) and (2) in the method section. Negative T-scores indicate a relatively higher frequency pre-CLC, whereas positive T-scores indicate a relatively higher frequency post-CLC. The total number of tokens in the pre-CLC tweets is 10,596,787 including 321,165 unique tokens. The total number of tokens in the post-CLC tweets is 12,976,118 which comprises 367,896 unique tokens. For each unique token three T-scores were computed, which indicates to what extent the relative frequency was affected by Baseline-split I, Baseline-split II and the CLC, respectively (see Fig. 1).

Figure 7 presents the distribution of the T-scores after removal of low frequency tokens, which shows the CLC had an independent effect on the language usage as compared to the baseline variance. Particularly, the CLC effect induced more T-scores <−4 and >4, as indicated by the reference lines. In addition, the T-score distribution of the Baseline-split II comparison shows an intermediate position between Baseline-split I and the CLC. That is, more variance in token usage as compared to Baseline-split I, but less variance in token usage as compared to the CLC. Therefore, Baseline-split II (i.e., comparison between week 3 and week 4) could suggests a subsequent trend of the CLC. In other words, a gradual change in the language usage as more users became familiar with the new limit.

To minimize natural-event-related confounds the T-score range, indicated by the reference lines in Fig. 7, was utilized as a cutoff rule. That is, tokens within the range of −4 to 4 were excluded, because this range of T-scores can be ascribed to baseline variance, as opposed to CLC-dependent variance. Furthermore, we removed tokens that showed greater variance for Baseline-split I as compared to the CLC. A similar procedure was performed with bigrams, resulting in a T-score cutoff-rule of −2 to 2, see Fig. 8. Tables 4–7 present a subset of tokens and bigrams of which occurrences were the most affected by the CLC. Each individual token or bigram in these tables are accompanied by three related T-scores: Baseline-split I, Baseline-split II, and CLC. These T-scores can be used to compare the CLC effect with Baseline-split I and Baseline-split II, for each individual token or bigram.

Table 4 Tokens that occurred relatively less frequently post-CLC and related T-scores (Baseline-split I; Baseline-split II; CLC)

Full size table

Table 5 Tokens that occurred relatively more frequently post-CLC and related T-scores (Baseline-split I; Baseline-split II; CLC)

Full size table

Table 6 Bigrams that occurred relatively less frequently post-CLC and related T-scores (Baseline-split I; Baseline-split II; CLC)

Full size table

Table 7 Bigrams that occurred relatively more frequently post-CLC and related T-scores (Baseline-split I; Baseline-split II; CLC)

Full size table

The tokens that occurred relatively less frequently post-CLC are presented in Table 4. These tokens comprise: symbols (e.g., &, >, /, +, ^, =), numerals (e.g., 1, 2, 3) acronyms, shortenings and contractions (e.g., t, k, ff, ni, mn, nie, jy, gwn, s, lol; which refer to: het, dat, ok/ik, even, niet, hem, niet, jij, gewoon, is, laugh out loud; translations: it, that, ok/I, for a bit^{Footnote 7}, not, my, not, you, just, is), punctuation marks (e.g., ! ? : ; but not the period and comma), pronouns (e.g., ik, jij, hem, hij, me, je, jou; translations: I, you, him, he, me, you/your), opinion-related adjectives/adverbs (e.g., echt, lekker, mooi, goed, nieuwe, niks, leuk, zeker, mooie, super; translations: really, nice/tasty, nice/beautiful, good, new, nothing, nice/beautiful, nice/nicely, sure, nice, super), and interjection words (e.g., ja, haha, nee, man, hoor, nou, hahaha, he, jaa, wow, jaaa, ok, fuck, shit, wtf; translations: yes, haha, no, man, you know, well, hahaha, hey/huh, yeah, wow, okay). In summary, the words that occurred relatively more frequently pre-CLC represent mainly informal language use, such as contractions, unconventional spellings, symbols and profanity.

Table 5 presents tokens that occurred relatively more frequently post-CLC, these tokens comprise: articles (i.e., de, het, een; translations: feminine/masculine the, neuter the, a(n)), conjunctions (e.g., en, of, omdat, want, zodat; translations: and, or, because, because, so that), prepositions (e.g., door, in, om, met, over, tijdens, aan, tot; translations: through/by, in, for/at, with, about/over, during, to/on, until), auxiliary and linking verbs (e.g., worden, hebben, zijn, moeten, kunnen, maken, willen; translations: become, have, are, must, can, make, want). Overall, the tokens that occurred relatively more frequently post-CLC represent more formal language usage as compared to the pre-CLC tokens in Table 4.

Table 6 presents bigrams that occurred relatively more frequently pre-CLC. These bigrams mainly comprise personal pronoun + verb combinations (i.e., ik ga, ik heb, ik ben, ik wil, ik dacht, heb je, ik moet, denk ik, ik kan, ik kom, ik had, ik was; translations: I am going, I have, I am, I want, I thought, have you, I must, I think, I can, I come, I had, I was). Again, the results suggest that there was relatively more informal language usage, that is, relatively more frequent occurrences of self-referential language, which implies a more personal and subjective language usage.

The bigrams that occurred relatively more frequently post-CLC, in Table 7, comprise mainly prepositional phrases or preposition + article combinations (e.g., van de, van het, door de, naar het, van een, om de, over de, aan de, over het, in het, met het, met de, om het, bij het, om een, voor het; translations: from the, from the, by the, to the, from a, about the, over the, in the, with the, about/over the, by the, around the, for the), suggesting more detailed descriptions of the situation that is referred to in the tweets. Importantly, the introduction of extra prepositions can also explain the increase in sentence length after the CLC.

POS analysis

The second hypothesis about a potential increase in the use of adjectives, adverbs, articles, conjunctions, and prepositions, was tested using a POS analysis. Table 8 displays the relative frequencies of POS categories. Figure 9 presents the relative differences in POS usage after the CLC, compared with Baseline-split I and II. The CLC had a greater effect on POS usage as compared to baseline differences. Particularly, the CLC induced an increase in the usage of articles, conjunctives, and prepositions as compared to other POS categories. This increase means that the CLC changed the syntactic structures of tweets, which is also supported by the finding that sentence length increased. Unexpectedly, the relative frequency of adverbs and adjectives did not increase after the CLC. In addition, the difference between Baseline-split I and Baseline-split II shows more variation between week 3 and week 4 as compared to week 1 and week 2. This suggests a trend in the language usage initiated by the CLC.

Table 8 Part-Of-speech (POS) distribution

Full size table

Discussion and conclusions

We investigated the effect of the character limit change (CLC) on the language usage in tweets. The results indicate that the CLC has, in fact, affected the language usage in tweets. The first hypothesis was supported; the pre-CLC tweets comprise relatively more textisms, such as shortenings, contractions, unconventional spellings, symbols and numerals. The second hypothesis was partially supported. As expected, the grammatical structure was affected by the CLC: post-CLC sentences are longer and comprise more articles, conjunctives, and prepositions than pre-CLC sentences. However, adjectives and adverbs did not increase in relative frequency. To discuss the results and implications, this section is structured as follows: first, we discuss an important insight about the results, that is, a change in the formality of language usage. After this, each of the investigated POS components are discussed separately. We conclude with possible interpretations of the results with regard to user behavior and limitations of our study.

Formality of language

The CLC seems to have brought about a qualitative change in language usage in tweets. Pre-CLC tweets contain relatively more informal language (i.e., textisms, self-referential pronouns, and interjection words), whereas post-CLC tweets show relatively more formal language usage. This change in formality is specifically evident in the relative frequencies of the personal pronoun ik (I) and the article word de (the), which decreased and increased, respectively. Previous n-gram research has shown that the frequencies for ik and de are indicators of informal and formal language usage (Bouma, 2015). Particularly, ik is used very frequently in self-referential and subjective texts such as personal social-media messages. On the other hand, de is used relatively more frequently in neutral and objective texts such as news articles and books. The results suggest that the CLC has led to a general change in the formality of language usage on Twitter.

POS structure

Articles indicate whether a noun refers to a specific entity or to an unspecified entity or class of entities (e.g., ‘the house’ vs., ‘a house’). This information is not always essential, hence, articles can be excluded to save space or reduce the number of words, a strategy that characterizes both telegraphese and textese, (Carrington, 2004; Oosterhof and Rawoens, 2017). Articles occurred relatively more frequently after the CLC. With sufficient space, apparently, users prefer to include articles.

Conjunctions are used to link words, phrases, or clauses. The increase in conjunctions after the CLC may have multiple causes. Firstly, the relaxation of the previous restraining character limit means conjunctions are no longer ‘wasting’ character space, conjunctions do not necessarily have to be excluded anymore. Secondly, more available space also means there is more room for summations and subordinate clauses, thus, increasing the need for conjunctions. Another explanation for the increase in conjunctions is the pre-CLC usage of conjunctive symbols instead of words (e.g., ‘/’, ‘+’, ‘&’ as compared to ‘or’, ‘and’).

Prepositions indicate ‘where’ or ‘when ‘an object or an individual is in relation to something else. Prepositions can describe the spatial arrangement of entities (e.g., ‘The tree is in front of the house.’). However, they are also routinely extended to depict the relations between abstract ideas, such as intentions and contrasts (e.g., ‘I wear overly casual clothing to work despite the criticism from my coworkers.’). As opposed to articles and conjunctions, most prepositions cannot be excluded without changing the conveyed meaning (e.g., ‘The three is [] the house’). Remarkably, the CLC increased preposition usage, which suggests that the prepositional information was being withheld prior to the CLC, in order to save character space. This restraint results in a truncated version of the originally intended sentence. Example (I):

Pre-CLC: ‘It was a sunny beach day.’

Post-CLC: ‘It was a sunny day on the beach, despite some rain in the morning’.

In contrast, some prepositions are omissible without changing the conveyed meaning (Rohdenburg, 2002). Example (II):

‘They had difficulty [in] getting there in time.’

Both example (I) and (II) show how the relative frequency of prepositions may have increased post-CLC. However, only example (I) suggests that information was being withheld. Interestingly, the bigram analysis showed that the CLC especially increased the usage of preposition and article combinations (e.g., by the, from the, to a), which appear to add non-omissible prepositional information. This finding supports the notion that information was being withheld and some sentences were obligatory truncated pre-CLC, much like example (I).

As opposed to prepositions, there was no increased usage of adjectives and adverbs. In fact, the relative usage of adjectives and adverbs decreased somewhat post-CLC. Adjectives and adverbs modify nouns and verbs and describe features of entities, actions, and events. For example: ‘These shoes are too (i.e., adverb) small (i.e., adjective).’ This featural information is, perhaps, too important to be excluded from a message. When a user has to decrease word usage to remain with the character limit, it appears prepositional information is considered as expendable, whereas information related by adjectives and adverbs is regarded as indispensable. Consider the following example:

1.
‘It was so nice to see my old friends and teachers from high school at the reunion.’ (i.e., the original message).
2.
‘Great reunion: nice to see my old high-school friends/teachers again.’ (excluding prepositions, articles, and conjunctions).
3.
‘My friends and teachers from high school were at the reunion.’ (excluding adjectives and adverbs).

Example 2 is clearly a more faithful rendition of the original message than example 3. Adjectives and adverbs are mainly used to describe feelings and/or opinions, which better represents the crux of a message than prepositional information. This could explain why adjective and adverb usage did not increase after the CLC.

Interjections show the largest decrease in relative frequency, see Fig. 8. The term ‘interjection’ is a descendent from the Latin words ‘inter’ and ‘jacĕre’ (i.e., ‘to throw’). An interjection is ‘thrown’ between sentences and represents a sudden expression of feelings (e.g., ‘Oh my!’, ‘Wow!’, ‘Haha’). Short replies mainly comprise interjections, and importantly, these interjections require very little character space. This means that the previous limit of 140 characters was already sufficient for the use of interjections. Any additional character space would therefore not be likely to affect interjection usage. This explains the relative decrease in interjection frequency compared to the other POS categories. Furthermore, the relatively low frequency of interjections also explains the higher baseline error variance as compared to the other categories.

In conclusion, the character limit change has affected language use in tweets in our sample. Tweets contained more articles, conjunctions, and prepositions, as well as relatively more formal language and relatively less informal language (i.e., textisms and interjections) after the limit change. Before the CLC, a group of users were being constrained in the conveyance of their message; post-CLC, these users obtained the character space they need. As our results show, doubling the character limit reduced the observed hindrance by a factor of ten. Therefore, the 280 characters limit appears to be much more sufficient than 140 characters to convey messages on Twitter. The new limit might appear to be a gold standard for Twitter. However, it is conceivable that, as users become more familiar with the new limit, the number of characters will increase over time. As suggested by the Baseline-split II analysis, the language usage evolves as subsequent trend of the CLC. Future research could show whether the character and language usage remains consistent or not.

Future research may also address whether the effects of the CLC in Dutch tweets are observable in other languages as well. That is, a decrease in the usage of textisms and an increase in the usage of articles, conjunctions, and prepositions. The underlying rationale being that the CLC effects are likely to be related to the function of these words and the type of information they convey, rather than the language itself. That being said, the character efficiency of the language could potentially moderate the CLC effects. Particularly, a language that is more character-efficient would be less constrained by a length limit as compared to a less character-efficient language.

An inevitable limitation of the current design is the confounding effect of natural events on the public language usage. The use of certain words can be event related. To assuage the potential impact of these confounds we removed tokens and bigrams that showed higher baseline variance as compared to the CLC-effect. However, to fully eliminate issues related to natural events, one may devise an experimental study to investigate the effect of a CLC on language usage. A CLC-dependent effect on language usage could be tested while controlling for any natural confounds (i.e., topic and event-related effects), that are bound to occur in observational studies. However, an experimental setting would reduce the ecological validity of the study. Therefore, the current study would be complementary to an experimental study.

Text-limit constraints in Tweets affect language usage, as we found in the current study. The relaxation of the character limit constraint means that writers are less likely to adapt their intended message by using strategies to compress it. Without constraints there is less need for economy of expression. The doubling of the character limit in Twitter has considerably decreased the need to compress messages. With the new limit of 280 characters, more users finally have the character space to express their thoughts. Our findings show that online language production can be affected by the character limit constraints of the medium. If necessary, language producers adapt their texts to overcome these constraints^{Footnote 8}.

Data availability

Tweet-ids and the complete procedure are available at the Open Science Framework. It is important to note that we are not permitted to share tweets. However, we are allowed to share tweet-ids on behalf of an academic institution and for the purpose of non-commercial research (see Developer Policy I.F.2.B. https://developer.twitter.com/en/developer-terms/policy).

Notes

Currently, there is much interest in algorithmic methods to define and recognize online human-behavior, such as consumer decisions, browsing activity, social-network structures, and personal interests. Twitter collects information to enhance the user experience; to show more relevant tweets, events, and people to follow, but also to enables targeted advertising (see Twitter’s privacy policy; Twitter Inc, 2018). From the user’s perspective, the specific implementation of personal information is unclear. That is, many of the design decisions in Twitter’s software are opaque to the user. In contrast, the CLC was a transparent design decision, which directly affected the way users could interact with the Twitter environment.
OSF: “TCLC 1 Data Collection Pre-CLC.html” and “TCLC 2 Data Collection Post-CLC.html”.
OSF: “tweet_ids_CLC_post.Rdata” and “tweet_ids_CLC_pre.Rdata”.
OSF: “TCLC 3 Data Pre-Processing.html” and “TCLC 4 Data Pre-Processing 2.html”.
OSF: “TCLC 6 Token Analysis.html” and “TCLC 7 Bigram Analysis.html”.
OSF: “TCLC 8 Part-of-Speech Analysis.html”.
The Dutch word ‘even’, which can be translated to ‘for a bit’ or ‘just for a moment’, is a commonly used filler and is often abbreviated to ‘ff’, which is short for “effe,” a colloquial version of “even”.
The Effect of the Twitter Character Limit Change on Language: https://osf.io/sg35a/?view_only=f360c9f624484062a43108968a4abc2b.

References

Arnholt AT, Evans B (2017) BSDA: Basic statistics and data analysis. R package version 1.2.0. https://CRAN.R-project.org/package=BSDA
Barton EL (1998) The grammar of telegraphic structures: sentential and nonsentential derivation. J Engl Linguist 26:37–67
Article Google Scholar
Benoit K (2018) quanteda: Quantative analysis of textual data. R package version 0.99.22. https://doi.org/10.5281/zenodo.1004683
Bouma G (2015) N-gram frequencies for Dutch Twitter data. Computat Linguistics Netherlands 5:25–36
Buchholz S, Marsi E (2006) CoNLL-X shared task on multilingual dependency parsing. In: Màrquez L, Klein D (eds) Proceedings of the tenth conference on computational natural language learning. Association for Computational Linguistics, New York City, p 92–122
Carrington V (2004) Texts and literacies of the Shi Jinrui. Br J Sociol Educ 25:215–228
Article Google Scholar
Church K, Gale W, Hanks P, Hindle D (1991) Using statistics in lexical analysis. In: Zernik Uri (ed) Lexical acquisition: exploiting on-line resources to build up a lexicon. Lawrence Erlbaum Associates, Hillsdale, p 115–164
Google Scholar
De Jonge S, Kemp N (2012) Text-message abbreviations and language skills in high school and university students. J Res Read 35:49–68
Article Google Scholar
Drouin M, Driver B (2014) Texting, textese and literacy abilities: a naturalistic study. J Res Read 37:250–267
Article Google Scholar
Feinerer I, Hornik K (2017) tm: Text mining package. R package version 0.7-3. https://CRAN.R-project.org/package=tm
Frehner C (2008) Email, SMS, MMS: the linguistic creativity of asynchronous discourse in the new media age. Peter Lang, Bern
Google Scholar
Gligorić K, Anderson A, West R (2018) How constraints affect content: the case of twitter’s switch from 140 to 280 characters. In: Proceedings of the Twelfth International AAAI Conference on Web and Social Media. AAAI Press, Palo Alto
Grolemund G, Wickham H (2011) Dates and times made easy with lubridate J Stat Softw 40:1–25. http://www.jstatsoft.org/v40/i03/
Hornik K (2016) openNLP: Apache OpenNLP tools interface. R package version 0.2-6. https://CRAN.R-project.org/package=openNLP
Hornik K (2017) NLP: Natural language processing infrastructure. R package version 0.1-11. https://CRAN.R-project.org/package=NLP
Horsmann T, Erbs N, Zesch T (2015) Fast or Accurate? A Comparative Evaluation of PoS Tagging Models. In: Fisseni B, Schröder B, Zesch T (eds) Proceedings of the international conference of the German society for computational linguistics and language technology. University of Duisburg-Essen, Duisburg, p 22–30
Google Scholar
Isserlin M (1985) On agrammatism. Cogn Neuropsychol 2:308–345
Article Google Scholar
Kearney MW (2017) rtweet: collecting twitter data. R package version 0.6.0. https://cran.r-project.org/package=rtweet
Koster J (1975) Dutch as an SOV language. Linguist Anal 1:111–136
Google Scholar
Ling R, Baron NS (2007) Text messaging and IM: Linguistic comparison of American college data. J Lang Soc Psychol 26:291–298
Article Google Scholar
Lyddy F, Farina F, Hanney J, Farrell L, Kelly O’Neill N (2014) An analysis of language in university students’ text messages. J Comput-Mediat Commun 19:546–561
Article Google Scholar
Oosterhof A, Rawoens G (2017) Register variation and distributional patterns in article omission in Dutch headlines. Linguist Var 17:205–228
Article Google Scholar
Plester B, Wood C, Joshi P (2009) Exploring the relationship between children’s knowledge of text message abbreviations and school literacy outcomes. Br J Dev Psychol 27:145–161
Article Google Scholar
Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Proceedings in empirical methods in natural language processing. Association for Computational Linguistics, New Brunswick, New Jersey
Rayner K, Slattery TJ, Drieghe D, Liversedge SP (2011) Eye movements and word skipping during reading: effects of word length and predictability. J Exp Psychol: Hum Percept Perform 37(2):514–528. https://doi.org/10.1037/a0020990
Article Google Scholar
R Core Team (2018) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Rohdenburg G (2002) Processing complexity and the variable use of prepositions in English. In: Cuyckens H, Radden G (eds) Perspectives on prepositions. Walter de Gruyter, Berlin, p 79–100
Google Scholar
Rosen A, Ihara I (2017) Giving you more characters to express yourself. Blog.twitter.com. https://blog.twitter.com/official/en_us/topics/product/2017/Giving-you-more-characters-to-express-yourself.html
Rosen A (2017) Tweeting Made Easier. Blog.twitter.com. https://blog.twitter.com/official/en_us/topics/product/2017/tweetingmadeeasier.html
RStudio Team (2016) RStudio: integrated development for R. R Studio, Inc., Boston. http://www.rstudio.com/
Silge J, Robinson D (2016) tidytext: text mining and analysis using tidy data principles in R. J Open Source Softw 1(3):37
Article ADS Google Scholar
Tagliamonte SA, Denis D (2008) Linguistic ruin? LOL! Instant messaging and teen language. Am speech 83:3–34
Article Google Scholar
Tesak J, Dittmann J (2009) Telegraphic style in normals and aphasics. Linguistics 29:1111–1138
Google Scholar
Thurlow C, Brown A (2003) Generation Txt? The sociolinguistics of young people’s text-messaging. Discourse Anal 1:30
Google Scholar
Tjong Kim Sang EF (2011) Het gebruik van Twitter voor taalkundig onderzoek. TABU 39:62–72
Google Scholar
Tjong Kim Sang EF, Van den Bosch A (2013) Dealing with big data: the case of twitter. Comput Linguist Neth 3:121–134
Google Scholar
Twitter Inc (2018) Twitter privacy policy [PDF file]. Twitter Inc: San Francisco. https://cdn.cms-twdigitalassets.com/content/dam/legal-twitter/site-assets/privacy-page-gdpr/pdfs/PP_Q22018_April_EN.pdf
Van der Beek L, Bouma G, Malouf R, Van Noord G (2002) The Alpino dependency treebank. Lang Comput 45:8–22
Google Scholar
Varnhagen CK, McFall GP, Pugh N, Routledge L, Sumida-MacDonald H, Kwong TE (2010) Lol: new language and spelling in instant messaging. Read Writ 23:719–733
Article Google Scholar
Watson C (2017) Twitter users respond to #280characters rollout: ‘All we wanted was an edit button’. The Guardian. https://www.theguardian.com/technology/2017/nov/08/twitter-users-respond-280characters-tweet-limit
Wickham H (2016) ggplot2: Elegant graphics for data analysis. Springer-Verlag, New York. http://ggplot2.org
Book Google Scholar
Wickham H (2017) stringr: Simple, consistent wrappers for common string operations. R package version 1.2.0. https://CRAN.R-project.org/package=string
Wickham H, Francois R, Henry L, Müller K (2017) dplyr: A grammar of data manipulation. R package version 0.7.4. https://CRAN.R-project.org/package=dplyr
Xie Y (2018) knitr: A general-purpose package for dynamic report generation in R. R package version 1.20
Zhu H (2018) kableExtra: construct complex table with ‘kable’ and pipe syntax. R package version 0.9.0. https://CRAN.R-project.org/package=kableExtra
Zwaan RA, Radvansky GA (1998) Situation models in language comprehension and memory. Psychol Bull 123:162–185
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Erasmus University Rotterdam, Mandeville building, room T16-03, Burgemeester Oudlaan 50, Rotterdam, NL, 3062 PA, The Netherlands
Arnout B. Boot, Katinka Dijkstra & Rolf A. Zwaan
Netherlands eScience Center, Amsterdam, The Netherlands
Erik Tjong Kim Sang

Authors

Arnout B. Boot
View author publications
You can also search for this author in PubMed Google Scholar
Erik Tjong Kim Sang
View author publications
You can also search for this author in PubMed Google Scholar
Katinka Dijkstra
View author publications
You can also search for this author in PubMed Google Scholar
Rolf A. Zwaan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arnout B. Boot.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Boot, A.B., Tjong Kim Sang, E., Dijkstra, K. et al. How character limit affects language usage in tweets. Palgrave Commun 5, 76 (2019). https://doi.org/10.1057/s41599-019-0280-3

Download citation

Received: 27 February 2019
Accepted: 12 June 2019
Published: 09 July 2019
DOI: https://doi.org/10.1057/s41599-019-0280-3

This article is cited by

How an Interest in Mindfulness Influences Linguistic Markers in Online Microblogging Discourse
- Clara Eugenia Rivera
- Rebekah Jane Kaunhoven
- Gemma Maria Griffith
Mindfulness (2023)
Harnessing Indigenous Tweets: The Reo Māori Twitter corpus
- David Trye
- Te Taka Keegan
- Mark Apperley
Language Resources and Evaluation (2022)
CAT-BiGRU: Convolution and Attention with Bi-Directional Gated Recurrent Unit for Self-Deprecating Sarcasm Detection
- Ashraf Kamal
- Muhammad Abulaish
Cognitive Computation (2022)

Subjects

Abstract

Similar content being viewed by others

Worldwide divergence of values

Improving microbial phylogeny with citizen science within a mass-market video game

Anger is eliminated with the disposal of a paper written because of provocation

Introduction

Method

Apparatus

Period of interest

Data collection

Data cleaning and preprocessing

Token and bigram analysis

Part-of-speech (POS) analysis

Statistical interpretation

Results

General statistics

Token and bigram analyses

POS analysis

Discussion and conclusions

Formality of language

POS structure

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

How an Interest in Mindfulness Influences Linguistic Markers in Online Microblogging Discourse

Harnessing Indigenous Tweets: The Reo Māori Twitter corpus

CAT-BiGRU: Convolution and Attention with Bi-Directional Gated Recurrent Unit for Self-Deprecating Sarcasm Detection

Search

Quick links