Word Frequencies and Emoji

Over the past year and a half of my Ph.D. studies, the topic of word frequency has come up… frequently. We have talked about it in terms of corpus linguistics and understanding how the English language works, and also in terms of teaching English as a foreign language. These two areas are related — as Applied Linguists, we are often thinking about how to use linguistic theory to improve the pedagogy of language teaching.

So it would seem that language teachers would be interested in teaching the most frequent words to language learners.  But what are these words?  Studies have shown that, although it appears that English speakers can differentiate the frequency of very common versus very rare words (Ringeling, 1984), they are unable to differentiate frequencies in the high and middle ranges. Even English teachers do not fare any better on this test than other native English speakers (McCrostie, 2007). There are many reasons why this might be the case, including “explicit retrieval factors such as salience, base rate neglect, framing effects, and heuristic ways of reasoning” (Kahneman and Tversky, 1973, as cited in Ellis, 2002).

But, having said that, many people probably know the answer to the trivia question, “What is the single most common English word?” The answer is “the”. And the Top 10 (at least according to the Corpus of Contemporary American English, the British National Corpus (BNC), and the Oxford English Corpus (OEC) are all very similar:

Rank COCA BNC OED
1 the the the
2 be of be
3 and and to
4 of a of
5 a in and
6 in to (inf. marker) a
7 to (inf. marker) it in
8 have is that
9 to (preposition) was have
10 it to (preposition) I

Although these list are very similar, there are some differences. First of all, the lists are very dependent on how the words are counted. For example, in COCA, all of the forms of “to be” are grouped into one word (the second most common word “be”), while the BNC, doesn’t have “be”, but instead has “is” and “was” at the eighth and ninth positions.

And secondly, the list is dependent on the corpus that you choose. American English is slightly different from British English, and the English words used in business are different than the words used in fiction novels.

The rise of the Internet has also had an effect on how the English language is used to communicate. In 2009, Time magazine did a study on the most frequent words used on Twitter, and their Top 10 words were:

  • the
  • I
  • to
  • a
  • and
  • is
  • in
  • it
  • you
  • of

In this list of words, in addition to not differentiating between parts of speech (e.g., “to” is listed only once), we can see that the pronouns “I” and “you” have leapt into the Top 10. This makes sense as communication on Twitter is more personal and conversational than either the COCA, the BNC, or the OEC. And by the way, the full list contains “my” at #14, and “me” at #19, up from positions 44, 34, 71 and 61, 50, 78 in the other three corpus lists. In other words, communication on Twitter is also very self-centered! But if you have spent any time using this service, you knew this already.

Not only is Twitter communication more personal than what you would find in one of the big corpora listed above, but other language changes are being seen on the Internet as well, such as the proliferation of the @ symbol and the e- prefix in language (Crystal, 2001).  The Internet has also brought the spread of emoticons such as a colon and right parenthesis smiley face : ) and their graphical counterparts, emoji d83dde04. There has been a lot of research in this area lately, from using the emoticons and emoji to do emotion analysis of an online corpus (Yang et al., 2007) to a socio-linguistic analysis of the way that Japanese youth create a culturally safe, yet innovative new way to communicate with each other (Miyake, 2007), to the gender differences in the frequency and range of emoticon use (Tossell, et al., 2012).

So, if we tie together these two ideas of “word frequency” and “emotions/emoji as words”, then perhaps you are wondering which emoji are the most frequently used. Well, take a guess which ones are the most common. But here are some hints:

  • 4 of the top 20 emoji graphics include red hearts
  • 13 of the top 20 have yellow faces – 8 positive, 4 negative, and neutral
  • Emoji users on Twitter seem happy. If we consider hearts to be positive, then 14 of the top 20 emoji are positive, 4 negative, and 2 neutral (depending on how you define positive, negative, and neutral, of course)

So do you think that you can guess the most common emoji on Twitter?  Give it a try and then see if you are correct at emojitracker — a realtime scoreboard tracking Twitter emoji use

References

Crystal, D. (2001). Language and the internet. Cambridge University Press

Ellis, N. C. (2002). Reflections on frequency effects in language processing. Studies in Second Language Acquisition, 24(2), 297-339

McCrostie, J. (2007). Investigating the accuracy of teachers’ word frequency intuitions. RELC Journal, 38(1), 53-66. doi:10.1177/003368820607615

Miyake, K. (2007). How young japanese express their emotions visually in mobile phone messages: A sociolinguistic analysis. Japanese Studies, 27(1), 53-72. doi:10.1080/1037139070126864

Nawar, H. (2012). Multicultural transposition: From alphabets to pictographs, towards semantographic communication. Technoetic Arts, 10(1), 59-68.

Ringeling, J.C.T. (1984). Subjective Estimations as a Useful Alternative to Word Frequency Counts. Interlanguage Studies Bulletin 8: 59-69.

Tossell, C. C., Kortum, P., Shepard, C., Barg-Walkow, L. H., Rahmati, A., & Zhong, L. (2012). A longitudinal study of emoticon use in text messaging from smartphones. Computers in Human Behavior, 28(2), 659-663. doi:10.1016/j.chb.2011.11.01

Yang, C., Lin, K. H. -Y., & Chen, H. -H. (2007). Building emotion lexicon from weblog corpora. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 133-136)