Category Archives: Linguistics

A new blog about language de-fossilization

For years I have been intrigued by how and why it is so difficult for me to learn how to communicate using the Thai language (as any reader of my blog has heard me say many times over the past 11 years). In fact, my lack of success is part of the reason I started working on a Ph.D. in Applied Linguistics. The complexity of language is fascinating to me, and diving into some of the linguistic theories about how languages actually work has been a lot of fun over the past two years. (But my Thai is still not very good!)

As part of my studies and interests in this field, I am always on the lookout for good blogs about language. One of my very favorites was written by Scott Thornbury called An A-Z of ELT (English Language Teaching). Every Sunday, Scott would post a very thought-provoking essay about some Applied Linguistics topic, such as Autonomy, Krashen, Post-Modern Methods, or many others.

Unfortunately, Scott decided to end the A-Z blog last month, but in its place is an even more interesting blog, at least for me personally. Scott has lived in Spain for 30 years, but claims that his Spanish skills have not improved in a very long time. So in his new blog, he will be looking at “de-fossilizing” his Spanish language learning, from an academic perspective. As he says,

I am going to do this using a number of means, including formal instruction, vocabulary memorization, extensive reading and (if I can find it) informal interaction. At the same time, I plan to inform the process by occasional reference to the literature on second language acquisition (SLA), including such issues as motivation, age effects, aptitude, exposure, fluency, error correction, and identity formation.

Since I have recently restarted my own de-fossilization (if you want to call it that) by focusing on vocabulary memorization of Thai words for now, I am very interested to hear what Scott has to say on his own language learning journey. If you are interested in language learning at all, I also recommend that you follow along at The (de-) fossilization diaries.


Linguistic Perspective on Writing Quality

Lately, I have been doing some research on the Linguistic perspective on writing quality: what frameworks/theories others have used in research, what their methodologies are, how they define “quality”, etc. I found quite a lot of articles, but after reading through them, I realized that they are all pure Computational Linguistics, both in their theoretical frameworks and their methodologies. Most of the recent ones are trying to solve the problem of having the computer determine if their NLG output is “good” or not. (For example: Are automated summaries coherent?)

Almost all of the articles I found equate quality with coherence/cohesion. The articles will sometimes give a passing nod to Halliday and Hasan (1976), but not much more than that. Instead, they seem to focus on theories in the Computational Linguistics research such as Centering Theory (Grosz et al., 1983), or the “theory of attention, intention, and aggregation of utterances” (Grosz and Sidner, 1986) or Rhetorical Structure Theory (Mann and Thompson, 1988). Or they base it on cognitive psychology work, such as “Coherence in text, coherence in mind” — a book by Givón (1993).

The methodologies of the studies I have been reading are all using a lot of formulas and Hidden Markov Models trying to find a model of language that fits the data and which correlates with some human judgement of quality. I am not sure how far I will be going down that path, but out of all of it, the Rhetorical Structure Theory looks the most interesting and might be applicable to my research as an analysis tool. It’s definitely the most popular framework for the articles I have seen.

Unfortunately, my research purpose and rationale is not as focused as I would like it to be at this point. I was hoping to narrow it down sooner rather than later. But maybe I should just gather my data and pick a topic (or at least a linguistic level) and dive in and see what happens.


Givón, T. (1993). Coherence in text, coherence in mind. Pragmatics & Cognition, 1(2), 171-227.

Grosz, B. J., & Sidner, C. L. (1986). Attention, intentions, and the structure of discourse. Computational Linguistics, 12(3), 175-204

Grosz, B. J., Joshi, A. K., & Weinstein, S. (1983). Providing a unified account of definite noun phrases in discourse. In Proceedings of the 21st annual meeting on association for computational linguistics (pp. 44-50)

Halliday, M. A., & Hasan, R. (1976). Cohesion in English.

Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3), 243-281.

A Sample of Key (Highly-Cited) Computational Linguistics journal articles about Cohesion/Coherence and/or Writing/Text Quality

Barzilay, R., & Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1), 1-34

Carlson, L., Marcu, D., & Okurowski, M. E. (2003). Building a discourse-tagged corpus in the framework of rhetorical structure theory. Springer

Crossley, S. A., & McNamara, D. S. (2011). Text coherence and judgments of essay quality: Models of quality and coherence. In Proceedings of the 29th annual conference of the cognitive science society (pp. 1236-1241)

Elsner, M., Austerweil, J. L., & Charniak, E. (2007). A unified local and global model for discourse coherence. In HLT-NAACL (pp. 436-443)

Gordon, P. C., Grosz, B. J., & Gilliom, L. A. (1993). Pronouns, names, and the centering of attention in discourse. Cognitive Science, 17(3), 311-347

Lapata, M., & Barzilay, R. (2005). Automatic evaluation of text coherence: Models and representations. In IJCAI (Vol. 5, pp. 1085-1090)

Lin, Z., Ng, H. T., & Kan, M. -Y. (2011). Automatically evaluating text coherence using discourse relations. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 997-1006)

Louis, A., & Nenkova, A. (2012). A coherence model based on syntactic patterns.

Louis, A., & Nenkova, A. (2013). A corpus of science journalism for analyzing writing quality. Dialogue & Discourse, 4(2), 87-117

Pitler, E., & Nenkova, A. (2008). Revisiting readability: A unified framework for predicting text quality. In Proceedings of the conference on empirical methods in natural language processing (pp. 186-195)

Pitler, E., Louis, A., & Nenkova, A. (2010). Automatic evaluation of linguistic quality in multi-document summarization. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 544-554)

Soricut, R., & Marcu, D. (2003). Sentence level discourse parsing using syntactic and lexical information. In Proceedings of the 2003 conference of the north american chapter of the association for computational linguistics on human language technology-volume 1 (pp. 149-156).

Word Frequencies and Emoji

Over the past year and a half of my Ph.D. studies, the topic of word frequency has come up… frequently. We have talked about it in terms of corpus linguistics and understanding how the English language works, and also in terms of teaching English as a foreign language. These two areas are related — as Applied Linguists, we are often thinking about how to use linguistic theory to improve the pedagogy of language teaching.

So it would seem that language teachers would be interested in teaching the most frequent words to language learners.  But what are these words?  Studies have shown that, although it appears that English speakers can differentiate the frequency of very common versus very rare words (Ringeling, 1984), they are unable to differentiate frequencies in the high and middle ranges. Even English teachers do not fare any better on this test than other native English speakers (McCrostie, 2007). There are many reasons why this might be the case, including “explicit retrieval factors such as salience, base rate neglect, framing effects, and heuristic ways of reasoning” (Kahneman and Tversky, 1973, as cited in Ellis, 2002).

But, having said that, many people probably know the answer to the trivia question, “What is the single most common English word?” The answer is “the”. And the Top 10 (at least according to the Corpus of Contemporary American English, the British National Corpus (BNC), and the Oxford English Corpus (OEC) are all very similar:

1 the the the
2 be of be
3 and and to
4 of a of
5 a in and
6 in to (inf. marker) a
7 to (inf. marker) it in
8 have is that
9 to (preposition) was have
10 it to (preposition) I

Although these list are very similar, there are some differences. First of all, the lists are very dependent on how the words are counted. For example, in COCA, all of the forms of “to be” are grouped into one word (the second most common word “be”), while the BNC, doesn’t have “be”, but instead has “is” and “was” at the eighth and ninth positions.

And secondly, the list is dependent on the corpus that you choose. American English is slightly different from British English, and the English words used in business are different than the words used in fiction novels.

The rise of the Internet has also had an effect on how the English language is used to communicate. In 2009, Time magazine did a study on the most frequent words used on Twitter, and their Top 10 words were:

  • the
  • I
  • to
  • a
  • and
  • is
  • in
  • it
  • you
  • of

In this list of words, in addition to not differentiating between parts of speech (e.g., “to” is listed only once), we can see that the pronouns “I” and “you” have leapt into the Top 10. This makes sense as communication on Twitter is more personal and conversational than either the COCA, the BNC, or the OEC. And by the way, the full list contains “my” at #14, and “me” at #19, up from positions 44, 34, 71 and 61, 50, 78 in the other three corpus lists. In other words, communication on Twitter is also very self-centered! But if you have spent any time using this service, you knew this already.

Not only is Twitter communication more personal than what you would find in one of the big corpora listed above, but other language changes are being seen on the Internet as well, such as the proliferation of the @ symbol and the e- prefix in language (Crystal, 2001).  The Internet has also brought the spread of emoticons such as a colon and right parenthesis smiley face : ) and their graphical counterparts, emoji d83dde04. There has been a lot of research in this area lately, from using the emoticons and emoji to do emotion analysis of an online corpus (Yang et al., 2007) to a socio-linguistic analysis of the way that Japanese youth create a culturally safe, yet innovative new way to communicate with each other (Miyake, 2007), to the gender differences in the frequency and range of emoticon use (Tossell, et al., 2012).

So, if we tie together these two ideas of “word frequency” and “emotions/emoji as words”, then perhaps you are wondering which emoji are the most frequently used. Well, take a guess which ones are the most common. But here are some hints:

  • 4 of the top 20 emoji graphics include red hearts
  • 13 of the top 20 have yellow faces – 8 positive, 4 negative, and neutral
  • Emoji users on Twitter seem happy. If we consider hearts to be positive, then 14 of the top 20 emoji are positive, 4 negative, and 2 neutral (depending on how you define positive, negative, and neutral, of course)

So do you think that you can guess the most common emoji on Twitter?  Give it a try and then see if you are correct at emojitracker — a realtime scoreboard tracking Twitter emoji use


Crystal, D. (2001). Language and the internet. Cambridge University Press

Ellis, N. C. (2002). Reflections on frequency effects in language processing. Studies in Second Language Acquisition, 24(2), 297-339

McCrostie, J. (2007). Investigating the accuracy of teachers’ word frequency intuitions. RELC Journal, 38(1), 53-66. doi:10.1177/003368820607615

Miyake, K. (2007). How young japanese express their emotions visually in mobile phone messages: A sociolinguistic analysis. Japanese Studies, 27(1), 53-72. doi:10.1080/1037139070126864

Nawar, H. (2012). Multicultural transposition: From alphabets to pictographs, towards semantographic communication. Technoetic Arts, 10(1), 59-68.

Ringeling, J.C.T. (1984). Subjective Estimations as a Useful Alternative to Word Frequency Counts. Interlanguage Studies Bulletin 8: 59-69.

Tossell, C. C., Kortum, P., Shepard, C., Barg-Walkow, L. H., Rahmati, A., & Zhong, L. (2012). A longitudinal study of emoticon use in text messaging from smartphones. Computers in Human Behavior, 28(2), 659-663. doi:10.1016/j.chb.2011.11.01

Yang, C., Lin, K. H. -Y., & Chen, H. -H. (2007). Building emotion lexicon from weblog corpora. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 133-136)