Evolutionary forces in language change depend on selective pressure, but also on random chance


A new interesting paper from Nature: Detecting evolutionary forces in language change, by Newberry, Ahern, Clark, and Plotkin (2017). Discovered via Science Daily.

The following are excerpts of materials related to the publication (written by Katherine Unger Baillie), from The University of Pennsylvania:

Examining substantial collections of annotated texts dating from the 12th to the 21st centuries, the researchers found that certain linguistic changes were guided by pressures analogous to natural selection — social, cognitive and other factors — while others seem to have occurred purely by happenstance.

“Linguists usually assume that when a change occurs in a language, there must have been a directional force that caused it,” said Joshua Plotkin, professor of biology in Penn’s School of Arts and Sciences and senior author on the paper. “Whereas we propose that languages can also change through random chance alone. An individual happens to hear one variant of a word as opposed to another and then is more likely to use it herself. Chance events like this can accumulate to produce substantial change over generations. Before we debate what psychological or social forces have caused a language to change, we must first ask whether there was any force at all.”

“One of the great early American linguists, Leonard Bloomfield, said that you can never see a language change, that the change is invisible,” said Robin Clark, a coauthor and professor of linguistics in Penn Arts and Sciences. “But now, because of the availability of these large corpora of texts, we can actually see it, in microscopic detail, and begin to understand the details of how change happened.”

One change is the regularization of past-tense verbs. Using the Corpus of Historical American English, comprised of more than 100,000 texts ranging from 1810 to 2009 that have been parsed and digitized — a database that includes more than 400 million words — the team searched for verbs where both regular and irregular past-tense forms were present, for example, “dived” and “dove” or “wed” and “wedded.”

“There is a vast literature and a lot of mythology on verb regularization and irregularization,” Clark said, “and a lot of people have claimed that the tendency is toward regularization. But what we found was quite different.”

Indeed, the analysis pointed to particular instances where it seems selective forces are driving irregularization. For example, while a swimmer 200 years ago might have “dived”, today we would say they “dove.” The shift towards using this irregular form coincided with the invention of cars and concomitant increase in use of the rhyming irregular verb “drive”/“drove.”

Despite finding selection acting on some verbs, “the vast majority of verbs we analyzed show no evidence of selection whatsoever,” Plotkin said.

The team recognized a pattern: random chance affects rare words more than common ones. When rarely-used verbs changed, that replacement was more likely to be due to chance. But when more common verbs switched forms, selection was more likely to be a factor driving the replacement.

The grammar of negating a sentence has changed from “Ic ne secge” (Beowulf, c. 900) to “Ic ne sege noht” (the Ormulum, c. 1100) to “I seye not” (Chaucer, c. 1400) to “I doe not say” (Shakespeare, c. 1600) before returning to the familiar “I don’t say” (Virginia Woolf, c. 1900). A team from Penn used massive digital libraries along with inference techniques from population genetics to quantify the forces responsible for language evolution, such as in Jespersen’s cycle of negation, depicted here. (c) Cherissa Dukelow, 2017, license information below

The authors also observed a role of random chance in grammatical change. The periphrastic “do,” as used in, “Do they say?” or “They do not say,” did not exist 800 years ago. Back in the 1400s, these sentiments would have been expressed as, “Say they?” or “They say not.”

Using the Penn Parsed Corpora of Historical English, which includes 7 million syntactically parsed words from 1,220 British English texts, the researchers found that the use of the periphrastic “do” emerged in two stages, first in questions (“Don’t they say?”) around the 1500s, and then roughly 200 years later in imperative and declarative statements (“They don’t say.”).

These manuscripts show changes from Old English (Beowulf) through Middle English (Trinity Homilies, Chaucer) to Early Modern English (Shakespeare’s First Folio). Penn researchers used large collections of digitized texts spanning the 12th to the 21st centuries to show that many language changes can be attributed to random chance alone. (c) Mitchell Newberry, 2017, license information below

While most linguists have assumed that such a distinctive grammatical feature must have been driven to dominance by some selective pressure, the Penn team’s analysis questions that assumption. They found that the first stage of the rising periphrastic “do” use is consistent with random chance. Only the second stage appears to have been driven by a selective pressure.

“It seems that, once ‘do’ was introduced in interrogative phrases, it randomly drifted to higher and higher frequency over time,” said Plotkin. “Then, once it became dominant in the question context, it was selected for in other contexts, the imperative and declarative, probably for reasons of grammatical consistency or cognitive ease.”

As the authors see it, it’s only natural that social-science fields like linguistics increasingly exchange knowledge and techniques with fields like statistics and biology.

“To an evolutionary biologist,” said Newberry, “it’s important that language is maintained through a process of copying language; people learn language by copying other people. That copying introduces minute variation, and those variants get propagated. Each change is an opportunity for a different copying rate, which is the basis for evolution as we know it.”

Featured image: copyrighted, modified from the Supplementary information of the article.

Image (c) Cherissa Dukelow, 2017, licensed under CC-BY-NC-SA 4.0 http://creativecommons.org/licenses/by-nc-sa/4.0/
Image (c) Mitchell Newberry, 2017, https://creativecommons.org/licenses/by-nc/4.0/, licensed under CC-BY-NC 4.0 (see materials at University of Pennsylvania for further sources).


Forces driving grammatical change are different to those driving lexical change

Grammar change

A new paper at PNAS, Evolutionary dynamics of language systems, by Greenhill et al. (2017).


Do different aspects of language evolve in different ways? Here, we infer the rates of change in lexical and grammatical data from 81 languages of the Pacific. We show that, in general, grammatical features tend to change faster and have higher amounts of conflicting signal than basic vocabulary. We suggest that subsystems of language show differing patterns of dynamics and propose that modeling this rate variation may allow us to extract more signal, and thus trace language history deeper than has been previously possible.


Understanding how and why language subsystems differ in their evolutionary dynamics is a fundamental question for historical and comparative linguistics. One key dynamic is the rate of language change. While it is commonly thought that the rapid rate of change hampers the reconstruction of deep language relationships beyond 6,000–10,000 y, there are suggestions that grammatical structures might retain more signal over time than other subsystems, such as basic vocabulary. In this study, we use a Dirichlet process mixture model to infer the rates of change in lexical and grammatical data from 81 Austronesian languages. We show that, on average, most grammatical features actually change faster than items of basic vocabulary. The grammatical data show less schismogenesis, higher rates of homoplasy, and more bursts of contact-induced change than the basic vocabulary data. However, there is a core of grammatical and lexical features that are highly stable. These findings suggest that different subsystems of language have differing dynamics and that careful, nuanced models of language change will be needed to extract deeper signal from the noise of parallel evolution, areal readaptation, and contact.

This is in line with the studies by Bendt, like Adaptive Communication: Languages with More Non-Native Speakers Tend to Have Fewer Word Forms, which suggest a simplification of grammar with language contact.

It might then give further support to my proposal of Uralic as the Corded Ware substrate – common to Balto-Slavic and Indo-Iranian -, since they are the only Late Indo-European branches that clearly retain the grammatical complexity in word forms, which – together with their shared phonetic isoglosses (also present partially between Balto-Slavic and Germanic) -, put them nearer to a complex, potentially related Uralic (or other Indo-Uralic) branch.

On the other hand, the finding of a greater stability of lexicon gives further support to the concept of a North-West Indo-European group, since one of its foundations (the main one originally) is the shared vocabulary between Italo-Celtic, Germanic, and Balto-Slavic.

Featured image: from the article (copyrighted), “Map showing locations of languages in this study. The phylogenies show the maximum clade credibility tree of the Austronesian languages in our sample. Each phylogeny is colored by the average rate of change, with branches showing more change colored redder, while bluer branches show reductions in rate. Branches with significant shifts are annotated with an asterisk, and the languages showing significantly different rates of change in their grammatical data are located on the map”.


When linguistics does not seem to be a science


An interesting essay by Arika Okrent has appeared in Aeon – Is linguistics a science? It concerns the central position of Chomsky’s Universal Grammar to modern Linguistics, and revolves around a story in Tom Wolfe’s book The Kingdom of Speech (2016), Everett’s discovery of the Pirahã culture’s (and language’s) emphasis on the here and now: not embedding one phrase inside another, the simple kinship system, lack of numbers, and absence of fiction or creation myths. Some excerpts of the essay:

This looks suspiciously like defiance of a central feature of the scientific archetype, one first put forward by the philosopher Karl Popper: theories are not scientific unless they have the potential to be falsified. If you claim that recursion is the essential feature of language, and if the existence of a recursionless language does not debunk your claim, then what could possibly invalidate it?


In an interview with Edge.org in 2007, Everett said he emailed Chomsky: ‘What is a single prediction that universal grammar makes that I could falsify? How could I test it?’ According to Everett, Chomsky replied to say that universal grammar doesn’t make any predictions; it’s a field of study, like biology.


By contrast, good theories or hypotheses are those that allow you to search for contrary evidence. Thus Albert Einstein’s theory of general relativity made a very specific prediction about the effect of gravity on light, which could be subsequently tested during the solar eclipse of 1919. Unlike astrology or Freudianism, relativity could be contradicted. It was possible to conceive of an observation that would conflict with one’s expectations (although the eclipse ultimately vindicated Einstein). The capacity to be disproved is what makes general relativity scientific.


In Chomsky’s formulation, we are not just after a set of abstract rules that account for the things we can see and hear, but one that explains why they are the way they are. In the late 1970s, Chomsky began to refer to this method of enquiry as the ‘Galilean style’


Chomsky’s Galilean vision was that our intuitive judgments about language stem from an innate language faculty, a universal grammar underlying the human capacity for language. His project is to determine the essential nature of that universal grammar – not the nature of language, but the nature of the human capacity for language. The distinction is a subtle one.

Regarding the pseudoscience claims about Linguistics, or in this case Chomsky’s Universal Grammar, and the common answer to such criticism of linguistic abstractions by their authors (asserting that they can be “neither right nor wrong” but only “fecund or sterile”), they reminded me of an old XKCD comic which sums up this line of reasoning quite well:

To Dmoz or not to Dmoz, that is the question…

Firstly, I am not a SEO expert. In fact, I am rather bad knowing how the WWW (not to talk about the Internet as a whole) works.

A year ago a (geek) friend of mine told me that to be on the Open Directory Project (Dmoz) was cool to promote our project of Indo-European Language Revival. Now I know that (obviously) it’s mostly a question of Pagerank and Google.

A year ago I sent what we had, our website dnghu.org, which was scarce in its original content, although it was not under construction, and it offered already some material on the Proto-Indo-European reconstruction; it followed all rules for site suggestion, even the appropriate category: Proto-Indo-European.

A year ago I found some websites in the Proto-Indo-European category, which were already for 2006-2007 a bad suggestion for knowing/learning Indo-European; there were/are still some other very good ones, like the Indo-European Etymological Dictionary, the Indo-European Roots index, the interesting Piotr Gasiorowski’s site, an article on Kurgan Culture, and indeed Kortlandt studies.

There are also some (apparently) simple HTML web pages with an original article on it – i.e., a one-page research of someone (or some) who preferred to publish their personal opinions or reflections about PIE (or its dialects, as the page on Illyrian) online.

The rest of it, i.e. those “summaries” of PIE, and “demonstration” websites, were maybe good in 1998, when we only had that kind of introductory stuff in the net. But now, most of them have little content concerning the actual PIE reconstruction, and some are even still under construction (¡?).

I have sent again our site – I think more than one year after the first time. I don’t know why our site was rejected then – unfortunately, editors at Dmoz face probably too many requests for inclusion to answer them all -, but, really, if our resources on Proto-Indo-European aren’t for them as good to be listed at least among those ‘introductions’, I can only think of these answers:

1. There are no editors for that section. If that’s the case, I could become an editor myself to delete some deprecated stuff and add dnghu.org – and maybe other pages (like TITUS) not included in this category, but elsewhere on Indo-European languages; it doesn’t sound like ‘fair play’ to me, though, but I think it could anyway save all Google users from this stuff

2. The editor/s are owners of those websites, edited them and don’t want/have time to edit anymore – it could be, but most of them don’t show any ads, so the benefit doesn’t exist – but for the American Heritage Dictionary, which shows a link to a rather simple summary of PIE apart from its main Root index, both of them in the same website. In any case, to reveal the actual identity of those involved couldn’t hurt anyone (if properly advised to all of them), and it could save us some unuseful thinking.

3. There are editors, and they are not related to those websites, but no one is willing to add a website like an “Indo-European Revival Association” to that linguistics section – in that case, they should re-read what the web suggestion says and what is said about the appropriate category to choose : Even if what we proposed were an artificial language, a ‘conlang’ (which is not), what we offer in our site is still the same as those sites on Proto-Indo-European: free online resources about the reconstructed Proto-Indo-European language.

Anyway, I couldn’t be annoyed, even if it was worth it to be in Dmoz at any price; because I myself work in what I like (i.e. PIE resources) for free, and I do what I can the best I can. And I hate when people just criticize how bad this or that free resource of ours is, and don’t even try to help us improve it. ODP people are just doing their best since 1998, and it’s still a good place to look for other content – that which is not found with a simple Google search.

My thanks to them for achieving that.

PD- Btw. I thought about writing this post after reading this thread in their forum, where some ODP editors answer complaints like those we’ve all had sometime about the work in a free collaborative project like theirs.