Indo-European and Central Asian admixture in Indian population, dependent on ethnolinguistic and geodemographic divisions


Preprint paper at BioRxiv, Dissecting Population Substructure in India via Correlation Optimization of Genetics and Geodemographics, by Bose et al. (2017), a mixed group from Purdue University and IBM TJ Watson Research Center. A rather simple paper, which is nevertheless interesting in its approach to the known multiple Indian demographic divisions, and in its short reported methods and results.


India represents an intricate tapestry of population substructure shaped by geography, language, culture and social stratification operating in concert. To date, no study has attempted to model and evaluate how these evolutionary forces have interacted to shape the patterns of genetic diversity within India. Geography has been shown to closely correlate with genetic structure in other parts of the world. However, the strict endogamy imposed by the Indian caste system, and the large number of spoken languages add further levels of complexity. We merged all publicly available data from the Indian subcontinent into a data set of 835 individuals across 48,373 SNPs from 84 well-defined groups. Bringing together geography, sociolinguistics and genetics, we developed COGG (Correlation Optimization of Genetics and Geodemographics) in order to build a model that optimally explains the observed population genetic sub-structure. We find that shared language rather than geography or social structure has been the most powerful force in creating paths of gene flow within India. Further investigating the origins of Indian substructure, we create population genetic networks across Eurasia. We observe two major corridors towards mainland India; one through the Northwestern and another through the Northeastern frontier with the Uygur population acting as a bridge across the two routes. Importantly, network, ADMIXTURE analysis and f3 statistics support a far northern path connecting Europe to Siberia and gene flow from Siberia and Mongolia towards Central Asia and India.

Among the most interesting results (emphasis mine):

Our meta-analysis of the ADMIXTURE output shows that the IE and DR populations across castes shared very high ancestry, indicating the autochthonous origin of the caste system in India (Figure 2). f3 statistics show that most of the castes and tribes in India are admixed, with contributions from other castes and/or tribes, across languages affiliations (Supplementary Table 4 and Supplementary Note). The geographically isolated Tibeto-Burman tribes and the Dravidian speaking tribes appear to be the most isolated in India. Linear Discriminant Analysis on the normalized data set clearly supports genetic strati cation by castes and languages in the Indian sub-continent


Our meta-analysis of the ADMIXTURE plot in Figure 4A quantifies the ADMIXTURE results (darker colors indicate higher pairwise shared ancestry). Indian populations show a greater proportion of shared ancestry with the so-called Indian Northwestern Frontier populations, namely the tribal populations spanning Afghanistan and Pakistan. Central Asian populations share higher degrees of ancestry with IE and DR Froward castes. Uygurs share high degrees of ancestry with Indian populations.


f3 statistics (all negative Z-scores are shown) indicate Chinese and Siberian ancestry contributing to the Tibeto-Burman tribal speakers. On the other hand, the Mongols and the Europeans have contributed significant amounts of ancestry to the Indo-European and Tibeto-Burman forward castes. F3 statistics also show that the Central Asians are an admixed population with signs of admixture from Caucasus and other parts of Europe.

Among the results for proportions of shared ancestry between Indians and Eurasians (FIG. 4), there is an obvious influence of European admixture (Caucasus, and Southern, Central, and Northern EU), potentially from the Yamna-Corded Ware expansion, in IE_ForwardCaste, which is lessened in IE_BackwardCaste and also in IE_Tribal, while DR_ForwardCaste shows again more admixture than IE_Tribal, but diminishing with lower castes and quite low in DR_Tribal.

Ancestry from Central Asia is strong with a similar pattern, which hints at the influence of Sintashta, Andronovo, and BMAC influence in the expansion of the Steppe component, even more than a later Turkic component.

On the other hand, the influence from Turkey is difficult to assess, given the complex genetic history of Anatolia, but the map contained in Fig. 6 doesn’t feel right, not only from a genetic viewpoint, but also from linguistic and archaeological points of view. This is the typical map created with admixture analyses that is wrong because of not taking into account anthropological theories.

Quite interesting is then the influence of admixture in these different ethnolinguistic groups, Indo-European and Dravidic, which points to an initially greater expansion of Indo-European speakers, and later resurge of Dravidian languages.

Featured image contains simplified origin and data of samples studied, from the article.


About the European Union’s arcane language: the EU does seem difficult for people to understand

Mark Mardell asks in his post Learn EU-speak:

Does the EU shroud itself in obscure language on purpose or does any work of detail produce its own arcane language? Of course it is not just the lingo: the EU does seem difficult for people to understand. What’s at the heart of the problem?

His answer on the radio (as those comments that can be read in his blog) will probably look for complex reasoning on the nature of the European Union as an elitist institution, distant from real people, on the “obscure language” (intentionally?) used by MEPs, on the need of that language to be obscured by legal terms, etc.

All that is great. You can talk a lot about the possible reasons why people would find too boring those Europarliament discussions where everyone speaks his own national language; possible reasons why important media (like the BBC) would never show debates on important issues, unless the MEP uses their national language; possible reasons why that doesn’t happen with national parliaments where everyone speaks a common language…

But the most probable answer is so obvious it doesn’t really make sense to ask. The initeresting question is do people actually want to pay the price for having a common Europe?

Five lines of ancient script on a shard of pottery could be the longest proto-Canaanite text ever found, archaeologists say

According to the BBC News ‘Oldest Hebrew script’ is found:

The shard was found by a teenage volunteer during a dig about 20km (12 miles) south-west of Jerusalem. Experts at Hebrew University said dating showed it was written 3,000 years ago – about 1,000 years earlier than the Dead Sea Scrolls. Other scientists cautioned that further study was needed to understand it.

Preliminary investigations since the shard was found in July have deciphered some words, including judge, slave and king. The characters are written in Proto-Canaanite, a precursor of the Hebrew alphabet.

I found it interesting because of the implications that these findings might have on classifications of dead languages into more natural or artificial regarding the knowledge we have of them, especially about proto-languages like Proto-Canaanite (or Europe’s Indo-European), which can easily move from category 9 (‘hypothetical language’) to category 8 or even 7 (‘dead language’).

As we have said before, this implies that, despite the efforts of some conlangers to make their newly created conlangs (look) the same as proto-languages like PIE – in the sense of ‘artificiality’, they obviously aren’t.