Indo-European and Central Asian admixture in Indian population, dependent on ethnolinguistic and geodemographic divisions


Preprint paper at BioRxiv, Dissecting Population Substructure in India via Correlation Optimization of Genetics and Geodemographics, by Bose et al. (2017), a mixed group from Purdue University and IBM TJ Watson Research Center. A rather simple paper, which is nevertheless interesting in its approach to the known multiple Indian demographic divisions, and in its short reported methods and results.


India represents an intricate tapestry of population substructure shaped by geography, language, culture and social stratification operating in concert. To date, no study has attempted to model and evaluate how these evolutionary forces have interacted to shape the patterns of genetic diversity within India. Geography has been shown to closely correlate with genetic structure in other parts of the world. However, the strict endogamy imposed by the Indian caste system, and the large number of spoken languages add further levels of complexity. We merged all publicly available data from the Indian subcontinent into a data set of 835 individuals across 48,373 SNPs from 84 well-defined groups. Bringing together geography, sociolinguistics and genetics, we developed COGG (Correlation Optimization of Genetics and Geodemographics) in order to build a model that optimally explains the observed population genetic sub-structure. We find that shared language rather than geography or social structure has been the most powerful force in creating paths of gene flow within India. Further investigating the origins of Indian substructure, we create population genetic networks across Eurasia. We observe two major corridors towards mainland India; one through the Northwestern and another through the Northeastern frontier with the Uygur population acting as a bridge across the two routes. Importantly, network, ADMIXTURE analysis and f3 statistics support a far northern path connecting Europe to Siberia and gene flow from Siberia and Mongolia towards Central Asia and India.

Among the most interesting results (emphasis mine):

Our meta-analysis of the ADMIXTURE output shows that the IE and DR populations across castes shared very high ancestry, indicating the autochthonous origin of the caste system in India (Figure 2). f3 statistics show that most of the castes and tribes in India are admixed, with contributions from other castes and/or tribes, across languages affiliations (Supplementary Table 4 and Supplementary Note). The geographically isolated Tibeto-Burman tribes and the Dravidian speaking tribes appear to be the most isolated in India. Linear Discriminant Analysis on the normalized data set clearly supports genetic strati cation by castes and languages in the Indian sub-continent


Our meta-analysis of the ADMIXTURE plot in Figure 4A quantifies the ADMIXTURE results (darker colors indicate higher pairwise shared ancestry). Indian populations show a greater proportion of shared ancestry with the so-called Indian Northwestern Frontier populations, namely the tribal populations spanning Afghanistan and Pakistan. Central Asian populations share higher degrees of ancestry with IE and DR Froward castes. Uygurs share high degrees of ancestry with Indian populations.


f3 statistics (all negative Z-scores are shown) indicate Chinese and Siberian ancestry contributing to the Tibeto-Burman tribal speakers. On the other hand, the Mongols and the Europeans have contributed significant amounts of ancestry to the Indo-European and Tibeto-Burman forward castes. F3 statistics also show that the Central Asians are an admixed population with signs of admixture from Caucasus and other parts of Europe.

Among the results for proportions of shared ancestry between Indians and Eurasians (FIG. 4), there is an obvious influence of European admixture (Caucasus, and Southern, Central, and Northern EU), potentially from the Yamna-Corded Ware expansion, in IE_ForwardCaste, which is lessened in IE_BackwardCaste and also in IE_Tribal, while DR_ForwardCaste shows again more admixture than IE_Tribal, but diminishing with lower castes and quite low in DR_Tribal.

Ancestry from Central Asia is strong with a similar pattern, which hints at the influence of Sintashta, Andronovo, and BMAC influence in the expansion of the Steppe component, even more than a later Turkic component.

On the other hand, the influence from Turkey is difficult to assess, given the complex genetic history of Anatolia, but the map contained in Fig. 6 doesn’t feel right, not only from a genetic viewpoint, but also from linguistic and archaeological points of view. This is the typical map created with admixture analyses that is wrong because of not taking into account anthropological theories.

Quite interesting is then the influence of admixture in these different ethnolinguistic groups, Indo-European and Dravidic, which points to an initially greater expansion of Indo-European speakers, and later resurge of Dravidian languages.

Featured image contains simplified origin and data of samples studied, from the article.


Two more studies on the genetic history of East Asia: Han Chinese and Thailand


A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese, by Charleston et al. (2017).

It is believed – based on uniparental markers from modern and ancient DNA samples and array-based genome-wide data – that Han Chinese originated in the Central Plain region of China during prehistoric times, expanding with agriculture and technology northward and southward, to become the largest Chinese ethnic group.


As are most non-European populations around the globe, the Han Chinese are relatively understudied in population and medical genetics studies. From low-coverage whole-genome sequencing of 11,670 Han Chinese women we present a catalog of 25,057,223 variants, including 548,401 novel variants that are seen at least 10 times in our dataset. Individuals from our study come from 19 out of 22 provinces across China, allowing us to study population structure, genetic ancestry, and local adaptation in Han Chinese. We identify previously unrecognized population structure along the East-West axis of China and report unique signals of admixture across geographical space, such as European influences among the Northwestern provinces of China. Finally, we identified a number of highly differentiated loci, indicative of local adaptation in the Han Chinese. In particular, we detected extreme differentiation among the Han Chinese at MTHFR, ADH7, and FADS loci, suggesting that these loci may not be specifically selected in Tibetan and Inuit populations as previously suggested. On the other hand, we find that Neandertal ancestry does not vary significantly across the provinces, consistent with admixture prior to the dispersal of modern Han Chinese. Furthermore, contrary to a previous report, Neandertal ancestry does not explain a significant amount of heritability in depression. Our findings provide the largest genetic data set so far made available for Han Chinese and provide insights into the history and population structure of the world’s largest ethnic group.

Using Shanghai individuals as representatives, shared drift between Chinese and ancient humans are computed by calculating the outgroup f3 statistics of the form f3(Mbuty;X, Y), with ancient individuals separated into approximately Palaeolithic, Mesolithic, Neolithic , and Chalcolithic-Medieval times. it is found that modern Chinese individuals show greater shared drift with pre-Neolithic hunter-gatherers rather than Neolithic farmers (Featured image from the article).

EDIT (17/7/2017): Davidski at Eurogenes shares an interesting view on this kind of results:

These sorts of estimates always look way off. And I doubt that it’s largely the result of the Silk Road, which linked China to the Near East and Mediterranean rather than to Northern Europe. More likely it reflects gene flow from the Pontic-Caspian steppe in Eastern Europe during the Bronze and Iron ages, via the Afanasievo, Andronovo, and other closely related steppe peoples

New insights from Thailand into the maternal genetic history of Mainland Southeast Asia, by Kutanan et al. (2017)


Tai-Kadai (TK) is one of the major language families in Mainland Southeast Asia (MSEA), with a concentration in the area of Thailand and Laos. Our previous study of 1,234 mtDNA genome sequences supported a demic diffusion scenario in the spread of TK languages from southern China to Laos as well as northern and northeastern Thailand. Here we add an additional 560 mtDNA sequences from 22 groups, with a focus on the TK-speaking central Thai people and the Sino-Tibetan speaking Karen. We find extensive diversity, including 62 haplogroups not reported previously from this region. Demic diffusion is still a preferable scenario for central Thais, emphasizing the extension and expansion of TK people through MSEA, although there is also some support for an admixture model. We also tested competing models concerning the genetic relationships of groups from the major MSEA languages, and found support for an ancestral relationship of TK and Austronesian-speaking groups.