The genetic makings of South Asia – IVC as Proto-Dravidian


Review (behind paywall) The genetic makings of South Asia, by Metspalu, Monda, and Chaubey, Current Opinion in Genetics & Development (2018) 53:128-133.

Interesting excerpts (emphasis mine):

(…) the spread of agriculture in Europe was a result of the demic diffusion of early Anatolian farmers, it was discovered that the spread of agriculture to South Asia was mediated by a genetically completely different farmer population in the Zagros mountains in contemporary Iran (IF). The ANI-ASI cline itself was interpreted as a mixture of three components genetically related to Iranian agriculturalists, Onge and Early and Middle Bronze Age Steppe populations (Steppe_EMBA).

The first ever autosomal aDNA from South Asia comes from Northern Pakistan (Swat Valley, early Iron Age). This study presented altogether 362 aDNA samples from the broad South and Central Asia and contributes substantially to our understanding of the evolutionary past of South and Central Asia. The study redefines the three genetic strata that form the basis of the Indian Cline. The Indus Periphery (IP) component is composed of (varying proportions of): first, IF, second, Ancient Ancestral South Asians (AASI), which represents an ancient branch of human genetic variation in Asia arising from a population split contemporaneous with the splits of East Asian, Onge and Australian Aboriginal ancestors and third, West_Siberian Hunter gatherers (WS_HG).

The authors argue that IP could have formed the genetic base of the Indus Valley Civilization (IVC). Upon the collapse of the IVC IP contributes to the formation of both ASI and ANI. ASI is formed as IP admixes further with AASI. ANI in turn forms when IP admixes with the incoming Middle and Late Bronze Age Steppe (Steppe_MLBA) component, (rather than the Steppe_EMBA groups suggested earlier)

A sketch of the peopling history of South Asia. Depicting the full complexity of available reconstructions is not attempted. Placing of population labels does not indicate precise geographic location or range of the population in question. Rather we aim to highlight the essentials of the recent advancements in the field. We divide the scenario into three time horizons: Panels (a) before 10 000 BCE (pre agriculture era.); (b) 10 000 BCE to 3000 BCE (agriculture era) and (c) 3000 BCE to prehistoric era/modern era. (iron age).

Dating of the arrival of the Austro-Asiatic speakers in South Asia-based on Y chromosome haplogroup O2a1-M95 expansion estimates yielded dates between 3000 and 2000 BCE [30]. However, admixture LD decay-based approach on genome-wide data suggests the admixture between South Asian and incoming Austro-Asiatic speakers occurred slightly later between 1800 and 0 BCE (Tätte et al. submitted). It is interesting that while the mtDNA variants of the Mundas are completely South Asian, the Y chromosome variation is dominated at >60% by haplogroup O2a which is phylogeographically nested in East Asian-specific paternal lineages.

In India, the speakers of Tibeto-Burman (TB) languages live in the Seven Sisters States in Northeast India and in the very north of the country. Genetically they show a clear East Asian origin and around 20% of subsequent admixture with South Asians within the last 1000 years.The genetic flavour of East Asia in TB is different from that in Munda speakers as the best surrogates for the East Asian admixing component are contemporary Han Chinese.

I found the simplistic migration maps especially interesting to illustrate ancient population movements. The emergence of EHG is supposed to involve a WHG:ANE cline, though, and this isn’t clear from the map. Also, there is new information on what may be at the origin of WHG and Anatolian hunter-gatherers.

From the recent Reich’s session on South Asia at ISBA 8:

– Tale of three clines, with clear indication that “Indus Periphery” samples drawn from an already-cosmopolitan and heterogeneous world of variable ASI & Iranian ancestry. (I know how some people like to pore over these pictures – so note red dots = just dummy data for illustration.)
– Some more certainty about primary window of steppe ancestry injection into S. Asia: 2000-1500 BC
Alexander M. Kim

Featured image: map of South Asian languages from


Munda admixture happened probably during the ANI-ASI mixture


Preprint The genetic legacy of continental scale admixture in Indian Austroasiatic speakers, by Tätte et al. bioRxiv (2018).

Interesting excerpts:

Studies analysing mtDNA and Y chromosome markers have revealed a sex-specific admixture pattern of admixture of Southeast and South Asian ancestry components for Munda speakers. While close to 100% of mtDNA lineages present in Mundas match those in other Indian populations, around 65% of their paternal genetic heritage is more closely related to Southeast Asian than South Asian variation. Such a contrasting distribution of maternal and paternal lineages among the Munda speakers is a classic example of ‘father tongue hypothesis’. However, the temporality of this expansion is contentious. Based on Y-STR data the coalescent time of Indian O2a-M95 haplogroup was estimated to be >10 KYA. Recently, the reconstructed phylogeny of 8.8 Mb region of Y chromosome data showed that Indian O2a-M95 lineages coalesce within a clade nested within East/Southeast Asian within the last ~5-7 KYA. This date estimate sets the upper boundary for the main episode of gene flow of Y chromosomes from Southeast Asia to India.

Supplementary Figure S4. First two components of principal component analysis (PCA). Individuals and population medians (circles) are marked with abbreviations from population names. Different colours represent populations from different geographic areas and/or linguistic groups as shown on the legend on the right. For the full names of populations see Supplementary Table S1. PCA was performed using software EIGENSOFT 6.1.42 on the whole filtered dataset (1072 individuals), previously LD pruned as described in the title of Supplementary Figure S1. The first two principal components describe 5.13% and 2.57% of total variance.

Admixture proportions suggest a novel scenario

Regardless of which West Asian population we used, we found that Munda speakers can be described on average as a mixture of ~19% Southeast Asian, 15% West Asian and 66% Onge (South Asian) components. Alternatively, the West and South Asian components of Munda could be modelled using a single South Asian population (Paniya), accounting on average to 77% of the Munda genome. When rescaling the West and South Asian (Onge) components to 1 to explore the Munda genetic composition prior to the introduction of the Southeast Asian component, we note that the West Asian component is lower (~19%) in Munda compared to Paniya (27%) (Supplementary Table S4: *Average_Lao=0). Consistently with qpGraph analyses in Narasimhan et al. (2018), this may point to an initial admixture of a Southeast Asian substrate with a South Asian substrate free of any West Asian component, followed by the encounter of the resulting admixed population with a Paniya-like population. Such a scenario would imply an inverse relationship between the Southeast and West Asian relative proportions in Munda or, in other words, the increase of Southeast Asian component should cause a greater reduction of the West Asian compared to the reduction in the South Asian component in Munda.

The distribution of genetic components (K=13) based on the global ADMIXTURE analysis (Supplementary Figure S1, S2, S3) for a subset of populations on a map of South and Southeast Asia. The circular legend in the bottom left corner shows the ancestral components corresponding to the colours on pie charts. The sector sizes correspond to population median.

Dating the admixture event

In this study, we have replicated a result previously reported in Chaubey et al. (2011)7 that the Mundas lack one ancestral component (k2) that is characteristic to Indian Indo-European and Dravidian speaking populations. If this component came to India through one of the Indo-Aryan migrations then it would be fair to presume that the Munda admixture happened before this component reached India or at least before it spread all over the country. However, the admixture time computed here, falls in the exact same timeframe as the ANI-ASI mixture has been estimated to have happened in India through which the k2 component probably spread. Therefore, we propose that if the Munda admixture happened at the same time, it is possible for it to have happened in the eastern part of the country, east of Bangladesh, and later when populations from East Asia moved to the area, the Mundas migrated towards central India. Such a scenario, which may be further clarified by ancient DNA analyses, seems to be further supported by the fact that Mundas harbor a smaller fraction of West Asian ancestry compared to contemporary Paniya (Supplementary Table S4) and cannot therefore be seen as a simple admixture product of Southern Indian populations with incoming Southeast Asian ancestries.

Image from Damgaard et al. (2018). A summary of the four qpAdm models fitted for South Asian populations. For each modern South Asian population. we fit different models with qpAdm to explain their ancestry composition using ancient groups and present the f irst model that we could not reject in the following priority order: 1. Namazga_CA + Onge, 2. Namazga_CA + Onge + Late Bronze Age Steppe, 3. Namazga_CA + Onge + Xiongnu_lA (East Asian proxy). and 4. Turkmenistan_lA + Xiongnu_lA. Xiongnu_lA were used here to represent East Asian ancestry. We observe that while South Asian Dravidian speakers can be modeled as a mixture of Onge and Namazga_CA. an additional source related to Late Bronze Age steppe groups is required for IE speakers. In Tibeto-Burman and Austro-Asiatic speakers. an East Asian rather than a Steppe_MLBA source is required

Linguistics and genome-wide data

(…) by and large, the linguistic classification justifies itself but Kharia and Juang do not fit in this simplification perfectly.

Once again, with the current level of detail in genetic studies, there is often no clear dialectal division possible for certain groups without fine-scale population studies, and the help from linguistics and archaeology.

Featured image from open access paper by Chaubey et al. (2011).


South-East Asia samples include shared ancestry with Jōmon


New paper (behind paywall) The prehistoric peopling of Southeast Asia, by McColl et al. (Science 2018) 361(6397):88-92 from a recent bioRxiv preprint.

Interesting is this apparently newly reported information including a female sample from the Ikawazu Jōmon of Japan ca. 570 BC (emphasis mine):

The two oldest samples — Hòabìnhians from Pha Faen, Laos [La368; 7950 with 7795 calendar years before the present (cal B.P.)] and Gua Cha, Malaysia (Ma911; 4415 to 4160 cal B.P.)—henceforth labeled “group 1,” cluster most closely with present-day Önge from the Andaman Islands and away from other East Asian and Southeast-Asian populations (Fig. 2), a pattern that differentiates them from all other ancient samples. We used ADMIXTURE (14) and fastNGSadmix (15) to model ancient genomes as mixtures of latent ancestry components (11). Group 1 individuals differ from the other Southeast Asian ancient samples in containing components shared with the supposed descendants of the Hòabìnhians: the Önge and the Jehai (Peninsular Malaysia), along with groups from India and Papua New Guinea.

We also find a distinctive relationship between the group 1 samples and the Ikawazu Jōmon of Japan (IK002). Outgroup f3 statistics (11, 16) show that group 1 shares the most genetic drift with all ancient mainland samples and Jōmon (fig. S12 and table S4). All other ancient genomes share more drift with present-day East Asian and Southeast Asian populations than with Jōmon (figs. S13 to S19 and tables S4 to S11). This is apparent in the fastNGSadmix analysis when assuming six ancestral components (K = 6) (fig. S11), where the Jōmon sample contains East Asian components and components found in group 1. To detect populations with genetic affinities to Jōmon, relative to present-day Japanese, we computed D statistics of the form D(Japanese, Jōmon; X, Mbuti), setting X to be different presentday and ancient Southeast Asian individuals (table S22). The strongest signal is seen when X=Ma911 and La368 (group 1 individuals), showing a marginally nonsignificant affinity to Jōmon (11). This signal is not observed with X = Papuans or Önge, suggesting that the Jōmon and Hòabìnhians may share group 1 ancestry (11).

Model for plausible migration routes into SEA. This schematic is based on ancestry patterns observed in the ancient genomes. Because we do not have ancient samples to accurately resolve how the ancestors of Jōmon and Japanese populations entered the Japanese archipelago, these migrations are represented by dashed arrows. A mainland component in Indonesia is depicted by the dashed red-green line. Gr, group; Kra, Kradai.

(…) Finally, the Jōmon individual is best-modeled as a mix between a population related to group 1/Önge and a population related to East Asians (Amis), whereas present-day Japanese can be modeled as a mixture of Jōmon and an additional East Asian component (Fig. 3 and fig. S29)

Interesting in relation to the oral communication of the SMBE O-03-OS02 Whole genome analysis of the Jomon remain reveals deep lineage of East Eurasian populations by Gakuuhari et al.:

Post late-Paleolithic hunter-gatherers lived throughout the Japanese archipelago, Jomonese, are thought to be a key to understanding the peopling history in East Asia. Here, we report a whole genome sequence (x1.85) of 2,500-year old female excavated from the Ikawazu shell-mound, unearthed typical remains of Jomon culture. The whole genome data places the Jomon as a lineage basal to contemporary and ancient populations of the eastern part of Eurasian continent, and supports the closest relationship with the modern Hokkaido Ainu. The results of ADMIXTURE show the Jomon ancestry is prevalent in present-day Nivkh, Ulchi, and people in the main-island Japan. By including the Jomon genome into phylogenetic trees, ancient lineages of the Kusunda and the Sherpa/Tibetan, early splitting from the rest of East Asian populations, is emerged. Thus, the Jomon genome gives a new insight in East Asian expansion. The Ikawazu shell-mound site locates on 34,38,43 north latitude, and 137,8, 52 east longitude in the central main-island of the Japanese archipelago, corresponding to a warm and humid monsoon region, which has been thought to be almost impossible to maintain sufficient ancient DNA for genome analysis. Our achievement opens up new possibilities for such geographical regions.


Mitogenomes from Thailand offer insights into maternal genetic history of mainland South-East Asia

Open access New insights from Thailand into the maternal genetic history of Mainland Southeast Asia, by Kutanan et al. Eur. J. Hum. Genet. (2018) 26:898–911

Abstract (emphasis mine):

Tai-Kadai (TK) is one of the major language families in Mainland Southeast Asia (MSEA), with a concentration in the area of Thailand and Laos. Our previous study of 1234 mtDNA genome sequences supported a demic diffusion scenario in the spread of TK languages from southern China to Laos as well as northern and northeastern Thailand. Here we add an additional 560 mtDNA genomes from 22 groups, with a focus on the TK-speaking central Thai people and the Sino-Tibetan speaking Karen. We find extensive diversity, including 62 haplogroups not reported previously from this region. Demic diffusion is still a preferable scenario for central Thais, emphasizing the expansion of TK people through MSEA, although there is also some support for gene flow between central Thai and native Austroasiatic speaking Mon and Khmer. We also tested competing models concerning the genetic relationships of groups from the major MSEA languages, and found support for an ancestral relationship of TK and Austronesian-speaking groups.

Map showing sample locations and haplogroup distributions. Blue stars indicate the 22 presently studied populations (Tai-Kadai, Austroasiatic, and Sino-Tibetan groups) while red and green circles represent Tai-Kadai and Austroasiatic populations from the previous study [7]. Population abbreviations are in Supplementary Table S1

Interesting excerpts:

Finally, we used simulations to test hypotheses concerning the genetic relationships of groups belonging to different language families. We found that Starosta’s model [11] provided the best fit to the mtDNA data; however, Sagart’s model [9, 10] was also highly supported. These two models both postulate a close linguistic affinity between TK and AN. Although genetic relatedness between TK and AN groups has been previously studied [7, 46, 47], to our knowledge this is the first study to use demographic simulations to select the best-fitting model. Our results support the genetic relatedness of TK and AN groups, which might reflect a postulated shared ancestry among the proto-Austronesian populations of coastal East Asia [48].

Specifically, the best-fitting model suggests that after separation of the prehistoric TK from AN stocks around 5–6 kya in Southeast China, the TK spread southward throughout MSEA around 1–2 kya by a demic diffusion process, accompanied by population growth but with at most minor admixture with the autochthonous AA groups. Meanwhile, the prehistorical AN ancestors entered Taiwan and dispersed southward throughout ISEA, with these two expansions later meeting in western ISEA. The lack of mtDNA haplogroups associated with the expansion out of Taiwan in our Thai/Lao samples has two possible explanations: either the Out of Taiwan expansion did not reach MSEA (at least, in the area of present-day Thailand and Laos); or, if the prehistoric AN migrated through this area, their mtDNA lineages do not survive in modern Thai/Lao populations. Ancient DNA studies in MSEA would further clarify this issue. Moreover, although mtDNA analyses are informative in elucidating genetic perspectives in geographically and linguistically related populations, they have an obvious limitation in that they only provide insights into the maternal history of populations. Future studies of Y chromosomal and genome-wide data will provide further insights into the genetic history of Thai/Lao populations and the role of factors such as post-marital residence patterns and migration in shaping the genetic structure of the region.

Starosta’s chapter referred to in the paper is Proto-East Asian and the origin and dispersal of the languages of East and Southeast Asia and the Pacific.


Ancient genomes document multiple waves of migration in south-east Asian prehistory


Open access preprint at bioRxiv Ancient genomes document multiple waves of migration in Southeast Asian prehistory, by Lipson, Cheronet, Mallick, et al. (2018).

Abstract (emphasis mine):

Southeast Asia is home to rich human genetic and linguistic diversity, but the details of past population movements in the region are not well known. Here, we report genome-wide ancient DNA data from thirteen Southeast Asian individuals spanning from the Neolithic period through the Iron Age (4100-1700 years ago). Early agriculturalists from Man Bac in Vietnam possessed a mixture of East Asian (southern Chinese farmer) and deeply diverged eastern Eurasian (hunter-gatherer) ancestry characteristic of Austroasiatic speakers, with similar ancestry as far south as Indonesia providing evidence for an expansive initial spread of Austroasiatic languages. In a striking parallel with Europe, later sites from across the region show closer connections to present-day majority groups, reflecting a second major influx of migrants by the time of the Bronze Age.

Schematics of admixture graph results. (A) Wider phylogenetic context. (B) Details of the Austroasiatic clade. Branch lengths are not to scale, and the order of the two events on the Nicobarese lineage in (B) is not well determined (Supplementary Text).

Featured image, from the article: “Overview of samples. (A) Locations and dates of ancient individuals. Overlapping positions are shifted slightly for visibility. (B) PCA with East and Southeast Asians. We projected the ancient samples onto axes computed using the present-day populations (with the exception of Mlabri, who were projected instead due to their large population-speci c drift). Present-day colors indicate language family affiliation: green, Austroasiatic; blue, Austronesian; orange, Hmong-Mien; black, Sino-Tibetan; magenta, Tai-Kadai.”

See also:

Genomics reveals four prehistoric migration waves into South-East Asia

Open access preprint article at bioRxiv Ancient Genomics Reveals Four Prehistoric Migration Waves into Southeast Asia, by McColl, Racimo, Vinner, et al. (2018).

Abstract (emphasis mine):

Two distinct population models have been put forward to explain present-day human diversity in Southeast Asia. The first model proposes long-term continuity (Regional Continuity model) while the other suggests two waves of dispersal (Two Layer model). Here, we use whole-genome capture in combination with shotgun sequencing to generate 25 ancient human genome sequences from mainland and island Southeast Asia, and directly test the two competing hypotheses. We find that early genomes from Hoabinhian hunter-gatherer contexts in Laos and Malaysia have genetic affinities with the Onge hunter-gatherers from the Andaman Islands, while Southeast Asian Neolithic farmers have a distinct East Asian genomic ancestry related to present-day Austroasiatic-speaking populations. We also identify two further migratory events, consistent with the expansion of speakers of Austronesian languages into Island Southeast Asia ca. 4 kya, and the expansion by East Asians into northern Vietnam ca. 2 kya. These findings support the Two Layer model for the early peopling of Southeast Asia and highlight the complexities of dispersal patterns from East Asia.

A model for plausible migration routes into Southeast Asia, based on the ancestry patterns observed in the ancient genomes.