The concept of “outlier” in studies of Human Ancestry, and the Corded Ware outlier from Esperstedt


While writing the third version of the Indo-European demic diffusion model, I noticed that one Corded Ware sample (labelled I0104) clusters quite closely with steppe samples (i.e. Yamna, Afanasevo, and Potapovka). The other Corded Ware samples cluster, as expected, closely with east-central European samples, which include related cultures such as the Swedish Battle Axe, and later Sintashta, or Potapovka (cultures that are from the steppe proper, but are derived from Corded Ware).

I also noticed after publishing the draft that I had used the wording “Corded Ware outlier” at least once. I certainly had that term in mind when developing the third version, but I did not intend to write it down formally. Nevertheless, I think it is the right name to use.

PCA of dataset including Minoans and Mycenaeans, and Scythians and Sarmatians. The graphic has been arranged so that ancestries and samples are located in geographically friendly axes similar to north-south (Y), east-west(X). Symbols are used, in a simplified manner, in accordance with symbols for Y-DNA haplogroups used in the maps. Labels have been used for simplification of important components. Areas are drawn surrounding Yamna, Poltavka, Afanasevo, Corded Ware (including samples from Estonia, Battle Axe, and Poltavka outlier), and succeeding Sintashta and Potapovka cultures, as well as Bell Beaker. Corded Ware sample I0104, from Esperstedt, has also been labelled.

Outlier in Statistics, as you can infer from the name, is a sample (more precisely an observation) that lies distant to others. It is a slippery concept in Human Evolutionary Biology, because it has no clear definition, and it is thus dependent on a certain degree of subjective evaluation. It seems to be mainly based on a combination of PCA and ADMIXTURE analyses, but should obviously be dependent on the number of samples available for a certain culture, and the regional distribution of the samples available.

We have thus certain clear cases, like the Poltavka outlier, of R1a-M417 lineage, clustering close to Corded Ware (and Sintashta, and Potapovka) samples, but far from other R1b-L23 samples from Poltavka or Yamna cultures, from neighbouring regions in the steppe.

We have also less clear observations, like Balkan Chalcolithic samples, which may or may not have been part of different cultural groups (say, related to the Suvorovo-Novodanilovka expansion, or not), which may justify their differences in ancestral components in ADMIXTURE, and in their position in PCA.

And we have a Yamna sample from western Ukraine, which – unlike the other two available samples – clusters “to the south” of east Yamna samples. Taking into account the Yamna sample from Bulgaria, clustering closely with south-eastern European samples, could you really call this an outlier? Two outliers out of four western Yamna samples? Well, maybe. If you take east and west Yamna from the steppe as a whole, and exclude the Yamna sample from Bulgaria, of course you can. Whether that classification is useful, or actually hinders a proper interpretation of western Yamna samples, and of the “Yamna component” seen in them, is a different story…

PCA for European samples of Mathieson et al. (2017)

But what then about the Corded Ware male from Esperstedt, labelled I0104, dated ca. 2430 BC, which clusters among contemporaneous steppe (Poltavka) samples, and has the greatest proportion of ‘Yamna component’ in ADMIXTURE? After all, it is different in both respects from any other Corded Ware individual – including the oldest samples available, from Latvia (ca. 2885 BC) and Tiefbrunn (ca. 2755 BC).

This sample is one of the direct links between the steppe and Corded Ware in late times, and has been the main reason for the confusion a lot of people seem to have about the “Yamna component” in Corded Ware, with some supporting a direct migration from one into the other, and a few even daring to say that “Corded Ware is indistinguishable from Yamna”(!?).

His family members – all males of haplogroup R1a-M417 (like I0104 and most males from the Corded Ware culture) -, few generations later, show a decreased Yamna component, which clearly indicates that this individual’s admixture came directly from the steppe, and most likely from one or multiple female ancestors. That is compatible with the nomadic nature of the Corded Ware culture (and its known exogamy practices), which connected central Europe with the steppes, up to the North Caspian region.

If labelling other samples as outliers may be interesting to improve the conclusions one can obtain from genetic research, labelling this sample is, in my opinion, essential, to avoid certain strong misconceptions about the origin of the Corded Ware culture.


Indo-European demic diffusion model, 3rd edition


I have just uploaded the working draft of the third version of the Indo-European demic diffusion model. Unlike the previous two versions, which were published as essays (fully developed papers), this new version adds more information on human admixture, and probably needs important corrections before a definitive edition can be published.

The third version is available right now on ResearchGate and I will post the PDF at Academia Prisca, as soon as possible:

Map overlaid by PCA including Yamna, Corded Ware, Bell Beaker, and other samples

Feel free to comment on the paper here, or (preferably) in our forum.

A working version (needing some corrections) divided by sections, illustrated with up-to-date, high resolution maps, can be found (as always) at the official collaborative Wiki website

Palaeogenomic and biostatistical analysis of ancient DNA data from Mesolithic and Neolithic skeletal remains


PhD Thesis Palaeogenomic and biostatistical analysis of ancient DNA data from Mesolithic and Neolithic skeletal remains, by Zuzana Hofmanova (2017) at the University of Mainz.

Palaeogenomic data have illuminated several important periods of human past with surprising im- plications for our understanding of human evolution. One of the major changes in human prehistory was Neolithisation, the introduction of the farming lifestyle to human societies. Farming originated in the Fertile Crescent approximately 10,000 years BC and in Europe it was associated with a major population turnover. Ancient DNA from Anatolia, the presumed source area of the demic spread to Europe, and the Balkans, one of the first known contact zones between local hunter-gatherers and incoming farmers, was obtained from roughly contemporaneous human remains dated to ∼6 th millennium BC. This new unprecedented dataset comprised of 86 full mitogenomes, five whole genomes (7.1–3.7x coverage) and 20 high coverage (7.6–93.8x) genomic samples. The Aegean Neolithic pop- ulation, relatively homogeneous on both sides of the Aegean Sea, was positively proven to be a core zone for demic spread of farmers to Europe. The farmers were shown to migrate through the central Balkans and while the local sedentary hunter-gathers of Vlasac in the Danube Gorges seemed to be isolated from the farmers coming from the south, the individuals of the Aegean origin infiltrated the nearby hunter-gatherer community of Lepenski Vir. The intensity of infiltration increased over time and even though there was an impact of the Danubian hunter-gatherers on genetic variation of Neolithic central Europe, the Aegean ancestry dominated during the introduction of farming to the continent.

Taking only admixture analyses using Yamna samples:

This increased genetic affinity of Neolithic farmers to Danubians was observed for Neolithic Hungarians, LBK from central Europe and LBK Stuttgart sample. Some post-Neolithic samples also proved to share more drift with Danubians, again samples from Hungary (Bronze Age and Copper Age samples and also Yamnaya and samples with elevated Yamnaya ancestry (Early Bronze Age samples from Únětice, Bell Beaker samples, Late Neolithic Karlsdorf sample and Corded Ware samples).


The results of our ADMIXTURE analysis for the dataset including also Yamnaya samples are shown in Figure S1c. The cross-validation error was the lowest for K=2. Supervised and unsupervised analyses for K=3 are again highly concordant. Early Neolithic farmers again demonstrate almost no evidence of hunter-gatherer admixture, while it is observable in the Middle Neolithic farmers. However, much of the Late Neolithic hunter-gatherer ancestry from the previous analysis is replaced by Yamnaya ancestry. These results are consistent with the results of Haak et al. who demonstrated a resurgence of hunter-gatherer ancestry followed by the establishment of Eastern hunter-gatherer ancestry.

Again, admixture results show that something in the simplistic Yamna -> Corded Ware model is off. It is still interesting to review admixture results of European Mesolithic and Late Neolithic genomic data in relation to the so-called steppe or yamna ancestry or component (most likely an eastern steppe / forest zone ancestry probably also present in the earlier Corded Ware horizons) and its interpretation…

Image composed by me, from two different images of the PhD Thesis. To the left: Supervised run of ADMIXTURE. The clusters to be supervised were chosen to best fit the presumed ancestral populations (for HG Motala and for farmers Bar8 and Bar31 and for later Eastern migration Yamnaya). To the Right: Unsupervised run of ADMIXTURE for the Anatolian genomic dataset with Yamnaya samples for K=8.

Discovered via Généalogie génétique

Two more studies on the genetic history of East Asia: Han Chinese and Thailand


A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese, by Charleston et al. (2017).

It is believed – based on uniparental markers from modern and ancient DNA samples and array-based genome-wide data – that Han Chinese originated in the Central Plain region of China during prehistoric times, expanding with agriculture and technology northward and southward, to become the largest Chinese ethnic group.


As are most non-European populations around the globe, the Han Chinese are relatively understudied in population and medical genetics studies. From low-coverage whole-genome sequencing of 11,670 Han Chinese women we present a catalog of 25,057,223 variants, including 548,401 novel variants that are seen at least 10 times in our dataset. Individuals from our study come from 19 out of 22 provinces across China, allowing us to study population structure, genetic ancestry, and local adaptation in Han Chinese. We identify previously unrecognized population structure along the East-West axis of China and report unique signals of admixture across geographical space, such as European influences among the Northwestern provinces of China. Finally, we identified a number of highly differentiated loci, indicative of local adaptation in the Han Chinese. In particular, we detected extreme differentiation among the Han Chinese at MTHFR, ADH7, and FADS loci, suggesting that these loci may not be specifically selected in Tibetan and Inuit populations as previously suggested. On the other hand, we find that Neandertal ancestry does not vary significantly across the provinces, consistent with admixture prior to the dispersal of modern Han Chinese. Furthermore, contrary to a previous report, Neandertal ancestry does not explain a significant amount of heritability in depression. Our findings provide the largest genetic data set so far made available for Han Chinese and provide insights into the history and population structure of the world’s largest ethnic group.

Using Shanghai individuals as representatives, shared drift between Chinese and ancient humans are computed by calculating the outgroup f3 statistics of the form f3(Mbuty;X, Y), with ancient individuals separated into approximately Palaeolithic, Mesolithic, Neolithic , and Chalcolithic-Medieval times. it is found that modern Chinese individuals show greater shared drift with pre-Neolithic hunter-gatherers rather than Neolithic farmers (Featured image from the article).

EDIT (17/7/2017): Davidski at Eurogenes shares an interesting view on this kind of results:

These sorts of estimates always look way off. And I doubt that it’s largely the result of the Silk Road, which linked China to the Near East and Mediterranean rather than to Northern Europe. More likely it reflects gene flow from the Pontic-Caspian steppe in Eastern Europe during the Bronze and Iron ages, via the Afanasievo, Andronovo, and other closely related steppe peoples

New insights from Thailand into the maternal genetic history of Mainland Southeast Asia, by Kutanan et al. (2017)


Tai-Kadai (TK) is one of the major language families in Mainland Southeast Asia (MSEA), with a concentration in the area of Thailand and Laos. Our previous study of 1,234 mtDNA genome sequences supported a demic diffusion scenario in the spread of TK languages from southern China to Laos as well as northern and northeastern Thailand. Here we add an additional 560 mtDNA sequences from 22 groups, with a focus on the TK-speaking central Thai people and the Sino-Tibetan speaking Karen. We find extensive diversity, including 62 haplogroups not reported previously from this region. Demic diffusion is still a preferable scenario for central Thais, emphasizing the extension and expansion of TK people through MSEA, although there is also some support for an admixture model. We also tested competing models concerning the genetic relationships of groups from the major MSEA languages, and found support for an ancestral relationship of TK and Austronesian-speaking groups.

Effective migration in Western Eurasia reveals fine-scale migration surface features


Interesting poster from SMBE 2017, Maps of effective migration as a summary of global human genetic diversity, by Benjamin Peter, Desislava Petkova, Matthew Stephens & John Novembre, of the JNPopGen group of the University of Chicago.

You can read the full poster in the original PDF, or in compressed image. The following are important excerpts:

Aim: To answer the following questions:

  • Which regions have high/low effective migration?
  • How well is human genetic diversity explained by this pure isolation-by-distance model?
  • How does the explanatory performance of EEMS compare to PCA?

Method: It uses the method proposed by Petkova et al. (2016) to fit a map of time-averaged (effective) migration rates to geographically referenced samples, and merges data from 24 different studies (8740 individuals from 469 populations) to assess human genetic diversity on global and continental scale.

  1. Basic workflow:
    • Merge data, remove duplicated & related individuals.
    • Remove Hunter-Gatherer and recently admixed populations. Their locations are still indicated with (H) and (X), respectively
  2. EEMS analysis
    • Calculate genetic distance matrix between all individuals.
    • Fit migration map to data using EEMS MCMC algorithm
  3. Comparison to PCA: Standard PCA using flashpca (Abraham & Inouye 2014) was used, they compare correlation of genetic distance induced from first ten PCs with the fitted EEMS distance

Interpretation: A continuous habitat is approximated by a discrete grid (light gray). A Bayesian model is used to infer the most likely migration rates, which are given on a log scale compared to the Average (BLUE= 100x higher, BROWN=100x lower

Map of effective migrations in Europe

Results (see maps):

  1. Global diversity patterns correlate with topographical features
  2. In Western Eurasia, EEMS reveals fine-scale migration surface features

Discussion: EEMS Maps are intuitive and direct way to visualize geographically referenced genetic data.

Dense sampling (WEstern Eurasian panel) in particular yields high resolution and accuracy, but the method works well at a global scale (FST=0.06) and just in Western Eurasia (FST=0.01).

EEMS-maps are able to reasonably well predict genetic differences, but hunter-gatherer populations and admixed populations were a priori excluded.

Discovered via Eurogenes. Full image via Reddit.