Palaeogenomic and biostatistical analysis of ancient DNA data from Mesolithic and Neolithic skeletal remains


PhD Thesis Palaeogenomic and biostatistical analysis of ancient DNA data from Mesolithic and Neolithic skeletal remains, by Zuzana Hofmanova (2017) at the University of Mainz.

Palaeogenomic data have illuminated several important periods of human past with surprising im- plications for our understanding of human evolution. One of the major changes in human prehistory was Neolithisation, the introduction of the farming lifestyle to human societies. Farming originated in the Fertile Crescent approximately 10,000 years BC and in Europe it was associated with a major population turnover. Ancient DNA from Anatolia, the presumed source area of the demic spread to Europe, and the Balkans, one of the first known contact zones between local hunter-gatherers and incoming farmers, was obtained from roughly contemporaneous human remains dated to ∼6 th millennium BC. This new unprecedented dataset comprised of 86 full mitogenomes, five whole genomes (7.1–3.7x coverage) and 20 high coverage (7.6–93.8x) genomic samples. The Aegean Neolithic pop- ulation, relatively homogeneous on both sides of the Aegean Sea, was positively proven to be a core zone for demic spread of farmers to Europe. The farmers were shown to migrate through the central Balkans and while the local sedentary hunter-gathers of Vlasac in the Danube Gorges seemed to be isolated from the farmers coming from the south, the individuals of the Aegean origin infiltrated the nearby hunter-gatherer community of Lepenski Vir. The intensity of infiltration increased over time and even though there was an impact of the Danubian hunter-gatherers on genetic variation of Neolithic central Europe, the Aegean ancestry dominated during the introduction of farming to the continent.

Taking only admixture analyses using Yamna samples:

This increased genetic affinity of Neolithic farmers to Danubians was observed for Neolithic Hungarians, LBK from central Europe and LBK Stuttgart sample. Some post-Neolithic samples also proved to share more drift with Danubians, again samples from Hungary (Bronze Age and Copper Age samples and also Yamnaya and samples with elevated Yamnaya ancestry (Early Bronze Age samples from Únětice, Bell Beaker samples, Late Neolithic Karlsdorf sample and Corded Ware samples).


The results of our ADMIXTURE analysis for the dataset including also Yamnaya samples are shown in Figure S1c. The cross-validation error was the lowest for K=2. Supervised and unsupervised analyses for K=3 are again highly concordant. Early Neolithic farmers again demonstrate almost no evidence of hunter-gatherer admixture, while it is observable in the Middle Neolithic farmers. However, much of the Late Neolithic hunter-gatherer ancestry from the previous analysis is replaced by Yamnaya ancestry. These results are consistent with the results of Haak et al. who demonstrated a resurgence of hunter-gatherer ancestry followed by the establishment of Eastern hunter-gatherer ancestry.

Again, admixture results show that something in the simplistic Yamna -> Corded Ware model is off. It is still interesting to review admixture results of European Mesolithic and Late Neolithic genomic data in relation to the so-called steppe or yamna ancestry or component (most likely an eastern steppe / forest zone ancestry probably also present in the earlier Corded Ware horizons) and its interpretation…

Image composed by me, from two different images of the PhD Thesis. To the left: Supervised run of ADMIXTURE. The clusters to be supervised were chosen to best fit the presumed ancestral populations (for HG Motala and for farmers Bar8 and Bar31 and for later Eastern migration Yamnaya). To the Right: Unsupervised run of ADMIXTURE for the Anatolian genomic dataset with Yamnaya samples for K=8.

Another hint at the role of Corded Ware peoples in spreading Uralic languages into north-eastern Europe, found in mtDNA analysis of the Finnish population


Open article at Scientific Reports (Nature): Identification and analysis of mtDNA genomes attributed to Finns reveal long-stagnant demographic trends obscured in the total diversity, by Översti et al. (2017).

Of special interest is its depiction of Finland’s past as including the expansion of Corded Ware population of mtDNA U5b1b2 (and probably Y-DNA R1a-M417 subclades), most likely Uralic speakers of the Forest Zone, to the north of the Yamna culture (where Late Proto-Indo-European was spoken).

A later expansion of other subclades – particularly Y-DNA N1c -, was probably associated with the later western expansion of the Eurasian Seima-Turbino phenomenon, and its current prevalence in Finnish Y-DNA haplogroups might have been the consequence of the population decline ca. 1500 BC, and later Iron Age population bottleneck (with the population peak ca. 500 AD) described in the article.

That would more naturally explain the ‘cultural diffusion’ of Finnic languages into invading eastern N1c lineages, a diffusion which would have been in fact a long-term, quite gradual replacement of previously prevalent Y-DNA R1a subclades in the region, as supported by the prevalent “steppe” component in genome-wide ancestry of Finns.

Therefore, there were probably no sudden, strong population (and thus cultural) changes associated with the arrival of N1c lineages, like the ones seen with R1a (Corded Ware / Uralic) and R1b (Yamna / Proto-Indo-European) expansions in Europe.

How the Saami fit into this scheme is not yet obvious, though.


In Europe, modern mitochondrial diversity is relatively homogeneous and suggests an ubiquitous rapid population growth since the Neolithic revolution. Similar patterns also have been observed in mitochondrial control region data in Finland, which contrasts with the distinctive autosomal and Y-chromosomal diversity among Finns. A different picture emerges from the 843 whole mitochondrial genomes from modern Finns analyzed here. Up to one third of the subhaplogroups can be considered as Finn-characteristic, i.e. rather common in Finland but virtually absent or rare elsewhere in Europe. Bayesian phylogenetic analyses suggest that most of these attributed Finnish lineages date back to around 3,000–5,000 years, coinciding with the arrival of Corded Ware culture and agriculture into Finland. Bayesian estimation of past effective population sizes reveals two differing demographic histories: 1) the ‘local’ Finnish mtDNA haplotypes yielding small and dwindling size estimates for most of the past; and 2) the ‘immigrant’ haplotypes showing growth typical of most European populations. The results based on the local diversity are more in line with that known about Finns from other studies, e.g., Y-chromosome analyses and archaeology findings. The mitochondrial gene pool thus may contain signals of local population history that cannot be readily deduced from the total diversity.

From its results:

In general, there appears to be two loose and largely overlapping clusters among the Finn-characteristic haplogroups: the first between 1,000–2,000 ybp and the second around 3,300–5,500 ybp. The age of the older cluster coincides temporally with the arrival of the Corded-Ware culture and, notably, the spread of agriculture in Finland. The arrival and spread of agriculture, temporally corresponding with the age estimates for most of the haplogroups characteristic of Finns, might be a sign of population size increase enabled by the new mode of subsistence, resulting in reduced drift and accumulation of genetic diversity in the population.


Another insight in the past population sizes in Finland is based on radiocarbon-dated archaeological findings in different time periods. These analyses suggest two prehistoric population peaks in Finland, the Stone Age peak (c. 5,500 ybp) and the Metal Age peak (~1,500 ybp). Both of these peaks were followed by a population decline, which appears to have reached its ebb around 3,500 ybp. These developments are not distinguishable in the BSPs. However, these ages correspond well to the two haplogroup age clusters described above. The presumably less severe Iron Age population bottleneck seen in the archaeological data, 1,500–1,300 ybp, temporally coincides with the population size reduction visible for the Finn-characteristic subhaplogroups.


Preprint paper: Estimating genetic kin relationships in prehistoric populations, by Monroy Kuhn, Jakobsson, and Günther


A new preprint paper appeared some days ago in BioRxiv, Estimating genetic kin relationships in prehistoric populations, by researchers of the Uppsala University Jose Manuel Monroy Kuhn, Mattias Jakobsson, and Torsten Günther. Jakobsson and Günther. You might remember the last two from their work Ancient X chromosomes reveal contrasting sex bias in Neolithic and Bronze Age Eurasian migrations, whose results were said not to be replicable by Lazaridis and Reich (PNAS), something they denied pointing to the limitations of the current aDNA data (PNAS).

They propose a new, more conservative method to infer close relationships (in contrast with available methods, suitable for modern samples). They have implemented the method as a software program, called READ, which should work better with degraded samples (typical of ancient DNA) by reducing false positives – and having therefore more false negatives. Abstract:

Archaeogenomic research has proven to be a valuable tool to trace migrations of historic and prehistoric individuals and groups, whereas relationships within a group or burial site have not been investigated to a large extent. Knowing the genetic kinship of historic and prehistoric individuals would give important insights into social structures of ancient and historic cultures. Most archaeogenetic research concerning kinship has been restricted to uniparental markers, while studies using genome-wide information were mainly focused on comparisons between populations. Applications which infer the degree of relationship based on modern-day DNA information typically require diploid genotype data. Low concentration of endogenous DNA, fragmentation and other post-mortem damage to ancient DNA (aDNA) makes the application of such tools unfeasible for most archaeological samples. To infer family relationships for degraded samples, we developed the software READ (Relationship Estimation from Ancient DNA). We show that our heuristic approach can successfully infer up to second degree relationships with as little as 0.1x shotgun coverage per genome for pairs of individuals. We uncover previously unknown relationships among prehistoric individuals by applying READ to published aDNA data from several human remains excavated from different cultural contexts. In particular, we find a group of five closely related males from the same Corded Ware culture site in modern-day Germany, suggesting patrilocality, which highlights the possibility to uncover social structures of ancient populations by applying READ to genome-wide aDNA data.

The software READ applied to the 230 ancient European DNA data from Mathieson et al. (2015) was studied, with certain interesting results. For starters, this paper already supports the idea that the five German Corded Ware samples from Esperstedt were all related, thus further supporting to a certain extent the culture’s patrilocality and female exogamy practices:

Of particular interest was a group of five males from Esperstedt in Germany who were associated with the Corded Ware culture {a culture that arose after large scale migrations of males from the east. Around 50 Corded Ware burials, six of them stone cists, were excavated near Esperstedt in the context of road constructions in 2005. Characteristic Corded Ware pottery was found in the graves and all male individuals had been buried on their right hand site. Interestingly, the central individual of the group of related individuals (I1541) was buried in a stone cist approximately 700 meters from the graves of the other four individuals which were all close to each other. The close relationship of this group of only male individuals from the same location suggest patrilocality and female exogamy, a pattern which has also been found from Strontium isotopes at another Corded Ware site just 30 kilometers from Esperstedt and suggested for the Corded Ware culture in general. This represents just one example of how the genetic analysis of relationships can be used to uncover and understand social structures in ancient populations.

It is to be expected that improvement in such methods can help more accurately define certain samples, by inferring their precise subclades. For example, in the case of those relatives from Esperstedt – classified variously as R(xR1b), R1a, or R1a1 – one would be able to classify those related patrilineally to the most precise subclade: in this case, that of the sample I0104 (ca.2473-2348 BC), of subclade R1a1a1-M417.

However, errors are dependent on the quality of the ancient DNA recovered:

READ does not explicitly model aDNA damage and it only considers one allele at heterozygous sites. This implies that a careful curation of the data is required to avoid errors due to low coverage, short sequence fragments, deamination damage, sequencing errors and potential contamination. We recommend a number of well established filtering steps when working with low coverage aDNA data

Heyd, Mallory, and Prescott were right about Bell Beakers


Sometimes it is fun to read certain “old” papers. I have recently re-read some important papers that predicted what we are seeing now in aDNA analysis with surprising accuracy:

Harrison & Heyd (2007): “We predict that future stable isotope and ancientDNA analyses of Beaker skeletal material will support our view that immigration played an important role in the Europe-wide Bell Beaker phenomenon”. – Duh, obvious, right? Wrong. Read the whole paper. It was already becoming a classic in the study of the Bell Beaker culture before the latest research on Bell Beaker aDNA, and it will be still more important from now on. There are different models for the Bell Beaker origin and expansion, and this was only one of them: we had the Dutch model, the radiocarbon date-based attempts to locate Bell Beakers in Iberia or North Africa,… I tried to highlight the best sentences from Heyd’s article to include them in my article, and I just couldn’t stop highlighting almost everything. It is surprising that 10 years ago Volker Heyd was predicting so much from such a limited amount of material, and with conflicting reports coming from everywhere, from palaeogenetics to radiocarbon dating. Not that today their chronology of Le Petit – Chasseur is accepted by all, but their general Bell Beaker and Yamna model has been clearly established as the most likely one with support from aDNA.

– Mallory in Celtic from the West 2 (2013), as the last of many to propose Bell Beaker as the vector of spread of Late Indo-European languages, but the first to relate it to North-West Indo-European: “The spread of Indo-European languages from Alpine Europe may have begun with the Beaker culture, presuming here a non-Iberian Beaker homeland (Rhineland, Central European) for that part of the Beaker phenomenon that was associated with an Indo-European language. While it is possible that IE language(s) spread with the Beaker phenomenon, it is questionable that this was associated with Proto-Celtic rather than earlier forms of Late Indo-European, at least part of which might be subsumed under the heading NW Indo-European. This is because the time depth of the dispersal of the Beakers is so great and the earliest attested Celtic languages are so similar (…)”. You might think that it is related to the Atlantic Indo-European theory favoured by Cunnliffe and Koch in the book… Wrong, he specifically dismisses a Neolithic spread of Indo-European, and a Calcholithic spread of Celtic languages as too early. You might also think that to publish that in 2013 has no merit, given the data. Wrong again. Just look at the trend among renown archaeologists – like Anthony (with Haak) and Kristiansen (with Allentoft) – trying to hop on the bandwagon of Corded Ware-driven Indo-European dispersal based on the “steppe admixture” proportion of recent genetic papers, and you realize he is going against the grain here.

Prescott and Walderhaug 1995 (as referred to in Prescott 2012): “The Bell Beaker period is the most, perhaps the only, reasonable candidate for the spread and final entrenchment of a common Indo-European language throughout Scandinavia (and not just Corded Ware core areas of southern and eastern Scandinavia), and particularly Norway”. Duh again? Not so fast. While Bell Beaker had been proposed before as a vector of Indo-European languages in Europe, the association with Germanic was far more controversial. Only the unifying Dagger Period was more clearly established as of Pre-Germanic nature, but it could be interpreted as of Corded Ware, Úněticean, or even early Neolithic origin, or a mix of them. Bell Beaker groups were never good candidates, if only because of the desire by some researchers to offer a romanticized (either more unifying or ancient) picture of a Germanic Northern Scandinavian homeland, explained as a culturally and genetically homogeneous group.

Their papers seem to state the obvious now that the latest aDNA samples are proving them correct, but it was far from clear years ago: remember the native European Basque-R1b – Uralic-N1c harmony disrupted by invasive Eurasian Indo-European-speaking warriors carrying R1a lineages from Yamna to Corded Ware? Well that is still a thing for some. And even today the most popular interpretation of the spread of Indo-European-speakers in Europe is based on the defined “steppe ancestry” proportion found in Corded Ware individuals, and a supposedly Yamna community formed by R1b-R1a lineages, which is obviously reminiscent of the identification of R1a lineages with Proto-Indo-Europeans based on the initial analysis of haplogroups in modern populations.

It is sad to imagine how much we would have improved in our knowledge, had we read their work with interest when it was necessary, and not now that we have most of the aDNA clues. Still sadder is to see people rely on genetic studies alone to derive today what are likely the wrong conclusions. Again.

I will end with a mea culpa. I hadn’t read those works; but even if I had, I would have stayed with the simpler, R1a-Corded Ware model of Indo-European dispersion. That oversimplification will remain in the different editions of our Grammar of Modern Indo-European as a permanent reminder. Simpler seems always better, and Cavalli-Sforza had famously asserted that ancient population movements could be solved with the study of the structure of modern populations. I think he was right, that we can in fact ascertain ancient population movements by studying modern populations if we include anthropological disciplines, but it is such a complex task – and geneticists have not shown a good grasp in (or interest for) Anthropology -, that it is nowadays clearly wrong to rely on modern population samples to derive conclusions about ancient populations, and we are better off studying ancient DNA samples in their context.

We were Back-to-the-Future-wrong, overestimating our potential in some aspects – like the results of researching modern DNA -, and underestimating it in others – like the potential changes that ancient DNA investigation could bring for anthropological disciplines. Just as we are wrong today in trusting the potential of admixture analysis to be self-explanatory, without a need for wide anthropological investigation (or even able to revolutionize archaeological and linguistic theories).

I hope to keep a more critical view of publications – especially the most popular ones – from now on.

Indo-European demic diffusion model, 2nd edition, revised and updated

It has been three months since I published the first paper on the Indo-European demic diffusion model.

In the meantime, important pre-print papers with samples of Bell Beaker and South-Eastern European cultures compel me to add new data in support of the model. I have taken this opportunity to revise the whole text in a new paper, Indo-European demic diffusion model, 2nd edition, and also some of the maps of Indo-European migrations, which are now hosted in this blog.

I have made changes to some of the old blogs I had, like this one, and I have merged two of them (from and in this domain,, to begin blogging about anthropological questions regarding Proto-Indo-Europeans and their language.

This blog was used years ago as my personal dialectic training site in English, mostly filled with controversial topics, and while I hope to keep some form of discussion, I want to turn it into a more pragmatic blog for news and reports on Indo-European studies. will be used as a collaborative Wiki website for this model to include supplementary information from published papers – such as results of individual and group’s admixture analyses, archaeological information of individual samples, and also mtDNA. To collaborate, users will have to request an account first (it will be a closed community), and those with important contributions will be added as authors of the following editions of the paper.

Indo-European Demic Diffusion – The expansion of Proto-Indo-Europeans potentially explained as the expansion of R1b subclades

I published an essay (or “dissertation”) some weeks ago, about what seems to me one of the most likely models of expansion of Indo-European-speaking peoples, based on Y-DNA haplogroups. Recently J.P. Mallory had proposed* (although he was not the first) that North-West Indo-European (the ancestor of Italo-Celtic and Germanic, and Balto-Slavic**) expanded with the Bell Beaker culture, a hypothesis that is supported by the most recent radiocarbon data (and subsequent proposal of an eastern origin of the pre-Bell Beaker culture, linked to the Yamna expansion, by Volker and Heyd). As I outline in the paper, ancient DNA samples and genetic data from modern populations seem to support this new model.

The still most prevalent model followed by archaeologists, based on Gimbutas’ theory, links the Corded Ware culture expansion to an expansion of the Yamna culture. Gimbutas linked the expansion of Bell Beaker to the expansion of certain Indo-European dialects through Vucedol, and Corded Ware was associated with the expansion of Germano-Balto-Slavic. Even though linguistics has changed its mainstream view of the dialectalization of Late Indo-European in the past half century, the archaeological community (those who supported the steppe expansion, at least) has remained strongly linked to Gimbutas, and more recently David Anthony has supported a similar model (with a phylogenetic model of Proto-Indo-European dialects by Don Ringe), by explaining a dual expansion into Corded Ware by Pre-Germanic (through a mixed Old European / IE Usatovo culture) and Pre-Balto-Slavic (through the Middle Dnieper culture), while eastern Bell Beakers expanded with Italo-Celtic dialects. While a strong cultural connection between Yamna and Corded Ware is currently undeniable, and admixture analyses show a connection between steppe and both Bell Beaker and Corded Ware samples, the actual relationship is today far less clear than it was 10 years ago (when we would simply connect Yamna with a R1a-dominated Corded Ware), and far more ancient samples from the steppe, steppe-forest, and forest zone are needed to extract any strong conclusions.

During this time I have received some comments on the paper, and have discovered some interesting sources for more information, like BioRxiv (for the newest pre-print papers on Genetics), and (for papers on Archaeology), both of which I can’t hardly recommend enough for anyone interested in these topics. From what I have experienced, Linguistics – which seemed to me a quite closed, strongly conservative community, due to my proposal of speaking a reconstructed Proto-Indo-European dialect as a common language today – has been more open with my model than some archaeological/genetic tandems, and linguists have shown a clearer grasp of all anthropological disciplines involved in Indo-European studies than others… My model remains a theory that I expect to develop further with more details and more genetic data, as they are published.

There are some interesting upcoming samples (mainly from Bell Beaker) by the Reich Lab – and today its publication seems nearer. While the interpretation seems to be in line with what has been said in previous similar publications, the most interesting data will most likely be the actual samples, apparently already showing a lack of steppe ancestry in Iberian Bell Beaker, and a clear invasion of Bell Beaker peoples (hence R1b?) in Great Britain. Hopefully some new samples of Yamna and Corded Ware might give us interesting information.

It is always to be remembered that, when talking about Indo-European peoples, what matters is linguistics: after, all, the peoples whose place and time we want to find are defined by their language, Indo-European. Archaeology might be able to date some cultural developments potentially linked with Indo-European-speaking peoples, and genetics might give support to the expansion of peoples (and thus maybe languages) accompanying such cultural expansions. Recent genetic developments are quite interesting, in that we might be able to place Late Indo-European and North-West Indo-European speakers in place and time, but it seems to me that some people are trying to answer the Urheimat problem the other way around.

* J.P. Mallory, ‘The Indo-Europeanization of Atlantic Europe’, in Celtic From the West 2: Rethinking the Bronze Age and the Arrival of Indo-European in Atlantic Europe, eds J. T. Koch and B. Cunliffe (Oxford, 2013), p.17-40

** It has been proposed that Balto-Slavic derived partially from North-West Indo-European, and partially from a different Late Indo-European language, although there are different models to explain the pidginization of this dialect