On the Ukraine Eneolithic outlier I6561 from Alexandria


Over the past week or so, since the publication of new Corded Ware samples in Narasimhan, Patterson et al. (2019) and after finding out that the R1a-M417 star-like phylogeny may have started ca. 3000 BC, I have been ruminating the relevance of contradictory data about the Ukraine_Eneolithic_o sample from Alexandria, its potential wrong radiocarbon date, and its implications for the Indo-European question.

How many other similar ‘controversial’ samples are there which we haven’t even considered? And what mechanisms are in place to control that the case of Hajji_Firuz_CA I2327 is not repeated?

Ukraine Eneolithic outlier I6561

It was not the first time that I (or many others) have alternatively questioned its subclade or its date, but the contradictory data seem to keep piling up. We can still explain all these discrepancies by assuming that the radiocarbon date is correct – seeing how it is a direct and newly reported lab analysis – because it is an isolated individual from a poorly sampled region, so he may actually be the first one to show features proper of later Corded Ware-related samples.

PCA of ancient Eurasian samples. An interpretation of the evolution of the Pontic-Caspian steppe populations in the Eneolithic. See full PCA.

The individual seems to be especially relevant for the Indo-European and Uralic homeland question. The last one to mention this sample in a publication was Anthony (2019), who considered it in common with two other Eneolithic samples from Dereivka to show how Anatolian farmer-related ancestry first appeared in the recently opened CHG mating network of the Pontic-Caspian steppes and forest-steppes during the Middle Eneolithic, after the expansion of Khvalynsk:

The currently oldest sample with Anatolian Farmer ancestry in the steppes in an individual at Aleksandriya, a Sredni Stog cemetery on the Donets in eastern Ukraine. Sredni Stog has often been discussed as a possible Yamnaya ancestor in Ukraine (Anthony 2007: 239- 254). The single published grave is dated about 4000 BC (4045–3974 calBC/ 5215±20 BP/ PSUAMS-2832) and shows 20% Anatolian Farmer ancestry and 80% Khvalynsk-type steppe ancestry (CHG&EHG). His Y-chromosome haplogroup was R1a-Z93, similar to the later Sintashta culture and to South Asian Indo-Aryans, and he is the earliest known sample to show the genetic adaptation to lactase persistence (I3910-T). Another pre-Yamnaya grave with Anatolian Farmer ancestry was analyzed from the Dnieper valley at Dereivka, dated 3600-3400 BC (grave 73, 3634–3377 calBC/ 4725±25 BP/ UCIAMS-186349). She also had 20% Anatolian Farmer ancestry, but she showed less CHG than Aleksandriya and more Dereivka-1 ancestry, not surprising for a Dnieper valley sample, but also showing that the old fifth-millennium-type EHG/WHG Dnieper ancestry survived into the fourth millennium BC in the Dnieper valley (Mathieson et al. 2018).

The main problem is that this sample has more than one inconsistent, anachronistic data compared to its reported precise radiocarbon date ca. 4045–3974 calBCE (5215±20BP, PSUAMS-2832). I summarized them on Twitter:

  • First known R1a-M417 sample, with subclade R1a-Y26 (Y2-), with formation date and TMRCA ca. 2750 BC (CI 95% ca. 3750–1950 BC), and proper of much later Steppe_MLBA bottlenecks. The closest available sample would be the Poltavka outlier of hg. R1a-Z94 (ca. 2700 BC), from a mixed cemetery that could belong to a later (likely Abashevo) layer; the closest related subclade is probably found in sample I12450 of Butkara_IA (ca. 800 BC).
  • NOTE. The formation date of upper clade R1a-Z93 is estimated ca. 3000 BC, with a CI 95% ca. 3550–2550 BC, suggesting that the actual TMRCA range for the subclade has most likely a lower maximum formation date than estimated with the available samples under Y3.

  • Ancestry and PCA cluster like Steppe_MLBA (see PCA below), different from neighbouring Sredni Stog samples of the roughly coetaneous Dereivka site (ca. 3600-3400 BC), and from a later Yamnaya sample from Dereivka (ca. 2800 BC), even more shifted toward WHG-related ancestry.
  • Allele for lactase persistence (I3910-T), found only much later among Bell Beakers, and still later in Sintashta and Steppe_MLBA samples. This suggests a strong selection in northern Europe and South Asia stemming from steppe-related (and not forest-steppe-related) peoples, postdating the age of massive Indo-European migrations.
  • Hajji Firuz Chalcolithic outlier

    My impression is that the Hajji_Firuz Chalcolithic outlier, initially dated ca. 5900-5500 BC, had much less reason to be questioned than this sample, since Pre-Yamnaya ancestry was (and apparently is still) believed by members of the Reich Lab to have come from south of the Caucasus, and to have arrived around that time or earlier to the North Caspian steppe, i.e. before the 5th millennium BC.

    The formation date of its initially reported haplogroup, R1b-Z2103, is ca. 4100 BC (CI 95% 4800-3500 BC), which seems also roughly compatible with that date and site – at least as compatible as R1a-Y3(xY2) is for ca. 4000 BC -, so it could have been interpreted as a migrant from the South Caspian region, potentially related to Proto-Anatolians, especially before the description of the Caucasus genetic barrier in Wang et al (2018). For some reason, though, the Hajji_Firuz sample was questioned, but this one didn’t even merited an interrogation mark.

    There was already a similar situation with two samples (RISE568 and RISE569) initially reported as belonging to Czech Corded Ware groups, that turned out to be Early Slavs ca. 3,000 years younger, in turn more closely related to Bell Beaker-derived cultures of Central-East Europe. It seems little has changed since that case.

    All in all, my guess is that genomic data of I6561 would have been a priori more compatible with a later period, during the expansion of East Corded Ware groups: at least Middle Dnieper culture, potentially Multi-Cordoned Ware culture, but most likely a Srubnaya-related one, given the most likely SNP mutation and TMRCA date, and the haplogroup variability found in the few samples available from that culture.

    PCA of ancient Eurasian samples. Marked I6561 sample within the cluster formed by Srubnaya samples. See full PCA.

    Compatibility checks

    I tried to start a thread on the possibility that the radiocarbon date was wrong, and IF it were, how likely it would be that formal stats could actually show this, or how could we automatically prevent ancestry magic fiascos.

    In other words: if this guy were a Srubnaya-related individual actually dated e.g. ca. 1700 BC, and someone would try to ‘prove’ – based on the current open source tools alone – that he was the ancestor of expanding peoples of the 4th and 3rd millennium BC (i.e. Balkan outliers, Yamnaya, Corded Ware, you name it), could these results be formally challenged?

    I was hoping for some original brainstorming where people would propose crazy, essentially impossible to understand statistical models, say plotting dozens of well-studied mutations of different geographically related ancient samples with their reported dates, to visually highlight samples that don’t exactly fit with such a feature-based time series analysis; I mean, the kind of theoretical models I wouldn’t even be able to follow after the first two tweets or so. I didn’t receive an answer like that, but still:

    I have nothing to add to these answers, because I agree that all contradictory data are circumstancial.

    The current absolute lack of this kind of validity checks for ancestry models is disappointing, though, and leaves the so-called outliers in a dangerous limbo between “potentially very interesting samples” and “potentially wrongly dated samples”. Radiocarbon date is thus – together with compatibility of population source in terms of archaeological cultures and their potential relationship – a necessary variable to take into account in any statistical design: an error in one of these variables means a catastrophic error in the whole model.

    Formal stats

    For example, in these qpAdm models, I assumed Srubnaya, Ukraine_Eneolithic_outlier, and Bulgaria_MLBA samples were roughly coetaneous and potentially related to the Srubnaya-SabatinovkaNoua cultural horizon, hence stemming from a source close to:

    1. Abashevo-like individuals (whose best proxy to date should be Poltavka_outlier I0432) potentially admixed with Poltavka-like herders; or
    2. Potapovka-like individuals potentially admixed with Catacomb-like peoples (whose best proxy until recently were probably Yamnaya_Kalmykia*).

    *To avoid adding more potential errors by merging different datasets, I have used only proxy samples available in the Reich Lab’s curated dataset of published ancient DNA.

    Srubnaya and Noua-Sabatinovka cultural horizon during the MLBA. See full maps.

    Apart from the lack of more models for comparison (I’m not going to dedicate more time to this), the results can’t be interpreted without a proper sampling and context, either, because (1) Poltavka_o may actually be from a much later group closely related to Srubnaya; (2) Bulgaria_MLBA is only one sample; and (3) there are only two samples from Potapovka; so the models here presented are basically useless, as many similar models that have been tested looking just for a formal “best fit”.

    So feel free to chime in and contribute with ideas as to how to detect in the future whether a sample is ancestral to or derived from others. I will post here informative answers from Twitter, too, if there are any. I don’t think a discussion about the potentially wrong date in this specific sample is very useful, because this seems impossible to prove or disprove at this point. Just what tools or data would you use to at least try and assess whether samples are compatible with its reported date or not – preferably in some kind of automated sieve that takes dozens or hundreds of samples into account.

    On the bright side, there is so much more than formal stats to arrive to relevant inferences about prehistoric populations, their movements and languages. That’s why I6561 didn’t matter for the conclusion by Anthony (2019) that it was the R1b-rich Eneolithic Don-Volga-Caucasus region the most likely Indo-Anatolian and Late Proto-Indo-European homeland, due to the creation of a wide Eneolithic mating network with extended exogamy practices, where Y-chromosome bottlenecks seem to be one of the main genomic data to take into account from the Neolithic to the Middle Bronze Age.

    And that is the same reason why it doesn’t matter that much for the Proto-Indo-European or Uralic question for me, either.


Mitogenomes suggest rapid expansion of domesticated horse before 3500 BC

Open access Origin and spread of Thoroughbred racehorses inferred from complete mitochondrial genome sequences: Phylogenomic and Bayesian coalescent perspectives, by Yoon et al. PLOS One (2018).

Abstract (emphasis mine)

The Thoroughbred horse breed was developed primarily for racing, and has a significant contribution to the qualitative improvement of many other horse breeds. Despite the importance of Thoroughbred racehorses in historical, cultural, and economical viewpoints, there was no temporal and spatial dynamics of them using the mitogenome sequences. To explore this topic, the complete mitochondrial genome sequences of 14 Thoroughbreds and two Przewalski’s horses were determined. These sequences were analyzed together along with 151 previously published horse mitochondrial genomes from a range of breeds across the globe using a Bayesian coalescent approach as well as Bayesian inference and maximum likelihood methods. The racing horses were revealed to have multiple maternal origins and to be closely related to horses from one Asian, two Middle Eastern, and five European breeds. Thoroughbred horse breed was not directly related to the Przewalski’s horse which has been regarded as the closest taxon to the all domestic horses and the only true wild horse species left in the world. Our phylogenomic analyses also supported that there was no apparent correlation between geographic origin or breed and the evolution of global horses. The most recent common ancestor of the Thoroughbreds lived approximately 8,100–111,500 years ago, which was significantly younger than the most recent common ancestor of modern horses (0.7286 My). Bayesian skyline plot revealed that the population expansion of modern horses, including Thoroughbreds, occurred approximately 5,500–11,000 years ago, which coincide with the start of domestication. This is the first phylogenomic study on the Thoroughbred racehorse in association with its spatio-temporal dynamics. The database and genetic history information of Thoroughbred mitogenomes obtained from the present study provide useful information for future horse improvement projects, as well as for the study of horse genomics, conservation, and in association with its geographical distribution.

Bayesian skyline plot (BSP) based on mitochondrial genome sequences from 167 modern horses.
The dark line in the BSP represents the estimated effective population size through time. The green area represents the 95% highest posterior density confidence intervals for this estimate.

Interesting excerpts:

We carried out a Bayesian coalescent approach using extended mitochondrial genome sequences from 167 horses in order to further assess the timescale of horse domestication. Here, we first calculated the time of the most recent common ancestor of Thoroughbred horses. Our analysis revealed the age of the most recent common ancestor of the racing horse to be around 8,100–111,500 years old. This estimate is much younger than that of the most recent common ancestor of the global horses, which has been estimated at 0.7286 Mys old.

Bayesian maximum clade credibility phylogenomic tree on the ground of the mitochondrial genome sequences of 167 modern horses.
The data set (16,432 base pairs) was also analyzed phylogenetically using Bayesian inference (BI) and maximum likelihood (ML) methods which showed the same topologies. 95% Highest Posterior Density of node heights are shown by blue bars. Groups are marked by a “G”. Numbers at the nodes represent (left to right): posterior probabilities (≥0.80) for the BI tree and bootstrap values (≥70%) for the ML tree. The racing horses were revealed to have multiple maternal origins and to be closely related to horses from one Asian, two Middle Eastern, and five European breeds. Results of phylogenomic analyses also uncovered no apparent association between geographic origin or breed and heterogeneity of global horses. The most recent common ancestor of the Thoroughbreds lived approximately 8,100–111,500 years ago, which was significantly younger than the most recent common ancestor of modern horses (0.7286 My).

On the domestication time of modern horses, there have been several publications derived from both archaeological [49–51] and molecular [11–12, 23, 48] evidences. D’Andrade [49] reported that the origin of domestic horses was around 4,000 years ago. Ludwig et al. [50] stated the domestication time to be about 5,000 years ago, while Anthony [51] noted that horse rearing by humans may have occurred approximately 6,000 years ago. Subsequently, on the basis of mitochondrial genome sequences, Lippold et al. [11] and Achilli et al. [12] postulated domestication time to be about 6,000–8,000 and 6,000–7,000 years ago, respectively. Warmuth [48] dated domestication time to 5,500 years ago based on autosomal genotype data, while Orlando et al. [23] claimed that Przewalski’s and domestic horse populations diverged 38,000–72,000 years ago based on analysis of genome sequences. In contrast to the previous hypothesized date of horse domestication, the results of our Bayesian skyline plot (BSP) analysis depict a rapid expansion of the horse population approximately 5,500–11,000 years ago, which coincides with the start of domestication.

It seems that we will not have an update on horse aDNA from the ISBA 8, so we will have to make do with this for the moment.


Cystic fibrosis probably spread with expanding Bell Beakers


New paper (behind paywall) Estimating the age of p.(Phe508del) with family studies of geographically distinct European populations and the early spread of cystic fibrosis, by Farrell et al., European Journal of Human Genetics (2018).

Interesting excerpts (emphasis mine):

Our results revealed tMRCA average values ranging from 4725 to 1175 years ago and support the estimates of Serre et al. (3000–6000 years ago) [11], rather than Morral et al. (52,000 years ago) [6], but the latter figure was challenged by Kaplan et al. [26] because of disagreement with assumptions used in their calculations. In addition, the tMRCA values from western European regions reported herein refine the results of Fichou et al. [7] from a study of Breton CF patients in which the Estiage analysis suggested that the most common recent ancestor lived 115 generations ago. That tMRCA value, however, may have underestimated the age of p.(Phe508del) in Brittany due to consideration of all the haplotypes, even those that were reconstructed with ambiguities, as well as a potential bias associated with consanguinity due to including both haplotypes in homozygous families. In the more stringent Estiage analyses reported herein, those potential biases were avoided for all populations, leading to estimates of the oldest tMCRA values corresponding to the Early Bronze Age in western Europe, which is generally agreed to begin around 3000 BCE. This finding extends our results from a direct investigation of aDNA in teeth from Iron Age burials near Vienna around 350 BCE and allow us to conclude that p.(Phe508del) was present in that region long before then. More specifically, in the Austrian families studied, the Estiage data revealed a mean tMCRA value of 3575 years ago, which converts to 1558 BCE (Middle Bronze Age) [22].

Perhaps most remarkably, the estimated ages of p.(Phe508del) in the three western European regions (France, Ireland, and Denmark) were similar with closely overlapping 95% CI values. This observation is also in line with previously documented spatial autocorrelograms expressing genetic and geographical distance for these populations [24]. Such data provide more insight about the ancient origin of CF in our judgment—both when and where—and lead us to propose that CFTR p.(Phe508del) is derived from ancestors who lived in western Europe during the Bronze Age, as early as 2700 BCE, and that its relatively rapid dissemination occurred because of human migrations around the northwestern Atlantic trading routes [21] and then towards central and eastern Europe [22]. Diffusion from northwestern to central Europe in approximately 1000 years is consistent with the prominent Bronze Age migrations evident in the archeological record [21, 22] and from genomic studies of aDNA [27]. On the other hand, we are assuming a discrete origin of the principal CF-causing variant, but it is possible that p.(Phe508del) arose more than once or earlier, and then reached western Europe subsequently through Neolithic migrations.


[About Bell Beakers] (…) More specifically, their distinctive Bell Beaker pottery appeared and spread across western and central Europe beginning around 3000–2750 BCE and then disappeared between 2200 and 1800 BCE [22, 29]. Their migrations are linked to the advent of western and central European metallurgy, as they manufactured and traded metal goods, especially weapons, while traveling over long distances [30]. Most relevant to our study is the evidence that they migrated in a direction and over a time period that fits well with the pattern of tMRCA data we found for the p.(Phe508del) variant. Olalde et al. [29] have shown that both migration and cultural transmission played a major role in diffusion of the “Beaker Complex” and led to a “profound demographic transformation” of Britain after 2400 BCE. Moreover, the cultural elements that unite the widely distributed Beaker folk are so obvious that some have considered them a distinct ethnicity of Bronze Age people [33].

From our results, we propose the novel concept that large scale, long term west-to-east migrations of the Bell Beaker Europeans [22, 28–30] during the Bronze Age, could explain the dissemination of p.(Phe508del) in Europe and its documented northwest-to-southeast gradient [4].In fact, our tMRCA data show a temporal gradient also.

As you can see from the references, they consulted with Barry Cunliffe (or people accepting his theory), who is obsessed with Bell Beakers expanding Celtic languages from the British Isles. He is like the British equivalent of Danish scholar Kristian Kristiansen, and his obsession with Corded Ware = Indo-European (and Germanic = CWC Denmark), immutable no matter what genetic results might show.

The funny thing is, the interpretation of the paper is probably right. From what we can see in the data, it is quite possible that the disease spread with expanding Bell Beakers…only it spread from the East group in Hungary, i.e. from east to west. The regional difference in TMRCA and apparent west—east cline would point to the different expansions of affected lineages in the corresponding regions, and not to an origin in the British Isles.