Expansion of domesticated goat echoes expansion of early farmers


New paper (behind paywall) Ancient goat genomes reveal mosaic domestication in the Fertile Crescent, by Daly et al. Science (2018) 361(6397):85-88.

Interesting excerpts (emphasis mine):

Thus, our data favor a process of Near Eastern animal domestication that is dispersed in space and time, rather than radiating from a central core (3, 11). This resonates with archaeozoological evidence for disparate early management strategies from early Anatolian, Iranian, and Levantine Neolithic sites (12, 13). Interestingly, our finding of divergent goat genomes within the Neolithic echoes genetic investigation of early farmers. Northwestern Anatolian and Iranian human Neolithic genomes are also divergent (14–16), which suggests the sharing of techniques rather than large-scale migrations of populations across Southwest Asia in the period of early domestication. Several crop plants also show evidence of parallel domestication processes in the region (17).

PCA affinity (Fig. 2), supported by qpGraph and outgroup f3 analyses, suggests that modern European goats derive from a source close to the western Neolithic; Far Eastern goats derive from early eastern Neolithic domesticates; and African goats have a contribution from the Levant, but in this case with considerable admixture from the other sources (figs. S11, S16, and S17 and tables S26 and 27). The latter may be in part a result of admixture that is discernible in the same analyses extended to ancient genomes within the Fertile Crescent after the Neolithic (figs. S18 and S19 and tables S20, S27, and S31) when the spread of metallurgy and other developments likely resulted in an expansion of inter-regional trade networks and livestock movement.

Maximumlikelihood phylogeny and geographical distributions of ancient mtDNA haplogroups. (A) A phylogeny placing ancient whole mtDNA sequences in the context of known haplogroups. Symbols denoting individuals are colored by clade membership; shape indicates archaeological period (see key). Unlabeled nodes are modern bezoar and outgroup sequence (Nubian ibex) added for reference.We define haplogroup T as the sister branch to the West Caucasian tur (9). (B and C) Geographical distributions of haplogroups show early highly structured diversity in the Neolithic period (B) followed by collapse of structure in succeeding periods (C).We delineate the tiled maps at 7250 to 6950 BP, a period >bracketing both our earliest Chalcolithic sequence (24, Mianroud) and latest Neolithic (6, Aşağı Pınar). Numbered archaeological sites also include Direkli Cave (8), Abu Ghosh (9), ‘Ain Ghazal (10), and Hovk-1 Cave (11) (table S1) (9).

Our results imply a domestication process carried out by humans in dispersed, divergent, but communicating communities across the Fertile Crescent who selected animals in early millennia, including for pigmentation, the most visible of domestic traits.


Inca and Spanish Empires had a profound impact on Peruvian demography


Open access Evolutionary genomic dynamics of Peruvians before, during, and after the Inca Empire by Harris et al., PNAS (2018) 201720798 (published ahead of print).

Abstract (emphasis mine):

Native Americans from the Amazon, Andes, and coastal geographic regions of South America have a rich cultural heritage but are genetically understudied, therefore leading to gaps in our knowledge of their genomic architecture and demographic history. In this study, we sequence 150 genomes to high coverage combined with an additional 130 genotype array samples from Native American and mestizo populations in Peru. The majority of our samples possess greater than 90% Native American ancestry, which makes this the most extensive Native American sequencing project to date. Demographic modeling reveals that the peopling of Peru began ∼12,000 y ago, consistent with the hypothesis of the rapid peopling of the Americas and Peruvian archeological data. We find that the Native American populations possess distinct ancestral divisions, whereas the mestizo groups were admixtures of multiple Native American communities that occurred before and during the Inca Empire and Spanish rule. In addition, the mestizo communities also show Spanish introgression largely following Peruvian Independence, nearly 300 y after Spain conquered Peru. Further, we estimate migration events between Peruvian populations from all three geographic regions with the majority of between-region migration moving from the high Andes to the low-altitude Amazon and coast. As such, we present a detailed model of the evolutionary dynamics which impacted the genomes of modern-day Peruvians and a Native American ancestry dataset that will serve as a beneficial resource to addressing the underrepresentation of Native American ancestry in sequencing studies.

Admixture among Peruvian populations. (A) Colors represent contributions from donor populations into the genomes of Peruvian mestizo groups, as estimated by CHROMOPAINTER and GLOBETROTTER. The label within parentheses for each Peruvian Native American source population corresponds to their geographic region where Ama, And, and Coa represent Amazon, Andes, and coast, respectively. (B) Admixture time and proportion for the best fit three-way ancestry (AP, Trujillo and Lima) and two-way ancestry (Iquitos, Cusco, and Puno) TRACT models [European, African, and Native American (NatAm) ancestries] for six mestizo populations. (C) Network of individuals from Peruvian Native American and mestizo groups according to their shared IBD length. Each node is an individual and the length of an edge equals to (1/total shared IBD). IBD segments with different lengths are summed according to different thresholds representing different times in the past (52), with 7.8 cM, 9.3 cM, and 21.8 cM roughly representing the start of the Inca Empire, the Spanish conquest and occupation, and Peruvian independence. IBD networks are generated by Cytoscape (98) and only the major clusters in the network are shown for different cutoffs of segment length. AP, Central Am, and Matsig are short for Afroperuvians, Central American, and Matsiguenka, respectively. The header of each IBD network specifies the length of IBD segments used in each network.

Interesting excerpts

The high frequency of Native American mitochondrial haplotypes suggests that European males were the primary source of European admixture with Native Americans, as previously found (23, 24, 41, 42). The only Peruvian populations that have a proportion of the Central American component are in the Amazon (Fig. 2A). This is supported by Homburger et al. (4), who also found Central American admixture in other Amazonian populations and could represent ancient shared ancestry or a recent migration between Central America and the Amazon.

Following the peopling of Peru, we find a complex history of admixture between Native American populations from multiple geographic regions (Figs. 2B and 3 A and C). This likely began before the Inca Empire due to Native American and mestizo groups sharing IBD segments that correspond to the time before the Inca Empire. However, the Inca Empire likely influenced this pattern due to their policy of forced migrations, known as “mitma” (mitmay in Quechua) (28, 31, 37), which moved large numbers of individuals to incorporate them into the Inca Empire. We can clearly see the influence of the Inca through IBD sharing where the center of dominance in Peru is in the Andes during the Inca Empire (Fig. 3C).

ASPCA of combined Peruvian Genome Project with the HGDP genotyped on the Human Origins Array. A.) European ancestry. B.) African ancestry. Samples are filtered by their corresponding ancestral proportion: European ≥ 30% (panel A) and African ≥ 10% (panel B). The two plots in each panel are identical except for the color scheme: reference populations are colored on the left and Peruvian populations are colored on the right. Each point is one haplotype. In the African ASPCA we note three outliers among our samples, two from Trujillo and one from Iquitos, that cluster closer to the Luhya and Luo populations, though not directly. It is likely that these individuals share ancestry with other regions of Africa in addition to western Africa, but we cannot test this hypothesis explicitly as we have too few samples.

A similar policy of large-scale consolidation of multiple Native American populations was continued during Spanish rule through their program of reducciones, or reductions (31, 32), which is consistent with the hypothesis that the Inca and Spanish had a profound impact on Peruvian demography (25). The result of these movements of people created early New World cosmopolitan communities with genetic diversity from the Andes, Amazon, and coast regions as is evidenced by mestizo populations’ ancestry proportions (Fig. 3A). Following Peruvian independence, these cosmopolitan populations were those same ones that predominantly admixed with the Spanish (Fig. 3B). Therefore, this supports our model that the Inca Empire and Spanish colonial rule created these diverse populations as a result of admixture between multiple Native American ancestries, which would then go on to become the modern mestizo populations by admixing with the Spanish after Peruvian independence.

Further, it is interesting that this admixture began before the urbanization of Peru (26) because others suspected the urbanization process would greatly impact the ancestry patterns in these urban centers (25). (…)


Domesticated horse population structure, selection, and mtDNA geographic patterns


Open access Detecting the Population Structure and Scanning for Signatures of Selection in Horses (Equus caballus) From Whole-Genome Sequencing Data, by Zhang et al, Evolutionary Bioinformatics (2018) 14:1–9.

Abstract (emphasis mine):

Animal domestication gives rise to gradual changes at the genomic level through selection in populations. Selective sweeps have been traced in the genomes of many animal species, including humans, cattle, and dogs. However, little is known regarding positional candidate genes and genomic regions that exhibit signatures of selection in domestic horses. In addition, an understanding of the genetic processes underlying horse domestication, especially the origin of Chinese native populations, is still lacking. In our study, we generated whole genome sequences from 4 Chinese native horses and combined them with 48 publicly available full genome sequences, from which 15 341 213 high-quality unique single-nucleotide polymorphism variants were identified. Kazakh and Lichuan horses are 2 typical Asian native breeds that were formed in Kazakh or Northwest China and South China, respectively. We detected 1390 loss-of-function (LoF) variants in protein-coding genes, and gene ontology (GO) enrichment analysis revealed that some LoF-affected genes were overrepresented in GO terms related to the immune response. Bayesian clustering, distance analysis, and principal component analysis demonstrated that the population structure of these breeds largely reflected weak geographic patterns. Kazakh and Lichuan horses were assigned to the same lineage with other Asian native breeds, in agreement with previous studies on the genetic origin of Chinese domestic horses. We applied the composite likelihood ratio method to scan for genomic regions showing signals of recent selection in the horse genome. A total of 1052 genomic windows of 10 kB, corresponding to 933 distinct core regions, significantly exceeded neutral simulations. The GO enrichment analysis revealed that the genes under selective sweeps were overrepresented with GO terms, including “negative regulation of canonical Wnt signaling pathway,” “muscle contraction,” and “axon guidance.” Frequent exercise training in domestic horses may have resulted in changes in the expression of genes related to metabolism, muscle structure, and the nervous system.

Bayesian clustering output for 5 K values from K = 2 to K = 8 in 45 domestic horses. Each individual is represented by a vertical line, which is partitioned into colored segments that represent the proportion of the inferred K clusters.

Interesting excerpts:

Admixture proportions were assessed without user-defined population information to infer the presence of distinct populations among the samples (Figure 2). At K = 3 or K = 4, Franches-Montagnes and Arabian forms one unique cluster; at K = 5, Jeju pony forms one unique cluster. For other breeds, comparatively strong population structure exists among breeds, and they can be assigned to 2 (or 3) alternate clusters from K = 3 to K = 5 including group A (Duelmener, Fjord, Icelandic, Kazakh, Lichuan, and Mongolian) and group B (Hanoverian, Morgan, Quarter, Sorraia, and Standardbred). For group A, geographically this was unexpected, where Nordic breeds (Norwegian Fjord, Icelandic, and Duelmener) clustered with Asian breeds including the Mongolian. Previous results of mitochondrial DNA have revealed links between the Mongolian horse and breeds in Iceland, Scandinavia, Central Europe, and the British Isles. The Mongol horses are believed to have been originally imported from Russia subsequently became the basis for the Norwegian Fjord horse.31 At K = 6, Sorraia forms one unique cluster. The Sorraia horse has no long history as a domestic breed but is considered to be of a nearly ancestral type in the southern part of the Iberian Peninsula.32 However, our result did not support Sorraia as an independent ancestral type based on result from K = 2 to K = 5, and the unique cluster in K = 6 may be explained by the small population size and recently inbreeding programs. Genetic admixture of Morgan reveals that these breeds are currently or traditionally continually crossed with other breeds from K = 2 to K = 8. The Morgan horse has been a largely closed breed for 200 years or more but there has been some unreported crossbreeding in recent times.33

Principal component analysis results of all 48 horses. The x-axis denotes the value of PC1, whereas the y-axis denotes the value of PC2. Each dot in the figure represents one individual.

Bayesian clustering and PCA demonstrated the relationships among the horse breeds with weak geographic patterns. The tight grouping within most native breeds and looser grouping of individuals in admixed breeds have been reported previously in modern horses using data from a 54K SNP chip.33,34 Cluster analysis reveals that Arabian or Franches-Montagnes forms one unique cluster with relatively low K value, which is consistent with former study using 50K SNP chip 33,34 Interestingly, Standardbred forms a unique cluster with relatively high K value in this study, different from previous study.33 To date, no footprints are available to describe how the earliest domestic horses spread into China in ancient times. Our study found that Kazakh and Lichuan were assigned to the same lineage as other native Asian breeds, in agreement with previous studies on the origin of Chinese domestic horses.4,5,35,36 The strong genetic relationship between Asian native breeds and European native breeds have made it more difficult to understand the population history of the horse across Eurasia. Low levels of population differentiation observed between breeds might be explained by historical admixture. Unlike the domestic pig in China,8  we suggest that in China, Northern/Southern distinct groups could not be used to genetically distinct native Chinese horse breeds. We consider that during domestication process of horse, gene flow continued among Chinese-domesticated horses.

Open access Some maternal lineages of domestic horses may have origins in East Asia revealed with further evidence of mitochondrial genomes and HVR-1 sequences, by Ma et al., PeerJ (2018).


There are large populations of indigenous horse (Equus caballus) in China and some other parts of East Asia. However, their matrilineal genetic diversity and origin remained poorly understood. Using a combination of mitochondrial DNA (mtDNA) and hypervariable region (HVR-1) sequences, we aim to investigate the origin of matrilineal inheritance in these domestic horses.

To investigate patterns of matrilineal inheritance in domestic horses, we conducted a phylogenetic study using 31 de novo mtDNA genomes together with 317 others from the GenBank. In terms of the updated phylogeny, a total of 5,180 horse mitochondrial HVR-1 sequences were analyzed.

Eighteen haplogroups (Aw-Rw) were uncovered from the analysis of the whole mitochondrial genomes. Most of which have a divergence time before the earliest domestication of wild horses (about 5,800 years ago) and during the Upper Paleolithic (35–10 KYA). The distribution of some haplogroups shows geographic patterns. The Lw haplogroup contained a significantly higher proportion of European horses than the horses from other regions, while haplogroups Jw, Rw, and some maternal lineages of Cw, have a higher frequency in the horses from East Asia. The 5,180 sequences of horse mitochondrial HVR-1 form nine major haplogroups (A-I). We revealed a corresponding relationship between the haplotypes of HVR-1 and those of whole mitochondrial DNA sequences. The data of the HVR-1 sequences also suggests that Jw, Rw, and some haplotypes of Cw may have originated in East Asia while Lw probably formed in Europe.

Our study supports the hypothesis of the multiple origins of the maternal lineage of domestic horses and some maternal lineages of domestic horses may have originated from East Asia.

Median joining network constructed based on the 247- bp HVR-1 sequences. Circles are proportional to the number of horses represented and a scale indicator (for node sizes) was provided. The length of lines represents the number of variants that separate nodes (some manual adjustment was made for visually good). In the circles, the colors of solid pie slices indicate studied horse populations: Orange, European horses; Blue, horses of West Asia; Light Green, horses from East Asia; Grey, ancient horses; Purper, Przewalskii horses.

Geographic distributions of horse mtDNA haplogroups

The analysis of geographic distribution of the mitochondrial genome haplogroups showed that horse populations in Europe or East Asia included all haplogroups defined from the mtDNA genome sequences. The lineage Fw comprised entirely of Przewalskii horses. The two haplogroups Iw and Lw displayed frequency peaks in Europe (14.08% and 37.32%, respectively) and a decline to the east (9.33% and 8.00% in the West Asia, and 6.45% and 12.90% in East Asia, respectively), especially for Lw, which contained the largest number of European horses (Table 2). However, an opposite distribution pattern was observed for haplogroups Aw, Hw, Jw, and Rw, which were harbored by more horses from East Asia than those from other regions. The proportions of horses from East Asia for the four haplogroups were 38%, 88%, 62%, and 54%, respectively.

Schematic phylogeny of mtDNAs genome from modern horses. This tree includes 348 sequences
and was rooted at a donkey (E. asinus) mitochondrial genome (not displayed). The topology was inferred by a beast approach, whereas a time divergence scale (based on rate substitutions) is shown on the bottom (age estimates were indicated with thousand years (KY)). The percentages on each branch represent Bayesian posterior credibility and the alphabets on the right represent the names of haplogroups. Additional details concerning ages were given in Tables S3 and S6.


Contrastive principal component analysis (cPCA) to explore patterns specific to a dataset

Interesting open access paper Exploring patterns enriched in a dataset with contrastive principal component analysis, by Abid, Zhang, Bagaria & Zou, Nature Communications (2018) 9:2134.

Abstract (emphasis mine):

Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected under different conditions, e.g., a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. This paper proposes a method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in many applications where PCA is currently used.

Schematic Overview of cPCA. To perform cPCA, compute the covariance matrices C X , C Y of the target and background datasets. The singular vectors of the weighted difference of the covariance matrices, C X  − α · C Y , are the directions returned by cPCA. As shown in the scatter plot on the right, PCA (on the target data) identifies the direction that has the highest variance in the target data, while cPCA identifies the direction that has a higher variance in the target data as compared to the background data. Projecting the target data onto the latter direction gives patterns unique to the target data and often reveals structure that is missed by PCA. Specifically, in this example, reducing the dimensionality of the target data by cPCA would reveal two distinct clusters

The Mexican example caught my attention:

Relationship between ancestral groups in Mexico

In previous examples, we have seen that cPCA allows the user to discover subclasses within a target dataset that are not labeled a priori. However, even when subclasses are known ahead of time, dimensionality reduction can be a useful way to visualize the relationship within groups. For example, PCA is often used to visualize the relationship between ethnic populations based on genetic variants, because projecting the genetic variants onto two dimensions often produces maps that offer striking visualizations of geographic and historic trends26,27. But again, PCA is limited to identifying the most dominant structure; when this represents universal or uninteresting variation, cPCA can be more effective at visualizing trends.

The dataset that we use for this example consists of single nucleotide polymorphisms (SNPs) from the genomes of individuals from five states in Mexico, collected in a previous study28. Mexican ancestry is challenging to analyze using PCA since the PCs usually do not reflect geographic origin within Mexico; instead, they reflect the proportion of European/Native American heritage of each Mexican individual, which dominates and obscures differences due to geographic origin within Mexico (see Fig. 4a). To overcome this problem, population geneticists manually prune SNPs, removing those known to derive from Europeans ancestry, before applying PCA. However, this procedure is of limited applicability since it requires knowing the origin of the SNPs and that the source of background variation to be very different from the variation of interest, which are often not the case.

Relationship between Mexican ancestry groups. a PCA applied to genetic data from individuals from 5 Mexican states does not reveal any visually discernible patterns in the embedded data. b cPCA applied to the same dataset reveals patterns in the data: individuals from the same state are clustered closer together in the cPCA embedding. c Furthermore, the distribution of the points reveals relationships between the groups that matches the geographic location of the different states: for example, individuals from geographically adjacent states are adjacent in the embedding. c Adapted from a map of Mexico that is originally the work of User:Allstrak at Wikipedia, published under a CC-BY-SA license, sourced from https://commons.wikimedia.org/wiki/File:Mexico_Map.svg

As an alternative, we use cPCA with a background dataset that consists of individuals from Mexico and from Europe. This background is dominated by Native American/European variation, allowing us to isolate the intra-Mexican variation in the target dataset. The results of applying cPCA are shown in Fig. 4b. We find that individuals from the same state in Mexico are embedded closer together. Furthermore, the two groups that are the most divergent are the Sonorans and the Mayans from Yucatan, which are also the most geographically distant within Mexico, while Mexicans from the other three states are close to each other, both geographically as well as in the embedding captured by cPCA (see Fig. 4c). See also Supplementary Fig. 6 for more details.

So, by using a background dataset, it discovers patterns in a single target dataset via dimensionality reduction, that standard dimensionality reduction techniques do not discover. Maybe useful for some prehistoric populations, too…

They have released a Python implementation of cPCA on GitHub: https://github.com/abidlabs/contrastive, including Python notebooks and datasets.

See also:

The concept of “Outlier” in Human Ancestry (III): Late Neolithic samples from the Baltic region and origins of the Corded Ware culture


I have written before about how the Late Neolithic sample from Zvejnieki seemed to be an outlier among Corded Ware samples (read also the Admixture analysis section on the IEDDM), due to its position in PCA, even more than its admixture components or statistical comparison might show.

In the recent update to Northern European samples in Mittnik et al. (2018), an evaluation of events similar to the previous preprint (2017) is given:

Computing D-statistics for each individual of the form D(Baltic LN, Yamnaya; X, Mbuti), we find that the two individuals from the early phase of the LN (Plinkaigalis242 and Gyvakarai1, dating to ca. 3200–2600 calBCE) form a clade with Yamnaya (Supplementary Table 7), consistent with the absence of the farmer-associated component in ADMIXTURE (Fig. 2b). Younger individuals share more alleles with Anatolian and European farmers (Supplementary Table 7) as also observed in contemporaneous Central European CWC individuals2.

Sampling locations and dating of 38 ancient Northern European samples introduced in this study. Chronology based on calibrated radiocarbon dates or relative dating

My interpretation of the Zvejnieki sample ca. 2880 BC (and thus also of the only Baltic LN sample forming a close cluster with it) as ‘outlier’ seems thus reinforced as more samples come in. My explanation based on exogamy is one possibility for the region. After all, great mobility and exogamy practices are universally accepted for the Corded Ware territory, and Yamna migrants had settled up along the Prut precisely around this period (ca. 3100-2900 BC), so this kind of relation between Yamna and Baltic samples is to be expected.

NOTE: Information on the Late Neolithic burial of Zvejnieki is scarce, since it is an isolated find in radiocarbon analysis, among Mesolithic burials. You can read more about it from Ilga Zagorska’s studies, such as The use of ochre in Stone Age burials of the East Baltic (2008), The persistent presence of the dead: recent excavations at the hunter-gatherer cemetery at Zvejnieki (Latvia) (Antiquity 2013), or Dietary freshwater reservoir effects and the radiocarbon ages of prehistoric human bones from Zvejnieki, Latvia (J. Archaeol. Sci. 2016).

Samples of Baltic “Late Neolithic / Corded Ware culture”

The only two samples clustering more closely to Yamna cluster also closely to the three previous samples from Khvalynsk in Samara (labelled ‘Steppe Eneolithic’ in the paper), which makes one wonder how strongly connected were cultures from the forest and forest-steppe zones before the expansion of Corded Ware and Yamna settlers.

NOTE: Apart from the scarcity of samples available, which is common in genetic studies, the description of both additional ‘outlier’ samples of the Baltic Late Neolithic – isolated finds based mainly on radiocarbon analysis – leaves a lot to the imagination, because of the lack of cultural context and potential problems with dating methods:

Plinkaigalis 242, >40 year old female (OxA-5936, 4280 ± 75 BP, 3260–2630 calBCE). The burial site is located in the plains of central Lithuania on the eastern bank of the river Šušvė on the outskirts of the Plinkaigalis village, approximately 400 m southeast of an Iron age hill fort and settlement. The burial site was discovered in 1975 when local residents started digging for gravel in the western part of the hill. The same year site was granted a legal protection with archaeological excavations carried out for eight straight years in a row (1977-1984). During the eight years of fieldwork a total of 373 graves (364 inhumation and 9 cremation graves) with all but two of them dating to 3rd to 8th c. AD were uncovered. The two exceptional graves (no. 241, 242) were uncovered in the northern part of the burial site and C14 dated to the Late Neolithic.

Gyvakarai 1, 35-40 year old male (Poz-61584, 4030 ± 30 BP, 2620–2470 calBCE). The burial site is located in the northern part of Lithuania on the steep gravelly bank (elevation up to 79 m a. s. l.) of the rivulet Žvikė, 500 m to the south from where, in the wet grassland valley, it meets the main stem river Pyvesa. The site was discovered in 2000 when local residents started digging for gravel in the central part of the gravelly bank. The same year rescue excavations were conducted in the surrounding area of the highly disturbed grave resulting in discovery of a single grave C14 dated to the Late Neolithic.

EDIT (16 FEB 2018): A commentator noted that Gyvakaray1 was also studied for Yersinia pestis, a disease which appears to have expanded first to the west from the steppe, and then to the east, so it is possible that its position in PCA related to Plinkaigalis242 shows a connection to late Yamna settlers or East Bell Beaker migrants.

File modified by me from Mittnik et al. (2018) to include the approximate position of the most common ancestral components, and an identification of potential outliers. Zoomed-in version of the European Late Neolithic and Bronze Age samples. “Principal components analysis of 1012 present-day West Eurasians (grey points, modern Baltic populations in dark grey) with 294 projected published ancient and 38 ancient North European samples introduced in this study (marked with a red outline).

NOTE: I haven’t had the time and patience to work with my virtual computer on the PCA of these new samples – my CPU is reaching everyday its limit and my fans work half the time – , so I don’t know exactly which of them is Plinkaigalis242 and which Gyvakarai1, I just made a wild guess (based on ADMIXTURE) that the earlier Plinkaigalis242 forms a common ‘outlier’ group with Zvejnieki; if they are reversed or otherwise wrong in the image, please correct me. It will be much appreciated.

We can see from the additional samples in Mittnik et al. (2018) that the common cluster formed by most Baltic LN samples in PCA (most of them with clear cultural context among Late Neolithic or Corded Ware material, unlike the two ‘outliers’ and Gyvakarai1) is among Ukraine Eneolithic samples, European Corded Ware samples, and also Mesolithic-Neolithic samples from the Baltic. This is a logical find in light of the mainstream opinion that the expansion of the third horizon of the Corded Ware culture seems to have begun in the Dnieper-Dniester region (a corridor of steppe, steppe-forest, and forest zones) ca. 3300 BC.

PCA and ADMIXTURE analysis reflecting three time periods in Northern European prehistory. a Principal components analysis of 1012 present-day West Eurasians (grey points, modern Baltic populations in dark grey) with 294 projected published ancient and 38 ancient North European samples introduced in this study (marked with a red outline). Population labels of modern West Eurasians are given in Supplementary Fig. 7 and a zoomed-in version of the European Late Neolithic and Bronze Age samples is provided in Supplementary Fig. 8. b Ancestral components in ancient individuals estimated by ADMIXTURE (k = 11)

Corded Ware culture origins

If we take the most recent reliable radiocarbon analyses of material culture, and interpretations based on them of Corded Ware as a ‘complex’ similar to Bell Beaker (accepted more and more by disparate academics such as Anthony or Klejn), it seems that the controversial ‘massive’ Corded Ware migration must have begun somehow later than previously thought, which leaves these early Baltic samples still less clearly part of the initial Corded Ware culture, and more as outliers waiting for a more precise cultural context among Late Neolithic changes in the region.

Their situation in PCA among Khvalynsk (Samara), Baltic Mesolithic, East Hunger-Gatherer samples, Yamna and Eneolithic Ukraine leaves us without enough information to understand their actual origin.

EDIT (3 FEB 2018): In the first edition of my IEDDM paper I based the potential expansion of the Corded Ware culture mainly on Piezonka’s detailed analyses of the evolution of Mesolithic and Neolithic cultures in the forest-steppe and Forest Zone, and on later phylogeographic finds, since there were no samples from these regions in this interesting period. I revised it in the second edition to accomodate the model to the Indo-Uralic proto-language supported by the Leiden school, and identified it with a a close Neolithic-Chalcolithic steppe community based on common language guesstimates and – after the latest revision of Mathieson et al. (2017) – on the appearance of steppe admixture in the steppe.

However, if traditional Uralicists are right in supposing a loose Neolithic community in the Forest Zone, and Kristiansen is right in supposing long-lasting contacts in the Dniester-Dnieper region, we might actually be seeing with these ‘outliers’ the first proof that Neolithic samples from the forest-steppe and Forest Zone of the 4th millenium – unrelated to the Corded Ware culture – clustered closely to Khvalynsk, Sredni Stog, or Yamna samples, which is compatible with Piezonka’s accounts of intercultural contacts.

Martin Furholt‘s assessment of the origin of the A-horizon of the Corded Ware culture would put the early dates of Late Neolithic in the Baltic coinciding with or just before the initial expansion of Corded Ware migrants. For example, here are some excerpts (emphasis mine) from Re-evaluating Corded Ware Variability in Late Neolithic Europe (2014), in Proceedings of the Prehistoric Society (you can read it free at Academia.edu):

Radiocarbon analysis

Acceptance of the results of radiometric dating meant that the concept of the so called ‘A-Horizon’ also had to be reformulated. If we are dealing with such a phase at all, it is not a classic typological period that is defined by a uniform material culture inventory, but rather a set of types which show a wide distribution, but which are always integrated into a locally specific and thus regionally variable context.

The situation resembles that of the Bell Beakers, where a few supra-regional types are associated with local forms of ‘Begleitkeramik’ (i.e. pottery that accompanies Bell Beakers: Strahm 1995; Besse 1996).

The distribution data indicate that this set of forms (namely the A-Beaker, ‘A-Amphora’, and A-Battle Axe, as well as Herringbone-decorated Beakers) was to be found over much of Europe around 2700 BC, and that the currency of these forms was not short: they seem to have been used continuously during the Final Neolithic, perhaps even until 2000 BC (Fig. 3; Furholt 2004). Analysis of the radiometric and dendrochronological determinations also indicates that the A-Horizon is not the earliest Corded Ware phase. Instead, it appears to follow an apparent earlier phase in Poland during which Corded Ware pottery was in use from as early as 2900 BC (Furholt 2003; 2008a; Wödarczak 2006; Ullrich 2008).

Chronological model following from radiocarbon dating. Mark the contrast to the traditional model of the A-horizon as the earliest phase and a successive increase in regional variability later on

Corded Ware and Yamna/Bell Beaker

While widening networks and a change in the mechanism of exchange appears to have contributed to the emergence of the Corded Ware archaeological phenomenon, and also the contemporaneous Yamnaya graves (Harrison & Heyd 2007) and the following Bell Beaker and Early Bronze Age phenomena, it remains to be seen exactly what factors contributed to the development of these systems. It may be that there were changes in subsistence practices, perhaps involving a rising importance of animal herding that subsequently required higher mobility (for a discussion see Dörfler & Müller 2008), but considering the obvious diversity in subsistence patterns present in different Corded Ware groups, such an explanation would seem appropriate for the transformation in some regions, but surely not for the eastern hunterfisher-gatherer groups of the Baltic (Bläuer & Kantanen 2013). Also, trade with amber and copper might have played its role, but there are so far no indications for a significant rise in quantity or reach of these two materials in connection with Corded Ware graves or settlements (Furholt 2003, 125–7).

The impacts of animal traction and the wagon are also to be taken into account, as they are present since 3400 BC (Mischka 2011) but does at least not play any visible role in Corded Ware burial rituals, very much in contrast to the previous periods (Johannsen & Laursen 2010). There is no evidence for horse riding, but the domesticated horse seems to be present in central Europe since before 3000 BC (Becker 1999) and have also been found in Corded Ware settlements (Becker 2008), but again the evidence of domesticated horses is much more abundant in the period before 3000 BC.

So, concerning amber and copper exchange, or the impact of the wheel and animal traction, there is the recurrent motive of stronger evidence for the period before 3000 BC than during or in connection to Corded Ware finds after 2700 BC.

Summary table for the chronological positions (extent of name plus vertical lines) of the most important traditional archaeological ‘cultures’, ‘Groups’ or pottery styles discussed in this paper. Note that the definitions of those units are far from consistent or comparable, because they derive from different national and regional research traditions. Bold letters indicate a unit connected to the Corded Ware phenonomenon


The evidence strongly points towards a long period of coalescence from 3000 to 2700 BC, when several innovations in burial customs, pottery, and tool types sprung forth from different places and subsequently spread via different networks of exchange and interaction. These surely showed a significant rise in scale, reach, and impact on local practices, but the same is true for the contemporary Globular Amphora and Yamnaya ‘Cultures’. This exchange resulted, roughly spoken, in a phenomenon like the A-Horizon.


Thus, it seems reasonable to explain the wide regional reach of those Corded Ware elements as the result of a general increase in mobility and thus an increase in the spatial extension of regional networks, triggered by the long-term effects of technological innovations and connected economic and social transformations in Europe since 3400 BC. It is the increase in mobility and regional networks that is new to the European Neolithic Societies after this time, and it is not only the Corded Ware elements, that are spread through these channels but also Yamnaya, Globular Amphorae, Bell Beaker ‘Cultures’, and copper and bronze artefacts in later periods. Those are archaeological classification units, heuristic tools for the ordering of finds, while brushing over variability and overlapping traits, and so they should not be confused with real social groups.

Network analysis based on the quantitative occurrence of Corded Ware pottery forms, pottery ornamentation styles, tools, weapons and ornaments as stated in Table 1, based on the catalogues given in Table 2, line thickness representing similarity

As a summary, we can say that there is still much work to be done on the origins and expansion of the Corded Ware culture, and that speculative interpretations of recent genetic papers (especially since 2015), based solely on scarce genetic finds, are not doing much in favour of sound anthropological models by connecting directly Yamna to Corded Ware (and the latter to Bell Beaker), as the multiple new anthropological ‘steppe’ models (and their unending revisions due to the gradual corrections from ‘Yamnaya’ to ‘steppe’ admixture in genetic papers) are showing.

Featured image, from Furholt’s article: Map of the Corded Ware regions discussed for central Europe. The dark shading indicates those regions where Corded Ware burial rituals are present regularly.


Differences in ADMIXTURE between Khvalynsk/Yamna and Sredni Stog/Corded Ware


Looking for differences among steppe cultures in Genomics is like looking for a needle in a haystack.

It means, after all, looking for differences among closely related cultures, such as between South-Western and North-Western Anatolian Neolithic cultures, or among Old European cultures (such as Vinča or Cucuteni–Trypillia), or between Iberian cultures after the arrival of steppe-related populations.

These differences between closely related regions, in all these cases and especially among steppe cultures, even when they are supported by Archaeology and anthropological models of migration (and compatible with linguistic models), are expected to be minimal.

Fortunately, we have phylogeography, which helps us point in the right direction when assessing potential migrations using genomic data.

User Tomenable recently pointed out a curious finding on Anthrogenica, from data available in Mathieson et al (2017): in ADMIXTURE results with K=12, a different ancestral component (in light green in the paper, see below) is traceable from the North Caspian steppe since the Neolithic. This is also partially distinguishable on K=10 and K=11, although not so clearly differentiating among later cultures.

NOTE: Read more on the controversy regarding the ideal number of ancestral populations, the absurd use of ADMIXTURE to solve language questions, and the meaning of cross-validation (CV) values

Unsupervised ADMIXTURE plot from k=10 to 12, on a dataset consisting of 1099 present-day individuals and 476 ancient individuals. We show newly reported ancient individuals and some previously published individuals for comparison.

Explanations for this finding might include, as the user points out, a greater contribution of CHG ancestry in the eastern steppe cultures (Khvalynsk/Yamna) compared to the North Pontic steppe (Sredni Stog/Corded Ware), which is probably one of the main genomic differences among both cultures, as I pointed out in the Indo-European demic diffusion model (see accounts on the origins of Khvalynsk and Sredni Stog populations and on contacts between Yamna and the Caucasus, and see below also my sketch of Eurasian genomic history).

Interesting is also the appearance of similar ancestral components later in Vučedol – which probably received admixture from Yamna settlers (see admixture components in West Yamna samples and in the Yamna settler from Bulgaria) – , and later still in the Balkans.

On the other hand, previous ancestral components in outliers from the Balkans seem to be more similar to Sredni Stog samples, giving still more strength to the hypothesis that this common (“steppe”) component expanded westward within the Pontic-Caspian steppe with the spread of Suvorovo-Novodanilovka chiefs.

Problems with this interpretation include:

1) The scarce samples available, the different cultures included, and the CV values of the K populations selected in ADMIXTURE.

2) The lack of data for comparison with Bell Beaker peoples (from Olalde et al. 2017).

3) The sample classified as Latvia_LN/CWC has this component. I have already said before that, given the differences with all other Corded Ware samples, this quite early sample might be an outlier, with Khvalynsk/Yamna population connected directly to the ancestors of this individual, possibly through exogamy (as it is clear from my sketch below). Whether or not this is an outlier among CWC populations in the Baltic, only future samples can tell.

4) Three later individuals from Corded Ware in Germany have the component, in a minimal amount. I would bet – judging by their position in the graphic – that this might be explained through the Esperstedt family. These individuals might have in turn got the contribution directly from the oldest member, who shows what seems (in PCA) like a recent admixture from contemporary steppe cultures (such as the Catacomb culture).

NOTE: See my graphics with interesting members of the Espersted family marked: ADMIXTURE and PCA (outlier).

Tentative sketch modelling the genetic history of Europe and West Eurasia from ancient populations up to the Neolithic, according to results in recent genetic papers and archaeological models of known migrations.

Again, needle in a haystack… And confirmation bias by me, indeed.

But interesting nonetheless.

EDIT (4 JAN 2017): A reader points out that the interpretation of Unsupervised ADMIXTURE should work backwards (i.e. different contributions into different modern populations), and not based solely on ancestral populations, which seems probably right. So again, confirmation bias (and potentially wrong direction fallacy) by me…


The concept of “Outlier” in Human Ancestry (II): Early Khvalynsk, Sredni Stog, West Yamna, Iron Age Bulgaria, Potapovka, Andronovo…


I already wrote about the concept of outlier in Human Ancestry, so I am not going to repeat myself. This is just an update of “outliers” in recent studies, and their potential origins (here I will repeat some of the examples):

Early Khvalynsk: the three samples from the Samara region have quite different positions in PCA, from nearest to EHG (of Y-DNA haplogroup R1a) to nearest to ANE ancestry (of Y-DNA haplogroup Q). This could represent the initial consequences of the second wave of ANE ancestry – as found later in Yamna samples from a neighbouring region -, possibly brought then by Eurasian migrants related to haplogroup Q.
With only 3 samples, this is obviously just a tentative explanation of the finds. The samples can only be reasonably said to show an unstable time for the region in terms of admixture (i.e. probably migration), judging by the data on PCA.

Ukraine Eneolithic samples offer a curious example of how the concept of outlier can change radically: from the third version (May 30th) of the preprint paper of Mathieson et al. (2017), when the Ukraine Eneolithic sample with steppe ancestry (and clustering with central European samples) was the ‘outlier’, to the fourth version (September 19th), when two samples with steppe ancestry clustering close to Corded Ware samples were now the ‘normal’ ones (i.e. those representing Ukraine Eneolithic population), and the outlier was the one clustering closely with Ukraine Mesolithic samples…

PCA and Admixture for south-eastern Europe. Image modified from Mathieson et al. (2017) – Third revision (May 30th), used in the 2nd edition of the Indo-European demic diffusion model.

This is one of the funny consequences of the wrong interpretation of the ‘yamnaya component’, that made geneticists believe at first that, out of two samples (!), the ‘outlier’ was the one with ‘yamnaya’ ancestry, because this component would have been brought by an eastern immigrant from early Khvalynsk…

This example offers yet another reason why precise anthropological context is necessary to offer the right interpretation of results. Within the Indo-European demic diffusion model – based mainly on Archaeology and Linguistics – , the sample with steppe ancestry was the most logical find in the region for a potential origin of the Corded Ware culture, and it was interpreted as such, well before the publication of the fourth version of Mathieson et al. (2017).

PCA of South-East European and other European samples. Image modified from Mathieson et al. (2017) – Fourth revision (September 19th), used in the 3rd edition of the Indo-European demic diffusion model.

West Yamna (to insist on the same question, the ‘yamnaya’ component): we have only four western Yamna samples, two of them showing Anatolian Neolithic ancestry (one of them, from Ukraine, with a strong ‘southern’ drift). On the other hand, Corded Ware migrants do not show this. So we could infer that their migrations were not coetaneous: whereas peoples of Corded Ware culture expanded ca. 3300 BC to the north – in the natural corridor to the Baltic that has been proposed for this culture in Archaeology for decades (and that is well represented by Ukraine Eneolithic samples) -, peoples of Yamna culture expanded to the west, replacing the Ukraine Eneolithic population (i.e. probably those of ‘Proto-Corded Ware culture’), and eventually mixing with Balkan populations of Anatolian Neolithic ancestry.

Potapovka, Andronovo, and Srubna: while Potapovka clusters closely to the steppe, and Andronovo (like Sintashta) clusters closely to Corded Ware (i.e. Ukraine Neolithic / Central-East European), both have certain ‘outliers’ in PCA: the former has one individual clustering closely to Corded Ware, and the latter to the steppe. Both ‘outliers’ fit well with the interpretation of the recent mixture of Corded Ware peoples with steppe populations, and they offer a different image for the evolution of populations of Potapovka and Sintashta-Petrovka, potentially influencing their language. The position of Srubna samples, nearer to Sintashta and Andronovo (but occupying the same territory as the previous Potapovka) offers the image of a late westward conquest from Corded Ware-related populations.

Diachronic map of migrations ca. 2250-1750 BC

Iron Age Bulgaria: a sample of haplogroup R1a-z93, with more ‘yamnaya’ ancestry than any other previous sample from the Balkans. For some, it might mean continuity from an older time. However – as with the Corded Ware outlier from Esperstedt before it – it is more likely a recent migrant from the steppe. The most likely origin of this individual is therefore people from the steppe, i.e. either the Srubna culture or a related group. Its relatively close cluster in PCA to certain recent Slavic populations can be interpreted in light of the multiple back and forth migrations in the region: of steppe populations to the west (Srubna, Cimmerians, Scythians, Sarmatians,…), and of Slavic-speaking populations:

Diachronic map of Bronze Age migrations ca. 1750-1250 BC.

Well-defined outliers are, therefore, essential to understand a recent history of admixture. On the other hand, the very concept of “outlier” can be a dangerous tool – when the lack of enough samples makes their classification as as such unjustified -, leading to the wrong interpretations.


Globular Amphora not linked to Pontic steppe migrants – more data against Kristiansen’s Kurgan model of Indo-European expansion


New open access article, Genome diversity in the Neolithic Globular Amphorae culture and the spread of Indo-European languages, by Tassi et al. (2017).


It is unclear whether Indo-European languages in Europe spread from the Pontic steppes in the late Neolithic, or from Anatolia in the Early Neolithic. Under the former hypothesis, people of the Globular Amphorae culture (GAC) would be descended from Eastern ancestors, likely representing the Yamnaya culture. However, nuclear (six individuals typed for 597 573 SNPs) and mitochondrial (11 complete sequences) DNA from the GAC appear closer to those of earlier Neolithic groups than to the DNA of all other populations related to the Pontic steppe migration. Explicit comparisons of alternative demographic models via approximate Bayesian computation confirmed this pattern. These results are not in contrast to Late Neolithic gene flow from the Pontic steppes into Central Europe. However, they add nuance to this model, showing that the eastern affinities of the GAC in the archaeological record reflect cultural influences from other groups from the East, rather than the movement of people.

(a) Principal component analysis on genomic diversity in ancient and modern individuals. (b) K = 3,4 ADMIXTURE analysis based only on ancient variation. (a) Principal component analysis of 777 modern West Eurasian samples with 199 ancient samples. Only transversions considered in the PCA (to avoid confounding effects of post-mortem damage). We represented modern individuals as grey dots, and used coloured and labelled symbols to represent the ancient individuals. (b) Admixture plots at K = 3 and K = 4 of the analysis conducted only considering the ancient individuals. The full plot is shown in electronic supplementary material, figure S7. The ancient populations are sorted by a temporal scale from Pleistocene to Iron Age. The GAC samples of this study are displayed in the box on the right.

Excerpt, from the discussion:

In its classical formulation, the Kurgan hypothesis, i.e. a late Neolithic spread of proto-Indo-European languages from the Pontic steppes, regards the GAC people as largely descended from Late Neolithic ancestors from the East, most likely representing the Yamna culture; these populations then continued their Westward movement, giving rise to the later Corded Ware and Bell Beaker cultures. Gimbutas [23] suggested that the spread of Indo-European languages involved conflict, with eastern populations spreading their languages and customs to previously established European groups, which implies some degree of demographic change in the areas affected by the process. The genomic variation observed in GAC individuals from Kierzkowo, Poland, does not seem to agree with this view. Indeed, at the nuclear level, the GAC people show minor genetic affinities with the other populations related with the Kurgan Hypothesis, including the Yamna. On the contrary, they are similar to Early-Middle Neolithic populations, even geographically distant ones, from Iberia or Sweden. As already found for other Late Neolithic populations [18], in the GAC people’s genome there is a component related to those of much earlier hunting-gathering communities, probably a sign of admixture with them. At the nuclear level, there is a recognizable genealogical continuity from Yamna to Corded Ware. However, the view that the GAC people represented an intermediate phase in this large-scale migration finds no support in bi-dimensional representations of genome diversity (PCA and MDS), ADMIXTURE graphs, or in the set of estimated f3-statistics.

Scheme summarizing the five alternative models compared via ABC random forest. We generated by coalescent simulation mtDNA sequences under five models, differing as to the number of migration events considered. The coloured lines represent the ancient samples included in the analysis, namely Unetice (yellow line), Bell Beaker (purple line), Corded Ware (green line) and Globular Amphorae (red line) from Central Europe, Yamnaya (light blue line) and Srubnaya (brown line) from Eastern Europe. The arrows refer to the three waves of migration tested. Model NOMIG was the simplest one, in which the six populations did not have any genetic exchanges; models MIG1, MIG2 and MIG1, 2 differed from NOMIG in that they included the migration events number 1, 2 (from Eastern to Central Europe, respectively before and after the onset of the GAC), or both. Model MIG2, 3 represents a modification of MIG2 model also including a back migration from Central to Eastern Europe after the development of the Corded Ware culture.

Together with Globular Amphora culture samples from Mathieson et al. (2017), this suggests that Kristiansen’s Indo-European Corded Ware Theory is wrong, even in its latest revised models of 2017.

The background shading indicates the tree migratory waves proposed by Marija Gimbutas, and personally
checked by her in 1995. The symbols refer to the ancient populations considered in the ABC analysis

On the other hand, the article’s genetic finds have some interesting connections in terms of mtDNA phylogeography, but without a proper archaeological model it is difficult to explain them.

Haplogroup frequencies were obtained for Early Neolithic (EN), Middle Neolithic (MN), Chalcolithic (CA), and Late Neolithic (LN). The color assigned to each haplogroup is represented on the lower right part of each plot. Haplogroup frequencies were plotted geographically using QGIS v2.14.

Text and images from the article under Creative Commons Attribution 4.0 license.

Discovered first via Bernard Sécher’s blog.

See also:

Human ancestry: how to work your own PCA, ADMIXTURE analyses for human evolutionary and genealogical studies


I wrote two days ago in the post anouncing the revised version (October 2017) of the Indo-European demic diffusion model, about dumping the information I had on doing PCA and ADMIXTURE analyses as ‘drafts’, without reviewing them, in the new section of this website called Human Ancestry.

I had some time today to review them, and to correct gross mistakes in the texts, so that they might be more usable now

I began to work with free datasets to see if I could learn something more about results of recent Genetic research by working with the available free software. For the moment, I don’t see it necessary to continue working with samples myself, because there are many professionals in Bioinformatics doing an excellent job with their publications – much better than I could do -, and publishing results early (as pre-prints) and with free licenses, which allow us to reuse and modify their material. To work again with their samples seems most of the time like reinventing the wheel.

After all, my interpretation of Indo-European migrations does not depend on my own analysis of free datasets – or on genetic analysis, or on archaeological fieldwork, for that matter – but on the study of all anthropological questions involved. I am actually more interested in Linguistics, and – only marginally – in Archaeology, as is the field of Indo-European Studies in general.

I did find certain interesting aspects that I have commented in the model, though: especially by labelling all samples and reading about them carefully (usually in the supplementary notes of the published papers), you can observe certain patterns and derive some information that others might have missed. Such examples include the Corded Ware outlier from Esperstedt (see more on the Corded Ware migration), or the differences in the three samples from early Khvalynsk.

Now that most data published seem to keep supporting what I have suggested – regarding the more complex nature of the steppe component (so-called ‘yamnaya component‘), and also regarding the migration from Yamna to Bell Beaker, and a migration of a different population (and probably language) with Corded Ware – I don’t find it worthy to spend more of my quite limited time in these tasks.

However, if I need to work again with datasets, I will try to complete the drafts the best I can. Especially regarding F3 Statistics and qpGraph, which I didn’t even try. If you want to help improve the sections, you are welcome of course.

If I find time, I might be of help with your work. And even though modern genealogy does not interest me (for the moment), I guess it can also be relevant to obtain conclusions on more recent migrations, so if I can be of any help to any interesting work, I will do it too.

Plot 3D of datasets Minoans and Mycenaeans + Scythians and Sarmatians, using the same colours as in the Indo-European demic diffusion model.


  • The concept of “outlier” in studies of Human Ancestry, and the Corded Ware outlier from Esperstedt
  • New Ukraine Eneolithic sample from late Sredni Stog, near homeland of the Corded Ware culture
  • The concept of “outlier” in studies of Human Ancestry, and the Corded Ware outlier from Esperstedt


    While writing the third version of the Indo-European demic diffusion model, I noticed that one Corded Ware sample (labelled I0104) clusters quite closely with steppe samples (i.e. Yamna, Afanasevo, and Potapovka). The other Corded Ware samples cluster, as expected, closely with east-central European samples, which include related cultures such as the Swedish Battle Axe, and later Sintashta, or Potapovka (cultures that are from the steppe proper, but are derived from Corded Ware).

    I also noticed after publishing the draft that I had used the wording “Corded Ware outlier” at least once. I certainly had that term in mind when developing the third version, but I did not intend to write it down formally. Nevertheless, I think it is the right name to use.

    PCA of dataset including Minoans and Mycenaeans, and Scythians and Sarmatians. The graphic has been arranged so that ancestries and samples are located in geographically friendly axes similar to north-south (Y), east-west(X). Symbols are used, in a simplified manner, in accordance with symbols for Y-DNA haplogroups used in the maps. Labels have been used for simplification of important components. Areas are drawn surrounding Yamna, Poltavka, Afanasevo, Corded Ware (including samples from Estonia, Battle Axe, and Poltavka outlier), and succeeding Sintashta and Potapovka cultures, as well as Bell Beaker. Corded Ware sample I0104, from Esperstedt, has also been labelled.

    Outlier in Statistics, as you can infer from the name, is a sample (more precisely an observation) that lies distant to others. It is a slippery concept in Human Evolutionary Biology, because it has no clear definition, and it is thus dependent on a certain degree of subjective evaluation. It seems to be mainly based on a combination of PCA and ADMIXTURE analyses, but should obviously be dependent on the number of samples available for a certain culture, and the regional distribution of the samples available.

    We have thus certain clear cases, like the Poltavka outlier, of R1a-M417 lineage, clustering close to Corded Ware (and Sintashta, and Potapovka) samples, but far from other R1b-L23 samples from Poltavka or Yamna cultures, from neighbouring regions in the steppe.

    We have also less clear observations, like Balkan Chalcolithic samples, which may or may not have been part of different cultural groups (say, related to the Suvorovo-Novodanilovka expansion, or not), which may justify their differences in ancestral components in ADMIXTURE, and in their position in PCA.

    And we have a Yamna sample from western Ukraine, which – unlike the other two available samples – clusters “to the south” of east Yamna samples. Taking into account the Yamna sample from Bulgaria, clustering closely with south-eastern European samples, could you really call this an outlier? Two outliers out of four western Yamna samples? Well, maybe. If you take east and west Yamna from the steppe as a whole, and exclude the Yamna sample from Bulgaria, of course you can. Whether that classification is useful, or actually hinders a proper interpretation of western Yamna samples, and of the “Yamna component” seen in them, is a different story…

    PCA for European samples of Mathieson et al. (2017)

    But what then about the Corded Ware male from Esperstedt, labelled I0104, dated ca. 2430 BC, which clusters among contemporaneous steppe (Poltavka) samples, and has the greatest proportion of ‘Yamna component’ in ADMIXTURE? After all, it is different in both respects from any other Corded Ware individual – including the oldest samples available, from Latvia (ca. 2885 BC) and Tiefbrunn (ca. 2755 BC).

    This sample is one of the direct links between the steppe and Corded Ware in late times, and has been the main reason for the confusion a lot of people seem to have about the “Yamna component” in Corded Ware, with some supporting a direct migration from one into the other, and a few even daring to say that “Corded Ware is indistinguishable from Yamna”(!?).

    His family members – all males of haplogroup R1a-M417 (like I0104 and most males from the Corded Ware culture) -, few generations later, show a decreased Yamna component, which clearly indicates that this individual’s admixture came directly from the steppe, and most likely from one or multiple female ancestors. That is compatible with the nomadic nature of the Corded Ware culture (and its known exogamy practices), which connected central Europe with the steppes, up to the North Caspian region.

    If labelling other samples as outliers may be interesting to improve the conclusions one can obtain from genetic research, labelling this sample is, in my opinion, essential, to avoid certain strong misconceptions about the origin of the Corded Ware culture.