Contrastive principal component analysis (cPCA) to explore patterns specific to a dataset

Interesting open access paper Exploring patterns enriched in a dataset with contrastive principal component analysis, by Abid, Zhang, Bagaria & Zou, Nature Communications (2018) 9:2134.

Abstract (emphasis mine):

Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected under different conditions, e.g., a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. This paper proposes a method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in many applications where PCA is currently used.

contrastive-pca-process
Schematic Overview of cPCA. To perform cPCA, compute the covariance matrices C X , C Y of the target and background datasets. The singular vectors of the weighted difference of the covariance matrices, C X  − α · C Y , are the directions returned by cPCA. As shown in the scatter plot on the right, PCA (on the target data) identifies the direction that has the highest variance in the target data, while cPCA identifies the direction that has a higher variance in the target data as compared to the background data. Projecting the target data onto the latter direction gives patterns unique to the target data and often reveals structure that is missed by PCA. Specifically, in this example, reducing the dimensionality of the target data by cPCA would reveal two distinct clusters

The Mexican example caught my attention:

Relationship between ancestral groups in Mexico

In previous examples, we have seen that cPCA allows the user to discover subclasses within a target dataset that are not labeled a priori. However, even when subclasses are known ahead of time, dimensionality reduction can be a useful way to visualize the relationship within groups. For example, PCA is often used to visualize the relationship between ethnic populations based on genetic variants, because projecting the genetic variants onto two dimensions often produces maps that offer striking visualizations of geographic and historic trends26,27. But again, PCA is limited to identifying the most dominant structure; when this represents universal or uninteresting variation, cPCA can be more effective at visualizing trends.

The dataset that we use for this example consists of single nucleotide polymorphisms (SNPs) from the genomes of individuals from five states in Mexico, collected in a previous study28. Mexican ancestry is challenging to analyze using PCA since the PCs usually do not reflect geographic origin within Mexico; instead, they reflect the proportion of European/Native American heritage of each Mexican individual, which dominates and obscures differences due to geographic origin within Mexico (see Fig. 4a). To overcome this problem, population geneticists manually prune SNPs, removing those known to derive from Europeans ancestry, before applying PCA. However, this procedure is of limited applicability since it requires knowing the origin of the SNPs and that the source of background variation to be very different from the variation of interest, which are often not the case.

cpca-mexico
Relationship between Mexican ancestry groups. a PCA applied to genetic data from individuals from 5 Mexican states does not reveal any visually discernible patterns in the embedded data. b cPCA applied to the same dataset reveals patterns in the data: individuals from the same state are clustered closer together in the cPCA embedding. c Furthermore, the distribution of the points reveals relationships between the groups that matches the geographic location of the different states: for example, individuals from geographically adjacent states are adjacent in the embedding. c Adapted from a map of Mexico that is originally the work of User:Allstrak at Wikipedia, published under a CC-BY-SA license, sourced from https://commons.wikimedia.org/wiki/File:Mexico_Map.svg

As an alternative, we use cPCA with a background dataset that consists of individuals from Mexico and from Europe. This background is dominated by Native American/European variation, allowing us to isolate the intra-Mexican variation in the target dataset. The results of applying cPCA are shown in Fig. 4b. We find that individuals from the same state in Mexico are embedded closer together. Furthermore, the two groups that are the most divergent are the Sonorans and the Mayans from Yucatan, which are also the most geographically distant within Mexico, while Mexicans from the other three states are close to each other, both geographically as well as in the embedding captured by cPCA (see Fig. 4c). See also Supplementary Fig. 6 for more details.

So, by using a background dataset, it discovers patterns in a single target dataset via dimensionality reduction, that standard dimensionality reduction techniques do not discover. Maybe useful for some prehistoric populations, too…

They have released a Python implementation of cPCA on GitHub: https://github.com/abidlabs/contrastive, including Python notebooks and datasets.

See also:

Science and Archaeology (Humanities): collaboration or confrontation?

Allentoft Corded Ware

Another discussion on the role of Science for Archaeology, in The Two Cultures and a World Apart: Archaeology and Science at a New Crossroads, by Tim Flohr Sørensen, Norwegian Archaeological Review, vol. 50, 2 (2017):

Within the past decade or so, archaeology has increasingly utilised and contributed to major advances in scientific methods when exploring the past. This progress is frequently celebrated as a quantum leap in the possibilities for understanding the archaeological record, opening up hitherto inaccessible dimensions of the past. This article represents a critique of the current consumption of science in archaeology, arguing that the discipline’s grounding in the humanities is at stake, and that the notion of ‘interdisciplinarity’ is becoming distorted with the increasing fetishisation of ‘data’, ‘facts’ and quantitative methods. It is argued that if archaeology is to break free of its self-induced inferiority to and dependence on science, it must revitalise its methodology for asking questions pertinent to the humanities.

Commentators in the discussion include:

The answer of Sørensen to them is on Archaeological Paradigms: Pendulum or Wrecking Ball?. Excerpts:

Thus, I argue that what we are witnessing with ‘the third science revolution’ (Kristiansen 2014) is precisely the proliferation of an already very authoritative science ideal in archaeology. And I worry that this dominance will limit research possibilities and potentials rather than encouraging plurality and radical experimentation with different forms of knowing.
(…)
I do believe in the coexistence of disparate academic principles and that collaboration is very often necessary, but I am also of the conviction that some degree of epistemological friction keeps both fields of research progressing. Nurturing distinctions, in other words, is no less useful than aiming for assimilation. What I am arguing for is thus a more respectful friction than the one characterising the processual/post-processual collisions, hoping for an academic environment where differences between research ideals are humbly accepted and cultivated precisely for their disparate strengths.
(…)
So, what I am arguing for is a more kaleidoscopic academic landscape, where different positions do not always have to assume a defensive or compromising stance, especially in confrontation with paradigms that are prospering politically. This also implies that science is not simply in the service of archaeology, as Lidén argues, but that we need to consider how archaeology may benefit science more generally by continuing to debate epistemological grounds, methodology and our modes of inquiry. And so, my fellow archaeologists: ask not what science can do for us, but what we can do for science.
(…)
In my original article, I addressed the widespread tendency in archaeology to disseminate research findings with sometimes too much conviction, where ambiguous results (and limited statistical data) are adopted with little concern for the inherent uncertainties. It is precisely this valorisation and authority of scientific observations that I claim to lead to an implicit devaluation of studies based in the humanities. The problem is – as stated numerous times in my original article – not science, but the consumption of scientific observations in archaeology, where the subtleties and not least ambiguities of scientific results are filtered out, leaving space almost exclusively for scientifically ‘proven’ facts and unequivocal results. This mode of consumption stands in direct contrast to the epistemological observation in the sciences, dictating that ‘“proof” and “certainty” are actually in short supply in the world of science’ (Freudenburg et al. 2008, p. 5). Hence, the risk is that archaeology somewhat uncritically adopts scientific observations that are in fact ‘empirically underdetermined – based largely on evidence that is in the category of the “maybe,” being inherently ambiguous rather than being absolutely clear-cut’ (Freudenburg et al. 2008, p. 6).

As I said recently on the article Massive Migrations…, by Martin Furholt, we are living a historical debate on essential questions for the future of all these disciplines.

And, as always, there is no shortcut to reading the texts. Unlike in Science, you cannot write a table with a summary of findings…

Discovered (again) via a comment on this blog by Joshua Jonathan.

Featured image from Allentoft et al. “They conclude that the Corded Ware culture of central Europe had ancestry from the Yamnaya. Allentoft et al. also show that the Afanasievo culture to the east is related to the Yamnaya, and that the Sintashta and Andronovo cultures had ancestry from the Corded Ware. Arrows indicate migrations — those from the Corded Ware reflect the evidence that people of this archaeological culture (or their relatives) were responsible for the spreading of Indo-European languages. All coloured boundaries are approximate.”

Related: