Friday, October 29, 2010

Exploring the 10 components of the Dodecad project's initial analysis

Project participants have received their admixture proportions from K=10 different inferred ancestral components. But, what are these components?

Unfortunately, ADMIXTURE does not provide any information about the age of the components. Indeed, village populations have been identified by ADMIXTURE-like analyses in the past: these were probably formed as distinctive entities no earlier than a few hundred years ago. But, the same is true also for the great continental groups (such as East Eurasians) which were most certainly formed thousands of years ago.

Nor can we be sure about the appearance of people who belong primarily to one of the components. This is due to the fact that many physical traits have evolved relatively recently in Eurasia, the result of natural and social adaptation to local environments.

A common way of exploring the relationship between populations is to represent them as an evolutionary tree. But, caution is needed: the tree representation assumes that populations split, but has no power of representing lateral gene flow between branches.

It is better to organize ancestral components (such as the 10 components currently reported to participants, e.g., "Northern European" or "West African"), rather than extant populations (e.g., Russians or Uygur) in a tree. To do otherwise would be equivalent to forcing a tree representation to populations in which lateral gene flow has been important.

Of course, no tree representation can capture the complexities of human relationships, but it nonetheless helps us visualize the data and generate hypotheses about the deep origins of prehistoric humans.

And, while lateral gene flow may have occurred among the ancestral components themselves, we are, nonetheless removing one layer of admixture (e.g., between East and West Eurasians in the ancestry of Uygur), and are getting closer to the situation in Eurasia before historical and late prehistorical movements of people began shuffling genes around in force.

Another common way of presenting relationships between populations is with multidimensional scaling (MDS). This takes the distances between populations, and maps the populations on a "map", the first few dimensions of which are usually displayed in a series of 2-dimensional scatterplots. This is quite useful, as the first 2 dimensions of the MDS representation has been discovered to correlate well with a map of geography in Europe, and probably elsewhere.

Notice, however, that there is information loss: there are 45 pairwise distances (10 choose 2) between 10 populations, but each of them is represented with two (x, y) co-ordinates on a 2D map. Hence: 45 pairwise distances are mapped to 20 co-ordinate values. What this means, is that distances cannot be preserved. That is, if we take our ruler and measure the distance between two populations on a 2D MDS plot, we are not guaranteed that it is proportional to the original distance.

(The problem is even more severe if we were to map 1,000 individuals themselves onto a 2D MDS map: about half a million pairwise distances are mapped to 2,000 co-ordinates. Normally, the first two dimensions capture a lot of the information, but we always have to examine the raw distances themselves to be sure of individuals' relationships with each other. This is a formidable task as the number of individuals grows, which, in addition, defeats the purpose of using visualization as an aid to data interpretation).

Also, as I have noted before, individuals of quite different ancestry may fall on the same spot of an MDS or the related Principal Components Analysis (PCA) map. Nonetheless, MDS is also a method that can give us a quick visual perception of the relationships between populations.

Without further ado, here is the table of Fst distances between the 10 ancestral components, as produced by ADMIXTURE; note that this depends on the marker set used for analysis, but there has been no selection of markers because they have big or small differences between populations:


Here is an MDS representation of these distances:


Here is a Neighbor-Joining tree representation:
Finally, here is a hierarchical clustering with complete linkage:


Some observations:
  • There are three well-defined "poles" of maximal differentiation: West Africans, East Eurasians, and West Eurasians
  • East Africans are related to West Africans but deviate toward West Eurasians
  • South Asians are intermediate between East and West Eurasians
  • West Eurasians consist of a core of North/South Europeans and West Asians, with Southwest Asians being slightly more removed
  • Northwest Africans are close to West Eurasians but also deviate towards other Africans
  • East Eurasians consist of Northeast Asians and East Asians
Note that in the above I am speaking of the 10 components, not of living geographical populations. For example, Ethiopians are "geographical" East Africans, who partake primarily of the East African but also of the Southwest Asian component and thus are even more inclined towards West Eurasians than the East African component is.

5 comments:

  1. Dieneke, on the component list beside the MDS plot the name of the Northwest African component is wrongly written as "Northwest Asian".

    ReplyDelete
  2. You have a good eye. I'll correct it.

    ReplyDelete
  3. You have a good eye.

    Well, I have myopia of -3.5 diopters in both eyes. ;-)

    ReplyDelete
  4. It is interesting to see that the Fst distance between Euro and East Africans is in fact smaller than the distance between Euro and Northeast Asians...

    ReplyDelete
  5. It would be interesting to know which Ethnic Groups you have used to represent the Southern Europeans?

    ReplyDelete