BOT300 Home Page

Phenetic analysis (clustering & ordination) 1 (27-Mar-03)


How do we visualize taxonomic (or other) relationships in multivariate data?

Simple bivariate plots can often be extremely effective for small datasets. The plot below is based on a famous dataset consisting of four flower measurements for three species of Iris collected in the Gaspé (Anderson 1935; Fisher 1936).

This plot was made using Rweb. Run your mouse over the to see the instructions used to create this plot.

With even as few as four measurements it can be difficult to take in all of the (3*4)/2 pairwise relationships between these characters. In fact, however, there are methods available that allow us to efficiently summarize multivariate data so as to see these relationships very clearly. These methods are based on the matrices of R- and Q-mode resemblances discussed on 25 March 2003.

[Back to TOP] 


R-mode analyses

Principal components analysis (PCA) allows us to rotate our multidimensional data so as to see the directions (say 2 or 3 at a time) in which it varies most. It is based on eigenanalysis of a symmetric dispersion matrix, such as a matrix of variances and covariances, or a matrix of correlations. In effect, eigenanalysis examines the shape of the cloud of data points in hyperspace (a space of more than 3 dimensions) and finds eigenvalues that describe the maximum dispersion of the datapoints in a succession of orthogonal (at 90° to each other) directions. Eigenanalysis also finds vectors (eigenvectors) corresponding to each eigenvalue that allow us to rotate our data into a new coordinate system in which their dispersion is maximized (= the eigenvalues).

Consider the following 2-dimensional data, created using Rweb.

This plot was made using Rweb. Run your mouse over the to see the instructions used to create this plot.

It's pretty clear in which directions the data vary most, and that these directions are not the same as those of the x- and y-axes. A PCA of these data rotates this cloud of points into such a coordinate system.

This plot was made using Rweb. Run your mouse over the to see the instructions used to create this plot.

PCA of the Iris data when done up fully will look like this. Note the biplot of PCA scores and component-character correlations, and the screeplot showing the relative magnitudes of the eigenvalues. These features are described HERE.

[Back to TOP] 


Q-mode analyses

Eigenanalysis of a Q-mode resemblance matrix can be used in a similar way, to obtain a Principal Coordinates Analysis (PCoA) of the data. The results of PCA and PCoA will tend to be identical, to the extent that the original datasets meet the assumptions of each method. In general, PCA is restricted to ratio scale metric data for which character covariation and correlation are meaningful. PCoA can be applied much more widely, since resemblance functions are available for most data types, and for mixed data (handout, 25 March 2003).

Cluster analysis is the other major type of Q-mode analysis. This involves representing the distances between objects (OTUs) in the form of a tree diagram (dendrogram). Once discontinuities in the data have been detected by means of an ordination method like PCA or PCoA you can then use an algorithmic method to sort objects into groups on the basis of their resemblances to each other. Typically, this is done in a bottom-up agglomerative manner, but with increased computational power it is now practical to also do this sorting top-down, or divisively, if that is more appropriate.

This plot, like the biplot and screeplot above, was created using S-Plus, the commercial version of the S language. Clustering of this kind can also be done using R and Rweb by attaching the cluster package ("library(cluster)").

[Back to TOP] 


Rweb

R is a freeware package available for multiple platforms (as well as on the web) that implements the original S language for data analysis and graphics that was developed at the Bell Laboratories in Murray Hill NJ. There is also another R package for multivariate data analysis that is also distributed as freeware from the University of Montréal.

Further reading

Anderson, E. (1935). The irises of the Gaspé Peninsula, Bulletin of the American Iris Society, 59, 2-5.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179-188.

Judd et al. (2002). Box 6C, ch. 6 [PCA as a tool in studying hybridization].

Legendre, P. & L. Legendre (1998). Numerical Ecology, 2nd ed. New York, Springer-Verlag.

Manly, B. F. J. (1994). Multivariate statistical methods - A primer, 2nd ed. London, Chapman & Hall.

Podani, J. (2000). Introduction to the exploration of Multivariate biological data. Leiden, Backhuys.


|BOT300S Home Page | U of T Botany | University of Toronto |

© 2003 Botany Department, University of Toronto.

Please send your comments to tim.dickinson@utoronto.ca; last updated 27-Mar-2003