In R, one of the builtin datasets is iris
, which is a famous (Fisher’s or Anderson’s) data set (published in 1936) that gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Iris Setosa | Iris Versicolor | Iris Virginica |
---|---|---|
Here is a diagram of the measurements - image is of iris versicolor
Let’s load the iris dataset using the data()
command. The dataset includes:
The head()
command will show us the first 6 rows of the dataset so you get an idea of what’s inside the dataset.
data(iris)
head(iris)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
plot(iris$Sepal.Length, iris$Sepal.Width)
Notice that even without any information here on the species (using coloring, different symbols or other graphical tools), we can see that there are probably at least 2 “clusters” or “groupings” of the points here in this iris
dataset.
plot(iris$Petal.Length, iris$Petal.Width)
So, let’s now add color so we can see how the different species of flower cluster together (or not).
plot(iris$Petal.Length, iris$Petal.Width,
col = iris$Species)
Using the car
package, make a scatterplotMatrix()
to see all 2D bivariate scatterplots at once.
library(car)
car::scatterplotMatrix(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width|Species,
data=iris,
main="Three IRIS Species")
Let’s use the hclust()
function to run a cluster analysis of the iris
dataset - letting the software algorithm create the clusters - this is done WITHOUT knowledge that there are (should be) 3 clusters based on the 3 species of flowers.
We’ll run the cluster analysis, which defaults to the “complete” linkage algorithm (aka. “agglomeration method”) to compute the distances (similarity or dissimilarity) between the samples.
Next we use this result to make a plot called a “dendogram”.
clusters <- hclust(dist(iris[,1:4]))
plot(clusters)
If we want to get 3 clusters from this data, we will use the cutree()
function. Then we can see how these resulting clusters line up with the original species.
clusterCut <- cutree(clusters, 3)
table(clusterCut, iris$Species)
clusterCut/ | setosa | versicolor | virginica |
---|---|---|---|
1 | 50 | 0 | 0 |
2 | 0 | 23 | 49 |
3 | 0 | 27 | 1 |
Notice that “cluster 2” is not “clean” - it has both versicolor and virginica. Clusters 1 and 3 are pretty good.
Let’s try another linkage method; set method = “average”.
clusters <- hclust(dist(iris[, 3:4]),
method = "average")
plot(clusters)
Notice that this “average linkage” method results in clusters that more closely align with the original species. Of course, we don’t usually know what the “correct” answer is (the true cluster designations).
clusterCut <- cutree(clusters, 3)
table(clusterCut, iris$Species)
clusterCut/ | setosa | versicolor | virginica |
---|---|---|---|
1 | 50 | 0 | 0 |
2 | 0 | 45 | 1 |
3 | 0 | 5 | 49 |
The heatmap()
function is another good visualization tool - you get cluster analyses in 2 dimensions showing the associations between subjects and between variables. So, you can not only get an idea of what clusters and associations exist between the subjects - useful for classifying the subjects or samples in your data - but you can also see what associations exist between the variables.
irismatrix <- as.matrix(iris[,1:4])
heatmap(irismatrix)
palette <- colorRampPalette(c("green","blue"))(256)
heatmap(irismatrix, Colv=F, scale='none', col=palette)
Built in to base r (technically in the stats
package) is the function princomp()
which will perform Principal Components Analysis or PCA.
PCA basically looks at the matrix of variables and “decomposes” them into “factors” or “components” which are linear combinations of these variables such that they best capture the 1st major component of variability in the data, followed by the 2nd major component of variability in the data and so on until you end up with as many components as variables in the dataset.
iris
datasetfit <- princomp(iris[1:4], cor=TRUE)
summary(fit) # print variance accounted for
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.7083611 0.9560494 0.38308860 0.143926497
## Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
## Cumulative Proportion 0.7296245 0.9581321 0.99482129 1.000000000
loadings(fit) # pc loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length 0.521 -0.377 0.720 0.261
## Sepal.Width -0.269 -0.923 -0.244 -0.124
## Petal.Length 0.580 -0.142 -0.801
## Petal.Width 0.565 -0.634 0.524
##
## Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings 1.00 1.00 1.00 1.00
## Proportion Var 0.25 0.25 0.25 0.25
## Cumulative Var 0.25 0.50 0.75 1.00
plot(fit,type="lines") # scree plot
Let’s see how the 4 variables “load” onto the 1st 2 principal components
plot(fit[["loadings"]],
xlim=c(-1,1),ylim=c(-1,1))
abline(h=0, col="red")
abline(v=0, col="red")
text(fit[["loadings"]],
c("Sepal.Length","Sepal.Width",
"Petal.Length","Petal.Width"))
We can also plot the updated “scores” for the first 2 principal components for all 150 samples. These “scores” are the aggregated scores based on the linear combinations of the original variables (e.g. using the weights or loadings listed above).
head(fit$scores) # top listing of the sample scores for all 4 principal components
Comp.1 | Comp.2 | Comp.3 | Comp.4 |
---|---|---|---|
-2.264703 | -0.4800266 | 0.1277060 | 0.0241682 |
-2.080961 | 0.6741336 | 0.2346089 | 0.1030068 |
-2.364229 | 0.3419080 | -0.0442015 | 0.0283771 |
-2.299384 | 0.5973945 | -0.0912901 | -0.0659556 |
-2.389842 | -0.6468354 | -0.0157382 | -0.0359228 |
-2.075631 | -1.4891775 | -0.0269683 | 0.0066082 |
biplot(fit) # plot samples on PC1 and PC2
GGobi
There is a really good visualization tool/software called “GGobi”. You can learn more at http://www.ggobi.org/.
If you would like to install “GGobi”, see the instructions at http://www.ggobi.org/downloads/.
Once you get everything installed, then install the rggobi
R package.
Here is some code to get you started. You can run this code from an R script inside RStudio to see how it works interactively.
library(rggobi)
# View iris dataset - use KNOWN species to color
g <- ggobi(iris)
glyph_colour(g[1]) <- as.numeric(iris$Species)
# View iris dataset - use Cluster Analysis
# unsupervised clusters to color
g <- ggobi(iris)
clustering <- hclust(dist(iris[,1:4]),
method="average")
glyph_colour(g[1]) <- cutree(clustering, 3)