Introduction
Objective
Objective: grouping of observations into clusters, so that
- similar observations appear in the same cluster
- dissimilar observations appear in distinct clusters
\(\longrightarrow\) need for a measure for similarity and dissimilarity?
Example 1
Single cell transcriptomics: \(n \times p\) Matrix for which
every column contains the expression levels of one of \(p\) genes for \(n\) cells
every row contains the expression levels of \(p\) genes for one cell (sample)
Research question: look for groups of cells that have similar gene expression patterns
Or, look for groups of genes that have similar expression levels across the different cells. This can help us in understanding the regulation and functionality of the genes.
\(\longrightarrow\) both observations (rows) and variables (columns) can be clustered
Example 2.
Abundance studies: the abundances of \(n\) plant species are counted on \(p\) plots (habitats)
- look for groups that contain species that live in the same habitats, or, look for groups of habitats that have similar species communities
\(\longrightarrow\) both observations (rows) and variables (columns) can be clustered
Hierarchical Cluster Analysis: Agnes
General Algorithm
In step 0 the intercluster dissimilarity coincides with the dissimilarity between the corresponding observations
\(\rightarrow\) intercluster dissimilarity?
Intercluster Dissimilarities
Represent clusters (e.g. \(C_1\) and \(C_2\)) as sets of points \(\mathbf{x}_i\) which belong to that cluster
\(d(C_1,C_2)\): intercluster dissimilarity between
We consider three intercluster dissimilarities.
Single Linkage = Nearest Neighbour
\[
d(C_1,C_2) = \min_{\mathbf{x}_1 \in C_1; \mathbf{x}_2 \in C_2}
d(\mathbf{x}_1,\mathbf{x}_2) ,
\]
i.e. the dissimilarity between \(C_1\) and \(C_2\) is determined by the smallest dissimilarity between a point of \(C_1\) and a point of \(C_2\).
Complete Linkage = Furthest Neighbour
\[
d(C_1,C_2) = \max_{\mathbf{x}_1 \in C_1; \mathbf{x}_2 \in C_2}
d(\mathbf{x}_1,\mathbf{x}_2) ,
\] i.e. the dissimilarity between \(C_1\) and \(C_2\) is determined by the largest dissimilarity between a point of \(C_1\) and a point of \(C_2\).
Average Linkage = Group Average
\[
d(C_1,C_2) = \frac{1}{\vert C_1 \vert \vert C_2 \vert}
\sum_{\mathbf{x}_1 \in C_1; \mathbf{x}_2 \in C_2}
d(\mathbf{x}_1,\mathbf{x}_2) ,
\] i.e. the dissimilarity between \(C_1\) and \(C_2\) is determined by the average dissimilarity between all points of \(C_1\) and all points of \(C_2\).
Cluster Tree
Hierarchical nature of the algorithm:
Nested sequence of clusters \(\longrightarrow\) visualisation via a tree
height of branches indicate the intercluster dissimilarity at which clusters are merged.
Can used as instrument for deciding the number of clusters in the data
Toy example
1.50 |
2.40 |
1 |
2.00 |
2.50 |
2 |
2.50 |
2.25 |
3 |
2.00 |
3.00 |
4 |
2.25 |
3.20 |
5 |
library(cluster)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.5 ✔ dplyr 1.0.7
## ✔ tidyr 1.1.4 ✔ stringr 1.4.0
## ✔ readr 2.0.1 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
toy %>%
ggplot(aes(X1, X2, label = label)) +
geom_point() +
geom_text(nudge_x = .05)
## 1 2 3 4
## 2 0.5099020
## 3 1.0111874 0.5590170
## 4 0.7810250 0.5000000 0.9013878
## 5 1.0965856 0.7433034 0.9823441 0.3201562
Single linkage
toySingle <- agnes(toy[,1:2], method = "single")
par(mfrow=c(1,2),pty="s")
plot(X2 ~ X1, toy, xlim = c(1.25,2.75),ylim = c(2,3.5))
text(toy$X1*1.05,toy$X2,label=toy$label)
plot(toySingle, which.plot = 2, main = "Single")
## 1 2 3 4
## 2 0.5099020
## 3 1.0111874 0.5590170
## 4 0.7810250 0.5000000 0.9013878
## 5 1.0965856 0.7433034 0.9823441 0.3201562
Complete linkage
toyComplete <- agnes(toy[,1:2], method = "complete")
par(mfrow=c(1,2),pty="s")
plot(X2 ~ X1, toy, xlim = c(1.25,2.75),ylim = c(2,3.5))
text(toy$X1*1.05,toy$X2,label=toy$label)
plot(toyComplete, which.plot = 2, main = "Complete")
## 1 2 3 4
## 2 0.5099020
## 3 1.0111874 0.5590170
## 4 0.7810250 0.5000000 0.9013878
## 5 1.0965856 0.7433034 0.9823441 0.3201562
Average linkage
toyAvg <- agnes(toy[,1:2], method = "average")
par(mfrow=c(1,2),pty="s")
plot(X2 ~ X1, toy, xlim = c(1.25,2.75),ylim = c(2,3.5))
text(toy$X1*1.05,toy$X2,label=toy$label)
plot(toyAvg, which.plot = 2, main = "Average")
## 1 2 3 4
## 2 0.5099020
## 3 1.0111874 0.5590170
## 4 0.7810250 0.5000000 0.9013878
## 5 1.0965856 0.7433034 0.9823441 0.3201562
LS0tCnRpdGxlOiAiSW50cm9kdWN0aW9uIHRvIEhpZXJhcmNoaWNhbCBDbHVzdGVyaW5nIgphdXRob3I6ICJMaWV2ZW4gQ2xlbWVudCIKb3V0cHV0OgogIGh0bWxfZG9jdW1lbnQ6CiAgICBkZl9wcmludDogcGFnZWQKICBwZGZfZG9jdW1lbnQ6CiAgICB0b2M6IHRydWUKICAgIG51bWJlcl9zZWN0aW9uczogdHJ1ZQotLS0KCiMgSW50cm9kdWN0aW9uCiMjIE9iamVjdGl2ZQoKT2JqZWN0aXZlOiBncm91cGluZyBvZiBvYnNlcnZhdGlvbnMgaW50byAqKmNsdXN0ZXJzKiosIHNvIHRoYXQKCi0gc2ltaWxhciBvYnNlcnZhdGlvbnMgYXBwZWFyIGluIHRoZSBzYW1lIGNsdXN0ZXIKLSBkaXNzaW1pbGFyIG9ic2VydmF0aW9ucyBhcHBlYXIgaW4gZGlzdGluY3QgY2x1c3RlcnMKCiRcbG9uZ3JpZ2h0YXJyb3ckIG5lZWQgZm9yIGEgbWVhc3VyZSBmb3IgKipzaW1pbGFyaXR5KiogYW5kICoqZGlzc2ltaWxhcml0eSoqPwoKCiMjIEV4YW1wbGUgMQoKU2luZ2xlIGNlbGwgdHJhbnNjcmlwdG9taWNzOiAgJG4gXHRpbWVzIHAkIE1hdHJpeCBmb3Igd2hpY2gKCiAgLSBldmVyeSBjb2x1bW4gY29udGFpbnMgdGhlIGV4cHJlc3Npb24gbGV2ZWxzIG9mIG9uZSBvZiAkcCQgZ2VuZXMgZm9yICRuJCBjZWxscwogIAogIC0gZXZlcnkgcm93IGNvbnRhaW5zIHRoZSBleHByZXNzaW9uIGxldmVscyBvZiAkcCQgZ2VuZXMgZm9yIG9uZSBjZWxsICgqKnNhbXBsZSoqKQoKICAtIFJlc2VhcmNoIHF1ZXN0aW9uOiBsb29rIGZvciBncm91cHMgb2YgY2VsbHMgdGhhdCBoYXZlIHNpbWlsYXIgZ2VuZSBleHByZXNzaW9uIHBhdHRlcm5zCgogLSBPciwgbG9vayBmb3IgZ3JvdXBzIG9mIGdlbmVzIHRoYXQgaGF2ZSBzaW1pbGFyIGV4cHJlc3Npb24gbGV2ZWxzIGFjcm9zcyB0aGUgZGlmZmVyZW50IGNlbGxzLiBUaGlzIGNhbgogaGVscCB1cyBpbiB1bmRlcnN0YW5kaW5nIHRoZSByZWd1bGF0aW9uIGFuZCBmdW5jdGlvbmFsaXR5IG9mIHRoZSBnZW5lcy4KCiAkXGxvbmdyaWdodGFycm93JCBib3RoICoqb2JzZXJ2YXRpb25zKiogKHJvd3MpIGFuZCAqKnZhcmlhYmxlcyoqIChjb2x1bW5zKSBjYW4gYmUgY2x1c3RlcmVkCgoKIyMgRXhhbXBsZSAyLgoKQWJ1bmRhbmNlIHN0dWRpZXM6IHRoZSBhYnVuZGFuY2VzIG9mICRuJCBwbGFudCBzcGVjaWVzIGFyZSBjb3VudGVkIG9uICRwJCBwbG90cyAoaGFiaXRhdHMpCgogIC0gbG9vayBmb3IgZ3JvdXBzIHRoYXQgY29udGFpbiBzcGVjaWVzIHRoYXQgbGl2ZSBpbiB0aGUgc2FtZSBoYWJpdGF0cywgb3IsIGxvb2sgZm9yIGdyb3VwcyBvZgogaGFiaXRhdHMgdGhhdCBoYXZlIHNpbWlsYXIgc3BlY2llcyBjb21tdW5pdGllcwoKJFxsb25ncmlnaHRhcnJvdyQgYm90aCAqKm9ic2VydmF0aW9ucyoqIChyb3dzKSBhbmQgKip2YXJpYWJsZXMqKiAoY29sdW1ucykgY2FuIGJlIGNsdXN0ZXJlZAoKCiMgSGllcmFyY2hpY2FsIENsdXN0ZXIgQW5hbHlzaXM6IEFnbmVzCgojIyBHZW5lcmFsIEFsZ29yaXRobQoKLSBJbiBzdGVwIDAgZWFjaCBvYnNlcnZhdGlvbnMgaXMgY29uc2lkZXJlZCBhcyBhIGNsdXN0ZXIgKGkuZS4gJG4kIGNsdXN0ZXJzKS4KCi0gRXZlcnkgbmV4dCBzdGVwIGNvbnNpc3RzIG9mOgoKICAgMS4gbWVyZ2UgdGhlIHR3byBjbHVzdGVycyB3aXRoIHRoZSBzbWFsbGVzdCBpbnRlcmNsdXN0ZXIgZGlzc2ltaWxhcml0eQogICAyLiByZWNhbGN1bGF0ZSB0aGUgaW50ZXJjbHVzdGVyIGRpc3NpbWlsYXJpdGllcwoKSW4gc3RlcCAwIHRoZSBpbnRlcmNsdXN0ZXIgZGlzc2ltaWxhcml0eSBjb2luY2lkZXMgd2l0aCB0aGUgZGlzc2ltaWxhcml0eSBiZXR3ZWVuIHRoZSBjb3JyZXNwb25kaW5nIG9ic2VydmF0aW9ucyAKCiRccmlnaHRhcnJvdyQgaW50ZXJjbHVzdGVyIGRpc3NpbWlsYXJpdHk/CgojIyBJbnRlcmNsdXN0ZXIgRGlzc2ltaWxhcml0aWVzCgotIFJlcHJlc2VudCBjbHVzdGVycyAoZS5nLiAkQ18xJCBhbmQgJENfMiQpCiAgIGFzIHNldHMgb2YgcG9pbnRzICRcbWF0aGJme3h9X2kkIHdoaWNoIGJlbG9uZyB0byB0aGF0IGNsdXN0ZXIKCi0gJGQoQ18xLENfMikkOiBpbnRlcmNsdXN0ZXIgZGlzc2ltaWxhcml0eSBiZXR3ZWVuCgpXZSBjb25zaWRlciB0aHJlZSBpbnRlcmNsdXN0ZXIgZGlzc2ltaWxhcml0aWVzLgoKIyMjIFNpbmdsZSBMaW5rYWdlID0gTmVhcmVzdCBOZWlnaGJvdXIKClxbCiAgZChDXzEsQ18yKSA9IFxtaW5fe1xtYXRoYmZ7eH1fMSBcaW4gQ18xOyBcbWF0aGJme3h9XzIgXGluIENfMn0KICBkKFxtYXRoYmZ7eH1fMSxcbWF0aGJme3h9XzIpICwKXF0KCmkuZS4gdGhlIGRpc3NpbWlsYXJpdHkgYmV0d2VlbiAkQ18xJCBhbmQgJENfMiQgaXMgZGV0ZXJtaW5lZCBieSB0aGUgc21hbGxlc3QgZGlzc2ltaWxhcml0eSBiZXR3ZWVuIGEgcG9pbnQgb2YgJENfMSQgYW5kIGEgcG9pbnQgb2YgJENfMiQuCgpgYGB7ciwgZWNobz1GQUxTRSwgb3V0LndpZHRoPSc3MCUnfQprbml0cjo6aW5jbHVkZV9ncmFwaGljcygiLi9maWd1cmVzL2hjbHVzdE5lYXJlc3QucG5nIikKYGBgCgojIyMgQ29tcGxldGUgTGlua2FnZSA9IEZ1cnRoZXN0IE5laWdoYm91cgogICBcWwogICAgZChDXzEsQ18yKSA9IFxtYXhfe1xtYXRoYmZ7eH1fMSBcaW4gQ18xOyBcbWF0aGJme3h9XzIgXGluIENfMn0KICAgIGQoXG1hdGhiZnt4fV8xLFxtYXRoYmZ7eH1fMikgLAogICBcXQogICBpLmUuIHRoZSBkaXNzaW1pbGFyaXR5IGJldHdlZW4gJENfMSQgYW5kICRDXzIkIGlzIGRldGVybWluZWQgYnkgdGhlIGxhcmdlc3QgZGlzc2ltaWxhcml0eSBiZXR3ZWVuIGEgcG9pbnQgb2YgJENfMSQgYW5kIGEKICAgcG9pbnQgb2YgJENfMiQuCgoKYGBge3IsIGVjaG89RkFMU0UsIG91dC53aWR0aD0nNzAlJ30Ka25pdHI6OmluY2x1ZGVfZ3JhcGhpY3MoIi4vZmlndXJlcy9oY2x1c3RGdXJ0aGVzdC5wbmciKQpgYGAKCiMjIyBBdmVyYWdlIExpbmthZ2UgPSBHcm91cCBBdmVyYWdlCgogICBcWwogICAgZChDXzEsQ18yKSA9IFxmcmFjezF9e1x2ZXJ0IENfMSBcdmVydCBcdmVydCBDXzIgXHZlcnR9CiAgICBcc3VtX3tcbWF0aGJme3h9XzEgXGluIENfMTsgXG1hdGhiZnt4fV8yIFxpbiBDXzJ9CiAgICBkKFxtYXRoYmZ7eH1fMSxcbWF0aGJme3h9XzIpICwKICAgXF0KICAgaS5lLiB0aGUgZGlzc2ltaWxhcml0eSBiZXR3ZWVuICRDXzEkIGFuZCAkQ18yJCBpcyBkZXRlcm1pbmVkIGJ5IHRoZSBhdmVyYWdlIGRpc3NpbWlsYXJpdHkgYmV0d2VlbiBhbGwgcG9pbnRzIG9mICRDXzEkIGFuZCBhbGwKICAgcG9pbnRzIG9mICRDXzIkLgoKYGBge3IsIGVjaG89RkFMU0UsIG91dC53aWR0aD0nNzAlJ30Ka25pdHI6OmluY2x1ZGVfZ3JhcGhpY3MoIi4vZmlndXJlcy9oY2x1c3RBdmVyYWdlLnBuZyIpCmBgYAoKCiMjIENsdXN0ZXIgVHJlZQoKSGllcmFyY2hpY2FsIG5hdHVyZSBvZiB0aGUgYWxnb3JpdGhtOgoKLSBOZXN0ZWQgc2VxdWVuY2Ugb2YgY2x1c3RlcnMgJFxsb25ncmlnaHRhcnJvdyQgdmlzdWFsaXNhdGlvbiB2aWEgYSB0cmVlICAKCgotIGhlaWdodCBvZiBicmFuY2hlcyBpbmRpY2F0ZSB0aGUgaW50ZXJjbHVzdGVyIGRpc3NpbWlsYXJpdHkgYXQgd2hpY2ggY2x1c3RlcnMgYXJlIG1lcmdlZC4KCi0gQ2FuIHVzZWQgYXMgaW5zdHJ1bWVudCBmb3IgZGVjaWRpbmcgdGhlIG51bWJlciBvZiBjbHVzdGVycyBpbiB0aGUgZGF0YQoKCiMgVG95IGV4YW1wbGUKCgpgYGB7ciBlY2hvID0gRkFMU0V9CnRveSA8LSBkYXRhLmZyYW1lKAogIFgxID0gYygxLjUwLAogICAgICAgICAyLjAwLAogICAgICAgICAyLjUwLAogICAgICAgICAyLjAwLAogICAgICAgICAyLjI1KSwKICBYMiA9IGMoMi40MCwKICAgICAgICAgMi41MCwKICAgICAgICAgMi4yNSwKICAgICAgICAgMy4wMCwKICAgICAgICAgMy4yMCksCiAgbGFiZWwgPSAxOjUKICApCgprbml0cjo6a2FibGUodG95KQpgYGAKCmBgYHtyfQpsaWJyYXJ5KGNsdXN0ZXIpCmxpYnJhcnkodGlkeXZlcnNlKQoKdG95ICU+JQogIGdncGxvdChhZXMoWDEsIFgyLCBsYWJlbCA9IGxhYmVsKSkgKwogIGdlb21fcG9pbnQoKSArCiAgZ2VvbV90ZXh0KG51ZGdlX3ggPSAuMDUpCgp0b3lbLDE6Ml0gJT4lIGRpc3QKYGBgCgojIyBTaW5nbGUgbGlua2FnZQoKYGBge3J9CnRveVNpbmdsZSA8LSBhZ25lcyh0b3lbLDE6Ml0sIG1ldGhvZCA9ICJzaW5nbGUiKQpwYXIobWZyb3c9YygxLDIpLHB0eT0icyIpCnBsb3QoWDIgfiBYMSwgdG95LCB4bGltID0gYygxLjI1LDIuNzUpLHlsaW0gPSBjKDIsMy41KSkKdGV4dCh0b3kkWDEqMS4wNSx0b3kkWDIsbGFiZWw9dG95JGxhYmVsKQpwbG90KHRveVNpbmdsZSwgd2hpY2gucGxvdCA9IDIsIG1haW4gPSAiU2luZ2xlIikKdG95WywxOjJdICU+JSBkaXN0CmBgYAoKIyMgQ29tcGxldGUgbGlua2FnZQoKYGBge3J9CnRveUNvbXBsZXRlIDwtIGFnbmVzKHRveVssMToyXSwgbWV0aG9kID0gImNvbXBsZXRlIikKcGFyKG1mcm93PWMoMSwyKSxwdHk9InMiKQpwbG90KFgyIH4gWDEsIHRveSwgeGxpbSA9IGMoMS4yNSwyLjc1KSx5bGltID0gYygyLDMuNSkpCnRleHQodG95JFgxKjEuMDUsdG95JFgyLGxhYmVsPXRveSRsYWJlbCkKcGxvdCh0b3lDb21wbGV0ZSwgd2hpY2gucGxvdCA9IDIsIG1haW4gPSAiQ29tcGxldGUiKQp0b3lbLDE6Ml0gJT4lIGRpc3QKYGBgCgojIyBBdmVyYWdlIGxpbmthZ2UKCmBgYHtyfQp0b3lBdmcgPC0gYWduZXModG95WywxOjJdLCBtZXRob2QgPSAiYXZlcmFnZSIpCnBhcihtZnJvdz1jKDEsMikscHR5PSJzIikKcGxvdChYMiB+IFgxLCB0b3ksIHhsaW0gPSBjKDEuMjUsMi43NSkseWxpbSA9IGMoMiwzLjUpKQp0ZXh0KHRveSRYMSoxLjA1LHRveSRYMixsYWJlbD10b3kkbGFiZWwpCnBsb3QodG95QXZnLCB3aGljaC5wbG90ID0gMiwgbWFpbiA9ICJBdmVyYWdlIikKdG95WywxOjJdICU+JSBkaXN0CmBgYAo=