Introduction
 Objective
Objective: grouping of observations into clusters, so that
- similar observations appear in the same cluster
- dissimilar observations appear in distinct clusters
\(\longrightarrow\) need for a measure for similarity and dissimilarity?
 
 Example 1
Single cell transcriptomics: \(n \times p\) Matrix for which
- every column contains the expression levels of one of \(p\) genes for \(n\) cells 
- every row contains the expression levels of \(p\) genes for one cell (sample) 
- Research question: look for groups of cells that have similar gene expression patterns 
- Or, look for groups of genes that have similar expression levels across the different cells. This can help us in understanding the regulation and functionality of the genes. 
\(\longrightarrow\) both observations (rows) and variables (columns) can be clustered
 
 Example 2.
Abundance studies: the abundances of \(n\) plant species are counted on \(p\) plots (habitats)
- look for groups that contain species that live in the same habitats, or, look for groups of habitats that have similar species communities
\(\longrightarrow\) both observations (rows) and variables (columns) can be clustered
 
 
 Hierarchical Cluster Analysis: Agnes
 General Algorithm
In step 0 the intercluster dissimilarity coincides with the dissimilarity between the corresponding observations
\(\rightarrow\) intercluster dissimilarity?
 
 Intercluster Dissimilarities
- Represent clusters (e.g. \(C_1\) and \(C_2\)) as sets of points \(\mathbf{x}_i\) which belong to that cluster 
- \(d(C_1,C_2)\): intercluster dissimilarity between 
We consider three intercluster dissimilarities.
 Single Linkage = Nearest Neighbour
\[
  d(C_1,C_2) = \min_{\mathbf{x}_1 \in C_1; \mathbf{x}_2 \in C_2}
  d(\mathbf{x}_1,\mathbf{x}_2) ,
\]
i.e. the dissimilarity between \(C_1\) and \(C_2\) is determined by the smallest dissimilarity between a point of \(C_1\) and a point of \(C_2\).

 
 Complete Linkage = Furthest Neighbour
\[
    d(C_1,C_2) = \max_{\mathbf{x}_1 \in C_1; \mathbf{x}_2 \in C_2}
    d(\mathbf{x}_1,\mathbf{x}_2) ,
   \] i.e. the dissimilarity between \(C_1\) and \(C_2\) is determined by the largest dissimilarity between a point of \(C_1\) and a point of \(C_2\).

 
 Average Linkage = Group Average
\[
    d(C_1,C_2) = \frac{1}{\vert C_1 \vert \vert C_2 \vert}
    \sum_{\mathbf{x}_1 \in C_1; \mathbf{x}_2 \in C_2}
    d(\mathbf{x}_1,\mathbf{x}_2) ,
   \] i.e. the dissimilarity between \(C_1\) and \(C_2\) is determined by the average dissimilarity between all points of \(C_1\) and all points of \(C_2\).

 
 
 Cluster Tree
Hierarchical nature of the algorithm:
- Nested sequence of clusters \(\longrightarrow\) visualisation via a tree 
- height of branches indicate the intercluster dissimilarity at which clusters are merged. 
- Can used as instrument for deciding the number of clusters in the data 
 
 
 Toy example
| 1.50 | 2.40 | 1 | 
| 2.00 | 2.50 | 2 | 
| 2.50 | 2.25 | 3 | 
| 2.00 | 3.00 | 4 | 
| 2.25 | 3.20 | 5 | 
library(cluster)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.5     ✔ dplyr   1.0.7
## ✔ tidyr   1.1.4     ✔ stringr 1.4.0
## ✔ readr   2.0.1     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
toy %>%
  ggplot(aes(X1, X2, label = label)) +
  geom_point() +
  geom_text(nudge_x = .05)

##           1         2         3         4
## 2 0.5099020                              
## 3 1.0111874 0.5590170                    
## 4 0.7810250 0.5000000 0.9013878          
## 5 1.0965856 0.7433034 0.9823441 0.3201562
 Single linkage
toySingle <- agnes(toy[,1:2], method = "single")
par(mfrow=c(1,2),pty="s")
plot(X2 ~ X1, toy, xlim = c(1.25,2.75),ylim = c(2,3.5))
text(toy$X1*1.05,toy$X2,label=toy$label)
plot(toySingle, which.plot = 2, main = "Single")

##           1         2         3         4
## 2 0.5099020                              
## 3 1.0111874 0.5590170                    
## 4 0.7810250 0.5000000 0.9013878          
## 5 1.0965856 0.7433034 0.9823441 0.3201562
 
 Complete linkage
toyComplete <- agnes(toy[,1:2], method = "complete")
par(mfrow=c(1,2),pty="s")
plot(X2 ~ X1, toy, xlim = c(1.25,2.75),ylim = c(2,3.5))
text(toy$X1*1.05,toy$X2,label=toy$label)
plot(toyComplete, which.plot = 2, main = "Complete")

##           1         2         3         4
## 2 0.5099020                              
## 3 1.0111874 0.5590170                    
## 4 0.7810250 0.5000000 0.9013878          
## 5 1.0965856 0.7433034 0.9823441 0.3201562
 
 Average linkage
toyAvg <- agnes(toy[,1:2], method = "average")
par(mfrow=c(1,2),pty="s")
plot(X2 ~ X1, toy, xlim = c(1.25,2.75),ylim = c(2,3.5))
text(toy$X1*1.05,toy$X2,label=toy$label)
plot(toyAvg, which.plot = 2, main = "Average")

##           1         2         3         4
## 2 0.5099020                              
## 3 1.0111874 0.5590170                    
## 4 0.7810250 0.5000000 0.9013878          
## 5 1.0965856 0.7433034 0.9823441 0.3201562
 
 
LS0tCnRpdGxlOiAiSW50cm9kdWN0aW9uIHRvIEhpZXJhcmNoaWNhbCBDbHVzdGVyaW5nIgphdXRob3I6ICJMaWV2ZW4gQ2xlbWVudCIKb3V0cHV0OgogIGh0bWxfZG9jdW1lbnQ6CiAgICBkZl9wcmludDogcGFnZWQKICBwZGZfZG9jdW1lbnQ6CiAgICB0b2M6IHRydWUKICAgIG51bWJlcl9zZWN0aW9uczogdHJ1ZQotLS0KCiMgSW50cm9kdWN0aW9uCiMjIE9iamVjdGl2ZQoKT2JqZWN0aXZlOiBncm91cGluZyBvZiBvYnNlcnZhdGlvbnMgaW50byAqKmNsdXN0ZXJzKiosIHNvIHRoYXQKCi0gc2ltaWxhciBvYnNlcnZhdGlvbnMgYXBwZWFyIGluIHRoZSBzYW1lIGNsdXN0ZXIKLSBkaXNzaW1pbGFyIG9ic2VydmF0aW9ucyBhcHBlYXIgaW4gZGlzdGluY3QgY2x1c3RlcnMKCiRcbG9uZ3JpZ2h0YXJyb3ckIG5lZWQgZm9yIGEgbWVhc3VyZSBmb3IgKipzaW1pbGFyaXR5KiogYW5kICoqZGlzc2ltaWxhcml0eSoqPwoKCiMjIEV4YW1wbGUgMQoKU2luZ2xlIGNlbGwgdHJhbnNjcmlwdG9taWNzOiAgJG4gXHRpbWVzIHAkIE1hdHJpeCBmb3Igd2hpY2gKCiAgLSBldmVyeSBjb2x1bW4gY29udGFpbnMgdGhlIGV4cHJlc3Npb24gbGV2ZWxzIG9mIG9uZSBvZiAkcCQgZ2VuZXMgZm9yICRuJCBjZWxscwogIAogIC0gZXZlcnkgcm93IGNvbnRhaW5zIHRoZSBleHByZXNzaW9uIGxldmVscyBvZiAkcCQgZ2VuZXMgZm9yIG9uZSBjZWxsICgqKnNhbXBsZSoqKQoKICAtIFJlc2VhcmNoIHF1ZXN0aW9uOiBsb29rIGZvciBncm91cHMgb2YgY2VsbHMgdGhhdCBoYXZlIHNpbWlsYXIgZ2VuZSBleHByZXNzaW9uIHBhdHRlcm5zCgogLSBPciwgbG9vayBmb3IgZ3JvdXBzIG9mIGdlbmVzIHRoYXQgaGF2ZSBzaW1pbGFyIGV4cHJlc3Npb24gbGV2ZWxzIGFjcm9zcyB0aGUgZGlmZmVyZW50IGNlbGxzLiBUaGlzIGNhbgogaGVscCB1cyBpbiB1bmRlcnN0YW5kaW5nIHRoZSByZWd1bGF0aW9uIGFuZCBmdW5jdGlvbmFsaXR5IG9mIHRoZSBnZW5lcy4KCiAkXGxvbmdyaWdodGFycm93JCBib3RoICoqb2JzZXJ2YXRpb25zKiogKHJvd3MpIGFuZCAqKnZhcmlhYmxlcyoqIChjb2x1bW5zKSBjYW4gYmUgY2x1c3RlcmVkCgoKIyMgRXhhbXBsZSAyLgoKQWJ1bmRhbmNlIHN0dWRpZXM6IHRoZSBhYnVuZGFuY2VzIG9mICRuJCBwbGFudCBzcGVjaWVzIGFyZSBjb3VudGVkIG9uICRwJCBwbG90cyAoaGFiaXRhdHMpCgogIC0gbG9vayBmb3IgZ3JvdXBzIHRoYXQgY29udGFpbiBzcGVjaWVzIHRoYXQgbGl2ZSBpbiB0aGUgc2FtZSBoYWJpdGF0cywgb3IsIGxvb2sgZm9yIGdyb3VwcyBvZgogaGFiaXRhdHMgdGhhdCBoYXZlIHNpbWlsYXIgc3BlY2llcyBjb21tdW5pdGllcwoKJFxsb25ncmlnaHRhcnJvdyQgYm90aCAqKm9ic2VydmF0aW9ucyoqIChyb3dzKSBhbmQgKip2YXJpYWJsZXMqKiAoY29sdW1ucykgY2FuIGJlIGNsdXN0ZXJlZAoKCiMgSGllcmFyY2hpY2FsIENsdXN0ZXIgQW5hbHlzaXM6IEFnbmVzCgojIyBHZW5lcmFsIEFsZ29yaXRobQoKLSBJbiBzdGVwIDAgZWFjaCBvYnNlcnZhdGlvbnMgaXMgY29uc2lkZXJlZCBhcyBhIGNsdXN0ZXIgKGkuZS4gJG4kIGNsdXN0ZXJzKS4KCi0gRXZlcnkgbmV4dCBzdGVwIGNvbnNpc3RzIG9mOgoKICAgMS4gbWVyZ2UgdGhlIHR3byBjbHVzdGVycyB3aXRoIHRoZSBzbWFsbGVzdCBpbnRlcmNsdXN0ZXIgZGlzc2ltaWxhcml0eQogICAyLiByZWNhbGN1bGF0ZSB0aGUgaW50ZXJjbHVzdGVyIGRpc3NpbWlsYXJpdGllcwoKSW4gc3RlcCAwIHRoZSBpbnRlcmNsdXN0ZXIgZGlzc2ltaWxhcml0eSBjb2luY2lkZXMgd2l0aCB0aGUgZGlzc2ltaWxhcml0eSBiZXR3ZWVuIHRoZSBjb3JyZXNwb25kaW5nIG9ic2VydmF0aW9ucyAKCiRccmlnaHRhcnJvdyQgaW50ZXJjbHVzdGVyIGRpc3NpbWlsYXJpdHk/CgojIyBJbnRlcmNsdXN0ZXIgRGlzc2ltaWxhcml0aWVzCgotIFJlcHJlc2VudCBjbHVzdGVycyAoZS5nLiAkQ18xJCBhbmQgJENfMiQpCiAgIGFzIHNldHMgb2YgcG9pbnRzICRcbWF0aGJme3h9X2kkIHdoaWNoIGJlbG9uZyB0byB0aGF0IGNsdXN0ZXIKCi0gJGQoQ18xLENfMikkOiBpbnRlcmNsdXN0ZXIgZGlzc2ltaWxhcml0eSBiZXR3ZWVuCgpXZSBjb25zaWRlciB0aHJlZSBpbnRlcmNsdXN0ZXIgZGlzc2ltaWxhcml0aWVzLgoKIyMjIFNpbmdsZSBMaW5rYWdlID0gTmVhcmVzdCBOZWlnaGJvdXIKClxbCiAgZChDXzEsQ18yKSA9IFxtaW5fe1xtYXRoYmZ7eH1fMSBcaW4gQ18xOyBcbWF0aGJme3h9XzIgXGluIENfMn0KICBkKFxtYXRoYmZ7eH1fMSxcbWF0aGJme3h9XzIpICwKXF0KCmkuZS4gdGhlIGRpc3NpbWlsYXJpdHkgYmV0d2VlbiAkQ18xJCBhbmQgJENfMiQgaXMgZGV0ZXJtaW5lZCBieSB0aGUgc21hbGxlc3QgZGlzc2ltaWxhcml0eSBiZXR3ZWVuIGEgcG9pbnQgb2YgJENfMSQgYW5kIGEgcG9pbnQgb2YgJENfMiQuCgpgYGB7ciwgZWNobz1GQUxTRSwgb3V0LndpZHRoPSc3MCUnfQprbml0cjo6aW5jbHVkZV9ncmFwaGljcygiLi9maWd1cmVzL2hjbHVzdE5lYXJlc3QucG5nIikKYGBgCgojIyMgQ29tcGxldGUgTGlua2FnZSA9IEZ1cnRoZXN0IE5laWdoYm91cgogICBcWwogICAgZChDXzEsQ18yKSA9IFxtYXhfe1xtYXRoYmZ7eH1fMSBcaW4gQ18xOyBcbWF0aGJme3h9XzIgXGluIENfMn0KICAgIGQoXG1hdGhiZnt4fV8xLFxtYXRoYmZ7eH1fMikgLAogICBcXQogICBpLmUuIHRoZSBkaXNzaW1pbGFyaXR5IGJldHdlZW4gJENfMSQgYW5kICRDXzIkIGlzIGRldGVybWluZWQgYnkgdGhlIGxhcmdlc3QgZGlzc2ltaWxhcml0eSBiZXR3ZWVuIGEgcG9pbnQgb2YgJENfMSQgYW5kIGEKICAgcG9pbnQgb2YgJENfMiQuCgoKYGBge3IsIGVjaG89RkFMU0UsIG91dC53aWR0aD0nNzAlJ30Ka25pdHI6OmluY2x1ZGVfZ3JhcGhpY3MoIi4vZmlndXJlcy9oY2x1c3RGdXJ0aGVzdC5wbmciKQpgYGAKCiMjIyBBdmVyYWdlIExpbmthZ2UgPSBHcm91cCBBdmVyYWdlCgogICBcWwogICAgZChDXzEsQ18yKSA9IFxmcmFjezF9e1x2ZXJ0IENfMSBcdmVydCBcdmVydCBDXzIgXHZlcnR9CiAgICBcc3VtX3tcbWF0aGJme3h9XzEgXGluIENfMTsgXG1hdGhiZnt4fV8yIFxpbiBDXzJ9CiAgICBkKFxtYXRoYmZ7eH1fMSxcbWF0aGJme3h9XzIpICwKICAgXF0KICAgaS5lLiB0aGUgZGlzc2ltaWxhcml0eSBiZXR3ZWVuICRDXzEkIGFuZCAkQ18yJCBpcyBkZXRlcm1pbmVkIGJ5IHRoZSBhdmVyYWdlIGRpc3NpbWlsYXJpdHkgYmV0d2VlbiBhbGwgcG9pbnRzIG9mICRDXzEkIGFuZCBhbGwKICAgcG9pbnRzIG9mICRDXzIkLgoKYGBge3IsIGVjaG89RkFMU0UsIG91dC53aWR0aD0nNzAlJ30Ka25pdHI6OmluY2x1ZGVfZ3JhcGhpY3MoIi4vZmlndXJlcy9oY2x1c3RBdmVyYWdlLnBuZyIpCmBgYAoKCiMjIENsdXN0ZXIgVHJlZQoKSGllcmFyY2hpY2FsIG5hdHVyZSBvZiB0aGUgYWxnb3JpdGhtOgoKLSBOZXN0ZWQgc2VxdWVuY2Ugb2YgY2x1c3RlcnMgJFxsb25ncmlnaHRhcnJvdyQgdmlzdWFsaXNhdGlvbiB2aWEgYSB0cmVlICAKCgotIGhlaWdodCBvZiBicmFuY2hlcyBpbmRpY2F0ZSB0aGUgaW50ZXJjbHVzdGVyIGRpc3NpbWlsYXJpdHkgYXQgd2hpY2ggY2x1c3RlcnMgYXJlIG1lcmdlZC4KCi0gQ2FuIHVzZWQgYXMgaW5zdHJ1bWVudCBmb3IgZGVjaWRpbmcgdGhlIG51bWJlciBvZiBjbHVzdGVycyBpbiB0aGUgZGF0YQoKCiMgVG95IGV4YW1wbGUKCgpgYGB7ciBlY2hvID0gRkFMU0V9CnRveSA8LSBkYXRhLmZyYW1lKAogIFgxID0gYygxLjUwLAogICAgICAgICAyLjAwLAogICAgICAgICAyLjUwLAogICAgICAgICAyLjAwLAogICAgICAgICAyLjI1KSwKICBYMiA9IGMoMi40MCwKICAgICAgICAgMi41MCwKICAgICAgICAgMi4yNSwKICAgICAgICAgMy4wMCwKICAgICAgICAgMy4yMCksCiAgbGFiZWwgPSAxOjUKICApCgprbml0cjo6a2FibGUodG95KQpgYGAKCmBgYHtyfQpsaWJyYXJ5KGNsdXN0ZXIpCmxpYnJhcnkodGlkeXZlcnNlKQoKdG95ICU+JQogIGdncGxvdChhZXMoWDEsIFgyLCBsYWJlbCA9IGxhYmVsKSkgKwogIGdlb21fcG9pbnQoKSArCiAgZ2VvbV90ZXh0KG51ZGdlX3ggPSAuMDUpCgp0b3lbLDE6Ml0gJT4lIGRpc3QKYGBgCgojIyBTaW5nbGUgbGlua2FnZQoKYGBge3J9CnRveVNpbmdsZSA8LSBhZ25lcyh0b3lbLDE6Ml0sIG1ldGhvZCA9ICJzaW5nbGUiKQpwYXIobWZyb3c9YygxLDIpLHB0eT0icyIpCnBsb3QoWDIgfiBYMSwgdG95LCB4bGltID0gYygxLjI1LDIuNzUpLHlsaW0gPSBjKDIsMy41KSkKdGV4dCh0b3kkWDEqMS4wNSx0b3kkWDIsbGFiZWw9dG95JGxhYmVsKQpwbG90KHRveVNpbmdsZSwgd2hpY2gucGxvdCA9IDIsIG1haW4gPSAiU2luZ2xlIikKdG95WywxOjJdICU+JSBkaXN0CmBgYAoKIyMgQ29tcGxldGUgbGlua2FnZQoKYGBge3J9CnRveUNvbXBsZXRlIDwtIGFnbmVzKHRveVssMToyXSwgbWV0aG9kID0gImNvbXBsZXRlIikKcGFyKG1mcm93PWMoMSwyKSxwdHk9InMiKQpwbG90KFgyIH4gWDEsIHRveSwgeGxpbSA9IGMoMS4yNSwyLjc1KSx5bGltID0gYygyLDMuNSkpCnRleHQodG95JFgxKjEuMDUsdG95JFgyLGxhYmVsPXRveSRsYWJlbCkKcGxvdCh0b3lDb21wbGV0ZSwgd2hpY2gucGxvdCA9IDIsIG1haW4gPSAiQ29tcGxldGUiKQp0b3lbLDE6Ml0gJT4lIGRpc3QKYGBgCgojIyBBdmVyYWdlIGxpbmthZ2UKCmBgYHtyfQp0b3lBdmcgPC0gYWduZXModG95WywxOjJdLCBtZXRob2QgPSAiYXZlcmFnZSIpCnBhcihtZnJvdz1jKDEsMikscHR5PSJzIikKcGxvdChYMiB+IFgxLCB0b3ksIHhsaW0gPSBjKDEuMjUsMi43NSkseWxpbSA9IGMoMiwzLjUpKQp0ZXh0KHRveSRYMSoxLjA1LHRveSRYMixsYWJlbD10b3kkbGFiZWwpCnBsb3QodG95QXZnLCB3aGljaC5wbG90ID0gMiwgbWFpbiA9ICJBdmVyYWdlIikKdG95WywxOjJdICU+JSBkaXN0CmBgYAo=