Canonical Correlation Analysis (CCA) is a multivariate data analysis method that aims at finding correlations between two multivariate data sets, \(X\) and \(Y\). The method looks for the linear combination of the \(X\)-variables and the linear combination of the \(Y\)-variables that show maximal correlation. When the number of variables in \(X\) and/or \(Y\) is very large (high-dimensional), the classical CCA method needs to be adapted to deal with the high dimensionality.
The aim of this homework assignment is:
to understand the classical CCA method (based on the literature) and a CCA method for high-dimensional data
to implement the CCA method and its high-dimensional version (not using existing R packages or R functions for CCA)
apply the method to a dataset
You may consult the literature to find a description of the CCA method. Here I give one possible reference (it is a paper about an R package, but remember that you may not use this R package for the implementation):
González, I., Déjean, S., Martin, P. G., & Baccini, A. (2008). CCA: An R package to extend canonical correlation analysis. Journal of Statistical Software, 23(12), 1-14. http://dx.doi.org/10.18637/jss.v023.i12
The paper also describes a regularised CCA method (section 2.4), which is applicable to high-dimensional data. However, there are other high-dimensional CCA methods described in the literature. You are free to choose the regularised CCA from the paper, or any other appropriate high-dimensional CCA method.
Note that in the paper a cross-validation method is proposed for selecting e.g. the tuning parameters in the regularised CCA; you should not implement this. If tuning parameters are involved, you may set them manually to an arbitrary value (or play with it when analysing the dataset and set it to a value that seems appropriate to you – no need to motivate your choice).
You must apply your implemented method to the nutrimouse data, which is part of the CCA R package. More information about the data can be found in the paper. You must only look at the first two dimensions of the CCA, which will allow you to make two-dimensional graphs.
# install the CCA package with
# install.packages("CCA")
library(CCA)
data("nutrimouse")
X <- nutrimouse$gene # the gene expression matrix
dim(X)
#> [1] 40 120
Y <- nutrimouse$lipid # the lipids matrix
dim(Y)
#> [1] 40 21
The assignment can be done in groups of 2.
You should write a report containing the following:
a short (mathematical) description of the CCA methods (classical and high-dimensional) that you have implemented
the application of your method to the nutrimouse data
Classical CCA on multivariate data with \(p < n\). (Hint: it will not be possible to apply the classical CCA method to the full data matrix \(X\). You should subset the data to reflect the case of \(p < n\).)
High-dimensional CCA on data with \(p > n\)
interpretation and conclusion of the data analysis results
The length of the written report (excl. R code, R output and graphs) should be about 2 pages.
It is recommended (but not mandatory) to prepare your report in RMarkdown. You can render it to either HTML (output: html_document
) or to PDF (output: pdf_document
). In both cases the original .Rmd
file should be included when handing in the assignment. If you don’t use RMarkdown, you should include the .R
file(s) containing your implementation and analysis scripts.
When submitting, please use the following format:
Submissions should be done through UFora.
The deadline for submission is November 8th.
LS0tCnRpdGxlOiAiSG9tZXdvcms6IENhbm9uaWNhbCBDb3JyZWxhdGlvbiBBbmFseXNpcyIKc3VidGl0bGU6ICJIaWdoIERpbWVuc2lvbmFsIERhdGEgQW5hbHlzaXMgMjAyMCIKYXV0aG9yOiAiQWRhcHRlZCBieSBNaWxhbiBNYWxmYWl0IgpkYXRlOiAiMjAgT2N0IDIwMjAiCm91dHB1dDoKICBodG1sX2RvY3VtZW50OgogICAgdG9jOiBmYWxzZQogIHBkZl9kb2N1bWVudDoKICAgIHRvYzogZmFsc2UKICAgIG51bWJlcl9zZWN0aW9uczogdHJ1ZQotLS0KCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFLCBjYWNoZT1GQUxTRX0Ka25pdHI6Om9wdHNfY2h1bmskc2V0KAogIGNvbGxhcHNlID0gVFJVRSwKICBjb21tZW50ID0gIiM+IgopCmBgYAoKKioqCgpDYW5vbmljYWwgQ29ycmVsYXRpb24gQW5hbHlzaXMgKENDQSkgaXMgYSBtdWx0aXZhcmlhdGUgZGF0YSBhbmFseXNpcyBtZXRob2QgdGhhdCBhaW1zIGF0IGZpbmRpbmcgY29ycmVsYXRpb25zIGJldHdlZW4gdHdvIG11bHRpdmFyaWF0ZSBkYXRhIHNldHMsICRYJCBhbmQgJFkkLiBUaGUgbWV0aG9kIGxvb2tzIGZvciB0aGUgbGluZWFyIGNvbWJpbmF0aW9uIG9mIHRoZSAkWCQtdmFyaWFibGVzIGFuZCB0aGUgbGluZWFyIGNvbWJpbmF0aW9uIG9mIHRoZSAkWSQtdmFyaWFibGVzIHRoYXQgc2hvdyBtYXhpbWFsIGNvcnJlbGF0aW9uLiBXaGVuIHRoZSBudW1iZXIgb2YgdmFyaWFibGVzIGluICRYJCBhbmQvb3IgJFkkIGlzIHZlcnkgbGFyZ2UgKGhpZ2gtZGltZW5zaW9uYWwpLCB0aGUgY2xhc3NpY2FsIENDQSBtZXRob2QgbmVlZHMgdG8gYmUgYWRhcHRlZCB0byBkZWFsIHdpdGggdGhlIGhpZ2ggZGltZW5zaW9uYWxpdHkuIAoKVGhlIGFpbSBvZiB0aGlzIGhvbWV3b3JrIGFzc2lnbm1lbnQgaXM6CgoqIHRvIHVuZGVyc3RhbmQgdGhlIGNsYXNzaWNhbCBDQ0EgbWV0aG9kIChiYXNlZCBvbiB0aGUgbGl0ZXJhdHVyZSkgYW5kIGEgQ0NBIG1ldGhvZCBmb3IgaGlnaC1kaW1lbnNpb25hbCBkYXRhCgoqIHRvIGltcGxlbWVudCB0aGUgQ0NBIG1ldGhvZCBhbmQgaXRzIGhpZ2gtZGltZW5zaW9uYWwgdmVyc2lvbiAobm90IHVzaW5nIGV4aXN0aW5nIFIgcGFja2FnZXMgb3IgUiBmdW5jdGlvbnMgZm9yIENDQSkKCiogYXBwbHkgdGhlIG1ldGhvZCB0byBhIGRhdGFzZXQKCgpZb3UgbWF5IGNvbnN1bHQgdGhlIGxpdGVyYXR1cmUgdG8gZmluZCBhIGRlc2NyaXB0aW9uIG9mIHRoZSBDQ0EgbWV0aG9kLiBIZXJlIEkgZ2l2ZSBvbmUgcG9zc2libGUgcmVmZXJlbmNlIChpdCBpcyBhIHBhcGVyIGFib3V0IGFuIFIgcGFja2FnZSwgYnV0IHJlbWVtYmVyIHRoYXQgeW91IG1heSBub3QgdXNlIHRoaXMgUiBwYWNrYWdlIGZvciB0aGUgaW1wbGVtZW50YXRpb24pOiAKCkdvbnrDoWxleiwgSS4sIETDqWplYW4sIFMuLCBNYXJ0aW4sIFAuIEcuLCAmIEJhY2NpbmksIEEuICgyMDA4KS4gQ0NBOiBBbiBSIHBhY2thZ2UgdG8gZXh0ZW5kIGNhbm9uaWNhbCBjb3JyZWxhdGlvbiBhbmFseXNpcy4gSm91cm5hbCBvZiBTdGF0aXN0aWNhbCBTb2Z0d2FyZSwgMjMoMTIpLCAxLTE0LiA8aHR0cDovL2R4LmRvaS5vcmcvMTAuMTg2MzcvanNzLnYwMjMuaTEyPgoKVGhlIHBhcGVyIGFsc28gZGVzY3JpYmVzIGEgKnJlZ3VsYXJpc2VkIENDQSogbWV0aG9kIChzZWN0aW9uIDIuNCksIHdoaWNoIGlzIGFwcGxpY2FibGUgdG8gaGlnaC1kaW1lbnNpb25hbCBkYXRhLiBIb3dldmVyLCB0aGVyZSBhcmUgb3RoZXIgaGlnaC1kaW1lbnNpb25hbCBDQ0EgbWV0aG9kcyBkZXNjcmliZWQgaW4gdGhlIGxpdGVyYXR1cmUuIFlvdSBhcmUgZnJlZSB0byBjaG9vc2UgdGhlIHJlZ3VsYXJpc2VkIENDQSBmcm9tIHRoZSBwYXBlciwgb3IgYW55IG90aGVyIGFwcHJvcHJpYXRlIGhpZ2gtZGltZW5zaW9uYWwgQ0NBIG1ldGhvZC4gCgpOb3RlIHRoYXQgaW4gdGhlIHBhcGVyIGEgY3Jvc3MtdmFsaWRhdGlvbiBtZXRob2QgaXMgcHJvcG9zZWQgZm9yIHNlbGVjdGluZyBlLmcuIHRoZSB0dW5pbmcgcGFyYW1ldGVycyBpbiB0aGUgcmVndWxhcmlzZWQgQ0NBOyB5b3Ugc2hvdWxkIG5vdCBpbXBsZW1lbnQgdGhpcy4gSWYgdHVuaW5nIHBhcmFtZXRlcnMgYXJlIGludm9sdmVkLCB5b3UgbWF5IHNldCB0aGVtIG1hbnVhbGx5IHRvIGFuIGFyYml0cmFyeSB2YWx1ZSAob3IgcGxheSB3aXRoIGl0IHdoZW4gYW5hbHlzaW5nIHRoZSBkYXRhc2V0IGFuZCBzZXQgaXQgdG8gYSB2YWx1ZSB0aGF0IHNlZW1zIGFwcHJvcHJpYXRlIHRvIHlvdSAtLSBubyBuZWVkIHRvIG1vdGl2YXRlIHlvdXIgY2hvaWNlKS4KCllvdSBtdXN0IGFwcGx5IHlvdXIgaW1wbGVtZW50ZWQgbWV0aG9kIHRvIHRoZSAqKm51dHJpbW91c2UqKiBkYXRhLCB3aGljaCBpcyBwYXJ0IG9mIHRoZSAqKkNDQSoqIFIgcGFja2FnZS4gTW9yZSBpbmZvcm1hdGlvbiBhYm91dCB0aGUgZGF0YSBjYW4gYmUgZm91bmQgaW4gdGhlIHBhcGVyLiBZb3UgbXVzdCBvbmx5IGxvb2sgYXQgdGhlIGZpcnN0IHR3byBkaW1lbnNpb25zIG9mIHRoZSBDQ0EsIHdoaWNoIHdpbGwgYWxsb3cgeW91IHRvIG1ha2UgdHdvLWRpbWVuc2lvbmFsIGdyYXBocy4gCgpgYGB7ciwgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRX0KIyBpbnN0YWxsIHRoZSBDQ0EgcGFja2FnZSB3aXRoCiMgaW5zdGFsbC5wYWNrYWdlcygiQ0NBIikKCmxpYnJhcnkoQ0NBKQpkYXRhKCJudXRyaW1vdXNlIikKClggPC0gbnV0cmltb3VzZSRnZW5lICAjIHRoZSBnZW5lIGV4cHJlc3Npb24gbWF0cml4CmRpbShYKQpZIDwtIG51dHJpbW91c2UkbGlwaWQgIyB0aGUgbGlwaWRzIG1hdHJpeApkaW0oWSkKYGBgCgpUaGUgYXNzaWdubWVudCBjYW4gYmUgZG9uZSBpbiBfX2dyb3VwcyBvZiAyX18uCgpZb3Ugc2hvdWxkIHdyaXRlIGEgcmVwb3J0IGNvbnRhaW5pbmcgdGhlIGZvbGxvd2luZzoKCiogYSBzaG9ydCAobWF0aGVtYXRpY2FsKSBkZXNjcmlwdGlvbiBvZiB0aGUgQ0NBIG1ldGhvZHMgKGNsYXNzaWNhbCBhbmQgaGlnaC1kaW1lbnNpb25hbCkgdGhhdCB5b3UgaGF2ZSBpbXBsZW1lbnRlZAoKKiB0aGUgYXBwbGljYXRpb24gb2YgeW91ciBtZXRob2QgdG8gdGhlICpudXRyaW1vdXNlKiBkYXRhCgogIC0gQ2xhc3NpY2FsIENDQSBvbiBtdWx0aXZhcmlhdGUgZGF0YSB3aXRoICRwIDwgbiQuCiAgKCpIaW50OiBpdCB3aWxsIG5vdCBiZSBwb3NzaWJsZSB0byBhcHBseSB0aGUgY2xhc3NpY2FsIENDQSBtZXRob2QgdG8gdGhlIGZ1bGwgZGF0YSBtYXRyaXggJFgkLiBZb3Ugc2hvdWxkIHN1YnNldCB0aGUgZGF0YSB0byByZWZsZWN0IHRoZSBjYXNlIG9mICRwIDwgbiQqLikKCiAgLSBIaWdoLWRpbWVuc2lvbmFsIENDQSBvbiBkYXRhIHdpdGggJHAgPiBuJAoKKiBpbnRlcnByZXRhdGlvbiBhbmQgY29uY2x1c2lvbiBvZiB0aGUgZGF0YSBhbmFseXNpcyByZXN1bHRzCgpUaGUgbGVuZ3RoIG9mIHRoZSB3cml0dGVuIHJlcG9ydCAoZXhjbC4gUiBjb2RlLCBSIG91dHB1dCBhbmQgZ3JhcGhzKSBzaG91bGQgYmUgYWJvdXQgMiBwYWdlcy4KCkl0IGlzIHJlY29tbWVuZGVkIChidXQgbm90IG1hbmRhdG9yeSkgdG8gcHJlcGFyZSB5b3VyIHJlcG9ydCBpbiBfX1JNYXJrZG93bl9fLgpZb3UgY2FuIHJlbmRlciBpdCB0byBlaXRoZXIgSFRNTCAoYG91dHB1dDogaHRtbF9kb2N1bWVudGApIG9yIHRvIFBERiAoYG91dHB1dDogcGRmX2RvY3VtZW50YCkuCkluIGJvdGggY2FzZXMgdGhlIG9yaWdpbmFsIGAuUm1kYCBmaWxlIHNob3VsZCBiZSBpbmNsdWRlZCB3aGVuIGhhbmRpbmcgaW4gdGhlIGFzc2lnbm1lbnQuCklmIHlvdSBkb24ndCB1c2UgUk1hcmtkb3duLCB5b3Ugc2hvdWxkIGluY2x1ZGUgdGhlIGAuUmAgZmlsZShzKSBjb250YWluaW5nIHlvdXIgaW1wbGVtZW50YXRpb24gYW5kIGFuYWx5c2lzIHNjcmlwdHMuCgpXaGVuIHN1Ym1pdHRpbmcsIHBsZWFzZSB1c2UgdGhlIGZvbGxvd2luZyBmb3JtYXQ6CgoqIEhXLU5hbWUxLU5hbWUyLltwZGZ8aHRtbF0KCiogSFctTmFtZTEtTmFtZTIuUm1kIChvciBIVzEtTmFtZTEtTmFtZTIuUikKClN1Ym1pc3Npb25zIHNob3VsZCBiZSBkb25lIF9fdGhyb3VnaCBbVUZvcmFdKGh0dHBzOi8vdWZvcmEudWdlbnQuYmUvZDJsL2hvbWUvMjIxMjEyKV9fLgoKPHNwYW4gc3R5bGU9ImNvbG9yOnJlZCI+X19UaGUgZGVhZGxpbmUgZm9yIHN1Ym1pc3Npb24gaXMgTm92ZW1iZXIgOHRoLl9fPC9zcGFuPgo=