Challenge: We have to learn from high dimensional data!
We typically observe multiple variables/features (p) for each subject/experimental unit \(i=1,\ldots,n\) i.e. \[ \mathbf{x}_i^T=[x_{i1},\ldots,x_{ip}]\]
Multivariate statistics have a long history, but were designed for the \(n>>>p\) case,
Nowadays many high throughput technologies generate multivariate data with many variables (large \(p\)) as compared to the number of independent replicates or samples (sample size \(n\)), resulting in {high-dimensional data}, which is characterised by \[ p >>> n. \]
New statistical methods for dealing with the \(p >>> n\) case have been developed in the last 20 years. Some of them are adaptations of multivariate methods.
Issues with high-dimensional data:
Computational problems: large matrices, numerical accuracy of computers become an issue
Classical asymptotic theory does not hold for \(p\rightarrow\infty\) as \(n \rightarrow \infty\)
Model (or feature) selection requires specialised methods to deal with the enormous number of possible models. (e.g. in linear regression with p potential predictors: \(2^p\) possible models)
Models that can potentially depend on large-p predictors are vulnerable to huge overfitting.
In searching for associations between an outcome and large p potential exploratory variables, we are at risk to make many false discoveries
The Curse of Dimensionality may cause a prediction model to become useless. (\(n\) data points become sparse in large-\(p\) dimensional space)
library(tidyverse)
## Packages to load and visualize the single-cell data
library(ExperimentHub)
library(scater)
#> class: SingleCellExperiment
#> dim: 35635 29065
#> metadata(0):
#> assays(1): counts
#> rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
#> rowData names(2): ENSEMBL SYMBOL
#> colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
#> TTTGCATGTCTTAC-1
#> colData names(5): ind stim cluster cell multiplets
#> reducedDimNames(1): TSNE
#> mainExpName: NULL
#> altExpNames(0):
#> 11 x 12 sparse Matrix of class "dgCMatrix"
#>
#> RP11-467L20.10 . . . . . . . . . . . .
#> MYRF . . . . . . . . . . . .
#> TMEM258 . . 3 . . . 1 1 . . . .
#> FEN1 . . . . . . . . . . . .
#> FADS2 . . . . . . . . . . . .
#> FADS1 . . . . . . . . . . . .
#> FADS3 . . . . . . . . . . . .
#> RAB3IL1 . . . . . . . . . . . .
#> BEST1 1 . . . . . 1 . . . 1 1
#> FTH1 163 8 271 4 5 174 38 58 31 3 663 612
#> AP003733.1 . . . . . . . . . . . .
In single cell analysis it is key to identify cell types in scRNA-seq data sets before in-depth investigations of their functional and pathological roles.
Use models that were build based on reference data sets to predict cell types in new data sets based on their gene expression pattern
Problem: we have 35635 genes at our disposal to build this prediction model!
Other examples
Prediction of risk on mortality is used on a daily basis in intensive care units to prioritise patient care.
Facebook: predict the identity of the faces of people that are on an new image that is uploaded.
Netflix: Suggest movies that you would like
\[ \left \{ \begin{array}{lcl} y_{ig} &\sim& NB(\mu_{ig}, \phi_{g})\\ E[y_{ig}] &=& \mu_{ig}\\ \log(\mu_{ig}) &=& \eta_{ig}\\ \eta_{ig} &=& \beta_{0,g} + \sum\limits_{c=2}^C \beta_{c,g} X_{ic} + \beta_{s,g} X_{is} + \sum\limits_{c=2}^C \beta_{c:s,g} X_{ic} X_{is} + \sum\limits_{e=2}^E \beta_{e,g} X_{ie} + \alpha_{ig} \end{array} \right. \]
With
Cell type \(c=2\ldots C\) and \(X_{ic}\) is a dummy variable that \(X_{ic}=1\) if cell \(i\) is of cell type c and \(X_{ic}=0\) otherwise. Note that cell type \(c=1\) is the reference cell type
Indicator \(X_{is}\) indicates if cell \(i\) was stimulated \(X_{is}=1\) or not \(X_{is}=0\). So the control treatment is the reference treatment.
Experimental unit \(e=2\ldots E\) and \(X_{ie}\) a dummy variable that \(X_{ie}=1\) if cell \(i\) originates of experimental unit (subject) \(e\) and \(X_{ie}=0\) otherwise. Note that cell type \(e=1\) is the reference experimental unit.
\(\alpha_{ig}\) a cell specific and gene specific normalisation factor
Suppose we want to test if the effect of the stimulus (average difference in expression between stimulated and non stimulated cells) is different in cell type c than in the reference cell type 1?
\(H_0: \beta_{c:s,g} = 0\)
\(H_1: \beta_{c:s,g} \neq 0\)
We have to assess this for 35635 genes!
If we assess each test at the 5% level we can expect 0.05 * 35635 = 1782 false positives.
\(\rightarrow\) massive multiple testing problem!
Note, that we cannot differentiate between batch and treatment because of the flaw in the experimental design!
Other examples
Consider the following toy-dataset with 3 observation (X,Y):
\[ y_i=\beta_0+\beta_1x + \epsilon_i \]
If we write the model for each observation:
\[ \begin{array} {lcl} 1 &=& \beta_0+\beta_1 1 + \epsilon_1 \\ 2 &=& \beta_0 + \beta_1 2 + \epsilon_2 \\ 2 &=& \beta_0+\beta_1 3+ \epsilon_3 \\ \end{array} \]
Note, that upon centering of \(\mathbf{X}\) and \(\mathbf{Y}\) the 1 is dropped from each \(\mathbf{x}_i\) and thus the column of 1 is dropped in \(\mathbf{X}_c\)
We can also write this in matrix form
\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon} \]
with
\[ \mathbf{Y}=\left[ \begin{array}{c} 1\\ 2\\ 2\\ \end{array}\right], \quad \mathbf{X}= \left[ \begin{array}{cc} 1&1\\ 1&2\\ 1&3\\ \end{array} \right], \quad \boldsymbol{\beta} = \left[ \begin{array}{c} \beta_0\\ \beta_1\\ \end{array} \right] \quad \text{and} \quad \boldsymbol{\epsilon}= \left[ \begin{array}{c} \epsilon_1\\ \epsilon_2\\ \epsilon_3 \end{array} \right] \]
From the linear regression we get \[ E[Y \mid \mathbf{x}] = \mathbf{x}^T\boldsymbol{\beta} . \]
For a model with a single predictor we obtain
\[ E[Y \mid x] = \beta_0 + \beta_1 x \] and \[ \beta_1 = E[Y\mid x+1] - E[Y\mid x] = \beta_0 + \beta_1 (x+1) - \beta_0 - \beta_1 x. \]
In this course we will typically center Y and X and then we get the following
\[ E[Y_c \mid x_c] = \beta_1 x_c \] and \[ \beta_1 = E[Y_c\mid x_c+1] - E[Y_c\mid x_c] . \]
Note, that the estimator for \(\beta_1\) will be exactly same when estimated based on the models with/or without centering.
The parameter \(\beta_1\) has an interpretation as the average difference in the outcome between subjects that differ with one unit for the predictor.
The parameter \(\beta_1\) does not say much about individual outcomes. The residual variance \(\sigma^2\) determines how much individual outcomes vary about the mean outcome.
The \(\beta\) parameters are used to measure association, but a \(\beta \neq 0\) does not necessarily mean that the model will give good predictions.
\[ \hat Y = \mathbf{x}^T \hat \beta \] - In Chapter 3 will we will discuss the prediction problem for high dimensional data
Model fit and predictions based on the toy dataset
lm1 <- lm(y~x,data)
data$yhat <- lm1$fitted
data %>%
ggplot(aes(x,y)) +
geom_point() +
ylim(0,4) +
xlim(0,4) +
stat_smooth(method = "lm", color = "red", fullrange = TRUE) +
geom_point(aes(x=x, y =yhat), pch = 2, size = 3, color = "red") +
geom_segment(data = data, aes(x = x, xend = x, y = y, yend = yhat), lty = 2 )
In a Chapter 6 we will discuss the problem of large scale hypothesis testing: testing many hypotheses in a single study (ten to hundred thousands of hypotheses).
\[ H_0: \beta_1 = 0 \text{ vs }H_1:\beta_1 \neq 0. \]
However, when many hypotheses are to be tested in a single study, the probability to find false associations is no longer controlled if p-values are compared to the significance level \(\alpha\).
We will later introduce the concept of false discovery rates to overcome the problem.
Minimize the residual sum of squares \[\begin{eqnarray*} RSS(\boldsymbol{\beta})&=&\sum\limits_{i=1}^n e^2_i\\\\ &=&\sum\limits_{i=1}^n \left(y_i-\beta_0-\sum\limits_{j=1}^p x_{ij}\beta_j\right)^2 \end{eqnarray*}\]
or in matrix notation \[\begin{eqnarray*} RSS(\boldsymbol{\beta})&=&\mathbf{e}^T\mathbf{e}\\\\ &=& \left[\begin{array}{ccc} e_1 &\ldots& e_n \end{array}\right]\left[\begin{array}{c}e_1\\\vdots\\e_n\end{array}\right]\\ &=& e_1^2 + e_2^2 + \ldots + e_n^2\\\\ &=&(\mathbf{Y}-\mathbf{X\beta})^T(\mathbf{Y}-\mathbf{X\beta})\\\\ &=&\Vert \mathbf{Y}-\mathbf{X\beta}\Vert^2_2\\ \end{eqnarray*}\] with the \(L_2\)-norm of a \(p\)-dim. vector \(v\) \(\Vert \mathbf{v} \Vert_2=\sqrt{v_1^2+\ldots+v_p^2}\) \(\rightarrow\) \(\hat{\boldsymbol{\beta}}=\text{argmin}_\beta \Vert \mathbf{Y}-\mathbf{X\beta}\Vert^2_2\)
\[ \begin{array}{ccc} \frac{\partial RSS}{\partial \boldsymbol{\beta}}&=&\mathbf{0}\\\\ \frac{(\mathbf{Y}-\mathbf{X\beta})^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}&=&\mathbf{0}\\\\ -2\mathbf{X}^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})&=&\mathbf{0}\\\\ \mathbf{X}^T\mathbf{X\beta}&=&\mathbf{X}^T\mathbf{Y}\\\\ \hat{\boldsymbol{\beta}}&=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{array} \]
It can be shown that the estimator is unbiased:
\[ E[\hat{\boldsymbol{\beta}}]=\boldsymbol{\beta} \]
There is also another picture to regression:
Instead of plotting each observation \(i= 1 \ldots n\) as a data-point in \(\mathbb{R}^p\) with dimensions \(1 \ldots p\) for every variable/feature that is recorded for each observation
We can also plot \(\mathbf{Y}\), \(\hat{\mathbf{Y}}\) and each column of \(\mathbf{X}\): \(\mathbf{X}_j\) with \(j=1 \ldots p\) as a vector in \(\mathbb{R}^n\) with dimensions \(1 \ldots n\) for every observation.
In this representation linear regression can be interpreted as a projection of the vector \(\mathbf{Y}\) onto the subspace of \(\mathbb{R}^n\) that is spanned by the vectors for the predictors \(\mathbf{X}_1 \ldots \mathbf{X}_p\).
The space \(\mathbf{X}_1 \ldots \mathbf{X}_p\) is also referred to as the column space of \(\mathbf{X}\), the space that consists of all linear combinations of the vectors of the predictors or columns \(\mathbf{X}_1 \ldots \mathbf{X}_p\).
\[ \mathbf{e}=\left[\begin{array}{c} e_1\\e_2\end{array}\right], \mathbf{u}_1 = \left[\begin{array}{c} 1\\0\end{array}\right], \mathbf{u}_2 = \left[\begin{array}{c} 0\\1\end{array}\right] \]
\[\begin{eqnarray*} \mathbf{u}_1^T \mathbf{e} &=& \Vert \mathbf{u}_1\Vert_2 \Vert \mathbf{e}_1\Vert_2 \cos <\mathbf{u}_1,\mathbf{e}_1>\\ &=&\left[\begin{array}{cc} 1&0\end{array}\right] \left[\begin{array}{c} e_1\\e_2\end{array}\right]\\ &=& 1\times e_1 + 0 \times e_2 \\ &=& e_1\\ \end{eqnarray*}\]
\[\begin{eqnarray*} \mathbf{u}_2^T \mathbf{e} &=& \left[\begin{array}{cc} 0&1\end{array}\right] \left[\begin{array}{c} e_1\\e_2\end{array}\right]\\ &=& 0\times e_1 + 1 \times e_2 \\ &=& e_2 \end{eqnarray*}\]
\[\begin{eqnarray*} \mathbf{e}^T \mathbf{e} &=&\left[\begin{array}{cc} e_1&e_2\end{array}\right] \left[\begin{array}{c} e_1\\e_2\end{array}\right]\\ &=&e_1^2+e_2^2\\ &=&\Vert e \Vert^2_2 \rightarrow \text{ Pythagorean theorem} \end{eqnarray*}\]
Fitted values:
\[ \begin{array}{lcl} \hat{\mathbf{Y}} &=& \mathbf{X}\hat{\boldsymbol{\beta}}\\ &=& \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\\ &=& \mathbf{HY} \end{array} \] with \(\mathbf{H}\) the projection matrix also referred to as the hat matrix.
#> (Intercept) x
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> attr(,"assign")
#> [1] 0 1
#> (Intercept) x
#> (Intercept) 3 6
#> x 6 14
#> (Intercept) x
#> (Intercept) 2.333333 -1.0
#> x -1.000000 0.5
#> 1 2 3
#> 1 0.8333333 0.3333333 -0.1666667
#> 2 0.3333333 0.3333333 0.3333333
#> 3 -0.1666667 0.3333333 0.8333333
#> [,1]
#> 1 1.166667
#> 2 1.666667
#> 3 2.166667
We can also interpret the fit as the projection of the \(n\times 1\) vector \(\mathbf{Y}\) on the column space of the matrix \(\mathbf{X}\).
So each column in \(\mathbf{X}\) is also an \(n\times 1\) vector.
For the toy example n=3 and p=2. The other picture to linear regression is to consider \(X_0\), \(X_1\) and \(Y\) as vectors in the space of the data \(\mathbb{R}^n\), here \(\mathbb{R}^3\) because we have three data points. So the column space of X is a plane in the three dimensional space.
\[ \hat{\mathbf{Y}} = \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} \]
originRn <- data.frame(X1=0,X2=0,X3=0)
data$x0 <- 1
dataRn <- data.frame(t(data))
library(plotly)
p1 <- plot_ly(
originRn,
x = ~ X1,
y = ~ X2,
z= ~ X3, name="origin") %>%
add_markers(type="scatter3d") %>%
layout(
scene = list(
aspectmode="cube",
xaxis = list(range=c(-4,4)), yaxis = list(range=c(-4,4)), zaxis = list(range=c(-4,4))
)
)
p1 <- p1 %>%
add_trace(
x = c(0,1),
y = c(0,0),
z = c(0,0),
mode = "lines",
line = list(width = 5, color = "grey"),
type="scatter3d",
name = "obs1") %>%
add_trace(
x = c(0,0),
y = c(0,1),
z = c(0,0),
mode = "lines",
line = list(width = 5, color = "grey"),
type="scatter3d",
name = "obs2") %>%
add_trace(
x = c(0,0),
y = c(0,0),
z = c(0,1),
mode = "lines",
line = list(width = 5, color = "grey"),
type="scatter3d",
name = "obs3") %>%
add_trace(
x = c(0,1),
y = c(0,1),
z = c(0,1),
mode = "lines",
line = list(width = 5, color = "black"),
type="scatter3d",
name = "X1") %>%
add_trace(
x = c(0,1),
y = c(0,2),
z = c(0,3),
mode = "lines",
line = list(width = 5, color = "black"),
type="scatter3d",
name = "X2")
p1
Actual values of \(\mathbf{Y}\):
#> [1] 1 2 2
\[ \mathbf{Y}=\left[\begin{array}{c} 1 \\ 2 \\ 2 \end{array}\right] \]
p2 <- p1 %>%
add_trace(
x = c(0,Y[1]),
y = c(0,Y[2]),
z = c(0,Y[3]),
mode = "lines",
line = list(width = 5, color = "red"),
type="scatter3d",
name = "Y")
p2
Actual values of fitted values \(\mathbf{\hat{Y}}\):
#> [1] 1.166667 1.666667 2.166667
\[ \mathbf{Y}=\left[\begin{array}{c} 1.1666667 \\ 1.6666667 \\ 2.1666667 \end{array}\right] \]
p2 <- p2 %>%
add_trace(
x = c(0,Yhat[1]),
y = c(0,Yhat[2]),
z = c(0,Yhat[3]),
mode = "lines",
line = list(width = 5, color = "orange"),
type="scatter3d",
name="Yhat") %>%
add_trace(
x = c(Y[1],Yhat[1]),
y = c(Y[2],Yhat[2]),
z = c(Y[3],Yhat[3]),
mode = "lines",
line = list(width = 5, color = "red", dash="dash"),
type="scatter3d",
name="Y -> Yhat"
)
p2
\(\mathbf{Y}\) is projected in the column space of \(\mathbf{X}\)! spanned by the columns.
\[ \begin{array}{lcl} \hat{\mathbf{Y}} &=& \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\\ &=& \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1/2}(\mathbf{X}^T\mathbf{X})^{-1/2}\mathbf{X}^T\mathbf{Y}\\ &=& \mathbf{U}\mathbf{U}^T\mathbf{Y} \end{array} \]
\(\mathbf{U}\) is a new orthonormal basis in \(\mathbb{R}^2\), a subspace of \(\mathbb{R}^3\)
The space spanned by U and X is the column space of X, e.g. it contains all possible linear combinantions of X. \(\mathbf{U}^t\mathbf{Y}\) is the projection of Y on this new orthonormal basis
eigenXtX <- eigen(XtX)
XtXinvSqrt <- eigenXtX$vectors %*%diag(1/eigenXtX$values^.5)%*%t(eigenXtX$vectors)
U <- X %*% XtXinvSqrt
#> [,1] [,2]
#> 1 0.9116067 -0.04802616
#> 2 0.3881706 0.42738380
#> 3 -0.1352655 0.90279376
#> [,1] [,2]
#> [1,] 1.000000e+00 2.915205e-16
#> [2,] 2.915205e-16 1.000000e+00
#> 1 2 3
#> 1 0.8333333 0.3333333 -0.1666667
#> 2 0.3333333 0.3333333 0.3333333
#> 3 -0.1666667 0.3333333 0.8333333
#> 1 2 3
#> 1 0.8333333 0.3333333 -0.1666667
#> 2 0.3333333 0.3333333 0.3333333
#> 3 -0.1666667 0.3333333 0.8333333
p3 <- p1 %>%
add_trace(
x = c(0,U[1,1]),
y = c(0,U[2,1]),
z = c(0,U[3,1]),
mode = "lines",
line = list(width = 5, color = "blue"),
type="scatter3d",
name = "U1") %>%
add_trace(
x = c(0,U[1,2]),
y = c(0,U[2,2]),
z = c(0,U[3,2]),
mode = "lines",
line = list(width = 5, color = "blue"),
type="scatter3d",
name = "U2")
p3
p4 <- p3 %>%
add_trace(
x = c(0,Y[1]),
y = c(0,Y[2]),
z = c(0,Y[3]),
mode = "lines",
line = list(width = 5, color = "red"),
type="scatter3d",
name = "Y") %>%
add_trace(
x = c(0,U[1,1]*(U[,1]%*%Y)),
y = c(0,U[2,1]*(U[,1]%*%Y)),
z = c(0,U[3,1]*(U[,1]%*%Y)),
mode = "lines",
line = list(width = 5, color = "red",dash="dash"),
type="scatter3d",
name="Y -> U1") %>% add_trace(
x = c(Y[1],U[1,1]*(U[,1]%*%Y)),
y = c(Y[2],U[2,1]*(U[,1]%*%Y)),
z = c(Y[3],U[3,1]*(U[,1]%*%Y)),
mode = "lines",
line = list(width = 5, color = "red", dash="dash"),
type="scatter3d",
name="Y -> U1")
p4
p5 <- p4 %>%
add_trace(
x = c(0,U[1,2]*(U[,2]%*%Y)),
y = c(0,U[2,2]*(U[,2]%*%Y)),
z = c(0,U[3,2]*(U[,2]%*%Y)),
mode = "lines",
line = list(width = 5, color = "red",dash="dash"),
type="scatter3d",
name="Y -> U2") %>% add_trace(
x = c(Y[1],U[1,2]*(U[,2]%*%Y)),
y = c(Y[2],U[2,2]*(U[,2]%*%Y)),
z = c(Y[3],U[3,2]*(U[,2]%*%Y)),
mode = "lines",
line = list(width = 5, color = "red", dash="dash"),
type="scatter3d",
name="Y -> U2")
p5
p6 <- p5 %>%
add_trace(
x = c(0,Yhat[1]),
y = c(0,Yhat[2]),
z = c(0,Yhat[3]),
mode = "lines",
line = list(width = 5, color = "orange"),
type="scatter3d",
name = "Yhat") %>%
add_trace(
x = c(Y[1],Yhat[1]),
y = c(Y[2],Yhat[2]),
z = c(Y[3],Yhat[3]),
mode = "lines",
line = list(width = 5, color = "maroon2"),
type="scatter3d",
name = "e") %>%
add_trace(
x = c(U[1,1]*(U[,1]%*%Y),Yhat[1]),
y = c(U[2,1]*(U[,1]%*%Y),Yhat[2]),
z = c(U[3,1]*(U[,1]%*%Y),Yhat[3]),
mode = "lines",
line = list(width = 5, color = "orange", dash="dash"),
type="scatter3d",
name = "Y -> U") %>%
add_trace(
x = c(U[1,2]*(U[,2]%*%Y),Yhat[1]),
y = c(U[2,2]*(U[,2]%*%Y),Yhat[2]),
z = c(U[3,2]*(U[,2]%*%Y),Yhat[3]),
mode = "lines",
line = list(width = 5, color = "orange", dash="dash"),
type="scatter3d",
name = "Y -> U")
p6
Note, that it is also clear from the equation in the derivation of the least squares solution that the residual is orthogonal on the column space:
\[ -2 \mathbf{X}^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) = 0 \]
Imagine what happens when p approaches n \(p=n\) or becomes much larger than p >> n!!!
Suppose that we add a predictor \(\mathbf{X}_2 = [2,0,1]^T\)?
\[ \mathbf{Y}=\left[ \begin{array}{c} 1\\ 2\\ 2\\ \end{array}\right], \quad \mathbf{X}= \left[ \begin{array}{ccc} 1&1&2\\ 1&2&0\\ 1&3&1\\ \end{array} \right], \quad \boldsymbol{\beta} = \left[ \begin{array}{c} \beta_0\\ \beta_1\\ \beta_2 \end{array} \right] \quad \text{and} \quad \boldsymbol{\epsilon}= \left[ \begin{array}{c} \epsilon_1\\ \epsilon_2\\ \epsilon_3 \end{array} \right] \]
data$x2 <- c(2,0,1)
fit <- lm(y~x+x2,data)
# predict values on regular xy grid
x1pred <- seq(-1, 4, length.out = 10)
x2pred <- seq(-1, 4, length.out = 10)
xy <- expand.grid(x = x1pred,
x2 = x2pred)
ypred <- matrix (nrow = 30, ncol = 30,
data = predict(fit, newdata = data.frame(xy)))
library(plot3D)
# fitted points for droplines to surface
th=20
ph=5
scatter3D(data$x,
data$x2,
Y,
pch = 16,
col="darkblue",
cex = 1,
theta = th,
ticktype = "detailed",
xlab = "x1",
ylab = "x2",
zlab = "y",
colvar=FALSE,
bty = "g",
xlim=c(-1,3),
ylim=c(-1,3),
zlim=c(-2,4))
z.pred3D <- outer(
x1pred,
x2pred,
function(x1,x2)
{
fit$coef[1] + fit$coef[2]*x1+fit$coef[2]*x2
})
x.pred3D <- outer(
x1pred,
x2pred,
function(x,y) x)
y.pred3D <- outer(
x1pred,
x2pred,
function(x,y) y)
scatter3D(data$x,
data$x2,
data$y,
pch = 16,
col="darkblue",
cex = 1,
theta = th,
ticktype = "detailed",
xlab = "x1",
ylab = "x2",
zlab = "y",
colvar=FALSE,
bty = "g",
xlim=c(-1,4),
ylim=c(-1,4),
zlim=c(-2,4))
surf3D(
x.pred3D,
y.pred3D,
z.pred3D,
col="blue",
facets=NA,
add=TRUE)
Note, that the linear regression is now a plane.
However, we obtain a perfect fit and all the data points are falling in the plane! 😱
This is obvious if we look at the column space of X!
X <- cbind(X,c(2,0,1))
XtX <- t(X)%*%X
eigenXtX <- eigen(XtX)
XtXinvSqrt <- eigenXtX$vectors %*%diag(1/eigenXtX$values^.5)%*%t(eigenXtX$vectors)
U <- X %*% XtXinvSqrt
p7 <- p1 %>%
add_trace(
x = c(0,2),
y = c(0,0),
z = c(0,1),
mode = "lines",
line = list(width = 5, color = "darkgreen"),
type="scatter3d",
name = "X3")
p7
p8 <- p7 %>%
add_trace(
x = c(0,U[1,1]),
y = c(0,U[2,1]),
z = c(0,U[3,1]),
mode = "lines",
line = list(width = 5, color = "blue"),
type="scatter3d",
name = "U1") %>%
add_trace(
x = c(0,U[1,2]),
y = c(0,U[2,2]),
z = c(0,U[3,2]),
mode = "lines",
line = list(width = 5, color = "blue"),
type="scatter3d",
name = "U2") %>%
add_trace(
x = c(0,U[1,3]),
y = c(0,U[2,3]),
z = c(0,U[3,3]),
mode = "lines",
line = list(width = 5, color = "blue"),
type="scatter3d",
name = "U3")
p8
The column space now spans the entire \(\mathbb{R}^3\)!
With the intercept and the two predictors we can thus fit every dataset that only has 3 observations for the predictors and the response.
So the model can no longer be used to generalise the patterns seen in the data towards the population (new observations).
Problem of overfitting!!!
If \(p >> n\) then the problem gets even worse! Then there is even no longer a unique solution to the least squares problem…
Indeed, then we have more vectors/dimensions/columns in X than datapoints!
\[ \begin{array}{ccl} \hat{\boldsymbol{\Sigma}}_{\hat{\boldsymbol{\beta}}} &=&\text{var}\left[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\right]\\\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\text{var}\left[\mathbf{Y}\right]\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \text{(*)}\\\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{I}\sigma^2)\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \\\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{I}\quad\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\sigma^2\\\\ %\hat{\boldmath{\Sigma}}_{\hat{\boldsymbol{\beta}}}&=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{var}\left[\mathbf{Y}\right](\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\sigma^2\\\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\sigma^2 \end{array} \]
(*) Under assumption that all observations \(\mathbf{Y}\) are independent and identically distributed.
The fact that \(\hat{\boldsymbol{\beta}}\) is unbiased and has a variance of \((\mathbf{X}^T\mathbf{X})^{-1}\sigma^2\) will be important when assessing association!
Least squares estimators are unbiased and consistent, but these properties are not very important for prediction models.
A prediction model is considered good if it can predict well outcomes.
The prediction error for a prediction at predictor \(\mathbf{x}\) is given by
\[ \hat{Y}(\mathbf{x}) - Y^*, \]
where
\(\hat{Y}(\mathbf{x})=\mathbf{x}^T\hat{\boldsymbol{\beta}}\) is the prediction at \(\mathbf{x}\)
\(Y^*\) is an outcome at predictor \(\mathbf{x}\)
Since prediction is typically used to predict an outcome before it is observed, the outcome \(Y^*\) is not observed yet. Hence, the prediction error cannot be computed.
The problem of unobservable prediction errors is partly solved by the expected conditional test error (sometimes referred to as the mean squared error, MSE) \[ \text{Err}(\mathbf{x})=E[(\hat{Y}(\mathbf{x}) - Y^*)^2]. \]
With (suppressing the dependence on \(\mathbf{x}\)) \[ \mu = E[\hat{Y}] \text{ and } \mu^*=E[Y^*] \] the error can be expressed as \[\begin{eqnarray*} \text{Err} &=&E\left\{\left[(\hat{Y} -\mu)-(Y^*-\mu^*)-(\mu^*-\mu)\right]^2\right\} \\ &=& E[(\hat{Y} -\mu)^2]+E[(Y^*-\mu^*)^2]+E[(\mu^*-\mu)^2] \\ &=& \text{var}[\hat{Y}] +\text{var}[Y^*] +\text{bias}^2 \end{eqnarray*}\]
The term \(\text{var}[Y^*]\) (irreducible error) does not depend on the model and may therefore be ignored when \(\text{Err}\) is used for comparing prediction models.
In this introductory chapter we only aim to give a rough discussion on prediction errors. Later definitions will be refined and the notation will be more accurate. Also a more detailed discussion on the bias-variance trade-off will follow. For the moment it is sufficient to vaguely know that:
the expected conditional test error is introduced to circumvent the problem that the prediction error cannot be observed. In later chapters we will look at estimators of the expected error.
the expected conditional test error is in some literature also known as the mean squared error (MSE), but we do not adopt this terminology because MSE is also commonly used to refer to SSE divided by the residual degrees of freedom in a linear regression model.
The identity \(\text{Err}=\text{var}[\hat{Y}] +\text{var}[Y] +\text{bias}^2\) is known as the bias-variance trade-off. It shows that a good prediction model (i.e. a model resulting in a small Err), can be obtained by a model that shows a small bias as long as this bias is compensated with a large reduction of the variance or the predictions. A more detailed discussion will follow in later chapters.
For prediction models with a large number of predictors we will therefore introduce penalized regression. This will induce some bias in the estimation, but will allow us to reduce the variance considerably.
#> [1] "2024-10-07 12:41:46 CEST"
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.0 RC (2024-04-16 r86468)
#> os macOS Big Sur 11.6
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Brussels
#> date 2024-10-07
#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-8 2024-09-12 [1] CRAN (R 4.4.1)
#> AnnotationDbi 1.66.0 2024-05-01 [1] Bioconductor 3.19 (R 4.4.0)
#> AnnotationHub * 3.12.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.4.0)
#> beachmat 2.20.0 2024-05-06 [1] Bioconductor 3.19 (R 4.4.0)
#> beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.4.0)
#> Biobase * 2.64.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> BiocFileCache * 2.12.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> BiocGenerics * 0.50.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> BiocManager 1.30.25 2024-08-28 [1] CRAN (R 4.4.1)
#> BiocNeighbors 1.22.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> BiocParallel 1.38.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> BiocSingular 1.20.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> BiocVersion 3.19.1 2024-04-22 [1] Bioconductor 3.19 (R 4.4.0)
#> Biostrings 2.72.1 2024-06-02 [1] Bioconductor 3.19 (R 4.4.0)
#> bit 4.5.0 2024-09-20 [1] CRAN (R 4.4.1)
#> bit64 4.5.2 2024-09-22 [1] CRAN (R 4.4.1)
#> blob 1.2.4 2023-03-17 [1] CRAN (R 4.4.0)
#> bookdown 0.40 2024-07-02 [1] CRAN (R 4.4.0)
#> bslib 0.8.0 2024-07-29 [1] CRAN (R 4.4.0)
#> cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
#> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)
#> codetools 0.2-20 2024-03-31 [1] CRAN (R 4.4.0)
#> colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.0)
#> cowplot 1.1.3 2024-01-22 [1] CRAN (R 4.4.0)
#> crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.0)
#> crosstalk 1.2.1 2023-11-23 [1] CRAN (R 4.4.0)
#> curl 5.2.3 2024-09-20 [1] CRAN (R 4.4.1)
#> data.table 1.16.0 2024-08-27 [1] CRAN (R 4.4.1)
#> DBI 1.2.3 2024-06-02 [1] CRAN (R 4.4.0)
#> dbplyr * 2.5.0 2024-03-19 [1] CRAN (R 4.4.0)
#> DelayedArray 0.30.1 2024-05-07 [1] Bioconductor 3.19 (R 4.4.0)
#> DelayedMatrixStats 1.26.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)
#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
#> emo 0.0.0.9000 2024-10-02 [1] Github (hadley/emo@3f03b11)
#> evaluate 1.0.0 2024-09-17 [1] CRAN (R 4.4.1)
#> ExperimentHub * 2.12.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
#> farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
#> filelock 1.0.3 2023-12-11 [1] CRAN (R 4.4.0)
#> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
#> GenomeInfoDb * 1.40.1 2024-06-16 [1] Bioconductor 3.19 (R 4.4.0)
#> GenomeInfoDbData 1.2.12 2024-04-24 [1] Bioconductor
#> GenomicRanges * 1.56.1 2024-06-16 [1] Bioconductor 3.19 (R 4.4.0)
#> ggbeeswarm 0.7.2 2023-04-29 [1] CRAN (R 4.4.0)
#> ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
#> ggrepel 0.9.6 2024-09-07 [1] CRAN (R 4.4.1)
#> glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)
#> gridExtra 2.3 2017-09-09 [1] CRAN (R 4.4.0)
#> gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
#> highr 0.11 2024-05-26 [1] CRAN (R 4.4.0)
#> hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#> htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
#> httr 1.4.7 2023-08-15 [1] CRAN (R 4.4.0)
#> IRanges * 2.38.1 2024-07-03 [1] Bioconductor 3.19 (R 4.4.1)
#> irlba 2.3.5.1 2022-10-03 [1] CRAN (R 4.4.0)
#> jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
#> jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.4.1)
#> KEGGREST 1.44.1 2024-06-19 [1] Bioconductor 3.19 (R 4.4.0)
#> knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)
#> labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
#> lattice 0.22-6 2024-03-20 [1] CRAN (R 4.4.0)
#> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
#> lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
#> Matrix 1.7-0 2024-03-22 [1] CRAN (R 4.4.0)
#> MatrixGenerics * 1.16.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> matrixStats * 1.4.1 2024-09-08 [1] CRAN (R 4.4.1)
#> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
#> mgcv 1.9-1 2023-12-21 [1] CRAN (R 4.4.0)
#> mime 0.12 2021-09-28 [1] CRAN (R 4.4.0)
#> misc3d 0.9-1 2021-10-07 [1] CRAN (R 4.4.0)
#> munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
#> muscData * 1.18.0 2024-05-02 [1] Bioconductor 3.19 (R 4.4.0)
#> nlme 3.1-166 2024-08-14 [1] CRAN (R 4.4.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
#> plot3D * 1.4.1 2024-02-06 [1] CRAN (R 4.4.0)
#> plotly * 4.10.4 2024-01-13 [1] CRAN (R 4.4.0)
#> png 0.1-8 2022-11-29 [1] CRAN (R 4.4.0)
#> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
#> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.4.0)
#> Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.0)
#> readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
#> rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.4.0)
#> RSQLite 2.3.7 2024-05-27 [1] CRAN (R 4.4.0)
#> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
#> rsvd 1.0.5 2021-04-16 [1] CRAN (R 4.4.0)
#> S4Arrays 1.4.1 2024-05-30 [1] Bioconductor 3.19 (R 4.4.0)
#> S4Vectors * 0.42.1 2024-07-03 [1] Bioconductor 3.19 (R 4.4.1)
#> sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.0)
#> ScaledMatrix 1.12.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
#> scater * 1.32.1 2024-07-21 [1] Bioconductor 3.19 (R 4.4.0)
#> scuttle * 1.14.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
#> SingleCellExperiment * 1.26.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> SparseArray 1.4.8 2024-05-30 [1] Bioconductor 3.19 (R 4.4.0)
#> sparseMatrixStats 1.16.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
#> stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
#> SummarizedExperiment * 1.34.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
#> tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
#> tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
#> timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
#> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
#> UCSC.utils 1.0.0 2024-05-06 [1] Bioconductor 3.19 (R 4.4.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
#> vipor 0.4.7 2023-12-18 [1] CRAN (R 4.4.0)
#> viridis 0.6.5 2024-01-29 [1] CRAN (R 4.4.0)
#> viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.4.0)
#> withr 3.0.1 2024-07-31 [1] CRAN (R 4.4.0)
#> xfun 0.47 2024-08-17 [1] CRAN (R 4.4.0)
#> XVector 0.44.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#> yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)
#> zlibbioc 1.50.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────