\[\left\{\begin{array}{lcr} y_i &\sim& Poisson(\mu_i)\\ E[y_i]&=&\mu_i\\ \log(\mu_i)&=&\eta_i\\ \eta_i&=&\mathbf{x}_i\mathbf{\beta} \end{array}\right.\]
\[y_i \sim \frac{\mu_i^{y_i}e^{-\mu_i}}{y_i!}\]
\[y_i \sim \exp\left\{ \frac{y_i\theta_i- b(\theta_i)}{a(\phi)}+c(y_i,\phi)\right\}\] \[y_i \sim \exp\left\{y_i\log(\mu_i) - \mu_i - \log(y_i!)\right\}\]
For one observation: \[l(\mu_i \vert y_i) = y_i \log \mu_i - \mu_i - \log y_i!\] \[l(\mu_i \vert y_i) = y_i \theta_i - e^{\theta_i} - \log y_i!\]
Log-likelihood for all observations, given that they are independent: \[l(\mathbf{\mu} \vert \mathbf{y}) = \sum\limits_{i=1}^n \left\{ y_i \theta_i - e^{\theta_i} - \log y_i!\right\}\]
Maximum likelihood: choose the parameters \(\mathbf{\beta}\) so that the likelihood to observe the sample under the statistical model becomes maximum. \[\text{argmax}_{\mathbf{\beta}} l(\mathbf{\mu} \vert \mathbf{y})\]
Maximization \(\rightarrow\) set first derivative of likelihood to betas equal to zero. First derivative of likelihood is also called the Score function (\(S(\mathbf{\theta})\)):
\[S(\mathbf{\theta})=\frac{\partial l(\mathbf{\mu} \vert \mathbf{y})}{\partial \beta}=0\]
\[\begin{eqnarray*} S(\mathbf{\beta})&=&\frac{\partial \sum\limits_{i=1}^n \left\{ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)}+c(y_i,\phi)\right\}}{\partial \beta}\\ &=&\sum\limits_{i=1}^n\frac{\partial \left\{ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)}+c(y_i,\phi)\right\}}{\partial \theta}\frac{\partial \theta}{\partial\mu}\frac{\partial\mu}{\partial\eta}\frac{\partial\eta}{\partial\beta}\\ &=&\sum\limits_{i=1}^n\frac{y_i-\mu_i}{a(\phi)}\frac{1}{b^{\prime\prime}(\theta)}\frac{\partial\mu}{\partial\eta}\mathbf{x}^t\\ &=&\mathbf{X}^T\mathbf{A}\left(\mathbf{y}-\boldsymbol{\mu}\right) \end{eqnarray*}\]
For poisson data: \(a(\phi)b^{\prime\prime}(\theta)=\mu\) and \(\frac{\partial \mu}{\partial \eta}=\mu\). So \(\mathbf{A}=\mathbf{I}\) and
\[ S(\beta)=\mathbf{X}^T \left\{\mathbf{Y}-\mathbf{\mu}\right\}\]
Parameter estimator \(\hat{\mathbf{\beta}}\): Find \(\hat{\mathbf{\beta}}\) so that \[ S(\beta) = \mathbf{0}\]
Problem \(\mathbf{\mu}= b^\prime(\theta)\). For Poisson this becomes \(\mathbf{\mu}= \exp(\mathbf{\theta}) = \exp(\mathbf{\eta}) = \exp(\mathbf{X\beta})\)! Score equation is nonlinear in the model parameters! \(\rightarrow\) Find roots of score equation by using Newton-Raphson method.
We have to implement an interative algorithm for optimisation. To make things tractable we will act as if \(\mathbf{A}\) is known and fix it using the current values of \(\boldsymbol{\beta}^k\). Note, that for Poisson regression \(\mathbf{A}\) is constant and equals the identity matrix.
\[\begin{eqnarray*} \frac{\partial S(\mathbf{\beta})}{\partial \mathbf{\beta}} &=& \frac{ \mathbf{X}^T\mathbf{A} \left\{\mathbf{Y}-\boldsymbol{\mu}\right\}}{\partial \mathbf{\beta}}\\ &=& - \mathbf{X}^T \mathbf{A}\left[ \begin{array}{cccc} \frac{\partial \mu_1}{\partial \eta_1} &0&\ldots&0\\ 0&\frac{\partial \mu_2}{\partial \eta_2} &\ldots&0\\ \vdots&\vdots&\vdots&\vdots\\ 0&0&\ldots& \frac{\partial \mu_n}{\partial \eta_n}\\ \end{array}\right] \frac{\partial \mathbf{\eta}}{\partial \mathbf{\beta}}\\ &=&-\mathbf{X}^T\mathbf{WX} \end{eqnarray*}\]
We know two points of the plane \((\mathbf{\beta}^k,S(\mathbf{\beta}^k))\) and \((\mathbf{\beta}^{k+1},0)\)
We know the direction of the plane \(S^\prime(\mathbf{\beta})=\frac{\partial S(\mathbf{\beta})}{\partial \mathbf{\beta}}\)
Equation of plane: \[S(\mathbf{\beta})=\mathbf{\alpha}_0+S^\prime\vert_{\mathbf{\beta}^k} \mathbf{\beta}\]
Get \(\mathbf{\beta}_{k+1}\) \[\begin{eqnarray*} \mathbf{0}&=&\mathbf{\alpha}_0+S^\prime\vert_{\mathbf{\beta}^{k}} \mathbf{\beta}^{k+1}\\ \mathbf{\beta}^{k+1}&=&-\left(S^{\prime}\vert_{\mathbf{\beta}^{k}}\right)^{-1}\mathbf{\alpha}_0\\ \end{eqnarray*}\]
Get \(\mathbf{\alpha}_0\) \[\begin{eqnarray*} S(\mathbf{\beta}^k)&=&\mathbf{\alpha}_0+S^\prime\vert_{\mathbf{\beta}^k} \mathbf{\beta}^k\\ \mathbf{\alpha}_0&=&-S^\prime\vert_{\mathbf{\beta}^k} \mathbf{\beta}^k + S(\mathbf{\beta}^k)\\ \end{eqnarray*}\]
Get \(\mathbf{\beta}_{k+1}\)
\[\begin{eqnarray*} \mathbf{\beta}^{k+1}&=&\mathbf{\beta}^k-\left(S^{\prime}\vert_{\mathbf{\beta}^{k}}\right)^{-1}S(\mathbf{\beta}^k)\\ \mathbf{\beta}^{k+1}&=&\mathbf{\beta}^k+ \left(\mathbf{X}^T\mathbf{WX}\right)^{-1} S(\mathbf{\beta}^k) \end{eqnarray*}\]
With \(J(\mathbf{\beta})=I(\mathbf{\beta})=\mathbf{X}^T\mathbf{WX}\) the Fisher information matrix. Because we use the canonical model parameters the observed Fisher information matrix equals the expected Fisher information matrix \(J(\mathbf{\beta})=I(\mathbf{\beta})\). Indeed, the observed Fisher information matrix is not depending on the observations, but only on the design and the variance of the data (via the weights). Hence, Newton-Raphson is equivalent to Fisher scoring when the canonical link function is used.
Note, that the Fisher matrix, minus second derivative (or hessian) of the likelihood to the model parameters, is also the inverse of the variance covariance matrix of the model parameters. It is thus related to the precision.
We can rewrite Newton Raphson or Fisher scoring as IRLS.
\[\begin{eqnarray*} \mathbf{\beta}^{k+1}&=&\mathbf{\beta}^k+ \left(\mathbf{X}^T\mathbf{WX}\right)^{-1} S(\mathbf{\beta}^k)\\ \mathbf{\beta}^{k+1}&=&\mathbf{\beta}^k+ \left(\mathbf{X}^T\mathbf{WX}\right)^{-1} \mathbf{X}^T\mathbf{A} \left(\mathbf{Y}-\mathbf{\mu}\right)\\ \mathbf{\beta}^{k+1}&=& \left(\mathbf{X}^T\mathbf{WX}\right)^{-1}\mathbf{X}^T\mathbf{WX}\mathbf{\beta}^k+ \left(\mathbf{X}^T\mathbf{WX}\right)^{-1} \mathbf{X}^T \mathbf{W}\frac{\partial \eta}{\partial \mu} \left(\mathbf{Y}-\mathbf{\mu}\right)\\ \mathbf{\beta}^{k+1}&=& \left(\mathbf{X}^T\mathbf{WX}\right)^{-1}\mathbf{X}^T\mathbf{W} \left[\mathbf{X}\mathbf{\beta}^k + \frac{\partial \eta}{\partial \mu} \left(\mathbf{Y}-\mathbf{\mu}\right) \right]\\ \mathbf{\beta}^{k+1}&=& \left(\mathbf{X}^T\mathbf{WX}\right)^{-1}\mathbf{X}^T\mathbf{Wz} \end{eqnarray*}\]
with \(\mathbf{z}=\left[\mathbf{X}\mathbf{\beta}^k + \frac{\partial \eta}{\partial \mu} \left(\mathbf{Y}-\mathbf{\mu}\right)\right]\)
So we can fit the model by performing iterative regressions of the pseudo data \(\mathbf{z}\) on \(\mathbf{X}\). In each iteration we will update \(\mathbf{z}\), the weights \(\mathbf{W}\) and the model parameters.
For Poisson data \(\frac{\partial \eta}{\partial \mu}=\frac{\partial\log \mu}{\partial\mu}=\frac{1}{\mu}=\exp(-\eta)\) and \(\mathbf{W}=\mathbf{A}\frac{\partial{\mu}}{{\partial \eta}}\) is a diagonal matrix with \([\frac{\partial{\mu_i}}{{\partial \eta_i}}]_{ii}=[\mu_i]_{ii}=[\exp(\eta_i)]_{ii}\) on its diagonal elements.
In the IRWLS algorithm, the data is weighted according to the variance of \(\mathbf{Y}\). We correct for the fact that the data are heteroscedastic. Count data have a mean variance relation (e.g. in Poisson case \(\text{E}\left[Y \right]=\text{var}\left[Y \right]=\mu\)). The IRWLS also corrects for the scale parameter \(\phi\) in \(\mathbf{W}\). (Note that the scale parameter for Poisson is \(\phi=1\)).
So IRWLS the variance-covariance matrix for the model parameter equals \[\mathbf{\Sigma}_{\hat\beta}=\left(\mathbf{X}^T\mathbf{WX}\right)^{-1}.\]
Note, that the Fisher Information Matrix equals the inverse of the variance-covariance matrix of the experiment. The larger the Fisher Information Matrix the more information we have on the experiment to estimate the model parameters. FIM \(\uparrow\), precision \(\uparrow\), \(\text{SE}\downarrow\)
set.seed(300)
cbind(1,rnorm(100),rnorm(100))
xhlp<-c(2,0.8,1.2)
betasTrue<-%*%betasTrue
etaTrue<-xhlprpois(100,exp(etaTrue))
y<-plot(betasTrue,ylab=expression(beta),ylim=c(0,4),pch=19,type="b")
This is a very poor initial estimate used to illustrate the algorithm. Otherwise convergence for this simple example is way too quick
0
iteration=c(log(mean(y)),0,0)
betas<-plot(betasTrue,ylab=expression(beta),ylim=c(0,4),pch=19,type="b")
lines(betas,type="b",lty=2)
\[z_i= \eta_i + \frac{\partial \eta_i}{\partial \mu_i}(y_i -\mu_i)\] \[z_i= \eta_i + e^{-\eta_i} y_i -1\]
\[[w_{ii}]= var_{y_i}^{-1} \left(\frac{\partial \mu}{\partial \eta}\right)^2\] \[[w_{ii}]= e^{\eta_i}\]
First 3 times (colors are black 0, red iteration 1, green iteration 2, blue iteration 3)
plot(betasTrue,ylab=expression(beta),ylim=c(0,4),pch=19,type="b")
lines(betas,type="b",lty=2)
#Calculate current eta
%*%betas
eta<-xhlp
0
iteration=for (i in 1:3)
{#start IRLS UPDATE STEP
+1
iteration=iteration#calculate pseudo data based on current betas
+exp(-eta)*(y-exp(eta))
z=eta#calculate new weights: diagonal elements
c(exp(eta))
w<-
#update betas
lm(z~-1+xhlp,weight=w)
lmUpdate<-#eta<-xhlp%*%betas
$fitted
eta<-lmUpdate$coef
betas<-lmUpdatelines(betas,type="b",col=iteration+1,pch=iteration,lty=2)
}
log(y+.5)
z<-lm(z~-1+xhlp)$coef
betas<-plot(betasTrue,ylab=expression(beta),ylim=c(0,4),pch=19,type="b")
lines(betas,col=2,type="b",lty=2)
#calculate current eta
%*%betas eta<-xhlp
function(y)
ylogy<-
{return(ifelse(y==0,rep(0,length(y)),y*log(y)))
}
2*sum(ylogy(y)-y*eta-(y-exp(eta)))
deviance<-
1e30 devianceOld<-
plot(betasTrue,ylab=expression(beta),ylim=c(0,4),pch=19,type="b")
lines(betas,type="b",lty=2)
1e-6
tol<-0
iteration=while(((devianceOld-deviance)/devianceOld)>tol)
{#start IRLS UPDATE STEP
+1
iteration=iteration#calculate pseudo data based on current betas
+exp(-eta)*(y-exp(eta))
z=eta#calculate new weights: diagonal elements
c(exp(eta))
w<-
#update betas
lm(z~-1+xhlp,weight=w)
lmUpdate<-#eta<-xhlp%*%betas
$fitted
eta<-lmUpdate$coef
betas<-lmUpdatelines(betas,type="b",col=iteration+1,pch=iteration,lty=2)
#criterion for convergence
devianceOld<-deviance2*sum(ylogy(y)-y*eta-(y-exp(eta)))
deviance<-cat("iteration",iteration,"Deviance Old",devianceOld,"Deviance", deviance,"\n")
}
## iteration 1 Deviance Old 129.1127 Deviance 114.3748
## iteration 2 Deviance Old 114.3748 Deviance 114.3374
## iteration 3 Deviance Old 114.3374 Deviance 114.3374
\[\Sigma_{\beta}=\left(\mathbf{X}^T\mathbf{W} \mathbf{X}\right)^{-1}\]
solve(t(xhlp)%*%diag(w)%*%xhlp) varBeta=
Use -1 because intercept is already in xhlp
glm(y~-1+xhlp,family=poisson)
glmfit=data.frame(glmfit=c(glmfit$deviance,glmfit$coef,summary(glmfit)$coef[,2]),ourFit=c(deviance,betas,sqrt(diag(varBeta))))
comp=row.names(comp)=c("deviance",paste("beta",1:3,sep=""),paste("se",1:3,sep=""))
comp
## glmfit ourFit
## deviance 114.33739500 114.33739500
## beta1 1.96805691 1.96805691
## beta2 0.76136641 0.76136641
## beta3 1.23330031 1.23330031
## se1 0.03814382 0.03814378
## se2 0.01891208 0.01891203
## se3 0.02556665 0.02556659