Until now we used models for a continuous response in function of categorical or continuous predictor. Here we will focus on a categorical response.
##Saksen-study
The data are derived from a binary random variable \(X\)
Note: count problem: the outcome is a count (number of boys)
Formally we have considered a population of unborn children where each individual is characterized by a 0 or 1.
Binary data can be modelled using a Bernoulli distribution: \[\begin{eqnarray*} X_i &\sim& B(\pi) \text{ with}\\ B(\pi)&=&\pi^{X_i}(1-\pi)^{(1-X_i)}, \end{eqnarray*}\]
It has one model parameter \(\pi\)
The variance of Bernoulli data is also related to \(\pi\). \[\text{Var}[X_i]=\pi (1-\pi).\]
Some Bernoulli probability mass functions
[1] 0.5158408
In our example \(\bar x =\) 3175 / 6155 = 51.6%.
Is the observed probability 51.6% that an unborn child is male, sufficient evidence to conclude that there is a higher chance on a boy than on a girl?
Statistical test for \[H_0: \pi=1/2 \text{ versus } H_1: \pi\neq 1/2,\]
We need known the distribution of
Suppose that \(H_0:\pi=1/2\) is true (boys and girls have an equal frequency in the population)
A random individual from the population has a probability of \[P(X=1)=\pi=1/2.\] to be a boy
Two children that are drawn independently :
Random variable \(S\) is the sum of the outcomes :
\((x_1,x_2)\) | \(s\) | \(P(S = s)\) |
---|---|---|
(0,0) | 0 | 1/4 |
(0,1), (1,0) | 1 | 1/2 |
(1,1) | 2 | 1/4 |
###General: \(n\) independent samples
Probability for \(\pi\) on success (\(X=1\))
Total number of successes \(S\) (sum of all values of 1) can take \(n+1\) possible values \[S=k\text{, met }k=0,\ldots,n\]
Distribution of \(S\)? \[\begin{equation} P(S=k) = \left ( \begin{array}{c} n \\ k \\ \end{array} \right ) \pi^k (1-\pi)^{n-k} \end{equation}\]
\(1-\pi\): probability on 0 for individual observation and
binomial coefficient \[\begin{equation*} \left ( \begin{array}{c} n \\ k \\ \end{array} \right ) = \frac{n \times (n-1) \times ...\times (n-k+1) }{ k!} = \frac{ n!}{ k!(n-k)! } \end{equation*}\]
In R you can calculate the probabilities on each \(S=k\) with the function dbinom(k,n,p)
S is a:
Binomial distributed random variable with Binomial probability mass function
Parameters
success
for each draw).Calculate probability on \(k\) successes on \(n\) independent draws each with a succes probability of \(\pi\).
For binary data X.
e.g.: wild type vs mutant of a gene, infected or not infected with HIV, …
Use: Compare proportions or risks on a particular event between groups.
Test statistic \[H_0:\pi=1/2\text{ vs }H_1:\pi\neq 1/2\]
\[\begin{eqnarray*} \text{P}_0\left[S\geq s_0+ \vert \delta\vert \right] &=& P(S \geq 6155 \times 0.5 + \vert 3175 - 6155 \times 0.5\vert ) \\&=& P(S \geq 3175)\\ &= &P(S= 3175) + P(S=3176) + ... + P(S=6155)\\ & =& 0.0067\\\\ \text{P}_0\left[S \leq s_0 - \vert \delta\vert \right] &=& P(S \leq 6155 \times 0.5 - \vert 3175- 6155 \times 0.5\vert) \\&=& P(S \leq 2980)\\ &= &P(S=0) + ... + P(S=2980) \\ &=&0.0067 \end{eqnarray*}\]
[1] 97.5
[1] 2980 3175
[1] 0.006699883 0.006699883 0.013399766
The test can immediately be conducted using the binomial.test
function in R.
Exact binomial test
data: boys and n
number of successes = 3175, number of trials = 6155, p-value = 0.0134
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5032696 0.5283969
sample estimates:
probability of success
0.5158408
On the 5% significance-level we conclude that there is on average a higher probability on a unborn male child than an unborn female child.
[1] 0.5033559 0.5283257
[1] 0.5032696 0.5283969
attr(,"conf.level")
[1] 0.95
Note that the test for a proportion is equivalent to a one-sample t-test for binary data.
For the Saksen population we conclude at the 5% significance level that the gender of unborn children is more likely to be male than female (\(p=\) 0.013). The probability that an unborn child is male equals 51.6% (95% CI [50.3,52.8]%).
The gate is opened upon 3 minutes
aggressive vs non-agressive male
Each female undergoes the test twice:
friendly_agressive | friendly_non-agressive | total | |
---|---|---|---|
hostile_agressive | 3 (e) | 17 (f) | 20 |
hostile_non-agressive | 1 (g) | 13 (h) | 14 |
total | 4 | 30 | 34 |
Standard error on ARD \[\begin{equation*} \text{SE}_{\widehat{\text{ARD}}}=\frac{1}{n}\sqrt{f+g-\frac{(f-g)^2}{n}} \end{equation*}\]
If we have large number of observations we can use the CLT to establish an \((1-\alpha)100\%\) CI on the ARD \[\left[\widehat{\text{ARD}}-z_{\alpha/2}\text{SE}_{\widehat{\text{ARD}}},\widehat{\text{ARD}}-z_{\alpha/2}\text{SE}_{\widehat{\text{ARD}}}\right]\] or \[\left[\frac{f-g}{n}-\frac{z_{\alpha/2}}{n}\sqrt{f+g-\frac{(f-g)^2}{n}},\frac{f-g}{n}+\frac{z_{\alpha/2}}{n}\sqrt{f+g-\frac{(f-g)^2}{n}}\right] \]
hamster <- matrix(c(3,17,1,13),ncol=2,byrow=TRUE)
rownames(hamster) <- c("hostile-agressive", "hostile-non-agressive")
colnames(hamster) <- c("friendly-agressive","friendly-non-agressive")
f=hamster[1,2]; g=hamster[2,1] ;n=sum(hamster)
riskdiff=(f-g)/n
riskdiff
[1] 0.4705882
[1] 0.09517144
[1] 0.2840556 0.6571208
\[\begin{equation*} \widehat{\text{ARD}}=\frac{17-1}{34}=0.471 \end{equation*}\] The absolute risk difference on choosing for an agressive male is 47.1% larger upon staying in a hostile environment that when residing in a gentle environment. - The standard error \[\begin{equation*} \text{SE}_{\widehat{\text{ARD}}}=\frac{1}{34}\sqrt{17+1-\frac{(17-1)^2}{34}}=0.0952 \end{equation*}\] - A 95% confidence interval on this absolute risk difference is \[\begin{equation*} \left[0.471-1.96\times 0.0952,0.471+1.96\times 0.0952\right]=[0.284,0.658] \end{equation*}\]
friendly_agressive | friendly_non-agressive | total | |
---|---|---|---|
hostile_agressive | 3 (e) | 17 (f) | 20 |
hostile_non-agressive | 1 (g) | 13 (h) | 14 |
total | 4 | 30 | 34 |
\[\begin{eqnarray*} \text{E}\left[f/(f+g)\right]&\stackrel{H_0}{=}& 0.5\\ f & \stackrel{H_0}{\sim}& \text{Binom}(n=f+g,\pi=0.5)\\ \text{SE}_{\frac{f}{f+g}} & \stackrel{H_0}{=}& \sqrt{(f+g)\times 0.5\times 0.5}=\frac{\sqrt{f+g}}{2} \end{eqnarray*}\]
\[\begin{equation*} z=\frac{f-(f+g)/2}{\sqrt{f+g}/2}=\frac{f-g}{\sqrt{f+g}} \end{equation*}\]
The Mc Nemar test is the analogon of the paired t-test for binary qualitative variables.
In R we can conduct the analysis using the mcnemar.test
function
McNemar's Chi-squared test with continuity correction
data: hamster
McNemar's chi-squared = 12.5, df = 1, p-value = 0.000407
Exact binomial test
data: f and f + g
number of successes = 17, number of trials = 18, p-value = 0.000145
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.7270564 0.9985944
sample estimates:
probability of success
0.9444444
cancer variant variant2
Length:1372 Length:1372 Length:1372
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Genotype | Control | Cases | Total |
---|---|---|---|
Pro/Pro | 266 (a) | 342 (d) | 608 (a+d) |
Pro/Leu | 250 (b) | 369 (e) | 619 (b+e) |
Leu/Leu | 56 (c) | 89 (f) | 145 (c+f) |
Totaal | 572 (a+b+c) | 800 (d+e+f) | 1372 (n) |
Genotype | Controls | Cases | Total |
---|---|---|---|
Pro/Pro | 266 (a) | 342 (d) | 608 (a+d) |
Pro/Leu | 250 (b) | 369 (e) | 619 (b+e) |
Leu/Leu | 56 (c) | 89 (f) | 145 (c+f) |
Totaal | 572 (a+b+c) | 800 (d+e+f) | 1372 (n) |
\[\begin{equation*} Odds=\frac{p}{1-p} \end{equation*}\] with \(p\) the probability on the event.
Transformation of risk with following properties:
odds has value between 1 and \(\infty\).
odds only equals 1 for probability \(p=1/2\).
de odds increases if probability increases.
popular in gambling: how much more likely is it to win than to loose
Genotype | Controls | Cases | Total |
---|---|---|---|
Pro/Pro | 266 (a) | 342 (d) | 608 (a+d) |
Pro/Leu | 250 (b) | 369 (e) | 619 (b+e) |
Leu/Leu | 56 (c) | 89 (f) | 145 (c+f) |
Totaal | 572 (a+b+c) | 800 (d+e+f) | 1372 (n) |
Odds on allel Leu/Leu
Cases: \(\mbox{odds}_1=\frac{ f/(d+e+f)}{(d+e)/(d+e+f)}=f/(d+e)=89/711=0.125\). The double Leu/Leu variant is 8 times less likely than other allele combinations in the breast cancer cases
Controls: \(\mbox{odds}_2=c/(a+b)=56/516=0.109\).
Association between exposure and outcome: \[ OR_{Leu/Leu}=\frac{\mbox{odds}_T}{\mbox{odds}_C}= \frac{f/(d+e)}{c/(a+b)}=\frac{f/(d+e)}{c/(a+b)}=1.15 \]
Genotype | Controls | Cases | Total |
---|---|---|---|
Pro/Pro | 266 (a) | 342 (d) | 608 (a+d) |
Pro/Leu | 250 (b) | 369 (e) | 619 (b+e) |
Leu/Leu | 56 (c) | 89 (f) | 145 (c+f) |
Totaal | 572 (a+b+c) | 800 (d+e+f) | 1372 (n) |
Is the difference large enough to generalize the effect in the sample towards the population?
We will first rewrite the data in a 2x2 table
Genotype | Controls | Cases | Totaal |
---|---|---|---|
other | 516 (a) | 711 (c) | 1227 (a+c) |
Leu/Leu | 56 (b) | 89 (d) | 145 (b+d) |
Total | 572 (a+b) | 800 (c+d) | 1372 (n) |
Test association between categorical exposure(e.g. variant, X) en categorical response (e.g. disease, Y). \[H_0: \text{There is no association between } X \text{ and } Y \text{ vs } H_1: X \text{ and } Y \text{ are associated}\]
Consider row totals \(n_\text{other}=a+c\), \(n_\text{leu,leu}=b+d\) and
column totals \(n_\text{contr}=a+b\) en \(n_\text{case}=c+d\).
They give information on marginal distribution of the exposure (variant, X) and outcome (disease, Y), but not on the association between these variables.
Under \(H_0\) \(X\) and \(Y\) are independent and one expects that \((b+d)/n\) of the \(a+b\) controls have a Leu/Leu variant, or that \((a+b)(b+d)/n\) of them has a Leu/Leu variant
We can calculate this expected number \(E_{ij}\) under the null hypothesis for each cell of the \(2 \times 2\) table.
\(E_{11}\) = Expected number under \(H_0\) in (1,1)-cell = 1227 \(\times\) 800/1372 = 715.5 ;
\(E_{12}\) = Expected number under \(H_0\) in (1,2)-cell = 1227 \(\times\) 572/1372 = 511.5 ;
\(E_{21}\) = Expected number under \(H_0\) in (2,1)-cell = 145 \(\times\) 800/1372 = 84.55 ;
\(E_{22}\) = Expected number under \(H_0\) in (2,2)-cell = 145 \(\times\) 572/1372 = 60.45 ;
Test-statistic: \[\begin{eqnarray*} X^2 &=& \frac{\left (|O_{11} - E_{11}| - .5 \right)^2 }{ E_{11}} + \frac{ \left ( |O_{12} - E_{12}| - .5 \right)^2 }{E_{12} }+ \\ &&\quad\quad \frac{ \left ( |O_{21} - E_{21}| - .5 \right)^2 }{E_{21}}+ \frac{ \left ( |O_{22} - E_{22}| - .5 \right)^2 }{E_{22} }\\ X^2 &\stackrel{H_0}{\longrightarrow}& \chi^2(df=1) \end{eqnarray*}\]
expected <- matrix(0,nrow=2,ncol=2)
for (i in 1:2)
for (j in 1:2)
expected[i,j] <-
sum(brcaTab2[i,])*sum(brcaTab2[,j])/sum(brcaTab2)
expected
[,1] [,2]
[1,] 715.4519 511.5481
[2,] 84.5481 60.4519
[1] 0.481519
In R you can switch the correction on or off by setting the argument correct
on TRUE or FALSE, respectively:
Pearson's Chi-squared test with Yates' continuity correction
data: brcaTab2
X-squared = 0.49542, df = 1, p-value = 0.4815
Pearson's Chi-squared test
data: brcaTab2
X-squared = 0.62871, df = 1, p-value = 0.4278
Fisher's Exact Test for Count Data
data: brcaTab2
p-value = 0.4764
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.5974269 1.2501878
sample estimates:
odds ratio
0.8670925
The \(\chi^2\)-test can also be used if one of the categorical variables \(X\) and \(Y\) has more than 2 levels
Again: the null hypothesis \(H_0\): \(X\) and \(Y\) are independent (not associated), against the alternative \(H_A: X\) and \(Y\) are associated.
Let the variable in the rows have \(r\) different possible outcomes and the one in the columns \(c\) possible outcomes, then we obtain an \(r \times c\) table.
Again we will compare the observed values in cell \((i,j)\), referred to as \(O_{ij}\), with the expected number under \(H_0\), \(E_{ij}\)
Again \(E_{ij}\) is the product of the \(i^\text{th}\) row-total and the \(j^\text{th}\) column total devided by the overall total.
\[\begin{equation*} X^2 = \sum_{ij} \frac{\left (O_{ij} - E_{ij}\right)^2 }{ E_{ij}} \end{equation*}\]
Pearson's Chi-squared test
data: brcaTab
X-squared = 2.0551, df = 2, p-value = 0.3579
Framework to model binary data (e.g cancer vs no cancer): logistic regression model.
Model binary data with continuous and/or dummy variables.
The model assumes that observations for subject \(i=1,\ldots,n\) are independent and follow a Bernoulli distribution.
The logarithm of the odds is modelled using a linear model, also referred to as linear predictor: \[\begin{equation} \left\{ \begin{array}{ccl} Y_i&\sim&B(\pi_i)\\\\ \log \frac{\pi_i}{1-\pi_i}&=&\beta_0 + \beta_1X_{i1} + \ldots + \beta_p X_{ip} \end{array}\right. \end{equation}\]
breast cancer example: is BRCA 1 variant associated with breast cancer.
As in the Anova context, a factor in logistic regression is coded using dummy variables.
1 dummy variable less than the number of groups.
For BRCA 1 we need two dummy variables and obtain the following linear predictor:
\[\begin{eqnarray*} \log \frac{\pi_i}{1-\pi_i} &=& \beta_0+\beta_1 x_{i1} +\beta_2 x_{i2} \end{eqnarray*}\]
with : \[x_{i1} = \left\{ \begin{array}{ll} 1 & \text{ if subject $i$ is heterozygous, Pro/Leu variant} \\ 0 & \text{if subject $i$ is homozygous, (Pro/Pro or Leu/Leu variant)} \end{array}\right. .\] \[x_{i2} = \left\{ \begin{array}{ll} 1 & \text{ if subject $i$ is homozygous is the Leucine mutation: Leu/Leu } \\ 0 & \text{ of subject $i$ is not homozygous in the Leu/Leu variant} \end{array}\right. .\]
Homozygosity in the wild type allele Pro/Pro is the reference group.
We fit the model in R.
as.factor
to convert the cancer variable to a factor variable. - We further use the function relevel
to specify the control treatment as the reference group.brca$cancer<-brca$cancer %>% as.factor %>% relevel("control")
brca$variant<-brca$variant %>% as.factor %>% relevel("pro/pro")
brcaLogit <- glm(cancer~variant,data=brca,family=binomial)
summary(brcaLogit)
Call:
glm(formula = cancer ~ variant, family = binomial, data = brca)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.379 -1.286 1.017 1.017 1.073
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.25131 0.08175 3.074 0.00211 **
variantleu/leu 0.21197 0.18915 1.121 0.26243
variantpro/leu 0.13802 0.11573 1.193 0.23302
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1863.9 on 1371 degrees of freedom
Residual deviance: 1861.9 on 1369 degrees of freedom
AIC: 1867.9
Number of Fisher Scoring iterations: 4
The intercept is the log-odds on cancer in the reference class (Pro/Pro) and the slope terms are log odds ratios between treatment and reference class: \[\begin{eqnarray*} \log \text{ODDS}_\text{Pro/Pro}&=&\beta_0\\\\ \log \text{ODDS}_\text{Pro/Leu}&=&\beta_0+\beta_1\\\\ \log \text{ODDS}_\text{Leu/Leu}&=&\beta_0+\beta_2\\\\ \log \frac{\text{ODDS}_\text{Pro/Leu}}{\text{ODDS}_\text{Pro/Pro}}&=&\log \text{ODDS}_\text{Pro/Leu}-\log ODDS_{Pro/Pro}\\ &=&\beta_0+\beta_1-\beta_0=\beta_1\\\\ \log \frac{\text{ODDS}_\text{Leu/Leu}}{\text{ODDS}_\text{Pro/Pro}}&=&\beta_2 \end{eqnarray*}\]
The \(\chi^2\)-test on the logsitic regression model also indicates that there is no significant association between cancer status and the genetic variant of the BRCA gene (\(p=\) 0.358). De p-value is almost equivalent to the one of the \(\chi^2\)-test (see previous section).
Significant association?
suppressPackageStartupMessages({library(multcomp)})
posthoc <- glht(brcaLogit,linfct=mcp(variant = "Tukey"))
posthocTests <- summary(posthoc)
posthocTests
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: glm(formula = cancer ~ variant, family = binomial, data = brca)
Linear Hypotheses:
Estimate Std. Error z value Pr(>|z|)
leu/leu - pro/pro == 0 0.21197 0.18915 1.121 0.493
pro/leu - pro/pro == 0 0.13802 0.11573 1.193 0.449
pro/leu - leu/leu == 0 -0.07395 0.18922 -0.391 0.917
(Adjusted p values reported -- single-step method)
Simultaneous Confidence Intervals
Multiple Comparisons of Means: Tukey Contrasts
Fit: glm(formula = cancer ~ variant, family = binomial, data = brca)
Quantile = 2.3227
95% family-wise confidence level
Linear Hypotheses:
Estimate lwr upr
leu/leu - pro/pro == 0 0.21197 -0.22737 0.65131
pro/leu - pro/pro == 0 0.13802 -0.13079 0.40683
pro/leu - leu/leu == 0 -0.07395 -0.51345 0.36555
confint
function CI are obtained on the log-odds ratios corrected for multiple testing. Estimate lwr upr
leu/leu - pro/pro 1.2361111 0.7966292 1.918045
pro/leu - pro/pro 1.1480000 0.8774041 1.502049
pro/leu - leu/leu 0.9287191 0.5984282 1.441308
attr(,"conf.level")
[1] 0.95
attr(,"calpha")
[1] 2.322728
De odds ratios that we obtain is exactly equal to the one we calculated based on the contingency table:
e.g \(\text{OR}_\text{Leu/Leu-Pro/Pro}=89\times 266/(56\times 342)=\) 1.236.
Note, that statistical inference for logistic regression relies on asymptotic theory.
##Design## - 32 independent experiments - Each time 1 beatle is exposed to one of 8 CS\(_2\) concentrations (mg/l). - The outcome: mortality (\(y=1\)) or survival (\(y=0\)).
beetles<-read_csv("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/beetles.csv")
head(beetles)
0 1
169.07 3 1
172.42 3 1
175.52 3 1
178.42 2 2
181.13 1 3
183.69 0 4
186.1 0 4
188.39 0 4
We build a model for the log odds in function of the dose \(x_i\): \[\log \frac{\pi_i}{1-\pi_i}=\beta_0+\beta_1 \times x_i.\]
Call:
glm(formula = status ~ dose, family = binomial, data = beetles)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7943 -0.7136 0.2825 0.5177 2.1670
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -53.1928 18.0046 -2.954 0.00313 **
dose 0.3013 0.1014 2.972 0.00296 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 42.340 on 31 degrees of freedom
Residual deviance: 26.796 on 30 degrees of freedom
AIC: 30.796
Number of Fisher Scoring iterations: 5
Intercept has an interpretation of a log odds on mortality when no \(\text{CS}_2\) gas is applied.
Very small odds on mortality (\(\pi/(1-\pi)=\exp(-53.2)\)) so the probability is almost 0.
Note that this is a very large extrapolation: minimum dose in dataset is 169.07 mg/l.
Estimated odds ratio for the effect of dose on the mortality probability is \(\exp(0.3013)=1.35\).
So a beatle exposed to a CS\(_2\) concentration that is 1 mg/l larger than another beatle, will on average have an odds ratio on mortality of \(1.35\).
beetlesTab<-table(beetles) %>% data.frame
data.frame(grid=seq(min(beetles$dose),max(beetles$dose),.1)) %>%
mutate(piHat=predict(beatleModel,
newdata=data.frame(dose=grid),
type="response")) %>%
ggplot(aes(grid,piHat))+
geom_line() +
xlab("dose") +
ylab("probability (dead)") +
geom_text(aes(x=dose%>%as.character%>%as.double,y=status%>%as.character%>%as.double,label=Freq),beetlesTab%>%filter(status==0)) +
geom_text(aes(x=dose%>%as.character%>%as.double,y=status%>%as.character%>%as.double,label=Freq),beetlesTab%>%filter(status==1))