Assess genotoxicity of 1,2-dimethylhydrazine dihydrochloride (DMH) (EU directive)
 
Comet assay
dna <- read_delim("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/dna.txt", delim = " ")
dna$dose <- as.factor(dna$dose)
dnadna %>%
  ggplot(aes(x = dose, y = length, fill = dose)) +
  geom_boxplot() +
  geom_point(position = "jitter")
dna %>%
  ggplot(aes(sample = length)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~dose)
plot(lm(length ~ dose, data = dna))



The Kruskal-Wallis Rank Test (KW-test) is a non-parameteric alternative for ANOVA F-test.
Classical \(F\)-teststatistic can be written as \[ F = \frac{\text{SST}/(g-1)}{\text{SSE}/(n-g)} = \frac{\text{SST}/(g-1)}{(\text{SSTot}-\text{SST})/(n-g)} , \]
with \(g\) the number of groups.
SSTot depends only on outcomes \(\mathbf{y}\) and will not vary in permutation test.
SST can be used as statistic \[\text{SST}=\sum_{j=1}^t n_j(\bar{Y}_j-\bar{Y})^2\]
The KW test statistic is based on SST on rank-transformed outcomes1, \[ \text{SST} = \sum_{j=1}^g n_j \left(\bar{R}_j - \bar{R}\right)^2 = \sum_{j=1}^t n_j \left(\bar{R}_j - \frac{n+1}{2}\right)^2 , \]
with \(\bar{R}_j\) the mean of the ranks in group \(j\), and \(\bar{R}\) the mean of all ranks, \[ \bar{R} = \frac{1}{n}(1+2+\cdots + n) = \frac{1}{n}\frac{1}{2}n(n+1) = \frac{n+1}{2}. \]
The KW teststatistic is given by \[ KW = \frac{12}{n(n+1)} \sum_{j=1}^g n_j \left(\bar{R}_j - \frac{n+1}{2}\right)^2. \]
The factor \(\frac{12}{n(n+1)}\) is used so that \(KW\) has a simple asymptotic null distribution. In particular under \(H_0\), given thart \(\min(n_1,\ldots, n_g)\rightarrow \infty\), \[ KW \rightarrow \chi^2_{t-1}. \]
The exact KW-test can be executed by calculating the permutation null distribution (that only depends on \(n_1, \ldots, n_g\)) to test \[H_0: f_1=\ldots=f_g \text{ vs } H_1: \text{ at least two means are different}.\]
In order to allow \(H_1\) to be formulated in terms of means, the assumption of locations shift should be valid.
For DMH example this is not the case.
If location-shift is invalid, we have to formulate \(H_1\) in terms of probabilistic indices: \[H_0: f_1=\ldots=f_g \text{ vs } H_1: \exists\ j,k \in \{1,\ldots,g\} : \text{P}\left[Y_j\geq Y_k\right]\neq 0.5\]
kruskal.test(length ~ dose, data = dna)
    Kruskal-Wallis rank sum test
data:  length by dose
Kruskal-Wallis chi-squared = 14, df = 3, p-value = 0.002905On the \(5\%\) level of significance we can reject the null hypothesis.
R-functie kruskal.test only returns the asymptotic approximation for \(p\)-values.
With only 6 observaties per groep, this is not a good approximation of the \(p\)-value
With the coin R package we can calculate the exacte \(p\)-value
library(coin)
kwPerm <- kruskal_test(length ~ dose,
  data = dna,
  distribution = approximate(B = 100000)
)
kwPerm
    Approximative Kruskal-Wallis Test
data:  length by dose (0, 1.25, 2.5, 5)
chi-squared = 14, p-value = 0.00043We conclude that the difference in the distribution of the DNA damages due to the DMH dose is extremely significantly different.
Posthoc analysis with WMW tests.
pairwise.wilcox.test(dna$length, dna$dose)
    Pairwise comparisons using Wilcoxon rank sum exact test 
data:  dna$length and dna$dose 
     0     1.25  2.5  
1.25 0.013 -     -    
2.5  0.013 0.818 -    
5    0.013 0.721 0.788
P value adjustment method: holm pairwise.wilcox.test output. Point estimate on probability on higher DNA-damage?nGroup <- table(dna$dose)
probInd <- combn(levels(dna$dose), 2, function(x) {
  test <- wilcox.test(length ~ dose, subset(dna, dose %in% x))
  return(test$statistic / prod(nGroup[x]))
})
names(probInd) <- combn(levels(dna$dose), 2, paste, collapse = "vs")
probInd  0vs1.25    0vs2.5      0vs5 1.25vs2.5   1.25vs5    2.5vs5 
0.0000000 0.0000000 0.0000000 0.4444444 0.2777778 0.3333333 Because there are doubts on the location-shift model we draw our conclusions in terms of the probabilistic index.
we assume that no ties are available↩︎