-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path05-probability-and-hypothesis-testing.qmd
741 lines (611 loc) · 27.1 KB
/
05-probability-and-hypothesis-testing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
---
aliases:
- probability-and-hypothesis-testing.html
---
# Probability and Hypothesis Testing
```{r}
#| include: false
source("./_common.R")
```
> ... in which we reason about the nature of randomness and
discover various statistical tests.
{{< youtube LBU22HxJm6I >}}
## Motivation
```{r}
library(tidyverse)
```
In the first four lectures we covered the fundamentals of handling data with R.
Now, we will shift our focus away from the **how** and towards the **why** of
data analysis. We will talk about different statistical tests, common mistakes,
how to avoid them and how to spot them in other research. But of course, we will
do so using R. So you will still learn one or the other useful function or
technique along the way. In most instances it should be clear when I use R
solely to demonstrate an idea from statistics and the code is just included for
the curious, or whether the code is something you will likely also use for your
own analysis. I am open for questions if things are unclear in any of the two
cases. For purely aesthetic code I might also speed up the typing in the edit.
For the longer text parts it might be helpful to look at the script while watching the video or pause frequently to take your own notes (Rmarkdown is great for your lecture notes as well!).
## Statistically Significant...
> ...you keep using that word. I don't think it means what you think it means.
![statistically significant](images/statistically-significant.jpg)
You will hear the phrases "statistically significant", "significant" or even
"very significant" thrown around quite a bit in academic literature . And while
they are often used carelessly, they have a clearly defined meaning. A meaning
we will uncover today. This meaning is related to the concept of so called
p-values, which have an equally bad reputation for frequently being misused. The
p in p-value stands for **probability**, so in order to understand p-values, we
need to understand probability and learn how to deal with randomness, chance, or
luck if you will. So...
## Getting our Hands dirty with Probability
> To understand statistics means understanding the nature of randomness first.
```{r chess-board, fig.cap="A ggplot chessboard", echo=FALSE, fig.asp=1, fig.width=3}
n = 8
crossing(x = 1:n,
y = 1:n) %>%
mutate(f = xor((x %% 2 == 0), (y %% 2 == 0))) %>%
ggplot(aes(x, y, fill = f)) +
geom_tile() +
coord_equal() +
scale_fill_manual(values = c("#388e00", "#122a00")) +
theme_void() +
guides(fill = "none")
```
Say you and you friend are playing a game of chess, when your friend proudly
proclaims:\
"I am definitely the better player!".\
"Proof it!", you reply.\
"That's easy", she says: "I won 7 out of the 8 rounds we played to today."\
"Pah! That's just luck." is your less witty and slightly stubborn response.
As expected, we shall be using R to resolve this vital conflict.
![R rainbow](images/paste-5E31A2FF.png)
### Definitions: Hypothesis
Both of you involuntarily uttered an hypothesis, a testable assumption. And we
want to test these hypothesis using statistics. The first hypothesis ("I am the
better player.") is what we call the **alternative hypothesis** ($H_1$). The
name can be a bit confusing, because most often, this is your actual scientific
hypothesis, the thing you are interested in. So, alternative to what? It is
alternative to the so called **null hypothesis** ($H_0$), which is the second
statement ("This is just luck"). The null hypothesis provides a sort of baseline
for all our findings. It usually goes along the lines of "What if our
observations are just based on chance alone?", where "chance" can be any source
of random variation in our system.
The tricky part is that there is no way to directly test the alternative
Hypothesis, all we can test is the null hypothesis. Because for any null
hypothesis we discard, there are always multiple alternative hypothesis that
could explain our data. In our example, even if we end up discarding the idea of
our friend's chess success being only down to luck, this does not prove the
alternative hypothesis that she is the better player (she could still be
cheating for example). Do keep this in mind when we transfer this to a more
scientific setting. Just because we show that something is unlikely to have
arisen by chance does not mean that your favorite alternative hypothesis is
automatically true.
So, after these words of warning, let's test some null hypothesis!
### Testing the Null Hypothesis with a Simulation
We will start off by building a little simulation. Before testing any
hypothesis, it is important to have defined $H_0$ and $H_1$ properly, which we did in
the previous section. But we need to be a little more specific. Winning by
chance would entail a completely random process, which we can model with a coin
flip. R has the lovely function `sample` to take any number of things from a
vector, with or without replacement after taking each thing:
```{r}
coin <- c("heads", "tails")
sample(coin)
```
Not giving it a number of things to draw just shuffles the vector, which is
fairly boring in the case of just two tings. We can't sample 10 things from a
vector of only two elements
```{r, error=TRUE}
sample(coin, size = 10)
```
But we can, if we put the thing back every time:
```{r}
sample(coin, 10, replace = TRUE)
```
So, let's make this a little more specific to our question:
```{r}
winner <- c("you", "friend")
random_winners <- sample(winner, size = 8, replace = TRUE)
random_winners
random_winners == "friend"
1 + TRUE
```
```{r}
1 + FALSE
```
```{r}
sum(random_winners == "friend")
```
```{r}
mean(random_winners == "friend")
```
If we were to run this script a million times, the resulting proportion
of random wins for both of you would be very, very close to 50-50 because we
used a fair coin. However, we don't have the time to play this much Chess and we
sure don't have the money to run a million replicates for each experiment in the
lab. But here, in our little simulated world, we have near infinite
resources (our simulation is not to computationally costly).
> One trick used above: When we calculate e.g. a sum or mean,
R automatically converts TRUE to 1 and FALSE to 0.
Let's create a function that returns a random number of wins your friend would have
gotten by pure chance for a number of rounds `N`.
```{r}
get_n_win <- function(N) {
winner <- c("you", "friend")
random_winners <- sample(winner, size = N, replace = TRUE)
sum(random_winners == "friend")
}
get_n_win(8)
```
This number is different every time, so how does it change?
```{r}
result <- map_dbl(rep(8, 1000), get_n_win)
head(result)
```
A histogram is a type of plot that shows how often each value occurs in a
vector. Usually, the values are put into bins first, grouping close values
together for continuous values, but in this case it makes sense to just have one
value per bin because we are dealing with discrete values (e.g. no half-wins).
Histograms can either display the raw counts or the frequency e.g. as a
percentage. In ggplot, we use `geom_bar` when we don't need any binning, just
counting occurrences, and `geom_histogram` when we need to bin continuous
values.
```{r}
tibble(result) %>%
ggplot(aes(x = result)) +
geom_bar() +
labs(x = "N wins for friend",
title = "Throwing a coin 8 times") +
scale_x_continuous(breaks = 0:8)
```
As expected, the most common number of wins out of 8 is 4 (unless I got really
unlucky when compiling this script). Let us see, how this **distribution**
changes for different values of `N`. First, we set up a grid of numbers (all
possible combinations) so that we can run a bunch of simulations:
```{r}
simulation <- crossing(
N = 1:15,
rep = 1:1000
) %>%
mutate(
wins = map_dbl(N, get_n_win)
)
```
And then we use our trusty ggplot to visualize all the distributions.
```{r}
simulation %>%
ggplot(aes(wins)) +
geom_bar() +
facet_wrap(~N, labeller = label_both) +
labs(title = "Flipping a coin N times")
```
With a fair coin, the most common number of wins should be half of the number of
coin flips. Note, how it is still possible to flip a coin 15 times and and not
win a single time. It is just very unlikely and the bars are so small that we
can't see them.
Let us go back to the original debate. The first statement: "I am better." is
something that can never be definitively proven. Because there is always the
possibility, no matter how small, that the same result could have arisen by pure
chance alone. Even if she wins 100 times and we don't take a single game from
her, this sort of outcome is still not *impossible* to appear just by flipping a
coin. But what we can do, is calculate, how likely a certain event is under the
assumption of the null hypothesis (only chance). And we can also decide on some
threshold $\alpha$ at which we reject the null hypothesis. This is called the
**significance threshold**. When we make an observation and then calculate that
the probability for an observation like this or more extreme is smaller than the
threshold, we deem the result **statistically significant**. And the probability
thus created is called the **p-value**.
From our simulation, we find the that probability to win 7 out of 8 rounds under
the null hypothesis is:
```{r}
simulation %>%
filter(N == 8) %>%
summarise(
mean(wins >= 7)
)
```
Which is smaller than the commonly used significance threshold of $\alpha=0.05$
(i.e. $5\%$). So with 7 out of 8 wins, we would **reject the null hypothesis**.
Do note that this threshold, no matter how commonly and thoughtlessly it is used
throughout academic research, is completely arbitrary.
### Getting precise with the Binomial Distribution
Now, this was just from a simulation with 1000 trials, so the number can't be
arbitrarily precise, but there is a mathematical formula for this probability.
What we created by counting the number of successes in a series of yes-no-trials
is a **binomial distribution**. For the most common distributions, R provides a
set of functions. the functions starting with `d` give us the probability
density function. In the case of discrete values like counting wins, this is
equivalent to the actual probability, but for continuous values we obtain the
probability by taking the integral. We get these integrals with the
corresponding functions starting with `p` (for probability).
```{r}
dbinom(x = 7, size = 8, prob = 0.5)
```
This is the probability to win **exactly 7 out of 8** games. But what we wanted
was the probability for **7 or more out of 8**! So we move to the integral. This
part can get a bit confusing, because the default for `pbinom` is
`lower.tail = TRUE`, which according to the help page means that probabilities
it returns $P[X \le x]$.
```{r}
pbinom(q = 7, size = 8, prob = 0.5)
```
If we set `lower.tail` to `FALSE` , we get $P[X > x]$, so the probability for a
random variable X being bigger than a number x. So to get the probability that
we are interested in, we need to replace the 7 with a 6 as well:
```{r}
pbinom(q = 6, size = 8, prob = 0.5, lower.tail = FALSE)
```
Our simulation was pretty close! So the exact values agrees and we reject the
null hypothesis of both opponents being equally good.
Here is the full graph for the probability density function of the binomial
distribution.
```{r}
ggplot() +
stat_function(fun = function(x) dbinom(x = x, size = 8, prob = 0.5),
geom = "step",
n = 9) +
scale_x_continuous(n.breaks = 9, limits = c(0, 8))
```
And the integral, the probability $P[X \le x]$.
```{r}
ggplot() +
stat_function(fun = function(q) pbinom(q = q, size = 8, prob = 0.5),
geom = "step",
n = 9) +
scale_x_continuous(n.breaks = 9, limits = c(0, 8))
```
There is two more functions I want to showcase from this family. The third is the
so called **quantile function**. Quantiles divide a probability distribution
into pieces of equal probability. One example for a quantile is the 50th
percentile, also known as the median, which divides the values such that half of
the values are above and half are below. And we can keep dividing the two halves
as well, so that we end up with more quantiles. Eventually, we arrive at the
quantile function. It is the inverse of the probability function, so you obtain
it by swapping the axis.
```{r}
ggplot() +
stat_function(fun = function(p) qbinom(p = p, size = 8, prob = 0.5),
geom = "step",
n = 9) +
scale_x_continuous(n.breaks = 10, limits = c(0, 1))
```
Quantiles will also be useful for
deciding if a random sample follows a certain distribution
with quantile-quantile plots.
Lastly, there is always also an `r` variant of the function, which gives us any
number of random numbers from the distribution.
```{r}
rbinom(10, 8, 0.5)
```
### But how much better? Understanding Effect Size and Power, False Positives and False Negatives
We decided to abandon the null hypothesis that both players are equally good,
which equates to a 50% win-chance for each player. But we have not determined
**how much better** she is. And how much better does she need to be for us to
reliably discard the null hypothesis after just 8 games? The generalization of
the **how much better** part, the **true difference**, is called the **effect
size**.
Our ability to decide that something is statistically significant when there is
in fact a true difference is called the statistical **power**. It depends on the
effect size, our significance threshold $\alpha$ and the sample size $n$ (the
number of games). We can explore the concept with another simulation.
```{r}
reps <- 10000
simulation <- crossing(
N = c(8, 100, 1000, 10000),
true_prob = c(0.5, 0.8, 0.9)
) %>%
rowwise() %>%
mutate(
wins = list(rbinom(n = reps, size = N, prob = true_prob)),
) %>%
unnest(wins) %>%
mutate(
p = pbinom(q = wins - 1, size = N, prob = 0.5, lower.tail = FALSE)
)
head(simulation)
```
I also introduced a new piece of advanced dplyr syntax. `rowwise` is similar to
`group_by` and essentially puts each row into its own group. This can be useful
when working with list columns or running a function with varying arguments and
allows us to treat the inside of `mutate` a bit like as if we where using one of
the `map` functions. For more information, see [the documentation
article](https://dplyr.tidyverse.org/articles/rowwise.html).
It leaves us with `r format(reps)` simulated numbers of wins at N games for different
true probabilities of her winning (i.e. how much better our friend is). We then
calculate the probability to have this or a greater number of wins under the
null hypothesis (equal probability for win and loss), in other words: the
p-value.
```{r}
simulation %>%
ggplot(aes(p)) +
geom_histogram() +
facet_wrap(~ true_prob + N,
labeller = label_both,
scales = "free_y",
ncol = 4) +
geom_vline(xintercept = 0.05, color = "red") +
labs(x = "p-value",
y = "frequency") +
scale_y_continuous(breaks = NULL)
```
We notice a couple of things in this plot. As the number of games
played approaches very high numbers, the p-values for the
case where the null hypothesis is in fact true (both players have
the same chance of winning), start following a uniform distribution,
meaning for a true null hypothesis, all p-values are equally likely.
This seems counterintuitive at first, but is a direct consequence
of the definition of the p-value.
The consequence of this is, that if we apply our regular significance
threshold of 5%, by definition we will say that there is a true
difference, even though there is none (i.e. the null hypothesis is true
but we falsely reject it and favor of our alternative hypothesis).
This is called a **false positive**. By definition, we will get at
least $\alpha$ false positives in all of our experiments.
Later, we will learn, why the real number of false positives is
even higher.
Another name for false positives is **Type I errors**.
On the other side of the coin, there are also cases
where there is a true difference (we used winning probabilities
of 0.8 and 0.9), but we don't reject the null hypothesis because
we get a p-values larger than $alpha$.
These are all **false negatives** and their rate is sometimes
referred to as $\beta$.
Another name for false negatives is **Type II errors**.
People don't particularly like talking about negative things like errors,
so instead you will often see the inverse of $\beta$, the
**Statistical Power** $1-\beta$.
The proportion of correctly identified positives out of
the actual positives is also shown on the plot below.
For example, say her true win probability is 90%
and we play 8 games. If this experiment runs in an infinite
number of parallel universes, we will conclude that she is
better than chance in 80% of those.
We could set our significance threshold higher to detect
more of the true positives, but this would
also increase our false positives.
```{r}
simulation %>%
group_by(true_prob, N) %>%
summarise(signif = mean(p <= 0.05)) %>%
ggplot(aes(true_prob, signif, fill = true_prob == 0.5)) +
geom_col(color = "black") +
geom_text(aes(label = signif), vjust = -0.2) +
facet_wrap(~N,
labeller = label_both) +
scale_y_continuous(expand = expansion(c(0, 0.1))) +
scale_fill_viridis_d() +
labs(y = "Proportion of significant results")
```
There are also packages out there, which have a function
to compute the power for the binomial test, but I think
the simulation was way more approachable.
The cool thing about simulations is also, that they work
even when there is no analytical solution, so you
can use them to play around when planning an experiment.
## P-Value Pitfalls
Let us look into some of the pitfalls of p-values.
Remember from the definition of p-values, that we will get
a significant result even if there is no true difference in
5% of cases (assuming we use this as our alpha)?
Well, what if we test a bunch of things?
This is called **Multiple Testing** and there is a problem
associated with it:
If you test 20 different things, and your statistical
test will produce a significant result by chance alone
in 5% of cases, the expected number of significant results is 1.
So we are not very surprised.
Speaking of surprised: In his book, available for free online,
["Statistics done wrong"](https://www.statisticsdonewrong.com/), Alex Reinhart
describes p-values as a "measure of surprise":
> »A p value is not a measure of how right you are,
or how significant the difference is;
it’s a measure of how surprised you should be if there is no actual difference
between the groups, but you got data suggesting there is.
A bigger difference, or one backed up by more data,
suggests more surprise and a smaller p value.«
> --- Alex Reinhart [@reinhartStatisticsDoneWrong2015]
So, we are not very surprised, but if you focus to hard on
the one significant result, trouble ensues.
In a "publish or perish" mentality, this can easily happen,
and negative findings are not published nearly enough,
so most published findings are likely exaggerated.
John Bohannon showcased this beautifully by
running a study on chocolate consumption and
getting it published:
[I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How.](https://io9.gizmodo.com/i-fooled-millions-into-thinking-chocolate-helps-weight-1707251800)
What can we do about this?
### Multiple Testing Correction
The simplest approach is to take all p-values calculate
when running a large number of comparisons and
dividing them by the number of tests performed.
This is called the **Bonferroni correction**
```{r}
p_values <- c(0.5, 0.05, 0.3, 0.0001, 0.003)
p.adjust(p_values, method = "bonferroni")
```
Of course, this looses some statistical power
(remember, no free lunch).
A slightly more sophisticated approach to
controlling the false discovery rate (FDR)
is the Benjamini-Hochberg procedure. It retains
a bit more power. Here is what happens:
- Sort all p-values in ascending order.
- Choose a FDR $q$ you are willing to accept
and call the number of tests done $m$.
- Find the largest p-value with:
$p \leq iq/m$ with its index $i$.
- This is your new threshold for significance
- Scale the p-values accordingly
And this is how you do it in R:
```{r}
p.adjust(p_values, method = "fdr")
```
### Other forms of p-hacking
This sort of multiple testing is fairly obvious.
You will notice it, when you end up with a large
number of p-values, for example when doing a genetic
screening and testing thousands of genes.
Other related problems are harder to spot.
For a single research question there are often different
statistical tests that you could run, but trying
them all out and then choosing the one that best
agrees with your hypothesis is not an option!
Likewise, simply looking at your data is a form of
comparison if it influences your choice of statistical test.
Ideally, you first run some exploratory experiments
that are not meant to test your hypothesis, then
decide on the tests you need, the sample size you
want for a particular power and then run the actual
experiments designed to test your hypothesis.
At this point, here is another shout-out to
Alex Reinharts book [@reinhartStatisticsDoneWrong2015].
It is a very pleasant read
and also shines more light on some of the other
forms of **p-hacking**.
## Bayesian Statistics and the Base Rate Fallacy
There is another more subtle problem called the
base rate fallacy. As an example, we assume
a medical test, testing for a certain condition.
In medical testing, different words are used
for the same concepts we defined above^[This is a slightly annoying trend
in statistics; as it enters different fields, people come
up with new names for old things
(perhaps the most notorious field for this is
machine learning).].
Here, we have:
- Sensitivity = Power = true positive rate = $1-\beta$
- Specificity = true negative rate = $1-\alpha$
Let us assume a test with a
sensitivity of 90% and a specificity
of 92%. When we visit the doctor to
get a test, and get a positive result,
what is the probability, that we are in
fact positive (i.e. a true positive)?
Well, the test has a specificity of 92%,
so if we where negative, it would have detected
that in 92% of cases, does this mean, that we can
be 92% certain, that we are actually positive?
Well, **no**. What we are ignoring here
is **base rate**, which for diseases is called
the **prevalence**. It is the proportion at
which a disease exists in the general population.
So, let us say, we are picking 1000 people
at random from the population and testing
them. We are dealing with a hypothetical
condition that affects 1% of people,
so we assume 10 people in our sample to be positive.
Of those 10 people, 9 will be tested positive
(due to our sensitivity),
those will be our true positives.
The remaining 1 will be a false negative.
However, we are of course also testing
the negatives (if we knew ahead of time
there would be no point in testing) and of
those due to our specificity, 8% will also be
tested positive, which is 0.08 * 990, so
we get 79 false positives.
Because there are so many negatives in our
sample, even a relatively high specificity
will produce a lot of false positives.
So that actual probability of being
positive with a positive test result is
$$\frac{true~positives}{true~positives + false~positives}=10\%$$
::: {.callout-warning}
TODO: fix the plot!
(the `waffle` package is broken)
:::
```{r, echo = FALSE}
total <- 1000
positives <- 10
negatives <- total - positives
sensitivity <- 0.9
specificity <- 1 - 0.08
true_positives <- sensitivity * positives
false_positives <- (1 - specificity) * negatives
p_positive <- true_positives / (true_positives + false_positives)
colors = c("detected positives" = "darkred",
"not detected positives (false negatives)" = "red",
"false positives" = "palevioletred1",
"true negatives" = "white")
tibble(
parts = c("detected positives",
"not detected positives (false negatives)",
"false positives",
"true negatives"),
vals = c(true_positives,
positives - true_positives,
round(false_positives),
negatives)
) %>%
ggplot(aes(values = vals, fill = parts)) +
# waffle::geom_waffle(color = "black", n_rows = 10) +
theme_void() +
scale_fill_manual(values = colors) +
theme(legend.position = "bottom") +
guides(fill = guide_legend(title = "")) +
coord_equal()
```
Formally, this is described by Bayes's Formula
$$P(A|B)=\frac{P(B|A)*P(A)}{P(B)}$$
Read: The probability of A given B is the probability
of B given A times the probability of A divided by
the probability of B.
In bayesian statistics, the prevalences are known
as **priors**.
## Concepts discussed today
After today you should be familiar with the following concepts:
- Null and alternative hypothesis
- P-values and statistical significance
- Binomial distribution
- Probability density, probability and quantile functions
- Effect size and statistical power
- False positives, false negatives
- Multiple testing and p-hacking
- Bayes's Theorem
## Exercises
## A Fair Coin
We have a regular old coin and flip it 100 times.
Given a significance threshold $\alpha$ of 0.05,
with what probability do we (mistakenly) reject the null hypothesis i.e.
conclude that the coin is not fair even though it is?
Can you show this with a simulation?
As a tip I can tell you that due to the vectorized nature
of the functions involved you won't need a `map` or loop.
The shortest version I could think of uses only 3 functions.
## An Unfair Game
We are playing a game where you have to roll the most sixes in order to win.
Someone is trying to fool us and is using a loaded die.
The die is manipulated such that the chance of rolling
a six is 35% instead of the usual 1/6.
At the same significance threshold as above, what is the
chance of us rejecting the null hypothesis (= a fair die)
and thus concluding correctly that we are being tricked
after rolling the die 20 times?
You will have to run a simulation here again.
## Discovering a new Distribution
The binomial distribution was concerned with sampling
with replacement (you can get head or tails any number
of times without using up the coin). In this exercise
you will explore sampling **without** replacement.
The common model for this is an urn with two
different colored balls in it.
The resulting distribution is called the
**hypergeometric distribution** and the
corresponding R functions are `<r/d/p/q>hyper`
- Imagine you are a zoo manager.
- We got a gift from another zoo! It consists
of 8 red pandas and 2 giant pandas.
What is the probability that they end up
properly separated, if we randomly take 8 animals,
put them in one enclosure and put the rest in another?
- Our penguin colony hatched eggs and we have
a bunch of newcomers. We have have 15 males and
10 females. If we look at a random subset of
12 penguins, what does the distribution of the
number of males look like? Which number is most
likely? How likely is it, to get at least
9 males in the sample?
## Resources
- <https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108>
- <https://jimgruman.netlify.app/post/education-r/>
- [P-Value histograms blogpost by David Robinson](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/)
- ["Statistics done wrong"](https://www.statisticsdonewrong.com/)