ModernDive_book/07-sampling.Rmd at v2 · moderndive/ModernDive_book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
(ref:inferpart) Statistical Inference with `infer`

```{r sampling-conditional-text, echo=FALSE, results="asis", purl=FALSE}
if (is_latex_output()) {
  cat("# (PART) (ref:inferpart) {-}")
} else {
  cat("# (PART) Statistical Inference with infer {-} ")
}
```

# Sampling {#sampling}

```{r setup_infer, include=FALSE, purl=FALSE}
# Used to define Learning Check numbers:
chap <- 7
lc <- 0

# Set R code chunk defaults:
opts_chunk$set(
  echo = TRUE,
  eval = TRUE,
  warning = FALSE,
  message = TRUE,
  tidy = FALSE,
  purl = TRUE,
  out.width = "\\textwidth",
#  fig.height = 4,
  fig.align = "center"
)

# Set output digit precision
options(scipen = 99, digits = 3)

# Set random number generator seed value for replicable pseudo-randomness
set.seed(76)
```

The third portion of this book introduces statistical inference. This chapter is about *sampling*. Sampling involves drawing repeated random samples from a population. In Section \@ref(sampling-activity), we illustrate sampling by working with samples of white and red balls and the proportion of red balls in these samples. In Section \@ref(sampling-framework), we present a theoretical framework and define what is the sampling distribution. We introduce one of the fundamental theoretical results in Statistics: the *Central Limit Theorem* in Section \@ref(central-limit-theorem). In Section \@ref(sampling-activity-mean), we present a second sampling activity, this time working with samples of chocolate-covered almonds and the average weight of these samples.  In Section \@ref(sampling-other-scenarios), we present the sampling distribution in other scenarios. The concepts behind *sampling* form the basis of inferential methods, in particular confidence intervals and hypothesis tests; methods that are studied in Chapters \@ref(confidence-intervals) and \@ref(hypothesis-testing).

## Needed packages {-#sampling-packages}

If needed, read Section \@ref(packages) for information on how to install and load R packages.

```{r sampling-load-packages, message=FALSE}
library(tidyverse)
library(moderndive)
library(infer)
```

Recall that loading the `tidyverse` package loads many packages that we have encountered earlier. For details refer to Section \@ref(tidyverse-package). The packages `moderndive` and `infer` contain functions and data frames that will be used in this chapter.


```{r sampling-load-internal, message=FALSE, echo=FALSE, purl=FALSE}
# Packages needed internally, but not in text.
library(kableExtra)
library(patchwork)
library(scales)

# Dynamic coding of summary statistics for bowl i.e. avoid hard-coding any values
# wherever possible
num_balls <- nrow(bowl)
num_red <- bowl |>
  summarize(red = sum(color == "red")) |>
  pull(red)
prop_red <- num_red / num_balls
percent_red_chr <- prop_red |> percent(accuracy = 0.1)
```

## First activity: red balls {#sampling-activity}

Take a look at the bowl in Figure \@ref(fig:sampling-exercise-1). It has red and white balls of equal size. `r if_else(is_latex_output(), '(Note that in this printed version of the book "red" corresponds to the darker-colored balls, and "white" corresponds to the lighter-colored balls. We kept the reference to "red" and "white" throughout this book since those are the actual colors of the balls as seen in the background of the image on our book\'s [First Edition cover](https://moderndive.com/images/logos/book_cover.png).)', '')` The balls have been mixed beforehand and there does not seem to be any particular pattern for the location of red and white balls inside the bowl.

```{r sampling-exercise-1, echo=FALSE, fig.cap="A bowl with red and white balls.", purl=FALSE, out.width = "95%", purl=FALSE}
include_graphics("images/sampling/balls/sampling_bowl_1.jpg")
```

### The proportion of red balls in the bowl {#population-proportion}

We are interested in finding the proportion of red balls in the bowl. To find this proportion, we could count the number of red balls and divide this number by the total number of balls. The bowl seen in Figure \@ref(fig:sampling-exercise-1) is represented virtually by the data frame `bowl` included in the `moderndive` package. The first ten rows are shown here for illustration purposes:

```{r sampling-v4}
bowl
```

The `bowl` has `r num_balls` rows representing the `r num_balls` balls in the bowl shown in Figure \@ref(fig:sampling-exercise-1). You can view and scroll through the entire contents of the `bowl` in RStudio's data viewer by running `View(bowl)`. The first variable `ball_ID` is used as an *identification variable* as discussed in Subsection \@ref(identification-vs-measurement-variables); none of the balls in the actual bowl are marked with numbers.

The second variable `color` indicates whether a particular virtual ball is red or white. We compute the proportion of red balls in the bowl using the `dplyr` data-wrangling verbs presented in Chapter \@ref(wrangling). A few steps are needed in order to determine this proportion. We present these steps separately to remind you how they work but later introduce all the steps together and simplify some of the code. First, for each of the balls, we identify if it is red or not using a test for equality with the logical operator `==`. We do this by using the `mutate()` function from Section \@ref(mutate) that allows us to create a new Boolean variable called `is_red`.

```{r sampling-mutate}
bowl |>
  mutate(is_red = (color == "red"))
```

The variable `is_red` returns the Boolean (logical) value `TRUE` for each row where `color == "red"` and `FALSE` for every row where `color` is not equal to `"red"`. Since R treats `TRUE` like the number `1` and `FALSE` like the number `0`, accounting for `TRUE`s and `FALSE`s is equivalent to working with `1`'s and `0`'s. In particular, adding all the `1`'s and `0`'s is equivalent to counting how many red balls are in the bowl.

We compute this using the `sum()` function inside the `summarize()` function. Recall from Section \@ref(summarize) that `summarize()` takes a data frame with many rows and returns a data frame with a single row containing summary statistics such as the `sum()`:

```{r sampling-mutate-colored}
bowl |>
  mutate(is_red = (color == "red")) |>
  summarize(num_red = sum(is_red))
```

The `sum()` has added all the `1`'s and `0`'s and has effectively counted the number of red balls. There are `r num_red` red balls in the bowl. Since the bowl contains `r num_balls` balls, the proportion of red balls is `r num_red`/`r num_balls` = `r num_red/num_balls`. We could ask R to find the proportion directly by replacing the `sum()` for the `mean()` function inside `summarize()`. The average of `1`'s and `0`'s is precisely the proportion of red balls in the bowl:

```{r sampling-mutate-alt2}
bowl |>
  mutate(is_red = (color == "red")) |>
  summarize(prop_red = mean(is_red))
```

This code works well but can be simplified once more. Instead of creating a new Boolean variable `is_red` before finding the proportion, we could write both steps simultaneously in a single line of code:

```{r sampling-compute-mean}
bowl |>
  summarize(prop_red = mean(color == "red"))
```

This type of calculation will be used often in the next subsections.

### Manual sampling {#sampling-manual}


In the previous subsection we were able to find the proportion of red balls in the bowl using R only because we had the information of the entire bowl as a data frame. Otherwise, we would have to retrieve this manually. If the bowl contained a large number of balls, this could be a long and tedious process. How long do you think it would take to do this manually if the bowl had tens of thousands of balls? Or millions? Or even more?

In real-life situations, we are often interested in finding the proportion of a very large number of objects, or subjects, and performing an exhaustive count could be tedious, costly, impractical, or even impossible. Because of these limitations, we typically do not perform exhaustive counts. Rather, for this balls example, we randomly select a sample of balls from the bowl, find the proportion of red balls in this sample, and use this proportion to learn more about the proportion of red balls in the entire bowl.


#### One sample {-}

We start by inserting a shovel into the bowl as seen in Figure \@ref(fig:sampling-exercise-2) and collect $5 \cdot 10 = 50$ balls as shown in Figure \@ref(fig:sampling-exercise-3). The set of balls retrieved is called a _sample_.

```{r sampling-exercise-2, echo=FALSE, fig.cap="Inserting a shovel into the bowl.", purl=FALSE, out.width="100%", purl=FALSE}
include_graphics("images/sampling/balls/sampling_bowl_2.jpg")
```


```{r sampling-exercise-3, echo=FALSE, fig.cap="Taking a sample of 50 balls from the bowl.", purl=FALSE, out.width="80%", purl=FALSE}
include_graphics("images/sampling/balls/sampling_bowl_3_cropped.jpg")
```

Observe that 17 of the balls are red, and thus the proportion of red balls in the sample is 17/50 = 0.34 or 34%. Compare this to the proportion of red balls in the entire bowl, `r prop_red`, that we found in Subsection \@ref(population-proportion). The proportion from the sample seems actually pretty good, and it did not take much time or energy to get. But, was this approximate proportion just a lucky outcome? Could we be this lucky the next time we take a sample from the bowl? Next we take more samples from the bowl and calculate the proportions of red balls.

#### Thirty-three samples {-}

We now take many more random samples as shown in Figure \@ref(fig:sampling-exercise-3b). Each time we do the following:

- Return the 50 balls used earlier back into the bowl and mix the contents of the bowl to ensure that each new sample is not influenced by the previous sample.
- Take a new sample with the shovel and determine a new proportion of red balls.

```{r sampling-exercise-3b, echo=FALSE, fig.show='hold', fig.cap="Repeating sampling activity.", purl=FALSE, out.width="25%"}
include_graphics(c("images/sampling/balls/tactile_2_a.jpg", "images/sampling/balls/tactile_2_b.jpg", "images/sampling/balls/tactile_2_c.jpg"))
```

When we perform this activity many times, we observe that different samples may produce different proportions of red balls. A proportion of red balls from a sample is called a _sample proportion_.  A group of 33 students performed this activity previously and drew a histogram using blocks to represent sample proportions of red balls. Figure \@ref(fig:sampling-exercise-4) shows students working on the histogram with two blocks drawn already representing the first two sample proportions found and the third about to be added.

```{r sampling-exercise-4, echo=FALSE, fig.cap="Students drawing a histogram of sample proportions.", purl=FALSE, out.width="60%"}
include_graphics("images/sampling/balls/tactile_3_a.jpg")
```

Recall from Section \@ref(histograms) that histograms help us visualize the *distribution* \index{distribution} of a numerical variable. In particular, where the center of the values falls and how the values vary. A histogram of the first 10 sample proportions can be seen in Figure \@ref(fig:sampling-exercise-5).

```{r sampling-exercise-5, echo=FALSE, fig.cap="Hand-drawn histogram of 10 sample proportions.", purl=FALSE, out.height="30%"}
include_graphics("images/sampling/balls/tactile_3_c-2e.png")
```

By looking at the histogram, we observe that the lowest proportion of red balls was between 0.20 and 0.25 while the highest was between 0.45 and 0.5. More importantly, the most frequently occurring proportions were between 0.30 and 0.35.

This activity performed by 33 students has the results stored in the `tactile_prop_red` data frame included in the `moderndive` package. The first 10 rows are given:

```{r sampling-v9}
tactile_prop_red
```

Observe that for each student `group` the data frame provides their names, the number of `red_balls` observed in the sample, and the calculated proportion of red balls in the sample, `prop_red`. We also have a `replicate` variable enumerating each of the 33 groups. We chose this name because each row can be viewed as one instance of a replicated (in other words "repeated") activity.

Using again the R data visualization techniques introduced in Chapter \@ref(viz), we construct the histogram for all 33 sample proportions as shown in Figure \@ref(fig:samplingdistribution-tactile). Recall that each student has a sample of 50 balls using the same procedure and has calculated the proportion of red balls in each sample. The histogram is built using only those sample proportions. We do not need the individual information of each student or the number of red balls found. We constructed the histogram using `ggplot()`  with `geom_histogram()`. To align the bins in the computerized histogram version so it matches the hand-drawn histogram shown in Figure \@ref(fig:sampling-exercise-5), the arguments `boundary = 0.4` and `binwidth = 0.05` were used. The former indicates that we want a binning scheme, such that, one of the bins' boundaries is at 0.4; the latter fixes the width of the bin to 0.05 units.

```{r sampling-hist, echo=TRUE, fig.show='hide'}
ggplot(tactile_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Proportion of red balls in each sample",
       title = "Histogram of 33 proportions")
```
```{r samplingdistribution-tactile, echo=FALSE, fig.cap="The distribution of sample proportions based on 33 random samples of size 50.", fig.height=ifelse(knitr::is_latex_output(), 3.3, 4), purl=FALSE}
tactile_histogram <- ggplot(tactile_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white")
tactile_histogram +
  labs(
    x = "Proportion of red balls in each sample",
    title = "Histogram of 33 proportions"
  )
```

When studying the histogram we can see that some proportions are lower than 25% and others are greater than 45%, but most of the sample proportions are between 30% and 45%.

We can also use this activity to introduce some statistical terminology. The process of taking repeated *samples* of 50 balls and finding the corresponding *sample proportions* is called \index{sampling} *sampling*. Since we returned the observed balls to the bowl before getting another sample, we say that we performed *sampling with replacement* and because we mixed the balls before taking a new sample, the samples were *randomly drawn* and are called *random samples*.

As shown in Figure \@ref(fig:samplingdistribution-tactile), different random samples produce different sample proportions. This phenomenon is called *sampling variation*\index{sampling!variation}. Furthermore, the histogram is a graphical representation of the *distribution* of sample proportions; it describes the sample proportions determined and how often they appear. The distribution of all possible sample proportions that can be found from random samples is called, appropriately, the *sampling distribution* of the sample proportion. The sampling distribution is central to the ideas we develop in this chapter.


```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why is it important to mix the balls in the bowl before we take a new sample?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why is it that students did not all have the same sample proportion of red balls?

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


### Virtual sampling {#sampling-simulation}

In the previous Subsection \@ref(sampling-manual), we performed a *tactile* sampling activity: students took physical samples using a real shovel from a bowl with white and red balls by hand. We now extend the entire process using simulations on a computer, a sort of *virtual* sampling activity.

The use of simulations permits us to study not only 33 random samples but thousands, tens of thousands, or even more samples. When a large number of random samples is retrieved, we can gain a better understanding of the *sampling distribution* and the *sampling variation* of sample proportions. In addition, we are not limited by samples of 50 balls, as we can simulate sampling with any desired sample size. We are going to do all this in this subsection. We start by mimicking our manual activity.

#### One virtual sample {-}

Recall that the bowl seen in Figure \@ref(fig:sampling-exercise-1) is represented by the data frame `bowl` included in the `moderndive` package. The virtual analog to the 50-ball shovel seen in Figure \@ref(fig:sampling-exercise-2) can be achieved using the `rep_slice_sample()` function included in the `moderndive` package. This function allows us to take `rep`eated (or `rep`licated) random `samples` of size `n`. We start by taking a single sample of 50 balls:


```{r sampling-virtual-sample, echo=-1}
set.seed(76)
virtual_shovel <- bowl |>
  rep_slice_sample(n = 50)
virtual_shovel
```

Observe that `virtual_shovel` has 50 rows corresponding to our virtual sample of size 50. The `ball_ID` variable identifies which of the `r num_balls` balls from `bowl` are included in our sample of 50 balls while `color` denotes whether its white or red. The `replicate` variable is equal to 1 for all 50 rows because we have decided to take only one sample right now. Later on, we take more samples, and `replicate` will take more values.

We compute the proportion of red balls in our virtual sample. The code we use is similar to the one used for finding the proportion of red balls in the entire bowl in Subsection \@ref(population-proportion):


```{r sampling-compute-mean-colored, echo=-c(1, 2)}
# Neat way to remove from output of particular code pieces with echo=-c(1, 2)!
prop_red_sample1 <- virtual_shovel |>
  summarize(prop_red = mean(color == "red")) |>
  pull(prop_red)
virtual_shovel |>
 summarize(prop_red = mean(color == "red"))
```

Based on this random sample, `r prop_red_sample1*100`% of the `virtual_shovel`'s 50 balls were red! We proceed finding the sample proportion for more random samples.

#### Thirty-three virtual samples {-}

In Section \@ref(sampling-activity), students got 33 samples and sample proportions. They repeated/replicated the sampling process 33 times. We do this virtually by again using the function `rep_slice_sample()` and this time adding the `reps = 33` argument as we want to retrieve 33 random samples. We save these samples in the data frame `virtual_samples`, as shown, and then provide a preview of its first 10 rows. If you want to inspect the entire `virtual_samples` data frame, use RStudio's data viewer by running `View(virtual_samples)`.

```{r sampling-sample-rows, echo=-1}
set.seed(76)
virtual_samples <- bowl |>
  rep_slice_sample(n = 50, reps = 33)
virtual_samples
```

Observe in the data viewer that the first 50 rows of `replicate` are equal to `1`, the next 50 rows of `replicate` are equal to `2`, and so on. The first 50 rows correspond to the first sample of 50 balls while the next 50 rows correspond to the second sample of 50 balls. This pattern continues for all `reps = 33` replicates, and thus `virtual_samples` has 33 $\cdot$ 50 = 1650 rows.

Using `virtual_samples` we find the proportion of red balls for each replicate. We use the same `dplyr` verbs as before. In particular, we add `group_by()` of the `replicate` variable. Recall from Section \@ref(groupby) that by assigning the grouping variable "meta-data" before `summarize()`, we perform the calculations needed for each replicate separately. The other line of code, as explained in the case of one sample, calculates the sample proportion of red balls. A preview of the first 10 rows is presented:

```{r sampling-grouped-summary}
virtual_prop_red <- virtual_samples |>
  group_by(replicate) |>
  summarize(prop_red = mean(color == "red"))
virtual_prop_red
```

Actually, the function `rep_slice_sample()` already groups the data by replicate, so it is not necessary to include `group_by()` in the code. Moreover, using `dplyr` pipes in R we could simplify the work and write everything at once:

- using `rep_slice_sample()`, we have 33 replicates (each being a random sample of 50 balls) and
- using `summarize()` with `mean()` on the Boolean values, we determine the proportion of red balls for each sample.

We store these proportions on the data frame `virtual_prop_red` and print the first 10 sample proportions (for the first 10 samples) as an illustration:

```{r sampling-sample-rows2, echo=-1}
set.seed(76)
virtual_prop_red <- bowl |>
  rep_slice_sample(n = 50, reps = 33) |>
  summarize(prop_red = mean(color == "red"))
virtual_prop_red
```

As was the case in the tactile activity, there is sampling variation in the resulting 33 proportions from the virtual samples. As we did manually in Subsection \@ref(sampling-simulation), we construct a histogram with these sample proportions as shown in Figure \@ref(fig:samplingdistribution-virtual). The histogram helps us visualize the sampling distribution of the sample proportion. Observe again the histogram was constructed using `ggplot()`, `geom_histogram()`, and including the arguments `binwidth = 0.05` and `boundary = 0.4`.

```{r sampling-hist-white-border, echo=TRUE, fig.show='hide'}
ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Sample proportion",
       title = "Histogram of 33 sample proportions")
```
```{r samplingdistribution-virtual, echo=FALSE, fig.cap="The distribution of 33 proportions based on 33 virtual samples of size 50.", fig.height=ifelse(knitr::is_latex_output(), 3.5, 4), purl=FALSE}
virtual_histogram <- ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white")
virtual_histogram +
  labs(
    x = "Sample proportion",
    title = "Histogram of 33 sample proportions"
  )
```

When observing the histogram we can see that some proportions are lower than 25% and others are greater than 45%. Also, the sample proportions observed more frequently are between 35% and 40% (for 11 out of 33 samples). We found similar results when sampling was done by hand in Subsection \@ref(sampling-manual), and the histogram was presented in Figure \@ref(fig:samplingdistribution-tactile). We present both histograms side by side in Figure \@ref(fig:tactile-vs-virtual) for an easy comparison. Note that they are somewhat similar in their center and variation, although not identical. The differences are also due to *sampling variation*.

```{r tactile-vs-virtual, echo=FALSE, fig.cap="The sampling distribution of the sample proportion and sampling variation:  showing a histogram for virtual sample proportions (left) and another histogram for tactile sample proportions (right).", fig.height=ifelse(knitr::is_latex_output(), 2.9, 4), purl=FALSE}
facet_compare <- bind_rows(
  virtual_prop_red |>
    mutate(type = "Virtual sampling"),
  tactile_prop_red |>
    select(replicate, red = red_balls, prop_red) |>
    mutate(type = "Physical sampling")
) |>
  mutate(type = factor(type, levels = c("Virtual sampling", "Physical sampling"))) |>
  ggplot(aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  facet_wrap(~type) +
  labs(
    x = "Sample Proportion",
    title = "Histograms for sample proportions"
  )

if (is_latex_output()) {
  facet_compare +
    theme(
      strip.text = element_text(colour = "black"),
      strip.background = element_rect(fill = "grey93")
    )
} else {
  facet_compare
}
```


```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why couldn't we study the effects of sampling variation when we used the virtual shovel only once? Why did we need to take more than one virtual sample (in our case, 33 virtual samples)?

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


#### One thousand virtual samples {-}


It was helpful to observe how sampling variation affects sample proportions in 33 samples. It was also interesting to note that while the 33 virtual samples provide different sample proportions than the 33 physical samples, the overall patterns were fairly similar. Because the samples were taken at random in both cases, any other set of 33 samples, virtual or physical, would provide a different set of sample proportions due to sampling variation, but the overall patterns would still be similar. Still, 33 samples are not enough to fully understand these patterns.

This is why we now study the sampling distribution and the effects of sampling variation with 1000 random samples. Trying to do this manually could be impractical but getting virtual samples can be done quickly and efficiently. Additionally, we have already developed the tools for this. We repeat the steps performed earlier using the `rep_slice_sample()` function with a sample `size` set to be 50. This time, however, we set the number of replicates (`reps`) to `1000`, and use `summarize()` and `mean()` again on the Boolean values to calculate the sample proportions. We compute `virtual_prop_red` with the count of red balls and the corresponding sample proportion for all 1000 random samples. The proportions for the first 10 samples are shown:

```{r sampling-sample-rows2-dup1, echo=-1}
set.seed(76)
virtual_prop_red <- bowl |>
  rep_slice_sample(n = 50, reps = 1000) |>
  summarize(prop_red = mean(color == "red"))
virtual_prop_red
```

As done previously, a histogram for these 1000 sample proportions is given in Figure \@ref(fig:samplingdistribution-virtual-1000).

```{r sampling-hist-white-border-v2, echo=TRUE, fig.show='hide'}
ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.04, boundary = 0.4, color = "white") +
  labs(x = "Sample proportion", title = "Histogram of 1000 sample proportions")
```
```{r samplingdistribution-virtual-1000, echo=FALSE, fig.cap="The distribution of 1000 proportions based on 1000 random samples of size 50.", fig.height=ifelse(knitr::is_latex_output(), 2.3, 4), purl=FALSE}
virtual_histogram <- ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.04, boundary = 0.4, color = "white")
virtual_histogram +
  labs(
    x = "Sample proportion",
    title = "Histogram of 1000 sample proportions"
  )
```

The sample proportions represented by the histogram could be as low as 15% or as high as 60%, but those extreme proportions are rare. The most frequent proportions determined are those between 35% and 40%. Furthermore, the histogram now shows a symmetric and bell-shaped distribution that can be approximated well by a normal distribution.
```{r sampling-conditional-text-dup1, echo=FALSE, results="asis", purl=FALSE}
if(!is_latex_output())
  cat('Please read the "Normal distribution" section of ([Appendix A online](https://moderndive.com/v2/appendixa)) for a brief discussion of this distribution and its properties.')
```

<!--
Learning checks need to be updated. AV 8/17/23
-->

```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why did we not take 1000 samples of 50 balls by hand?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Looking at Figure \@ref(fig:samplingdistribution-virtual-1000), would you say that sampling 50 balls where 30% of them were red is likely or not? What about sampling 50 balls where 10% of them were red?

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


#### Different sample sizes {-}

Another advantage of using simulations is that we can also study how the sampling distribution of the sample proportion changes if we find the sample proportions from samples smaller than or larger than 50 balls. We do need to be careful to not mix results though:  we build the sampling distribution using sample proportions from samples of the **same** size, but the size chosen does not have to be 50 balls.

We must first decide the sample size we want to use, and then take samples using that size. As an illustration, we can perform the sampling activity three times, for each activity using a different sample size, think of having three shovels of sizes 25, 50, and 100 as shown in Figure \@ref(fig:three-shovels). Of course, we do this virtually: with each shovel size we gather many random samples, calculate the corresponding sample proportions, and plot those proportions in a histogram. Therefore we create three histograms, each one describing the sampling distribution for sample proportions from samples of size 25, 50, and 100, respectively. As we show later in this subsection, the size of the sample has a direct effect on the sampling distribution and the magnitude of its sampling variation.

<!--
A shovel with 25 slots          |  A shovel with 50 slots  | A shovel with 100 slots
:-------------------------:|:-------------------------:|:-------------------------:
![](images/sampling/balls/shovel_025.jpg){ width=1.6in }  |  ![](images/sampling/balls/shovel_050.jpg){ width=1.6in } | ![](images/sampling/balls/shovel_100.jpg){ width=1.6in }
-->

```{r three-shovels, echo=FALSE, fig.cap="Three shovels to extract three different sample sizes.", out.width='100%', purl=FALSE}
include_graphics("images/sampling/balls/three_shovels.png")
```

We follow the same process performed previously: we generate 1000 samples, find the sample proportions, and use them to draw a histogram. We follow this process three different times, setting the `size` argument in the code equal to `25`, `50`, and `100`, respectively. We run each of the following code segments individually and then compare the resulting histograms.

```{r sampling-hist-white-border-v2-dup2, eval=FALSE}
# Segment 1: sample size = 25 ------------------------------
# 1.a) Compute sample proportions for 1000 samples, each sample of size 25
virtual_prop_red_25 <- bowl |>
  rep_slice_sample(n = 25, reps = 1000) |>
  summarize(prop_red = mean(color == "red"))

# 1.b) Plot a histogram to represent the distribution of the sample proportions
ggplot(virtual_prop_red_25, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Proportion of 25 balls that were red", title = "25")


# Segment 2: sample size = 50 ------------------------------
# 2.a) Compute sample proportions for 1000 samples, each sample of size 50
virtual_prop_red_50 <- bowl |>
  rep_slice_sample(n = 50, reps = 1000) |>
  summarize(prop_red = mean(color == "red"))

# 2.b) Plot a histogram to represent the distribution of the sample proportions
ggplot(virtual_prop_red_50, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Proportion of 50 balls that were red", title = "50")


# Segment 3: sample size = 100 ------------------------------
# 2.a) Compute sample proportions for 1000 samples, each sample of size 100
virtual_prop_red_100 <- bowl |>
  rep_slice_sample(n = 100, reps = 1000) |>
  summarize(prop_red = mean(color == "red"))

# 3.b) Plot a histogram to represent the distribution of the sample proportions
ggplot(virtual_prop_red_100, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Proportion of 100 balls that were red", title = "100")
```

For easy comparison, we present the three resulting histograms in a single row with matching $x$ and $y$ axes in Figure \@ref(fig:comparing-sampling-distributions).

```{r comparing-sampling-distributions, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 3, 4), fig.cap="Histograms of sample proportions for different sample sizes.", purl=FALSE}
# n = 25
if (!file.exists("rds/virtual_prop_red_25.rds")) {
  virtual_prop_red_25 <- bowl |>
    rep_slice_sample(n = 25, reps = 1000) |>
    summarize(prop_red = mean(color == "red"), n = n())
  write_rds(virtual_prop_red_25, "rds/virtual_prop_red_25.rds")
} else {
  virtual_prop_red_25 <- read_rds("rds/virtual_prop_red_25.rds")
}

# n = 50
if (!file.exists("rds/virtual_prop_red_50.rds")) {
  virtual_prop_red_50 <- bowl |>
    rep_slice_sample(n = 50, reps = 1000) |>
    summarize(prop_red = mean(color == "red"), n = n())
  write_rds(virtual_prop_red_50, "rds/virtual_prop_red_50.rds")
} else {
  virtual_prop_red_50 <- read_rds("rds/virtual_prop_red_50.rds")
}

# n = 100
if (!file.exists("rds/virtual_prop_red_100.rds")) {
  virtual_prop_red_100 <- bowl |>
    rep_slice_sample(n = 100, reps = 1000) |>
    summarize(prop_red = mean(color == "red"), n = n())
  write_rds(virtual_prop_red_100, "rds/virtual_prop_red_100.rds")
} else {
  virtual_prop_red_100 <- read_rds("rds/virtual_prop_red_100.rds")
}

virtual_prop_red <- bind_rows(
  virtual_prop_red_25,
  virtual_prop_red_50,
  virtual_prop_red_100
)

comparing_sampling_distributions <- ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.04, boundary = 0.4, color = "white") +
  labs(
    x = "Sample proportions for red balls",
    title = "Histograms for three different sample sizes"
  ) +
  facet_wrap(~n)

if (is_latex_output()) {
  comparing_sampling_distributions +
    theme(
      strip.text = element_text(colour = "black"),
      strip.background = element_rect(fill = "grey93")
    )
} else {
  comparing_sampling_distributions
}
```


Observe that all three histograms are:

- centered around the same middle value, which appears to be a value slightly below 0.4,
- are somewhat bell-shaped, and
- exhibit *sampling variation* that is different for each sample size. In particular, as the sample size increases from 25 to 50 to 100, the sample proportions do not vary as much and they seem to get closer to the middle value.

These are important characteristics of the *sampling distribution* of the sample proportion: the first observation relates to the shape of the distribution, the second to the center of the distribution, and the last one to the *sampling variation* and how it is affected by the sample size. These results are not coincidental or isolated to the example of sample proportions of red balls in a bowl. In the next subsection, a theoretical framework is introduced that helps explain with precise mathematical equations the behavior of sample proportions coming from random samples.


```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** As shown in Figure \@ref(fig:comparing-sampling-distributions) the histograms of sample proportions are somewhat bell-shaped. What can you say about the center of the histograms?

- A. The smaller the sample size the more concentrated the center of the histogram.
- B. The larger the sample size the smaller the center of the histogram.
- C. The center of each histogram seems to be about the same, regardless of the sample size.


**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** As shown in Figure \@ref(fig:comparing-sampling-distributions) as the sample size increases, the histogram gets narrower. What happens with the sample proportions?

- A. They vary less.
- B. They vary by the same amount.
- C. They vary more.

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why do we use random sampling when constructing sampling distributions?

-   A. To always get the same sample
-   B. To minimize bias and make inferences about the population
-   C. To make the process easier
-   D. To reduce the number of samples needed

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why is it important to construct a histogram of sample means or proportions in a simulation study?

-   A. To visualize the distribution and assess normality or other patterns
-   B. To increase the accuracy of the sample means
-   C. To ensure all sample means are exactly the same
-   D. To remove any outliers from the data

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
```


## Sampling framework {#sampling-framework}

In Section \@ref(sampling-activity), we gained some intuition about sampling and its characteristics. In this section, we introduce some statistical definitions and terminology related to sampling. We conclude by introducing key characteristics that will be formally studied in the rest of the chapter.


### Population, sample, and the sampling distribution {#terminology-and-notation}


A **population** or **study population** is a collection of all individuals or observations of interest. In the bowl activities the **population** is the collection of all the balls in the bowl. A **sample** is a subset of the population. **Sampling** is the act of collecting samples from the population. **Simple random sampling** is *sampling* where each member of the population has the same chance of being selected, for example, by using a shovel to select balls from a bowl. A **random sample** is a sample found using simple random sampling. In the bowl activities, physical and virtual, we use simple random sampling to get random samples from the bowl.

A **population parameter** (or simply a **parameter**) is a numerical summary (a number) that represents some characteristic of the population. A **sample statistic** (or simply a **statistic**) is a numerical summary computed from a sample. In the bowl activities the parameter of interest was the population proportion $p=$ `r prop_red`. Similarly, previously a sample of 50 balls was taken and 17 were red. A statistic is the *sample proportion* which in this example was equal to $\widehat{p}= 0.34$. Observe how we use $p$ to represent the population proportion (parameter) and $\widehat{p}$ for the sample proportion (statistic).

The **distribution** of a list of numbers is the set of the possible values in the list and how often they occur. The **sampling distribution of the sample proportion** is the **distribution**  of sample proportions from **each possible** random samples of a given size. \index{sampling distributions} To illustrate this concept recall that in Subsection \@ref(sampling-simulation) we drew three histograms shown in Figure \@ref(fig:comparing-sampling-distributions). The histogram on the left, for example, was constructed from taking 1000 random samples of size $n=25$, then finding the sample proportion for each sample and using these proportions to draw the histogram. This histogram is a good visual approximation of the **sampling distribution** of the sample proportion.

The *sampling distribution* can be a difficult concept to grasp right away:

- The *sampling distribution of the sample proportion* is the distribution of *sample proportions*; it is constructed using exclusively *sample proportions*.
- Be careful as people learning this terminology sometimes confuse the term *sampling distribution* with a *sample's distribution*. The latter can be understood as the distribution of the values in a given sample.
- A histogram from a simulation of sample proportions is only a visual approximation of the sampling distribution. It is not the exact distribution. Still, when the simulations produce a large number of sample proportions, the resulting histogram provides a good approximation of the sampling distribution. This was the case in Subsection \@ref(sampling-simulation) and the three histograms shown in Figure \@ref(fig:comparing-sampling-distributions).

The lessons we learned by performing the activities in Section \@ref(sampling-activity) contribute to gaining insights about key characteristics of the *sampling distribution* of the *sample proportion*, namely:

1. The center of the *sampling distribution*
1. The effect of *sampling variation* on the *sampling distribution* and the effect of the sample size on this *sampling variation*
1. The shape of the *sampling distribution*

The first two points relate to measures of central tendency and dispersion, respectively. The last one provides a connection to one of the most important theorems in statistics: the Central Limit Theorem. In the next section, we formally study these characteristics.


```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** In the case of our bowl activity, what is the *population parameter*? Do we know its value? How can we know its value exactly?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How did we ensure that the samples collected with the shovel were random?


```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


## The Central Limit Theorem {#central-limit-theorem}

A fascinating result in statistics is that, when retrieving random samples from any population, the corresponding sample means follow a typical behavior: their histogram is bell-shaped and has very unique features. This is true regardless of the distribution of the population values and forms the basis of what we know as the Central Limit Theorem. Before fully describing it, we introduce a theoretical framework to construct this and other characteristics related to sampling.


### Random variables

A simple theoretical framework can help us formalize important properties of the sampling distribution of the sample proportion. To do this we modify the bowl activity slightly. Instead of using a shovel to select all 25 balls at once, we randomly select one ball at a time, 25 times. If the ball is red we call it a success and record a 1 (one); if it is not red we call it a failure and record a 0 (zero). Then, we return the ball to the bowl so the proportion of red balls in the bowl doesn't change.
This process is called a trial or a Bernoulli trial in honor of Jacob Bernoulli, a 17th-century mathematician who is among the first ones to work with these trials.
Getting a sample of 25 balls is running 25 trials and getting 25 numbers, ones or zeros, representing whether or not we have observed red balls on each trial, respectively.
The average of these 25 numbers (zeros or ones) represents precisely the proportion or red balls in a sample of 25 balls.

It is useful to represent a trial as a random variable. We use the uppercase letter $X$ and the subscript $1$ as $X_1$ to denote the random variable for the first trial. After the first trial is completed, so the color of the first ball is observed, the value of $X_1$ is realized as 1 if the ball is red or 0 if the ball is white. For example, if the first ball is red, we write $X_1 = 1$. Similarly, we use $X_2$ to represent the second trial. For example, if the second ball is white, $X_2$ is realized as $X_2=0$, and so on. $X_1$, $X_2$, $\dots$ are random variables only before the trials have been performed. After the trials, they are just the *ones* or *zeros* representing red or white balls, respectively.

Moreover, since our experiment is to perform 25 trials and then find the average of them, this average or mean, before the trials are carried out, can also be expressed as a random variable:

$$\overline X = \frac{X_1+X_2+\dots+X_{25}}{25}.$$

Here $\overline X$ is the random variable that represents the average, or mean, of these 25 trials. This is why we call $\overline X$ the **sample mean**. Again, $\overline X$ is a random variable before the 25 trials have been performed. After the trials, $\overline X$ is realized as the average of 25 zeros and ones.
For example, if the results of the trials are

$$\{0,0,0,1,0,1,0,1,0,0,1,0,1,1,0,0,0,1,1,0,1,0,0,0,1 \},$$

the observed value of $\overline X$ will be

$$\overline X = \frac{0+0+0+1+0+1+\dots+1+0+0+0+1}{25} = \frac{10}{25}=0.4.$$

So, for this particular example, the sample mean is $\overline X = 0.4$ which happens to be the sample proportion of red balls in this sample of 25 balls. In the context of Bernoulli trials, because we are finding averages of zeros and ones, these **sample means** are **sample proportions**! Connecting with the notation used earlier, observe that after the trials have been completed, $\overline X = \widehat{p}$.

### The sampling distribution using random variables

Suppose that we want to calculate the sample proportion for another random sample of 25 balls. In terms of the random variable $\overline X$, this is performing 25 trials and finding another 25 values, ones and zeros, for $X_1$, $X_2, $\dots$, $X_{25}$ and finding their average. For example we might get:

$$\{1,0,0,1,0,0,0,1,0,0,1,0,1,0,1,1,0,0,0,1,0,1,0,0,0,0\}$$

Then, the realization of $\overline X$ will be $\overline X = 9/25 = 0.36$. This sample proportion was different than the one found earlier, 0.4. The possible values of $\overline X$ are the possible proportions of red balls for a sample of 25 balls. In other words, the value that $\overline X$ takes after the trials have been completed is the sample proportion for the observed sample of red and white balls.

Moreover, while any given trial can result in choosing a red ball or not (1 or 0), the chances or getting a red ball are influenced by the proportion of red balls in the bowl. For example, if a bowl has more red balls than white, the chances of getting a red ball on any given trial are higher than getting a white ball. Because 1 is the realization of a trial when a red ball is observed, the sample proportion also would tend to be higher.

Sampling variation produces different sample proportions for different random samples, but they are influenced by the proportion of red and white balls in the bowl. This is why understanding the sampling distribution of the sample proportion is learning which sample proportions are possible and which proportions are more or less likely to be observed. Since the realization of $\overline X$ is the observed sample proportion, the sampling distribution of the sample proportion is precisely the distribution of $\overline X$. In the rest of this section, we use both expressions interchangeably. Recall the key characteristics of the *sampling distribution* of the sample proportion, now given in terms of $\overline X$:

1. The center of the *distribution* of $\overline X$
1. The effect of *sampling variation* on the *distribution* of $\overline X$ and the effect of the sample size on *sampling variation*
1. The shape of the *distribution* of $\overline X$

To address these points, we use simulations. Simulations seldom provide the exact structure of the distribution, because an infinite number of samples may be needed for this. A large number of replications often produces a really good approximation of the distribution though and can be used to understand well the distribution's characteristics. Let's use the output found in Subsection \@ref(sampling-simulation); namely, the sample proportions for samples of size 25, 50, and 100. If we focus on size 25, think of each sample proportion from samples of size 25 as a possible value of $\overline X$. We now use these sample proportions to illustrate properties of the distribution of $\overline X$, the sampling distribution of the sample proportion.

### The center of the distribution: the expected value

Since the distribution of $\overline X$ is composed of all the sample proportions that can be calculated for a given sample size, the center of this distribution can be understood as the average of all these proportions. This is the value we would *expect* to get, on average, from all these sample proportions. This is why the center value of the sampling distribution is called the **expected value** of the sample proportion, and we write $E(\overline X)$. Based on probability theory, the mean of $\overline X$ happens to be equal to the population proportion of red balls in the bowl. In Subsection \@ref(population-proportion) we determined that the population proportion was `r num_red`/`r num_balls` = `r prop_red`, therefore

$$E(\overline X) = p = `r prop_red`.$$

As an illustration, we noted in Subsection \@ref(sampling-simulation) when looking at the histograms in Figure \@ref(fig:comparing-sampling-distributions) that all three histograms were centered at some value between 0.35 and 0.4 (or between 35% and 40%). As we have established now, they are centered exactly at the expected value of $\overline X$, which is the population proportion. Figure \@ref(fig:comparing-sampling-distributions-3) displays these histograms again, but this time adds a vertical red line on each of them at the location of the population proportion value, $p$ = `r prop_red`.

```{r comparing-sampling-distributions-3, echo=FALSE, fig.cap="Three sampling distributions with population proportion $p$ marked by vertical line.", purl=FALSE, fig.height=ifelse(knitr::is_latex_output(), 3, 4)}
p <- bowl |>
  summarize(mean(color == "red")) |>
  pull()
samp_distn_compare <- virtual_prop_red |>
  mutate(
    n = str_c("n = ", n),
    n = factor(n, levels = c("n = 25", "n = 50", "n = 100"))
  ) |>
  ggplot(aes(x = prop_red)) +
  geom_histogram(
    binwidth = 0.04, boundary = 0,
    color = "white"
  ) +
  labs(
    x = expression(paste("Sample proportion ", italic(bar(X)))),
    title = expression(paste(
      "Distributions of ", italic(bar(X)),
      " based on n = 25, 50, 100."
    ))
  ) +
  scale_x_continuous(breaks = c(0.1, 0.3, 0.4, 0.6)) +
  facet_wrap(~n) +
  geom_vline(xintercept = p, col = "red", size = 1)

if (is_latex_output()) {
  samp_distn_compare +
    theme(
      strip.text = element_text(colour = "black"),
      strip.background = element_rect(fill = "grey93")
    )
} else {
  samp_distn_compare
}
```

The results shown seem to agree with the theory. We can further check, using the simulation results, by finding the average of the 1000 sample proportions. We start with the histogram on the left:

```{r sampling-compute-mean-v6}
virtual_prop_red_25
virtual_prop_red_25 |>
  summarize(E_Xbar_25 = mean(prop_red))
```

The variable `prop_red` in data frame `virtual_prop_red_25` contains the sample proportions for each of the 1000 samples taken. The average of these sample proportions is presented as object `E_Xbar_25` which represents the estimated expected value of $\overline X$, by using the average of the 1000 sample proportions. Each of the sample proportions is calculated from random samples of 25 balls from the bowl. This average happens to be precisely the same as the population proportion.

It is worth spending a moment understanding this result. If we take one random sample of a given size, we know that the sample proportion from this sample would be somewhat different than the population proportion due to sampling variation; however, if we take many random samples of the same size, the average of the sample proportions are expected to be about the same as the population proportion.

We present the equivalent results with samples of size 50 and 100:

```{r sampling-compute-mean-v7, echo=TRUE, results='hide'}
virtual_prop_red_50 |>
  summarize(E_Xbar_50 = mean(prop_red))
virtual_prop_red_100 |>
  summarize(E_Xbar_100 = mean(prop_red))
```

```{r sampling-compute-mean-v8, echo=FALSE, purl=FALSE}
e_xbar_50 <- virtual_prop_red_50 |>
  summarize(E_Xbar_50 = mean(prop_red))
e_xbar_100 <- virtual_prop_red_100 |>
  summarize(E_Xbar_100 = mean(prop_red))
e_xbar_50_pull <- e_xbar_50 |> pull(E_Xbar_50)
e_xbar_100_pull <- e_xbar_100 |> pull(E_Xbar_100)
e_xbar_50
e_xbar_100
```

Indeed, the results are about the same as the population proportion. Note that the average of 1000 sample proportions for samples of size 50 was actually `r e_xbar_50_pull` close to 0.375. This happens because the simulations only approximate the sampling distribution and the expected value. When using simulations we do not expect to achieve the exact theoretical results, rather values that are close enough to support our understanding of the theoretical results.

```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is the expected value of the sample mean in the context of sampling distributions?

-   A. The observed value of the sample mean
-   B. The population mean
-   C. The median of the sample distribution
-   D. The midpoint of the range

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```

### Sampling variation: standard deviation and standard error {#sampling-variation}

Another relevant characteristic observed in Figure \@ref(fig:comparing-sampling-distributions-3) is how the amount of dispersion or *sampling variation* changes when the sample size changes. While all the histograms have a similar bell-shaped configuration and are centered at the same value, observe that when...

- the sample size is $n=25$ (left histogram) the observed sample proportions are about as low as 0.1 and as high as 0.65.
- the sample size is $n=50$ (middle histogram) the observed sample proportions are about as low as 0.15 and as high as 0.55.
- the sample size is $n=100$ (right histogram) the observed sample proportions are about as low as 0.20 and as high as 0.5.

As the sample size $n$ increases from 25 to 50 to 100, \index{sampling distributions!relationship to sample size} the variation of the sampling distribution decreases. Thus, the values are clustered more and more tightly around the center of the distribution. In other words, the histogram on the left of Figure \@ref(fig:comparing-sampling-distributions-3) is more spread out than the one in the middle, which in turn is more spread out than the one on the right.

We know that the center of the distribution is the expected value of $\overline X$, which is the population proportion. From this, we can quantify this variation by calculating how far the sample proportions are, on average, from the population proportion. A well-known statistical measurement to quantify dispersion is the *standard deviation*. We discuss how it works before we continue with the sampling variation problem.

#### The standard deviation {-}

We start with an example and introduce some special notation. As an illustration, given four values $y_1=3$, $y_2=-1$, $y_3=5$, and $y_4= 9$, their average is given by

$$\bar y = \frac14\sum_{i=1}^4y_i =\frac14 (y_1 + y_2 + y_3 + y_4)=  \frac{3-1+5+9}{4}= 2.$$

The capital Greek letter $\Sigma$ represents the summation of values, and it is useful when a large number of values need to be added. The letter $i$ underneath $\Sigma$ is the index of summation. It starts at $i=1$, so the first value we are adding is $y_{\bf 1} = 3$. Afterwards $i=2$, so we add $y_{\bf 2}=-1$ to our previous result, an so on, as shown in the equation above. The summation symbol can be very useful when adding many numbers or making more complicated operations, such as defining the standard deviation.

To construct the standard deviation of a list of values, we

- first find the deviations of each value from their average,
- then square those deviations,
- then find the average of the squared deviations, and
- take the square root of this average to finish.

In our example, the standard deviation is given by

$$\begin{aligned}
SD &= \sqrt{\frac14\sum_{i=1}^4(y_i - \bar y)^2} = \sqrt{\frac{(3-2)^2+(-1-2)^2+(5-2)^2+(9-2)^2}{4}} \\
   &= \sqrt{\frac{1+9+9+49}{4}}=\sqrt{17} = 4.12
\end{aligned}$$

We present another example, this time using R. We use again our bowl activity with red and white balls in the bowl. We create a Boolean variable `is_red`  that corresponds to `TRUE`s or `1`s for red balls and `FALSE`s or `0`s for white balls and using these numbers, we compute the proportion (average of `1`s and `0`s) using the `mean()` function and the standard deviation using the `sd()` function[^1]  inside `summarize()`:

[^1]: The `sd()` function actually calculates the sample standard deviation, which divides the sum of squared deviations by $n-1$ instead of $n$. The difference is noticeable for small numbers of values but almost irrelevant, for practical purposes, when using a large number of values. It is used here for simplicity.

```{r sampling-mean-sd}
bowl |>
  mutate(is_red = color == "red") |>
  summarize(p = mean(is_red), st_dev = sd(is_red))
```

So, the proportion of red balls is 0.375 with a standard deviation of 0.484. The intuition behind the standard deviation can be expressed as follows: if you were to select many balls, with replacement, from the bowl, we would expect the proportion of red balls to be about 0.375 give or take 0.484.

In addition, when dealing with proportions, the formula for the standard deviation can be expressed directly in terms of the population proportion, $p$, using the formula:

$$SD = \sqrt{p(1-p)}.$$

Here is the value of the standard deviation using this alternative formula in R:

```{r sampling-create-p}
p <- 0.375
sqrt(p * (1 - p))
```

The value is the same as using the general formula. Now that we have gained a better understanding of the standard deviation, we can discuss the standard deviation in the context of sampling variation for the sample proportion.

#### The standard error {-}

Recall that we want to measure the magnitude of the sampling variation for the distribution of $\overline X$ (the sampling distribution of the sample proportion) and want to use the standard deviation for this purpose. We have shown earlier that the center of the distribution of $\overline X$ is the expected value of $\overline X$. In our case, this is the population proportion $p = 0.375$. The standard deviation will then indicate how far, on average, each possible sample proportion roughly is from the population proportion. If we were to consider using a sample proportion as an estimate of the population proportion, this deviation could be considered the error in estimation. Because of this particular relationship, the standard deviation of the sampling distribution receives a special name: the **standard error**. Note that all *standard errors* are standard deviations but not all standard deviations are standard errors.

We work again with simulations and the bowl of red and white balls. We take 10,000 random samples of size $n=100$, find the sample proportion for each sample, and calculate the average and standard deviation for these sample proportions. This simulation produces a histogram similar to the one presented on the right in Figure \@ref(fig:comparing-sampling-distributions-3). To produce this data we again use the `rep_slice_sample()` function and `mean()` and `sd()` function inside `summarize()` to produce the desired results:

```{r sampling-mean-and-sd}
bowl |>
  rep_slice_sample(n = 100, replace = TRUE, reps = 10000) |>
  summarize(prop_red = mean(color == "red")) |>
  summarize(p = mean(prop_red), SE_Xbar = sd(prop_red))
```

Observe that `p` is the estimated expected value and `SE_Xbar` is the estimated standard error based on the simulation of taking sample proportions for random samples of size $n=100$. Compare this value with the standard deviation for the entire bowl, discovered earlier. It is one-tenth the size! This is not a coincidence: the standard error of $\overline X$ is equal to the standard deviation of the population (the bowl) divided by the square root of the sample size. In the case of sample proportions, the standard error of $\overline X$ can also be determined using the formula:

$$SE(\overline X) = \sqrt{\frac{p(1-p)}{n}}$$
where $p$ is the population proportion and $n$ is the size of our sample. This formula shows that the standard error is inversely proportional to the square root of the sample size: as the sample size increases, the standard error decreases. In our example, the standard error is

$$SE(\overline X) = \sqrt{\frac{0.375\cdot(1-0.375)}{100}} = 0.0484$$

```{r sampling-assign-p}
p <- 0.375
sqrt(p * (1 - p) / 100)
```

This value is nearly identical to the result found on the simulation above. We repeat this exercise, this time finding the estimated standard error of $\overline X$ from the simulations done earlier. These simulations are stored in data frames `virtual_prop_red_25` and `virtual_prop_red_50`, when the sample sizes used are $n=25$ and $n=50$, respectively:

```{r sampling-summarize}
virtual_prop_red_25 |>
  summarize(SE_Xbar_50 = sd(prop_red))
virtual_prop_red_50 |>
  summarize(SE_Xbar_100 = sd(prop_red))
```

The standard errors for these examples, based on the proportion of red balls in the bowl and the sample sizes, are given:

```{r sampling-demo-code-v2-dup1}
sqrt(p * (1 - p) / 25)
sqrt(p * (1 - p) / 50)
```

The simulations support the standard errors derived using mathematical formulas. The simulations are used to check that in fact the results achieved agree with the theory. Observe also that the theoretical results are constructed based on the knowledge of the population proportion, $p$; by contrast, the simulations produce samples based on the population of interest but produce results only based on information found from samples and sample proportions.

The formula for the standard error of the sample proportion given here can actually be derived using facts in probability theory, but its development goes beyond the scope of this book. To learn more about it, please consult more advanced treatments in probability and statistics such as [this one](http://onlinestatbook.com/2/sampling_distributions/samp_dist_p.html).


#### The sampling distribution of the sample proportion {-}

So far we have shown some of the properties of the sampling distribution for the sampling proportion; namely, the expected value and standard error of $\overline X$. We now turn our attention to the shape of the sampling distribution.

As mentioned before, histograms (such as those seen earlier) provide a good approximation of the sampling distribution of the sample proportion, the distribution of $\overline X$. Since we are interested in the shape of the distribution, we redraw again the histograms using sample proportions from random samples of size $n=25$, $n=50$, and $n=100$, but this time we add a smooth curve that appears to connect the top parts of each bar in the histogram. These histograms are presented in Figures \@ref(fig:sample-proportion-25-with-normal-pdf), \@ref(fig:sample-proportion-50-with-normal-pdf), and \@ref(fig:sample-proportion-100-with-normal-pdf). The figures represent density histograms where the area of each bar represents the percentage or proportion of observations for the corresponding bin and the total area of each histogram is 1 (or 100%). The ranges for the $x-$ and $y-$axis on all these plots have been kept constant for appropriate comparisons among them.

```{r sample-proportion-25-with-normal-pdf, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 2, 4), fig.cap="Histogram of the distribution of the sample proportion and the normal curve (n=25).", purl=FALSE}
n = 25
p=9/24
sd.p = sqrt(p*(1-p)/n)


if (!file.exists("rds/virtual_prop_red_25.rds")) {
  virtual_prop_red_25 <- bowl |>
    rep_slice_sample(n = 25, replace = TRUE, reps = 1000) |>
    summarize(prop_red = mean(color == "red"), n = n())
  write_rds(virtual_prop_red_25, "rds/virtual_prop_red_25.rds")
} else {
  virtual_prop_red_25 <- read_rds("rds/virtual_prop_red_25.rds")
}


ggplot(virtual_prop_red_25, aes(x = prop_red)) +
  geom_histogram(aes(y=after_stat(density)), binwidth = 0.04, color = "white") +
  stat_function(fun = dnorm,  args = list(mean = p, sd = sd.p), col="red") + xlim(0,0.8) + ylim(0,10) +
  labs(
    x = "Sample proportions with n = 25"
  )


virtual_prop_red <- virtual_prop_red |>
  mutate(p = 0.375, SE = sqrt(p*(1-p)/n))
```

```{r sample-proportion-50-with-normal-pdf, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 2, 4), fig.cap="Histogram of the distribution of the sample proportion and the normal curve (n=50).", purl=FALSE}
n = 50
p=9/24
sd.p = sqrt(p*(1-p)/n)

if (!file.exists("rds/virtual_prop_red_50.rds")) {
  virtual_prop_red_50 <- bowl |>
    rep_slice_sample(n = 50, replace = TRUE, reps = 1000) |>
    summarize(prop_red = mean(color == "red"), n = n())
  write_rds(virtual_prop_red_50, "rds/virtual_prop_red_50.rds")
} else {
  virtual_prop_red_50 <- read_rds("rds/virtual_prop_red_50.rds")
}


ggplot(virtual_prop_red_50, aes(x = prop_red)) +
  geom_histogram(aes(y=after_stat(density)), binwidth = 0.02, color = "white") +
  stat_function(fun = dnorm,  args = list(mean = p, sd = sd.p), col="red") + xlim(0,0.8) + ylim(0,10) +
  labs(
    x = "Sample proportions with n = 50"
  )
```


```{r sample-proportion-100-with-normal-pdf, echo=FALSE, fig.height=ifelse(knitr::is_latex_output(), 2, 4), fig.cap="Histogram of the sampling distribution of the sample proportion and the normal curve (n=100).", purl=FALSE}
n = 100
p=9/24
sd.p = sqrt(p*(1-p)/n)

if (!file.exists("rds/virtual_prop_red_100.rds")) {
  virtual_prop_red_100 <- bowl |>
    rep_slice_sample(n = 100, replace = TRUE, reps = 1000) |>
    summarize(prop_red = mean(color == "red"), n = n())
  write_rds(virtual_prop_red_100, "rds/virtual_prop_red_100.rds")
} else {
  virtual_prop_red_100 <- read_rds("rds/virtual_prop_red_100.rds")
}


ggplot(virtual_prop_red_100, aes(x = prop_red)) +
  geom_histogram(aes(y=after_stat(density)), binwidth = 0.01, color = "white") +
  stat_function(fun = dnorm,  args = list(mean = p, sd = sd.p), col="red") + xlim(0,0.8) + ylim(0,10) +
  labs(
    x = "Sample proportions with n = 100"
  )
```

The curves in red seem to be a fairly good representation of the top bars of the histograms. However, we have not used the simulated data to draw these curves, these bell-shaped curves were extracted from the normal distribution with mean equal to $p=0.375$ and standard deviation equal to $\sqrt{{p(1-p)/n}}$ where $n$ changes for each histogram. This is a fascinating result due to an application of one of the most important results in Statistics: the Central Limit Theorem (CLT).

The CLT states that when the sample size, $n$, tends to infinity, the distribution of $\overline X$ tends to the normal distribution (with the appropriate mean and standard deviation).
Moreover, it does not depend on the population distribution; the population can be a bowl with red and white balls or anything else.

The observant reader might have noticed that, in practice, we cannot take samples of size equal to infinity.
What makes the CLT even more relevant for practical purposes is that the distribution of $\overline X$ approximates normality even when the sample size used is fairly small. As you can see in Figure \@ref(fig:sample-proportion-25-with-normal-pdf), even when random samples of size $n=25$ are used, the distribution of $\overline X$ already seems to follow a normal distribution.

Observe also that all the curves follow the bell-shaped form of the normal curve but the spread is greater when a smaller sample size has been used and is consistent with the standard error for $\overline X$ found earlier for each case.

```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is the role of the Central Limit Theorem (CLT) in statistical inference?

- A. It provides the formula for calculating the standard deviation of any given sample, allowing for an understanding of the sample's spread or variability.
- B. It states that the sampling distribution of the sample mean will approach a normal distribution, regardless of the population's distribution, as the sample size becomes large.
- C. It determines the actual mean of the population directly by calculating it from a randomly selected sample, without needing additional data or assumptions.
- D. It is a principle that applies strictly and exclusively to populations that are normally distributed, ensuring that only in such cases the sample means will follow a normal distribution pattern.

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does the term "sampling variation" refer to?

-   A. Variability in the population data.
-   B. Differences in sample statistics due to random sampling.
-   C. Changes in the population parameter over time.
-   D. Variation caused by errors in data collection.

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```

### Summary

Let's look at what we have learned about the sampling distribution of the sample proportion:

1. The mean of all the sample proportions will be exactly the same as the population proportion.
2. The standard deviation of the sample proportions, also called the standard error, is inversely proportional to the square root of the sample size: the larger the sample size used to calculate sample proportions, the closer those sample proportions will be from the population proportion, on average.
3. As long as the random samples used are large enough, the sampling distribution of the sample proportion, or simply the distribution of $\overline X$, will approximate the normal distribution. This is true for sample proportions regardless of the structure of the underlying population distribution; that is, regardless of how many red and white balls are in the bowl, or whether you are performing any other experiment that deals with sample proportions.

In case you want to reinforce these ideas a little more, Shuyi Chiou, Casey Dunn, and Pathikrit Bhattacharyya created a 3-minute and 38-second video at <https://youtu.be/jvoxEYmQHNM> explaining this crucial statistical theorem using the average weight of wild bunny rabbits and the average wingspan of dragons as examples. Figure \@ref(fig:CLT-video-preview) shows a preview of this video.