---
title: "C14 Random Variables and Expectation"
author: "David Housman"
format: docx
editor: visual
fig-width: 6
fig-height: 4
---

```{r, setup, include=FALSE}
knitr::opts_chunk$set(eval = TRUE)
```

```{r}
#| message: false
#| warning: false
#| echo: false
library(tidyverse)
```

## Introduction

Probability has many subtleties, some of which we have discussed: human intuition about the randomness in flipping coins is weak, deterministic computers can mimic randomness well, interpretations are manifold, the almost certainty of miracles, hot streaks may be human imagination rather than reality, and the general confusion between a priori and a posteriori conditional probabilities. The Bltzstein and Hwang *Introduction to Probability* book explores many other interesting subtleties and non-statistical applications of probability (e.g., sections 2.2.5-7 is an excellent discussion of conditional probability subtleties, section 2.7 contains a discussiong of the Monty Hall problem, section 2.8 contains discussions of the prosecutor's fallacy, defense attorney's fallacy, and Simpson's paradox, and chapters 11-12 are devoted to Markov chains and Monte Carlo techniques). Since this course is also about statistics, we will focus on the aspects of probability that serve to explain the inferential statistics techniques developed later.

A *random variable* is a function from a sample space to real numbers.  More informally, a random variable is a question that can be asked to outcomes of an experiment or the respondents in an observational study.

The *probability distribution function* (*pdf*) or *probability mass function* (*pmf*) records the probability for each possible value of a random variable.

The *cumulative distribution function* (*cdf*) records the probability for all values less than or equal to each possible real number.

The *expectation* or *mean* of a random variable 𝑋 with pdf $f$ is $\mu_X = E[X] = \sum_x xf(x)$. Observe the consistency of this population mean parameter and the previously defined sample mean statistic.

he *variance* of a random variable 𝑋 with pdf $f$ is $\sigma_X^2 = E[(X-\mu_X)^2] = \sum_x (x-\mu_X)^2f(x)$.  The *standard deviation* is $\sigma_X = \sqrt{\sigma_X^2}$  Observe the difference between this population variance and standard deviation parameters and the previously defined sample variance and standard deviation statistics.

The *empirical rule* states that a normally distributed random variable contains two-thirds (68.27%), 95% (95.45%), and almost all (99.73%) of its distribution within one, two, and three standard deviations of the mean.

## Exercises

1.  Draw three balls without replacement from an urn with 2 red and 8 blue balls.  Let $X$ be the number of red balls drawn. Do the following by hand. Find the probability distribution function of $X$, draw a spike graph of the probability distribution function of $X$, draw a graph of the cumulative distribution function of $X$, find the mean of $X$, find the variance of $X$, find the standard deviation of $X$, and compare the pdf of $X$ with the empirical rule.

```{r}
f = function(x) {choose(2,x)*choose(8,3-x)/choose(10,3)}
xdist = tibble(
  x = 0:2, 
  pdf = f(x),
  cdf = cumsum(pdf))
xdist
```

```{r}
xsumm = xdist |>
  summarise(
    mu = sum(x*pdf),
    sigma2 = sum((x-mu)^2*pdf),
    sigma = sqrt(sigma2),
    s1 = sum(f(x[mu -   sigma <= x & x <= mu +   sigma])),
    s2 = sum(f(x[mu - 2*sigma <= x & x <= mu + 2*sigma])),
    s3 = sum(f(x[mu - 3*sigma <= x & x <= mu + 3*sigma])))
xsumm
```

```{r}
#| fig-height: 2
xmark = tibble(
  sds = -3:3,
  x = xsumm$mu + sds*xsumm$sigma,
  lbl = c("µ-3σ","µ-2σ","µ-σ","µ","µ+σ","µ+2σ","µ+3σ"))
ggplot(xdist, aes(x = x, y = pdf)) + 
  geom_point() +
  geom_col(width = 0.03) +
  geom_segment(aes(x = x, y = 0, xend = x, yend = 0.05), xmark, color = "blue") +
  geom_text(aes(x = x, y = 0.02, label = lbl), xmark, color = "blue")
```

```{r}
#| fig-height: 2
ggplot(xdist, aes(x = x, y = cdf)) +
  geom_step() +
  scale_y_continuous(limits = c(0,1))
```

2.  Simulate $X$ as described in the previous exercise 50 times and compare the empirical and theoretical distributions with a histogram overlaid with a spike graph.

```{r}
set.seed(19570830)
urn = c(rep("r", 2), rep("b", 8))
draws = tibble()
for (i in 1:5000) {
  outcome = sample(urn, 3)
  draws = rbind(draws, outcome)
}
colnames(draws) = paste0("draw", 1:3)
draws = draws |>
  rowwise() |>
  mutate(X = sum(c_across(draw1:draw3) == "r")) |>
  ungroup()
draws
```

```{r}
ggplot(draws, aes(x = X, y = after_stat(count/sum(count)))) +
  geom_bar() +
  geom_col(aes(x = x, y = pdf), xdist, width = 0.05, fill = "red")
```


3.  Draw three balls in turn and without replacement from an urn with 2 red and 8 blue balls.  Let $Y$ be the draw number of the first blue ball. Find the probability distribution function of $Y$, draw a spike graph of the probability distribution function of Y, draw a graph of the cumulative distribution function of $Y$, find the mean of $Y$, find the variance of $Y$, find the standard deviation of $Y$, and compare the pdf of $Y$ with the empirical rule.

| $y$ |               $g(y)$ | $yg(y)$ |             $(y-\mu)^2g(y)$ |
|:---:|---------------------:|--------:|----------------------------:|
|  1  | 8*9*8/10*9*8 = 36/45 |   36/45 | (1-11/9)^2^(36/45) = 0.0395 |
|  2  |  2*8*8/10*9*8 = 8/45 |   16/45 | (2-11/9)^2^( 8/45) = 0.1075 |
|  3  |  2*1*8/10*9*8 = 1/45 |    3/45 | (3-11/9)^2^( 1/45) = 0.0702 |
| sum |                    1 |    11/9 |                      0.2172 |

$\mu = 11/9 \approx 1.22$, $\sigma^2 \approx 0.2172$, $\sigma \approx 0.466$.

$P(\mu-\sigma \leq X \leq \mu+\sigma) = 36/45 = 0.8$ vs. 0.68.

$P(\mu-2\sigma \leq X \leq \mu+2\sigma) = 44/45 \approx 0.978$ vs. 0.95.

$P(\mu-3\sigma \leq X \leq \mu+3\sigma) = 44/45 \approx 0.978$ vs. 0.997.

The following code chunks repeat the above work using R.

```{r}
g = function(y) {
  case_when(
    y == 0 ~ 8*9*8, 
    y == 1 ~ 2*8*8, 
    y == 2 ~ 2*1*8) / (10*9*8)
  }
ydist = tibble(
  y = 0:2, 
  pdf = g(y),
  cdf = cumsum(pdf))
ydist
```

```{r}
ysumm = ydist |>
  summarise(
    mu = sum(y*pdf),
    sigma2 = sum((y-mu)^2*pdf),
    sigma = sqrt(sigma2),
    s1 = sum(g(y[mu -   sigma <= y & y <= mu +   sigma])),
    s2 = sum(g(y[mu - 2*sigma <= y & y <= mu + 2*sigma])),
    s3 = sum(g(y[mu - 3*sigma <= y & y <= mu + 3*sigma])))
ysumm
```

```{r}
#| fig-height: 2
ymark = tibble(
  sds = -3:3,
  y = ysumm$mu + sds*ysumm$sigma,
  lbl = c("µ-3σ","µ-2σ","µ-σ","µ","µ+σ","µ+2σ","µ+3σ"))
ggplot(ydist, aes(x = y, y = pdf)) + 
  geom_point() +
  geom_col(width = 0.03) +
  geom_segment(aes(x = y, y = 0, xend = y, yend = 0.1), ymark, color = "blue") +
  geom_text(aes(x = y, y = 0.02, label = lbl), ymark, color = "blue")
```

```{r}
#| fig-height: 2
ggplot(ydist, aes(x = y, y = cdf)) +
  geom_step() +
  scale_y_continuous(limits = c(0,1))
```

4.  Prove that $\sigma_X^2 = E[X^2] - E[X]^2$. This provides us with a less meaningful but often useful way to calculate the variance of a random variable.