---
title: "16 Discrete Probability Models"
author: "David Housman"
format: docx
editor: visual
fig-width: 4.5
fig-height: 3
---

```{r, setup, include=FALSE}
#knitr::opts_chunk$set(eval = FALSE)
```

```{r}
#| message: false
#| warning: false
#| echo: false
library(tidyverse)
```

## Important Discrete Random Variables.

For each of the following random variables, (a) derive the probability distribution function, (b) use the pdf to answer the applied probability question, (c) derive the mean, (d) derive the variance and standard deviation, and (e) explore the distributions with a spike graph and comparison with the empirical rule.

1.  $\text{Uniform}(n)$ is the number on a ball randomly drawn from an urn containing balls marked $1, 2, \ldots, n$. What is the probability that a person's birthday is in January?

2.  $\text{Binomial}(n, p)$ is the number of heads which occur when a coin is tossed $n$ times, assuming that on any one toss, the probability of heads is $p$. What is the probability that a 0.300 batter gets 3 hits during a game in which she has 5 at bats?

```{r}
c(choose(5,3)*0.3^3*(1-0.3)^(5-3), dbinom(3, 5, 0.3))
```

3.  $\text{Geometric}(p)$ is the number of coin flips until heads appears, assuming that on any one toss, the probability of heads is $p$. (Note that R's definition uses "before" instead of "until". The definition used here is consistent with Ledolter and much of the rest of the world.) What is the probability that a six first appears on the fourth roll of a six-sided die?

4.  $\text{Poisson}(\lambda)$ is the number of events that occur in a fixed interval when $\lambda$ events occur in an interval on average, for sufficiently small intervals no more than one event occurs, and the number of events in two non-overlapping intervals are independent. The changes could be traffic accidents, defects in manufactured items, flaws in cloth, insurance claims, typographical errors, deaths, or German bombs dropped in a section of London during World War II. The intervals could be in time or space. What is the probability that in a sample of 5,000 people that exactly 10 have a disease assuming that there is a 0.2% chance that any single person has the disease?

## Distribution Shape and Comparison with Empirical Rule

This section has code for easily plotting spike graphs of random variables and comparing the distribution with the empirical rule.

```{r discreteplot}
#| echo: false
discrete.plot = function(x, p, mu, sigma, title='') {
  domain = max(x) - min(x)
  domain = c(min(x)-0.01*domain, max(x)+0.01*domain)
  dist = tibble(x = x, pdf = p)
  marks = tibble(
    sds = -3:3,
    x = mu + sds*sigma,
    lbl = c("µ-3σ","µ-2σ","µ-σ","µ","µ+σ","µ+2σ","µ+3σ"))
  lbl = paste(
    "p1 =", format(sum(p[mu -  sigma <= x & x <= mu +  sigma]), digits = 3), 
    "\np2 =", format(sum(p[mu -2*sigma <= x & x <= mu +2*sigma]), digits = 3), 
    "\np3 =", format(sum(p[mu -3*sigma <= x & x <= mu +3*sigma]), digits = 3))
  ggplot(dist, aes(x = x, y = pdf)) + 
    geom_point() +
    geom_col(width = 0.05) +
    geom_segment(aes(x = x, y = 0, xend = x, yend = 0.1*max(p)), marks, color = "blue") +
    geom_text(aes(x = x, y = 0.05*max(p), label = lbl), marks, color = "blue") +
    scale_x_continuous(limits = domain) +
    ggtitle(title) +
    annotate("text", x = domain[1], y = max(p), hjust = "left", vjust = "top", label = lbl, color = "blue")
}
```

```{r quadratic}
#| echo: false
#| warning: false
x = 1:10
p = (1/385) * x^2
mu = sum(x*p)
sigma = sqrt(sum((x-mu)^2*p))
discrete.plot(x, p, mu, sigma, "Quadratic Random Variable")
```

```{r duniform}
#| echo: false
#| warning: false
duniform = function(n) {
   x = 1:n
   p = rep(1/n, n)
   mu = (n+1)/2
   sd = sqrt((n^2-1)/12)
   title = paste("Uniform(",n,")", sep="")
   discrete.plot(x, p, mu, sd, title)
}
duniform(10)
```

```{r dbinomial}
#| echo: false
#| warning: false
dbinomial = function(n, p, xmin=0, xmax=n) {
   x = xmin:xmax
   prob = choose(n,x) * p^x * (1-p)^(n-x)
   mu = n*p
   sd = sqrt(n*p*(1-p))
   title = paste("Binomial(",n,",",p,")", sep="")
   discrete.plot(x, prob, mu, sd, title)
}
dbinomial(10, 0.6)
```

```{r dgeometric}
#| echo: false
#| warning: false
dgeometric = function(p, xmin=1, xmax=(1+4*sqrt(1-p))/p) {
   x = xmin:xmax
   prob = (1-p)^(x-1) * p
   mu = 1/p
   sd = sqrt(1-p)/p
   title = paste("Geometric(",p,")", sep="")
   discrete.plot(x, prob, mu, sd, title)
}
dgeometric(0.1)
```

```{r dpoisson}
#| echo: false
#| warning: false
dpoisson = function(lambda, xmin=0, xmax=lambda+4*sqrt(lambda)) {
   x = xmin:xmax
   prob = (lambda^x / factorial(x)) * exp(-lambda)
   mu = lambda
   sd = sqrt(lambda)
   title = paste("Poisson(",lambda,")", sep="")
   discrete.plot(x, prob, mu, sd, title)
}
dpoisson(0.9)
```

## Random Number Generation

Let $U$ be the random variable that yields a real number in the interval $(0, 1)$ with each number equally probable. We assume that a random number generator exists for $U$.The R function `runif(n)` generates `n` such numbers. We show how to obtain a random number generator for any other random variable $X$ by first illustrating with the random variable with pdf $f(x) = 0.04 + 0.08x^2$ for $x = -2, -1, 0, 1, 2$.

We first create the pdf and cdf for $X$.

```{r}
theory = tibble(
  x = -2:2,
  pdf = 0.04 + 0.08*x^2,
  cdf = cumsum(pdf)
)
theory
```

Observe that the pdf is non-negative and sums to 1.

The random number generation method is best illustrated with a graph of the cdf.

```{r}
ggplot(theory, aes(x = x, y = cdf)) +
  geom_step() +
  scale_x_continuous(breaks = theory$x)
```

An accurate graph of the cdf would have jumps instead of vertical lines as shown; however, these vertical lines show us how to build the desired random number generator. We first generate a number from $U$ and think of it as appearing on the vertical axis and use the cdf graph to identify a number on the horizontal axis. For example, if $U$ generates a number between 0.52 and 0.64, then 1 will be returned. Hence, the probability that 1 will be returned is 0.64 - 0.52 = 0.12, which is the desired probability for $X = 1$. Symbolically, if $F$ is the desired cdf for $X$, then $X = F^{-1}(U)$ where $F^{-1}$ is the "inverse" of $F$ as defined above. The next code chunk defines this inverse function.

```{r}
icdf = function(u) theory$x[min(which(u < theory$cdf))]
icdf = Vectorize(icdf)
```

The inner-most code returns the whether a number generated by $U$ is less than each of the key $X$ cdf jumps.

```{r}
0.60 < theory$cdf
```

The `which` function then returns the indices for which the number generated by $U$ is less than the key $X$ cdf jumps.

```{r}
which(0.60 < theory$cdf)
```

The `min` function then returns the index for which the number generated by $U$ is less than the key $X$ cdf jumps for the first time.

```{r}
min(which(0.60 < theory$cdf))
```

The `theory$x[]` then returns the appropriate value of $X$.

```{r}
theory$x[min(which(0.60 < theory$cdf))]
```

The first definition of `icdf` works for single number inputs. The second definition allows the function to work element by element on a vector input.

```{r}
icdf(c(0.60, 0.33, 0.51))
```

We can now generate $n = 10000$ simulations of $X$.

```{r}
n = 10000
simulation = tibble(
  u = runif(n),
  x = icdf(u) #map_vec(u, icdf)
)
```

Finally, we can compare the 10,000 simulations with the theoretical distribution graphically.

```{r}
ggplot(simulation, aes(x = x, y = after_stat(count/sum(count)))) +
  geom_histogram(binwidth = 1, color = "white") +
  geom_col(aes(x = x, y = pdf), theory, width = 0.05, fill = "red") +
  ylab("Probability")
```

Based upon our quick margin of error formula, we expect that 95% of the time the relative frequency of each value (the histogram bars) should be within $\pm 1/\sqrt{n} = \pm 0.01$ of the true probability (the red spikes).

To do the same thing with a Poisson random variable having a pdf of $f(x) = \dfrac{\lambda^x}{x!} e^{-\lambda}$ for $x = 0, 1, 2, \ldots$, adds the complication that there is an infinite number of possible values. The following code gets around that problem by generating the random numbers from $U$ before completing the pdf and cdf computations only as far as needed.

```{r}
# create part of the theoretical distribution
lambda = 4
theory = tibble(
  x = 0:ceiling(lambda + 2*sqrt(lambda)),
  pdf = lambda^x / factorial(x) * exp(-lambda),
  cdf = cumsum(pdf)
)

# obtain n uniform simulations
n = 10000
simulation = tibble(
  u = runif(n)
)

# complete as much of the theoretical distribution as needed
maxu = max(simulation$u)
repeat {
  last = theory |> slice_tail()
  if (last$cdf >= maxu) break
  nxt = tibble(
    x = last$x + 1,
    pdf = lambda^x / factorial(x) * exp(-lambda),
    cdf = last$cdf + pdf
  )
  theory = rbind(theory, nxt)
}

# create the inverse cdf function
icdf = function(u) theory$x[min(which(u < theory$cdf))]
icdf = Vectorize(icdf)

# obtain n Poisson(lambda) simulations
simulation = simulation |>
  mutate(x = icdf(u)
)

# compare simulation with theory
ggplot(simulation, aes(x = x, y = after_stat(count/sum(count)))) +
  geom_histogram(binwidth = 1, color = "white") +
  geom_col(aes(x = x, y = pdf), theory, width = 0.1, fill = "red") +
  ylab("Probability")
```

## Model Fit Example

In an experiment conducted by Rutherford, Chadwick, and Ellis, a radioactive substance was observed during 2608 time intervals of 7.5 seconds each; the number of particles reaching a counter was obtained for each period. The data are summarized in the tibble below. Can this data be modeled by one of the random variables we have defined?

```{r}
rutherford = tibble(
  particles = 0:14,
  freq = c(57,203,383,525,532,408,273,139,45,27,12,2,1,0,1)
)
```

```{r}
summ = rutherford |>
  summarize(xbar = sum(particles*freq)/sum(freq),
            s2 = sum((particles-xbar)^2*freq/(sum(freq)-1)),
            s = sqrt(s2))
summ
```

```{r}
theory = tibble(
  x = 0:14,
  pdf = dpois(x, 3.77)
)
```

```{r}
ggplot(rutherford, aes(x = particles, y = freq/sum(freq))) +
  geom_col() +
  geom_col(aes(x = x, y = pdf), theory, width = 0.1, fill = "red")
```

## R Black Box Functions

In the following, `binom` and `geom` can replace `pois` for binomial and geometric random variables. Note that R defines the geometric random variable as the flip in which the last tails appears, rather than the flip in which the first heads appears. Base R does not have a discrete uniform random variable.

```{r blackbox, eval=FALSE}
#With a Poisson random variable with lambda = 3 ...
#generate 8 random numbers
rpois(8, 3)
#find P(X=x) for x = 0, 1, ..., 4
dpois(0:4, 3)
#find P(X<=x) for x = 0, 1, ..., 4
ppois(0:4, 3)
#find the 0.1, 0.2, ..., 0.9 quantiles
qpois(seq(0.1, 0.9, 0.1), 3)
```