--- title: "C33 Sampling" author: "David Housman" format: docx editor: visual fig-width: 6 fig-height: 2.4 --- ```{r} #| message: false #| warning: false #| echo: false library(tidyverse) ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Concepts Suppose we have a known population attribute described by the random variable $X$ with pdf $f$, mean $\mu$, and standard deviation $\sigma$. A *random sample* of the population attribute is the list $X_1, X_2, \ldots, X_n$ of independent random variables each having the same distribution as $X$. A *statistic* is a function $\theta$ from random samples. Two important statistics are the sample mean $\qquad \bar{X} = \dfrac{1}{n}(X_1 + X_2 + \cdots + X_n)$ and the sample variance \qquad $S^2 = \dfrac{1}{n-1}((X_1-\bar{X})^2 + (X_2-\bar{X})^2 + \cdots (X_n-\bar{X})^2).$ We are interested in the distribution of $\theta(X_1, X_2, \ldots, X_n)$, called the *sampling distribution* of $\theta$. ![](31probability.png){width="3in"} An approximation to the sampling distribution can be found via simulation. Before computers became powerful enough to carry out large simulations, many sampling distributions were obtained theoretically using calculus and algebra. Some of these results are listed below. 1. $E[\bar{X}] = \mu$ and $\text{Var}[\bar{X}] = \sigma^2/n$. 2. $E[S^2] = \sigma^2$ and $\text{Var}[S^2] = \dfrac{\mu_4}{n} - \dfrac{\sigma^4(n−3)}{n(n−1)}$ where $\mu_4 = E[(X_i-\mu)^4]$ is the fourth central moment of each $X_i$. The table provides formulas for $\mu_4$ for several random variables (mathworld.wolfram.com is a good source for this type of information). | Random Variable | $\mu$ | $\sigma^2$ | $\mu_4$ | |--------------------------|----------------------|-------------------------------------|---------------------------------------------------------------------------------------------| | Binomial$(n,p)$ | $np$ | $np(1-p)$ | $np(1-p)(1+(3n-6)p(1-p)$ | | Poisson$(\lambda)$ | $\lambda$ | $\lambda$ | $3\lambda^2 + \lambda$ | | Geometric$(p)$ | $1/p$ | $(1-p)/p^2$ | $(p-1)(-p^2+9p-9)/p^4$ | | NegativeBinomial$(r,p)$ | $r/p$ | $r(1-p)/p^2$ | $r(1-p)(6-6p+p^2+3r-3pr)/p^4$ | | Uniform$(a,b)$ | $(a+b)/2$ | $(b-a)^2/12$ | $(b-a)^4/80$ | | Normal$(\mu,\sigma)$ | $\mu$ | $\sigma^2$ | $3\sigma^4$ | | LogNormal$(\mu, \sigma)$ | $e^{\mu+\sigma^2/2}$ | $e^{2\mu+\sigma^2}(e^{\sigma^2}-1)$ | $e^{4\mu+2\sigma^2}(e^{\sigma^2}-1)^2(e^{4\sigma^2} + 2e^{3\sigma^2} + 3e^{2\sigma^2} - 3)$ | | Exponential$(\lambda)$ | $1/\lambda$ | $1/\lambda^2$ | $6/\lambda^4$ | | Gamma$(\alpha,\beta)$ | $\alpha\beta$ | $\alpha\beta^2$ | $3\alpha(\alpha+2)\beta^4$ | 3. The limit of the distribution of $\dfrac{\bar{X}-\mu}{\sigma/\sqrt{n}}$ as $n$ approaches $\infty$ is the standard normal distribution. 4. The limit of the distribution of $\dfrac{S^2 - \sigma^2}{\sqrt{\mu_4 - \sigma^4(n-3)/(n-1)}/\sqrt{n}}$ as $n$ approaches $\infty$ is the standard normal distribution. 5. If each $X_i = \text{Normal}(\mu, \sigma)$, then $\bar{X} = \text{Normal}(\mu, \sigma/\sqrt{n})$ and $S^2 = \text{Gamma}\left(\dfrac{n-1}{2}, \dfrac{2\sigma^2}{n-1}\right)$. 6. If each $X_i = \text{Normal}(\mu, \sigma)$, then $T = \dfrac{\bar{X} - \mu}{S/\sqrt{n}}$ has a Student $t$ distribution with $n-1$ degrees of freedom. 7. If each $X_i = \text{Gamma}(\alpha, \beta)$, then $\bar{X} = \text{Gamma}\left(n\alpha, \dfrac{\beta}{n}\right)$. 8. If each $X_i = \text{Binomial}(m, p)$, then $\bar{X} = \dfrac{1}{n}\text{Binomial}\left(nm, p\right)$. ## Cereal Example A cereal manufacturer packages cereal in boxes with a 12-ounce label weight. Suppose that the actual distribution of weights is normal with mean 12.2 ounces and standard deviation 0.2 ounce. Use both simulation and theoretical approaches to find the probabilities of the following events. 1. A single box is under the labeled weight. ```{r} # Simulation approach nruns = 10000 tibble(x = rnorm(nruns, 12.2, 0.2)) |> summarise(prob = sum(x < 12) / nruns) ``` ```{r} #Theoretical approach pnorm(12, 12.2, 0.2) ``` 2. The mean of a sample of four boxes is under the labeled weight. ```{r} # Simulation approach nruns = 10000 n = 4 tibble(run = rep(1:nruns, each = 4), x = rnorm(n*nruns, 12.2, 0.2)) |> group_by(run) |> summarise(mean = mean(x)) |> summarise(prob = sum(mean < 12) / nruns) ``` ```{r} #Theoretical approach pnorm(12, 12.2, 0.2/sqrt(4)) ``` 3. The standard deviation of a sample of four boxes is under 0.2 ounce. ```{r} # Simulation approach nruns = 10000 n = 4 tibble(run = rep(1:nruns, each = 4), x = rnorm(n*nruns, 12.2, 0.2)) |> group_by(run) |> summarise(sd = sd(x)) |> summarise(prob = sum(sd < 0.2) / nruns) ``` ```{r} #Theoretical approach pgamma(0.2^2, shape = (4-1)/2, scale = 2*0.2^2/(4-1)) ``` Comments: Simulation is an all-purpose tool but requires substantial code and with $n = 10,000$ the answers are within $\pm 0.01$ with 95% confidence. Theoretical tools provide exact answers (up to the precision of the numerical approximations the integrations) but require that theorems have been proved. ## Widget Lifetime Example The lifetimes of widgets follow an exponential distribution with a mean of 250 hours. 1. What proportion of the widgets will fail within 120 hours? ```{r} # Simulation approach nruns = 10000 tibble(x = rexp(nruns, rate = 1/250)) |> summarise(prob = sum(x < 120) / nruns) ``` ```{r} #Theoretical approach pexp(120, rate = 1/250) pgamma(120, shape = 1, scale = 250) ``` 2. What is the probability that the mean lifetime of a sample of nine widgets is under 120 hours? ```{r} # Simulation approach nruns = 10000 n = 9 tibble(run = rep(1:nruns, each = n), x = rexp(n*nruns, rate = 1/250)) |> group_by(run) |> summarise(mean = mean(x)) |> summarise(prob = sum(mean < 120) / nruns) ``` If $X$ is the lifetime of widgets, then $X = \text{Exp}(\lambda) = \text{Gamma}(\alpha = 1, \beta = 1/\lambda)$, and by a theorem, $\bar{X} = \text{Gamma}(\alpha = n, \beta = 1/(\lambda n)$. ```{r} #Theoretical approach pgamma(120, shape = 9, scale = 250/9) ``` The following shows that the sample size must be larger than 9 for the CLT to provide a good approximation. ```{r} # CLT approach pnorm(120, 250, 250/sqrt(9)) ``` 3. What is the probability that in a sample of nine widgets, at least one will fail within 120 hours? ```{r} # Simulation approach nruns = 10000 n = 9 tibble(run = rep(1:nruns, each = n), x = rexp(n*nruns, rate = 1/250)) |> group_by(run) |> summarise(min = min(x)) |> summarise(prob = sum(min < 120) / nruns) ``` ```{r} #Theoretical approach 1 - (1 - pexp(120, rate = 1/250))^9 ``` ## Coin Flips Example What is the probability that there are fewer than 35 heads in 100 flips of a fair coin? ```{r} # Simulation approach 1 nruns = 10000 n = 100 tibble(run = rep(1:nruns, each = n), x = sample(0:1, n*nruns, replace = TRUE)) |> group_by(run) |> summarise(heads = sum(x)) |> summarise(prob = sum(heads < 35) / nruns) ``` ```{r} # Simulation approach 2 nruns = 1000000 tibble(x = rbinom(nruns, 100, 0.5)) |> summarise(prob = sum(x < 35) / nruns) ``` ```{r} #Exact amswer pbinom(34, 100, 0.5) ``` ```{r} #CLT approximation pnorm(34.5, 100*0.5, sqrt(100*0.5*(1-0.5))) ``` ## Gamma Population Sampling Distributions Suppose the population attribute $X$ has a gamma distribution with shape parameter $\alpha = 1$ and scale parameter $\beta = 1$. 1. Calculate the mean and standard deviation of $X$. Display the pdf of $X$ and a normal distribution with the same mean and standard deviation as $X$. ```{r} alpha = 1 beta = 1 mu = alpha*beta; mu sigma = sqrt(alpha)*beta; sigma ``` ```{r} ggplot() + geom_function(fun = dgamma, args = list(shape = alpha, scale = beta), xlim = c(0, 7), color = "blue") + geom_function(fun = dnorm, args = list(mean = mu, sd = sigma), xlim = c(0, 7), color = "red") ``` 2. For samples of size $n = 8$, simulate sampling distributions for the sample mean, sample variance, and method of moments estimators. ```{r} nruns = 10000 n = 8 set.seed(19570830) samples = tibble( run = rep(1:nruns, each = n), x = rgamma(n*nruns, shape = alpha, scale = beta) ) statistics = samples |> group_by(run) |> summarise( xbar = mean(x), s2 = var(x), v = (n-1)/n*s2, alpha.hat = xbar^2 / v, beta.hat = v / xbar ) ``` 3. For the statistic $\bar{X}$, calculate the theoretical mean and variance and the mean, standard error (standard deviation divided by the square root of the number of simulation runs), and variance of the simulated values. Display the sampling distribution as a density plot with what should be a matching gamma pdf and could be a matching normal pdf if the sample size is sufficiently large. Compare the simulation with theoretical results. ```{r} xbar.dist = statistics |> summarize( tmean = mu, mean = mean(xbar), se = sd(xbar)/sqrt(nruns), tvar = sigma^2 / n, var = var(xbar) ) xbar.dist ggplot(statistics, aes(x = xbar)) + geom_density() + geom_function(fun = dgamma, args = list(shape = n*alpha, scale = beta/n), color = "blue") + geom_function(fun = dnorm, args = list(mean = mu, sd = sigma/sqrt(n)), color = "red") ``` The sampling distribution of $\hat{\mu} = \bar{X}$ is the gamma distribution specified in the theorems and is different from but close to a normal distribution (despite $n$ being only 8 and the population attribute being far from normal). That the statistic $\bar{X}$ is an unbiased estimator for $\mu$ is confirmed here since the approximated value of $E[\bar{X}]$ is within one standard error of $\mu$. The approximated value of $\text{Var}(\bar{X})$ is also close to the theoretical value. 4. For the statistic $\bar{S^2}$, calculate the theoretical mean and variance and the mean, standard error, and variance of the simulated values. Display the sampling distribution as a density plot with what could be a matching normal pdf if the sample size is sufficiently large. Compare the simulation with theoretical results. ```{r} s2.dist = statistics |> summarize( tmean = sigma^2, mean = mean(s2), se = sd(s2) / sqrt(nruns), tvar = 3*alpha*(alpha+2)*beta^4/n - (alpha*beta^2)^2*(n-3)/n/(n-1), var = var(s2) ) s2.dist ggplot(statistics, aes(x = s2)) + geom_density() + geom_function(fun = dnorm, args = list(mean = s2.dist$tmean, sd = sqrt(s2.dist$tvar)), color = "red") ``` The sampling distribution of $\hat{\sigma^2} = S^2$ was not specified by theoretically and is definitely different from a normal distribution. That the statistic $S^2$ is an unbiased estimator for $\sigma^2$ is confirmed here since the approximated value of $E[S^2]$ is within one standard error of $\sigma^2$. The approximated value of $\text{Var}(S^2)$ is also close to the theoretical value. 5. For the statistic $\hat{\beta}$, calculate the mean, standard error, and standard deviation of the simulated values. Display the sampling distribution as a density plot with a normal pdf with the same mean and standard deviation as the simulated data. Compare the simulation with theoretical results. ```{r} beta.hat.dist = statistics |> summarize( tmean = beta, mean = mean(beta.hat), se = sd(beta.hat) / sqrt(nruns), sd = sd(beta.hat) ) beta.hat.dist ggplot(statistics, aes(x = beta.hat)) + geom_density() + geom_function(fun = dnorm, args = list(mean = beta.hat.dist$mean, sd = beta.hat.dist$sd), color = "red") ``` The statistic $\hat{\beta}$ is definitely biased as an estimator for $\beta$ and its distribution though unimodal is skewed right and clearly different from a normal distribution. 6. For the statistic $\hat{\alpha}$, calculate the mean, standard error, and standard deviation of the simulated values. Display the sampling distribution as a density plot with a normal pdf with the same mean and standard deviation as the simulated data. Compare the simulation with theoretical results. ```{r} alpha.hat.dist = statistics |> summarize( tmean = alpha, mean = mean(alpha.hat), se = sd(alpha.hat) / sqrt(nruns), sd = sd(alpha.hat) ) alpha.hat.dist ggplot(statistics, aes(x = alpha.hat)) + geom_density() + geom_function(fun = dnorm, args = list(mean = alpha.hat.dist$mean, sd = alpha.hat.dist$sd), color = "red") ``` The statistic $\hat{\alpha}$ is definitely biased as an estimator for $\alpha$ and its distribution though unimodal is skewed right and clearly different from a normal distribution.\` ## Proving the Theoretical Results During our discussion of parameter estimation, we proved the general expressions for $E[\bar{X}]$, $\text{Var}[\bar{X}]$, and $E[S^2]$. Our goal in this section is to examine result 5 in the Concepts section: If each $X_i = \text{Normal}(\mu, \sigma)$, then $\bar{X} = \text{Normal}(\mu, \sigma/\sqrt{n})$ and $S^2 = \text{Gamma}\left(\dfrac{n-1}{2}, \dfrac{2\sigma^2}{n-1}\right)$. We will prove the first half and provide numerical evidence to support the second half. 1. If $a$ and $b$ are numbers and $X = \text{Normal}(0, \sigma)$, then $aX + b = \text{Normal}(b, a\sigma)$. Proof. The cdf of $W = aX + b$ is $\begin{array}{lcl} H(w) &=& P(W \leq w) \\ &=& P(aX + b \leq w) \\ &=& P(X \leq (w-b)/a) \\ &=& \displaystyle\int_{-\infty}^{(w-b)/a} \dfrac{1}{\sqrt{2\pi}\sigma}e^{-x^2/(2\sigma^2)} \,dx \end{array}$ and $\begin{array}{lcl} h(w) &=& H'(w) \\ &=& \dfrac{d}{dw}\displaystyle\int_{-\infty}^{(w-b)/a} \dfrac{1}{\sqrt{2\pi}\sigma}e^{-x^2/(2\sigma^2)} \,dx \\ &=& \dfrac{1}{\sqrt{2\pi}\sigma}e^{-((w-b)/a)^2/(2\sigma^2)} \dfrac{1}{a}\\ &=& \dfrac{1}{\sqrt{2\pi}(a\sigma)}e^{-(w-b)^2/(2(a\sigma)^2)} \end{array}$ which is the desired pdf. 2. If $X$ and $Y$ are independent continuous random variables with pdfs $f$ and $g$, then $W = X + Y$ is a continuous random variable with pdf $h(w) = \int_{-\infty}^{w} f(x)g(w-x) \,dx$. Proof. Suppose $H$ is the cdf for $W$. Then $\begin{array}{lcl} H(w) &=& P(W \leq w) \\ &=& P(X + Y \leq w) \\ &=& \int_{-\infty}^{\infty} \int_{-\infty}^{w-x} f(x)g(y) \,dy\,dx \\ &=& \int_{-\infty}^{\infty} f(x) \left(\int_{-\infty}^{w-x} g(y) \,dy\right)\,dx \end{array}$ and $\begin{array}{lcl} h(w) &=& H'(w) \\ &=& \int_{-\infty}^{\infty} f(x) \left(\dfrac{d}{dw}\int_{-\infty}^{w-x} g(y) \,dy\right)\,dx \\ &=&\int_{-\infty}^{w} f(x)g(w-x) \,dx. \end{array}$ 3. If $X = \text{Normal}(0, \sigma)$ and $Y = \text{Normal}(0, \tau)$ are independent, then $X + Y = \text{Normal}(0, \sqrt{\sigma^2 + \tau^2})$. Proof. Note first that the mean and standard deviation of $W = X + Y$ are correct based on earlier results. By result 1, the pdf of $W$ is $\begin{array}{lcl} h(w) &=& \displaystyle\int_{-\infty}^{w} \dfrac{1}{\sqrt{2\pi}\sigma}e^{-x^2/(2\sigma^2)} \dfrac{1}{\sqrt{2\pi}\tau}e^{-(w-x)^2/(2\tau^2)} \,dx \\ &=& \dfrac{1}{2\pi\sigma\tau}\displaystyle\int_{-\infty}^{w} e^{-(\tau^2 x^2 + \sigma^2(w-x)^2)/(2\sigma^2\tau^2)} \,dx \\ &=& \dfrac{1}{2\pi\sigma\tau}\displaystyle\int_{-\infty}^{w} e^{-((\sigma^2 + \tau^2)x^2 - 2\sigma^2wx + \sigma^2 w^2)/(2\sigma^2\tau^2)} \,dx \\ \end{array}$ Completing the square on the quadratic expression, we obtain $\begin{array}{lcl} &=& (\sigma^2 + \tau^2)x^2 - 2\sigma^2wx + \sigma^2 w^2 \\ &=& (\sigma^2 + \tau^2)\left(x^2 - 2\dfrac{\sigma^2w}{\sigma^2 + \tau^2}x + \left(\dfrac{\sigma^2w}{\sigma^2 + \tau^2}\right)^2 - \left(\dfrac{\sigma^2w}{\sigma^2 + \tau^2}\right)^2 + \dfrac{\sigma^2 w^2}{\sigma^2 + \tau^2}\right) \\ &=& (\sigma^2 + \tau^2)\left(\left(x - \dfrac{\sigma^2w}{\sigma^2 + \tau^2}\right)^2 + \dfrac{\sigma^2 \tau^2 w^2}{(\sigma^2 + \tau^2)^2}\right) \end{array}$ Plugging back into our expression for the pdf, we obtain $\begin{array}{lcl} h(w) &=& \dfrac{1}{2\pi\sigma\tau}e^{-w^2/(2(\sigma^2 + \tau^2))}\displaystyle\int_{-\infty}^{w} e^{-\left(x - \dfrac{\sigma^2w}{\sigma^2 + \tau^2}\right)^2/\left(\dfrac{2\sigma^2\tau^2}{\sigma^2 + \tau^2}\right)} \,dx \\ &=& \dfrac{1}{2\pi\sigma\tau}e^{-w^2/(2(\sigma^2 + \tau^2))}\dfrac{\sqrt{2\pi}\sigma\tau}{\sqrt{\sigma^2+\tau^2}} \\ &=& \dfrac{1}{\sqrt{2\pi}\sqrt{\sigma^2+\tau^2}}e^{-w^2/(2(\sqrt{\sigma^2 + \tau^2})^2)} \end{array}$\$ which is the desired pdf. 4. If $X_i = \text{Normal}(\mu_i, \sigma_i)$ for $i = 1, 2, \ldots, n$ are independent, then $\displaystyle\sum_{i=1}^{n} X_i = \text{Normal}\left(\sum_{i=1}^{n} \mu_i, \sqrt{\sum_{i=1}^{n}\sigma_i^2}\right)$. Proof. $W = \displaystyle\sum_{i=1}^{n} (X_i - \mu_i) + \sum_{i=1}^{n} \mu_i$. By result 1, $X_i - \mu_i = \text{Normal}(0, \sigma_i)$ for $i = 1, 2, \ldots, n$. By repeated use of result 3, $\displaystyle\sum_{i=1}^{n} (X_i - \mu_i) = \text{Normal}\left(0, \sqrt{\sum_{i=1}^{n}\sigma_i^2}\right)$. By result 1, is the desired distribution. 5. If $X_i = \text{Normal}(\mu, \sigma)$ for $i = 1, 2, \ldots, n$ are independent, then $\bar{X} = \text{Normal}(\mu, \sigma/\sqrt{n})$. Proof. By result 4, $\displaystyle\sum_{i=1}^{n} X_i = \text{Normal}\left(n\mu, \sqrt{n\sigma^2}\right)$. By result 2, we obtain the desired distribution. 6. Choose a positive integer $n$, a real number $\mu$, and a positive real number $\sigma$. For samples of size $n$ from a population attribute $X = \text{Normal}(\mu, \sigma)$, simulate sampling distributions for the sample variance. Compare the mean, standard deviation, pdf, and cdf from the simulation with the theoretical $S^2 = \text{Gamma}\left(\dfrac{n-1}{2}, \dfrac{2\sigma^2}{n-1}\right)$. ```{r} nruns = 10000 n = 7; mu = 29; sigma = 3 set.seed(19570830) samples = tibble( run = rep(1:nruns, each = n), x = rnorm(n*nruns, mean = mu, sd = sigma) ) statistics = samples |> group_by(run) |> summarise(s2 = var(x)) statistics |> summarise( smean = mean(s2), se.smean = sd(s2)/sqrt(nruns), tmean = sigma^2, ssd = sd(s2), tsd = sqrt(3*sigma^4/n - sigma^4*(n-3)/n/(n-1)) ) ``` ```{r} ggplot(statistics, aes(x = s2)) + geom_density() + geom_function(fun = dgamma, args = list(shape = (n-1)/2, scale = 2*sigma^2/(n-1)), color = "red") ``` ```{r} ggplot(statistics, aes(x = s2)) + stat_ecdf() + geom_function(fun = pgamma, args = list(shape = (n-1)/2, scale = 2*sigma^2/(n-1)), color = "red") ```