--- title: "C31-32 Parameter Estimation" author: "David Housman" format: docx editor: source fig-width: 6 fig-height: 2.4 --- ```{r} #| message: false #| warning: false #| echo: false library(tidyverse) ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Concepts 1. After obtaining data $x_1, x_2, \ldots, x_n$ from a sample of a population, we summarize with a statistic $\hat{\theta}(x_1,x_2,\ldots,x_n)$ such as the mean, median, standard deviation, or interquartile range. This is part of descriptive statistics. 2. We hope that the statistic $\hat{\theta}$ tells us something about a population parameter $\theta$ such as $\lambda$ in the Poisson random variable, $\alpha$ and $\beta$ in the gamma random variable, or $p$ in the geometric random variable. In this context, the statistic $\hat{\theta}$ is called an *estimator* for the population parameter $\theta$. This is part of inferential statistics. 3. Right now, we focus on probability. We start with a known population, modeled by a random variable $X$ governed by a parameter $\theta$; take a sample, modeled by mutually independent random variables $X_1, X_2, \ldots, X_n$ each having the same distribution as $X$; and study the *statistic* $\qquad\hat{\theta}(X_1, X_2, \ldots, X_n)$ and its distribution (usually called a *sampling distribution*). Observe our use of capital letters when referring to random variables and lower case letters with referring to data. 4. We hope that our statistics are unbiased (a statistic's expected value $E\left[\hat{\theta}(X_1,X_2,\ldots,X_n)\right]$ equals the parameter $\theta$ being estimated) and are precise (a statistic's variance $\text{Var}\left[\hat{\theta}(X_1,X_2,\ldots,X_n)\right]$ is as small as possible). 5. Recall that if $x$ and $Y$ are random variables and $a$ and $b$ are real numbers, then $E[aX + bY] = aE[X] + bE[Y]$ and $\text{Var}[aX + bY] = a^2\text{Var}[X] + b^2\text{Var}[Y] + 2ab\text{Cov}(X,Y)$. If $X$ and $Y$ are also independent, then $\text{Var}[aX + bY] = a^2\text{Var}[X] + b^2\text{Var}[Y]$. 6. Suppose `x = c(1,2,3)`. Then - `rep(x, times = 4)` returns `c(1,2,3,1,2,3,1,2,3,1,2,3)`, and - `rep(x, each = 4)` returns `c(1,1,1,1,2,2,2,2,3,3,3,3)`. ## Poisson Population Example In an experiment conducted by Rutherford, Chadwick, and Ellis, a radioactive substance was observed during 2,608 time intervals of 7.5 seconds each; the number of particles reaching a counter was obtained for each period. The sample mean and variance were 3.87 and 3.68 particles, respectively, and we wondered whether this data could have arisen from a Poisson distribution with mean around 3.87. This is an inferential statistics question. Let us answer a related probability question: For a population modeled by a Poisson distribution with $\lambda = 3.87$, what would samples of size 2,608 look like? 1. Generate a sample of size 2,608 from a Poisson random variable with parameter $\lambda = 3.87$. Describe the sample with a histogram, mean, and variance. Repeat a few times to get some intuition about the amount of variation in the sample histograms, means, and variances. ```{r} n = 2608 lambda = 3.87 data = tibble( x = rpois(n, lambda) ) ggplot(data, aes(x = x)) + geom_histogram(center = 0, binwidth = 1, color = "white") data |> summarise(xbar = mean(x), s = sd(x)) ``` 1. Simulate 10,000 samples of size 2,608 from a Poisson distribution with $\lambda = 3.87$. Save in a tibble named `samples` with columns `run` and `x`. ```{r} n = 2608 lambda = 3.87 samples = tibble( run = rep(1:10000, each = n), x = rpois(n * 10000, lambda) ) ``` ## Sample Mean Statistic $\bar{X}$ Suppose $X_1, X_2, \ldots, X_n$ are independent identically distributed random variables each having mean $\mu$ and standard deviation $\sigma$. Let the sample mean statistic $$\bar{X} = \dfrac{1}{n}(X_1 + X_2 + \cdots + X_n)$$ be an estimator for the population mean $\mu$. 3. Calculate the mean for each sample in `samples`. Display a histogram of the sample means. Calculate the mean of the sample means, the standard deviation of the sample means, and the interval within which 95% of the sample means lie. Write a sentence summarizing the results. How might this provide an inference from the original data? ```{r} stats = samples |> group_by(run) |> summarise(xbar = mean(x)) ``` ```{r} ggplot(stats, aes(x = xbar)) + geom_histogram() ``` ```{r} stats |> summarise(muhat = mean(xbar), shat = sd(xbar), lo = quantile(xbar,0.025), hi = quantile(xbar, 0.975)) ``` The mean of the sample means is roughly equal to the population mean indicating that the sample mean as an estimator is unbiased. The standard deviation of the sample means is roughly 0.039, but this will only be useful in a comparison with a different estimator. About 95% of the sample means are between 3.796 and 3.946, or 95% of the sample means are $3.87 \pm 0.077$ With 95% confidence, the population mean is \$3.87 \\pm 0.077\$ Deterrmine the mean, variance, standard deviation, and distribution of the random variable $\bar{X}$ in general. Apply these results when each $X_i$ is $\text{Poisson}(3.87)$. How is this related to the previous exercise? 3. What do the results from the previous exercise tell us about the use of the statistic $\bar{X}$ to estimate the parameter $\mu$ of a population? ## Sample Variance Statistic $S^2$ Suppose $X_1, X_2, \ldots, X_n$ are independent identically distributed random variables each having mean $\mu$ and standard deviation $\sigma$. Let the sample variance statistic $$S^2 = \dfrac{1}{n-1}((X_1-\bar{X})^2 + (X_2-\bar{X})^2 + \cdots (X_n-\bar{X})^2)$$ be an estimate for the population variance $\sigma^2$. 6. Calculate the variance for each sample in `samples`. Display a histogram of the sample variances. Calculate the mean of the sample variances, the standard deviation of the sample variances, and the interval within which 95% of the sample variances lie. Write a sentence summarizing the results. How might this provide an inference from the original data? ```{r} stats = samples |> group_by(run) |> summarise(xbar = mean(x), s2 = var(x)) ``` ```{r} stats |> summarise(muhat = mean(s2), shat = sd(s2), lo = quantile(s2,0.025), hi = quantile(s2, 0.975)) ``` 6. Determine the mean, variance, standard deviation, and distribution of the random variable $S^2$. Apply these results when each $X_i$ is $\text{Poisson}(3.87)$. How is this related to the previous exercise? 7. What do the results from the previous exercise tell us about the use of the statistic $S^2$ to estimate the parameter $\sigma^2$ of a population? 8. What do the results from exercises 4 and 7 tell us about the use of the statistics $\bar{X}$ and $S^2$ to estimate the parameter $\lambda$ of a Poisson population? ## Finding Estimators Two approaches to finding a statistic to estimate a parameter are the method of moments and maximizing likelihood. We illustrate these approaches via some examples. 10. Find the method of moments estimators for the mean $\mu$ and variance $\sigma^2$ of a normal population. 11. Find the method of moments estimator for the mean $\lambda$ of a Poisson population. 12. Find the maximum likelihood estimator for the mean $\lambda$ of a Poisson population. ## Gamma Population Suppose we have a gamma population with unknown parameters $\alpha$ and $\beta$. We obtain a sample $x_1, x_2, \ldots, x_n$ and want to estimate the population parameters. 13. Find the method of moments estimators. 14. Find the maximum likelihood estimators. 15. Suppose the data given below came from a gamma population. Find the method of moments estimators and maximum likelihood estimators. Compare the empirical cdf and the two theoretical cdfs in a single plot. ```{r} actual = tibble(x = c(49.2,53.9,50.0,44.5,42.2,42.3,32.3,31.3,60.9,47.5)) ``` ```{r} #| echo: false #| eval: false moments = actual |> summarize( n = n(), xbar = mean(x), v = (n-1)/n * var(x), alpha = xbar^2 / v, beta = v / xbar ) moments ``` ```{r} #| echo: false #| eval: false likelihood = actual |> summarize( n = n(), xbar = mean(x), v = (n-1)/n * var(x), alpha = uniroot(function(a) -n*digamma(a) - n*log(xbar/a) + sum(log(x)), c(10, 40))$root, beta = xbar / alpha ) likelihood ``` ```{r} #| echo: false #| eval: false ggplot(actual, aes(x = x)) + stat_ecdf() + geom_function(fun = pgamma, args = list(shape = moments$alpha, scale = moments$beta), xlim = c(29, 63), color = "red") + geom_function(fun = pgamma, args = list(shape = likelihood$alpha, scale = likelihood$beta), xlim = c(29, 63), color = "blue") ```