---
title: "C31 Parameter Estimation"
author: "David Housman"
format: docx
editor: visual
fig-width: 6
fig-height: 2.4
---

```{r}
#| message: false
#| warning: false
#| echo: false
library(tidyverse)
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Concepts

1.  After obtaining data $x_1, x_2, \ldots, x_n$ from a sample of a population, we summarize with a statistic $\hat{\theta}(x_1,x_2,\ldots,x_n)$ such as the mean, median, standard deviation, or interquartile range. This is part of descriptive statistics.

2.  We hope that the statistic $\hat{\theta}$ tells us something about a population parameter $\theta$ such as $\lambda$ in the Poisson random variable, $\alpha$ and $\beta$ in the gamma random variable, or $p$ in the geometric random variable. In this context, the statistic $\hat{\theta}$ is called an *estimator* for the population parameter $\theta$. This is part of inferential statistics.

3.  Right now, we focus on probability. We start with a known population, modeled by a random variable $X$ governed by a parameter $\theta$; take a sample, modeled by mutually independent random variables $X_1, X_2, \ldots, X_n$ each having the same distribution as $X$; and study the *statistic*

    $\qquad\hat{\theta}(X_1, X_2, \ldots, X_n)$

    and its distribution (usually called a *sampling distribution*). Observe our use of capital letters when referring to random variables and lower case letters with referring to data.

4.  We hope that our statistics are unbiased (a statistic's expected value $E\left[\hat{\theta}(X_1,X_2,\ldots,X_n)\right]$ equals the parameter $\theta$ being estimated) and are precise (a statistic's variance $\text{Var}\left[\hat{\theta}(X_1,X_2,\ldots,X_n)\right]$ is as small as possible).

5.  Recall that if $x$ and $Y$ are random variables and $a$ and $b$ are real numbers, then $E[aX + bY] = aE[X] + bE[Y]$ and $\text{Var}[aX + bY] = a^2\text{Var}[X] + b^2\text{Var}[Y] + 2ab\text{Cov}(X,Y)$. If $X$ and $Y$ are also independent, then $\text{Var}[aX + bY] = a^2\text{Var}[X] + b^2\text{Var}[Y]$.

6.  Suppose `x = c(1,2,3)`. Then `rep(x, times = 4)` returns `c(1,2,3,1,2,3,1,2,3,1,2,3)`, and `rep(x, each = 4)` returns `c(1,1,1,1,2,2,2,2,3,3,3,3)`.

## Poisson Population Example

In an experiment conducted by Rutherford, Chadwick, and Ellis, a radioactive substance was observed during 2,608 time intervals of 7.5 seconds each; the number of particles reaching a counter was obtained for each period. The sample mean and variance were 3.87 and 3.68 particles, respectively, and we wondered whether this data could have arisen from a Poisson distribution with mean around 3.87. This is an inferential statistics question. Let us answer a related probability question: For a population modeled by a Poisson distribution with $\lambda = 3.87$, what would samples of size 2,608 look like?

1.  Generate a sample of size 2,608 from a Poisson random variable with parameter $\lambda = 3.87$. Describe the sample with a histogram, mean, and variance. Repeat a few times to get some intuition about the amount of variation in the sample histograms, means, and variances.

```{r}
n = 2608
lambda = 3.87
sample = tibble(x = rpois(n, lambda))
ggplot(sample, aes(x = x)) + geom_histogram(binwidth = 1, center = 0, color = "white")
sample |> summarise(mean(x), var(x))
```

2.  Simulate 10,000 samples of size 2,608 from a Poisson distribution with $\lambda = 3.87$. Save in a tibble named `samples` with columns `run` and `x`.

```{r}
nruns = 10000
n = 2608
lambda = 3.87
samples = tibble(
  run = rep(1:nruns, each = n),
  x = rpois(nruns*n, lambda))
```

## Sample Mean Statistic $\bar{X}$

Suppose $X_1, X_2, \ldots, X_n$ are independent identically distributed random variables each having mean $\mu$ and standard deviation $\sigma$. Let the sample mean statistic

$$\bar{X} = \dfrac{1}{n}(X_1 + X_2 + \cdots + X_n)$$

be an estimator for the population mean $\mu$.

3.  Calculate the mean for each sample in `samples`. Display a histogram of the sample means. Calculate the mean of the sample means, the standard deviation of the sample means, and the interval within which 95% of the sample means lie. Write a sentence summarizing the results. How might this provide an inference from the original data?

```{r}
means = samples |> group_by(run) |> summarise(xbar = mean(x))
```

```{r}
ggplot(means, aes(x = xbar)) +
  geom_histogram()
```

```{r}
means |> summarise(
  mean = mean(xbar),
  sd = sd(xbar),
  lo = quantile(xbar, 0.025),
  hi = quantile(xbar, 0.975)
)
```

The distribution of sample means looks bell shaped with a mean of 3.87, standard deviation of 0.04, and a 95% coverage interval of 3.79 to 3.95.

4.  Determine the mean, variance, standard deviation, and distribution of the random variable $\bar{X}$ in general. Apply these results when each $X_i$ is $\text{Poisson}(3.87)$. How is this related to the previous exercise?

The mean of $\bar{X}$ is

$$\begin{array}{lcl}
\mu_{\bar{X}} &=& E\left[\bar{X}\right] \\
&=& E\left[\dfrac{1}{n}(X_1 + X_2 + \cdots + X_n)\right] \\
&=& \dfrac{1}{n}(E[X_1]+E[X_2]+\cdots + E[X_n]) \\
&=& \dfrac{1}{n}\left(\mu+\mu+\cdots+\mu\right) \\ 
&=& \mu.
\end{array}$$

The variance of $\bar{X}$ (using independence) is

$$\begin{array}{lcl}
\text{Var}[\bar{X}]
&=& \text{Var}\left[\dfrac{1}{n}(X_1 + X_2 + \cdots + X_n)\right] \\
&=& \dfrac{1}{n^2}(\text{Var}[X_1]+\text{Var}[X_2]+\cdots + \text{Var}[X_n]) \\
&=& \dfrac{1}{n^2}\left(\sigma^2+\sigma^2+\cdots+\sigma^2\right) \\
&=& \dfrac{1}{n}\sigma^2.
\end{array}$$

The standard deviation is

$\qquad\sigma_{\bar{X}} = \sqrt{\text{Var}[\bar{X}]} = \sqrt{\dfrac{1}{n}\sigma^2} = \dfrac{\sigma}{\sqrt{n}}.$

The Central Limit Theorem tells us that the distribution of $\bar{X}$ is approximately normal for sufficiently large $n$.

For $X_i \sim \text{Poisson}(3.87)$ and $n = 2608$, $\mu_{\bar{X}} = 3.87$ and $\sigma_{\bar{X}} = \sqrt{3.87/2608} = 0.0385$, and the central 95% of the sample means should be in 3.87 ± 0.08 particles per period. This matches what we found in the previous exercise because that was a simulation of $\bar{X}$.

```{r}
means |> summarise(
  smean = mean(xbar),
  tmean = lambda,
  ssd = sd(xbar),
  tsd = sqrt(lambda/n)
)
```

```{r}
ggplot(means, aes(x = xbar)) +
  geom_histogram(aes(y = after_stat(density))) +
  geom_function(fun = dnorm, args = list(mean = lambda, sd = sqrt(lambda/n)), color = "red", linewidth = 2)
```

5.  What do the results from the previous exercise tell us about the use of the statistic $\bar{X}$ to estimate the parameter $\mu$ of a population?

## Sample Variance Statistic $S^2$

Suppose $X_1, X_2, \ldots, X_n$ are independent identically distributed random variables each having mean $\mu$ and standard deviation $\sigma$. Let the sample variance statistic

$$S^2 = \dfrac{1}{n-1}((X_1-\bar{X})^2 + (X_2-\bar{X})^2 + \cdots (X_n-\bar{X})^2)$$

be an estimate for the population variance $\sigma^2$.

6.  Calculate the variance for each sample in `samples`. Display a histogram of the sample variances. Calculate the mean of the sample variances, the standard deviation of the sample variances, and the interval within which 95% of the sample variances lie. Write a sentence summarizing the results. How might this provide an inference from the original data?

```{r}
vars = samples |> group_by(run) |> summarise(s2 = var(x))
```

```{r}
ggplot(vars, aes(x = s2)) +
  geom_histogram()
```

```{r}
vars |> summarise(
  mean = mean(s2),
  sd = sd(s2),
  lo = quantile(s2, 0.025),
  hi = quantile(s2, 0.975)
)
```

7.  Determine the mean, variance, standard deviation, and distribution of the random variable $S^2$. Apply these results when each $X_i$ is $\text{Poisson}(3.87)$. How is this related to the previous exercise?

Consider

$$\begin{array}{lcl}
E[(n-1)S^2] &=& E\left[\displaystyle\sum_{i=1}^{n} (X_i - \bar{X})^2 \right] \\
&=& E\left[\displaystyle\sum_{i=1}^{n} ((X_i - \mu) - (\bar{X} - \mu))^2 \right] \\
&=& E\left[\displaystyle\sum_{i=1}^{n} ((X_i - \mu)^2 - 2(X_i - \mu)(\bar{X} - \mu) + (\bar{X} - \mu)^2) \right] \\
&=& E\left[\displaystyle\sum_{i=1}^{n} ((X_i - \mu)^2 - 2(X_i - \mu)\dfrac{1}{n}\sum_{j=1}^{n} (X_j - \mu) + (\bar{X} - \mu)^2) \right] \\
&=& \displaystyle\sum_{i=1}^{n} E\left[(X_i - \mu)^2\right] - \dfrac{2}{n}\sum_{i=1}^{n}\sum_{j=1}^{n}E[(X_i - \mu)(X_j - \mu)] + \sum_{i=1}^{n} E\left[(\bar{X} - \mu)^2\right]  \\
&=& n\sigma^2 - \dfrac{2}{n}n\sigma^2 + n\dfrac{\sigma^2}{n} \\
&=& (n-1)\sigma^2
\end{array}$$

Thus, the mean of $S^2$ is

$\qquad E[S^2] = \sigma^2$.

It can be shown (https://math.stackexchange.com/questions/72975/variance-of-sample-variance) that

$\qquad \text{Var}(S^2) = \dfrac{\mu_4}{n} - \dfrac{\sigma^4(n−3)}{n(n−1)}$

where $\mu_4 = E[(X-\mu)^4]$ is the fourth central moment of $X$.

A corollary to the Central Limit Theorem is that the distribution of $S^2$ is approximately normal for large sample sizes (https://math.stackexchange.com/questions/2325500/central-limit-theorem-for-the-variance#mjx-eqn-2).

For $X_i \sim \text{Poisson}(3.87)$ (the `s` values arise from our simulation and the `t` values arise from the above theory):

```{r}
vars |> summarise(
  smean = mean(s2),
  tmean = lambda,
  ssd = sd(s2),
  tsd = sqrt((3*lambda^2+lambda)/n - lambda^2*(n-3)/n/(n-1))
)
```

The histogram is of the simulated $S^2$ and the red curve is of the theoretical pdf for $S^2$:

```{r}
ggplot(vars, aes(x = s2)) +
  geom_histogram(aes(y = after_stat(density))) +
  geom_function(fun = dnorm, args = list(mean = lambda, sd = sqrt((3*lambda^2+lambda)/n - lambda^2*(n-3)/n/(n-1))), color = "red", linewidth = 2)
```

8.  What do the results from the previous exercise tell us about the use of the statistic $S^2$ to estimate the parameter $\sigma^2$ of a population?

9.  What do the results from exercises 4 and 7 tell us about the use of the statistics $\bar{X}$ and $S^2$ to estimate the parameter $\lambda$ of a Poisson population?

## Finding Estimators

Two approaches to finding a statistic to estimate a parameter are the method of moments and maximizing likelihood. We illustrate these approaches via some examples.

10. Find the method of moments estimators for the mean $\mu$ and variance $\sigma^2$ of a normal population.

The central moments of a random variable $X$ are $\mu_1 = E[X]$ and $\mu_k = E[(X-\mu_1)^k]$ for positive integers $k$.

If $X$ is randomly selecting an element from a list of data $x_1, x_2, \ldots, x_n$, then $\mu_1 = \dfrac{1}{n}\displaystyle\sum_{i=1}^n x_i = \bar{x}$ and $\mu_2 = \dfrac{1}{n}\displaystyle\sum_{i=1}^n (x_i - \bar{x})^2 = \dfrac{n-1}{n}s^2 = v$.

If $X$ is a normal distribution with mean $\mu$ and standard deviation $\sigma$, then $\mu_1 = \mu$ and $\mu_2 = \sigma^2$.

Matching the first two central moments yields $\hat{\mu} = \dfrac{1}{n}\displaystyle\sum_{i=1}^n x_i$ and $\widehat{\sigma^2} = \dfrac{1}{n}\displaystyle\sum_{i=1}^n (x_i - \bar{x})^2$.

Note that $\hat{\mu}$ is the sample mean but $\widehat{\sigma^2}$ is the natural but biased $\dfrac{n-1}{n}$ times the sample variance. Note also that neither $S$ nor $\sqrt{V}$ is an unbiased estimator for $\sigma$.

11. Find the method of moments estimator for the mean $\lambda$ of a Poisson population.

If $X$ is a Poisson distribution with mean $\lambda$, then $\mu_1 = \lambda$. Matching the first central moments yields $\hat{\lambda} = \dfrac{1}{n}\displaystyle\sum_{i=1}^n x_i$.

12. Find the maximum likelihood estimator for the mean $\lambda$ of a Poisson population.

Our population $X$ has a pdf of the form $f(x) = e^{-\lambda}\dfrac{\lambda^{x}}{x!}$ for $x = 0, 1, 2, \ldots$ where $\lambda > 0$ is unknown. We obtain data $x_1, x_2, \ldots, x_n$ from a sample of the population and endeavor to estimate the parameter $\lambda$. The probability (called likelihood in this context) that a sample from the given population would be the same as our actual data is

$\qquad L(\lambda) = \displaystyle\prod_{i=1}^{n} f(x_i) = \prod_{i=1}^{n} e^{-\lambda}\dfrac{\lambda^{x_i}}{x_i!}$.

We want to choose our parameter $\lambda$ so as to maximize the likelihood $L(\lambda)$. It is often easier to maximize the log likelihood:

$$\begin{array}{lcl}
l(\lambda) &=& \ln(L(\lambda)) \\
&=& \displaystyle\sum_{i=1}^{n} \left(\ln(e^{-\lambda}) + \ln(\lambda^{x_i}) - \ln(x_i!)\right) \\
&=& -n\lambda + \ln(\lambda)\sum_{i=1}^{n} x_i - \sum_{i=1}^{n} \ln(x_i!)
\end{array}$$

A necessary condition to maximize $l$ comes from calculus:

$\qquad 0 = l'(\hat{\lambda}) = -n + \dfrac{1}{\hat{\lambda}}\sum_{i=1}^{n} x_i$

The unique solution is

$\qquad \hat{\lambda} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n} x_i$.

It is easy to verify with a first derivative test that this is a global maximum.

## Gamma Population

Suppose we have a gamma population with unknown parameters $\alpha$ and $\beta$. We obtain a sample $x_1, x_2, \ldots, x_n$ and want to estimate the population parameters.

12. Find the method of moments estimators.

Recall from *C25 Important Random Variables* that for $\text{Gamma}(\alpha, \beta)$, the mean is $\mu = \alpha\beta$ and the variance is $\sigma^2 = \alpha\beta^2$. By solving these two equations for the parameters, we obtain $\beta = \sigma^2/\mu$ and $\alpha = \mu^2/\sigma^2$. Thus, the method of moments estimators are $\hat{\alpha} = \bar{x}^2 / v$ and $\hat{\beta} = v / \bar{x}$ where $v = \dfrac{n-1}{n}s^2$.

13. Find the maximum likelihood estimators.

Recall from *C25 Important Random Variables* that for $\text{Gamma}(\alpha, \beta)$, the pdf is $f(x) = \dfrac{1}{\Gamma(\alpha)\beta^{\alpha}}x^{\alpha-1} e^{-x/\beta}$ for $x \geq 0$ where $\Gamma(\alpha) = \int_0^{\infty}t^{\alpha-1}e^{-t}dt$ is a generalization of $(\alpha - 1)!$. Thus, the log likelihood function is

$$\begin{array}{lcl}
l(\alpha,\beta) 
&=& \displaystyle\sum_{i=1}^{n} \ln\left( \dfrac{1}{\Gamma(\alpha)\beta^{\alpha}}x_i^{\alpha-1} e^{-x_i/\beta} \right) \\
&=& -n\ln(\Gamma(\alpha)) - n\alpha\ln(\beta) + (\alpha - 1) \left(\displaystyle\sum_{i=1}^{n} \ln(x_i)\right) - \dfrac{1}{\beta}\displaystyle\sum_{i=1}^{n} x_i
\end{array}$$

Necessary conditions to maximize $l$ are

$\qquad 0 = \dfrac{\partial l}{\partial \alpha} = -n\dfrac{\Gamma'(\alpha)}{\Gamma(\alpha)} - n\ln(\beta) + \displaystyle\sum_{i=1}^{n} \ln(x_i)$

and

$\qquad 0 = \dfrac{\partial l}{\partial \beta} = -n\dfrac{\alpha}{\beta} + \dfrac{1}{\beta^2}\displaystyle\sum_{i=1}^{n} x_i$.

The second equation yields

$\qquad \hat{\alpha}\hat{\beta} = \bar{x}$

but the first equation is not very helpful symbolically. We must find a numerical solution for any given list of data.

14. Suppose the data given below came from a gamma population. Find the method of moments estimators and maximum likelihood estimators. Compare the empirical cdf and the two theoretical cdfs in a single plot.

```{r}
actual = tibble(x = c(49.2,53.9,50.0,44.5,42.2,42.3,32.3,31.3,60.9,47.5))
```

```{r}
moments = actual |> summarize(
  n = n(),
  xbar = mean(x),
  v = (n-1)/n * var(x),
  alpha = xbar^2 / v,
  beta = v / xbar
)
moments
```

```{r}
likelihood = actual |> summarize(
  n = n(),
  xbar = mean(x),
  v = (n-1)/n * var(x),
  alpha = uniroot(function(a) -n*digamma(a) - n*log(xbar/a) + sum(log(x)), c(10, 40))$root,
  beta = xbar / alpha
)
likelihood
```

```{r}
ggplot(actual, aes(x = x)) +
  stat_ecdf() +
  geom_function(fun = pgamma, args = list(shape = moments$alpha, scale = moments$beta), xlim = c(29, 63), color = "red") +
  geom_function(fun = pgamma, args = list(shape = likelihood$alpha, scale = likelihood$beta), xlim = c(29, 63), color = "blue")
```

It appears that the data could have come from a gamma population and that the two theoretical distributions are quite similar.