--- title: "35 Applied Inferential Statistics" author: "David Housman" format: docx editor: source fig-width: 6 fig-height: 2.4 --- ```{r} #| message: false #| warning: false #| echo: false library(tidyverse) library(goftest) library(nortest) library(gapminder) ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Level of Significance The *level of significance ($\alpha$)* is the probability at which we say that an improbable event does not occur. More concretely, $\alpha$ is the acceptable probability for erroneously accepting (a) that a new drug is effective when it is actually ineffective, (b) that a person is guilty when they are actually innocent, or (c) that a company is lying about the quality of its product when they are actually telling the truth. In scientific literature, often used levels of significance are 0.05, 0.01, and 0.001. ## Hypothesis Test A *hypothesis test* consists of null and alternative hypotheses. A *test statistic* that measures how far the data are from the null hypothesis is chosen. The *p-value* is the probability of obtaining a sample from the null hypothesized population that has a test statistic at least as supportive of the alternative hypothesis as obtained from the data. The p-value could be approximated with a simulation or calculated with a theoretically derived formula. If the p-value is smaller than the level of significance $\alpha$, then we reject the null hypothesis and accept the alternative hypothesis. If the p-value is at least as large as the level of significance, then we fail to reject the null hypothesis and fail to accept the alternative hypothesis. The hypothesis test process is similar to jurisprudence in most western democracies. The accused is assumed to be innocent (the null hypothesis) and is determined to be guilty (the alternative hypothesis) if and only if the evidence (summarized by the test statistic and p-value) proves guilt beyond a reasonable doubt (the p-value is less than the level of significance). Although assumed to be innocent, the accused is either proclaimed guilty or not guilty (which could be considered different from proclaiming innocence). A hypothesis test process is also similar to a mathematician's proof by contradiction. To prove that $\sqrt{2}$ is not rational (the alternative hypothesis), the mathematician supposes $\sqrt{2}$ is rational (the null hypothesis), finds a contradiction (evidence that shows the p-value is zero), and concludes that $\sqrt{2}$ is not rational. Evidence derived from a sample almost never yields a p-value of zero, and so we are willing to say that will a small enough p-value there is sufficient evidence to accept the alternative hypothesis. We can compare reality and our decision in the following table (with rows corresponding to reality and columns corresponding to our decision). | | reject $H_0$ | don't reject $H_0$ | |----------------|-------------------------|-------------------------| | $H_0$ is true | type I error ($\alpha$) | correct decision | | $H_0$ is false | correct decision | type II error ($\beta$) | We can see that the significance level is also the probability of making a type I error: rejecting a true null hypothesis. The probability of a type II error may be a function of the possible alternative hypotheses. The hypothesis testing process provides a quantitative way to describe the intuition we can obtain by looking at plots and more intuitive statistics. Since multiple tests have been defined, users must make choices about which test to use in any particular circumstance. This is one of the many aspects of statistics that is more art than science, and there are regular controversies and different schools of thought in the professional statistics community. Thus, it is important to state what test was used and the actual p-value obtained in any inferential statistical analysis. ## Confidence Interval A *confidence interval* is an estimate of a population parameter that conveys the precision of the estimate. A level of confidence equal to one minus the level of significance ($1 - \alpha$) is typically reported as a percentage confidence. The classical (parametric) approach is to include all null hypothesized values for the parameter that would not be rejected with a hypothesis test. A resampling (nonparametric) approach is to use the actual sample data as a best estimate for the population and determine a $1 - \alpha$ coverage interval for the estimator. Since people use a variety of different procedures and confidence levels, these whould be reported along with the confidence interval. ## Validity Hypothesis tests and confidence intervals are only considered valid if the significance level and procedure is chosen **before** the data are collected and analyzed. This is how the United States Food and Drug Administration deals with drug testing for safety and effectiveness. In the examples that follow, we will often consider different procedures and signifance levels as an aid to understanding the concepts. This is often what researchers do in what is often called *exploratory data analysis*. We should be careful, honest, and transparent in reporting any results obtained through exploration. ## Fair Coin Claim An actual penny was flipped 100 times resulting in 37 heads. Is the penny a fair coin? Statisticians organize this scenario as a *hypothesis test*. The *population* consists of heads and tails with the probability of heads being the *parameter* $p$. The *sample* is the $n = 100$ actual coin flips observed. The *null hypothesis ($H_0$)* is that $p = 1/2$, and the *alternative hypothesis ($H_a$)* is that $p \ne 1/2$. The *p-value* is the probability of obtaining a random sample at least as supportive of the alternative hypothesis as was actually observed given that the null hypothesis is true. In this case, the p-value is $P(X \leq 37 \text{ or } 63 \leq X \;|\; X \sim \text{Binomial}(100,1/2))$. ```{r} pbinom(37,100,0.5) + (1 - pbinom(62,100,0.5)) ``` With a p-value of 0.012 and a 0.01 level of significance, we would conclude that $p$ could be $1/2$, i.e., the coin could be fair. With a p-value of 0.012 and a 0.05 level of significance, we would conclude that $p \ne 1/2$, i.e., the coin is not fair. This illustrates the danger of choosing the level of significance after, rather than before, the data is collected. To illustrate the danger in choosing the test design after collecting the date, perhaps we conjecture that the probability of heads is less than 1/2 (based upon the small number of heads obtained in the sample). This would suggest changing the alternative hypothesis to $p < 1/2$ resulting in a change of the p-value to $P(X \leq 37 \;|\; X \sim \text{Binomial}(10,1/2))$. ```{r} pbinom(37,100,0.5) ``` With a p-value of 0.006 and a 0.01 level of significance, we would conclude that $p < 1/2$ although we earlier could not conclude (at the same level of significance) that $p$ was different from $1/2$. This example has illustrated *two-tailed* (when $H_a: p \ne 1/2$) and *one-tailed* (when $H_a: p < 1/2$) hypotheses tests and how to translate a p-value and a level of significance into statistical conclusions. Observe that we either accept the alternative hypothesis or fail to reject the null hypothesis, because there is either sufficient or insufficient evidence to reject the null hypothesis. Equally important is the importance of designing the test and choosing the significance level *before* gathering the data. ## Normal(50, 3) Concrete Block Strength Claim Acme Concrete claims that the compressive strength of its concrete blocks is normally distributed with a mean of 50 hundred pounds per square inch and a standard deviation of 3 hundred pounds per square inch. To test the claim, ten concrete blocks are randomly chosen and their compressive strengths (in hundred pounds per square inch) are measured. ```{r} acme = tibble(x = c(49.2,53.9,50.0,44.5,42.2,42.3,32.3,31.3,60.9,47.5)) ``` We compare this data to the claimed distribution (the null hypothesis) in a variety of ways. First, compare the sample mean and standard deviation with the claimed population mean and standard deviation. Observe that the sample mean is less than the claimed mean and the sample standard deviation is larger than the claimed standard deviation. ```{r} acme |> summarise( n = n(), smean = mean(x), cmean = 50, ssd = sd(x), csd = 3) ``` Second, compare the sample and claimed population pdfs. Observe that the histogram of the data is shifted left from and wider than the claimed pdf. This visually indicates the sample mean is less than the claimed population mean and that the sample standard deviation is more than the claimed population standard deviation. ```{r} ggplot(acme, aes(x = x)) + geom_histogram(aes(y = after_stat(density)), binwidth = 5, center = 32.5, color = "white") + geom_density(color = "blue", linewidth = 2) + geom_function(fun = dnorm, args = list(mean = 50, sd = 3), color = "red", linewidth = 2) ``` Third, compare the sample and claimed population distributions with a qq plot. Observe that the midpoint of the qq points is at $(50, 46)$ and their slope is greater than one. This visually indicates the sample mean is less than the claimed population mean and that the sample standard deviation is more than the claimed population standard deviation. ```{r} ggplot(acme, aes(sample = x)) + stat_qq(distribution = function(p) qnorm(p, 50, 3)) + geom_abline(slope = 1, intercept = 0, color = "red") ``` Fourth, compare the sample and claimed population cdfs. The sample points rise earlier (indicating a smaller center) and more slowly (indicating a larger spread) than the claimed cdf. ```{r} ggplot(acme, aes(x = x)) + stat_ecdf() + geom_function(fun = pnorm, args = c(50, 3), color = "red") ``` To turn these observations into a hypothesis test we need a way to measure how far our sample is from samples generated by the claimed population. Such a measure is called a *statistic*. One example would be the sample mean $\bar{X}$. For a one-sided test, the p-value would be $P(\bar{X} \leq 45.41 \;|\; X \sim \text{Normal}(50,3))$. ```{r} pnorm(45.41, 50, 3/sqrt(10)) ``` A second example would be a sample minimum $X_{\min}$. For a one-sided test, the p-value would be $P(X_{\min} \leq 31.3 \;|\; X \sim \text{Normal}(50,3))$. ```{r} 1 - (1-pnorm(31.3, 50, 3))^10 ``` Rather than focus on a center or an end point of the data, it would be better in general to consider the overall distribution. The difference between the empirical cdf $F_n$ and the claimed population cdf $F$ can be summarized in a variety of ways. The Komogorov-Smirnov statistic is the maximum difference between the two cdfs. The Cramer-von Mises statistic integrates the squared difference between the two cdfs weighted by the claimed population pdf. We will use the Anderson-Darling statistic, which is similar to the Cramer-von Mises statistic but weights the tail regions more heavily than the middle region: $$\begin{array}{lcl} \text{AD} &=& n\displaystyle\int_{-\infty }^{\infty } \frac{\left(F_{n}(x)-F(x)\right){}^2}{F(x)\left(1-F(x)\right)} \, dF(x) \\ &=& -n - \displaystyle\sum_{k=1}^{n}\dfrac{2k-1}{n}\left(\ln(F(y_k))+\ln(1-F(y_{n+1-k})\right)) \end{array}$$ where the data in $x$ is sorted in ascending order in $y$. For Acme's compressive strength data, the Anderson-Darling statistic is calculated as follows. ```{r} n = length(acme$x) y = sort(acme$x) AD = -n - sum((2*(1:n)-1)/n*(log(pnorm(y, 50, 3)) + log(1 - pnorm(rev(y), 50, 3)))); AD ``` Is this a reasonable value for the Anderson-Darling statistic? To find out, we simulate samples from the claimed distribution, calculate their Anderson-Darling statistics, and determine the proportion of them at least as great as the Anderson-Darling statistic for the data. ```{r} set.seed(17) runs = 1000 ad = numeric(runs) for (i in 1:runs) { sample = rnorm(n, 50, 3) y = sort(sample) ad[i] = -n - sum((2*(1:n)-1)/n*(log(pnorm(y, 50, 3)) + log(1 - pnorm(rev(y), 50, 3)))) } ggplot(tibble(ad = ad), aes(x = ad)) + geom_density() + geom_vline(xintercept = AD, color = "red") prob = length(which(ad >= AD)) / runs; prob ``` With a p-value of essentially zero, the compressive strengths of Acme concrete blocks is significantly different from a normal random variable with mean 50 hundred pounds per square inch and standard deviation 3 hundred pounds per square inch. The followning R function will calculate the Anderson-Darling statistic (once the `goftest` package is installed and required) and approximate the p-value theoretically. ```{r} goftest::ad.test(acme$x, "pnorm", 50, 3) ``` ## Normal Concrete Block Strength Claim Acme Concrete claims that the compressive strength of its concrete blocks is normally distributed, but does not specify the two parameters. We need to compare the data to some unknown normal distribution. To do so, we need to estimate the parameters $\mu$ and $\sigma$ with the data. The method of moments and the maximum likelihood estimators are the same: $\qquad \hat{\mu} = \bar{x}$ and $\hat{\sigma} = \sqrt{\dfrac{n-1}{n}}s$. To test the claim, ten concrete blocks are randomly chosen and their compressive strengths (in hundred pounds per square inch) are measured. We compare this data to the claimed distribution (using the estimators for the population parameters) in a variety of ways. First, use sample statistics to estimate the corresponding population parameters. ```{r} acme = tibble(x = c(49.2,53.9,50.0,44.5,42.2,42.3,32.3,31.3,60.9,47.5)) n = length(acme$x); n mu.hat = mean(acme$x); mu.hat sigma.hat = sqrt((n-1)/n)*sd(acme$x); sigma.hat ``` Second, compare the sample and claimed population pdfs. ```{r} ggplot(acme, aes(x = x)) + geom_histogram(aes(y = after_stat(density)), binwidth = 5, center = 32.5, color = "white") + geom_density(color = "blue", linewidth = 2) + geom_function(fun = dnorm, args = list(mean = mu.hat, sd = sigma.hat), color = "red", linewidth = 2) ``` Third, compare the sample and claimed population distributions with a qq plot. ```{r} ggplot(acme, aes(sample = x)) + stat_qq(distribution = function(p) qnorm(p, mu.hat, sigma.hat)) + geom_abline(slope = 1, intercept = 0, color = "red") ``` Fourth, compare the sample and claimed population cdfs. ```{r} ggplot(acme, aes(x = x)) + stat_ecdf() + geom_function(fun = pnorm, args = c(mu.hat, sigma.hat), color = "red") ``` The difference between the sample and claimed population cdf can be summarized with the Anderson-Darling statistic with the normal population parameters estimated with our estimators. ```{r} n = length(acme$x) mu.hat = mean(acme$x) sigma.hat = sqrt((n-1)/n)*sd(acme$x) y = sort(acme$x) AD = -n - sum((2*(1:n)-1)/n*(log(pnorm(y, mu.hat, sigma.hat)) + log(1 - pnorm(rev(y), mu.hat, sigma.hat)))); AD ``` Is this a reasonable value for the Anderson-Darling statistic? To find out, we simulate samples from the claimed distribution, calculate their Anderson-Darling statistics, and determine the proportion of them at least as great as the Anderson-Darling statistic for the data. ```{r} set.seed(17) runs = 1000 ad = numeric(runs) for (i in 1:runs) { sample = rnorm(n, mu.hat, sigma.hat) smu.hat = mean(sample) ssigma.hat = sqrt((n-1)/n)*sd(sample) y = sort(sample) ad[i] = -n - sum((2*(1:n)-1)/n*(log(pnorm(y, smu.hat, ssigma.hat)) + log(1 - pnorm(rev(y), smu.hat, ssigma.hat)))) } ggplot(tibble(ad = ad), aes(x = ad)) + geom_density() + geom_vline(xintercept = AD, color = "red") prob = length(which(ad >= AD)) / runs; prob ``` Either (1) Acme Concrete's claim is correct and the probability of obtaining a sample that is at least as unsupportive of Acme's claim as the actual sample is `r format(prob, digits=4)`, or (2) Acme Concrete's claim is incorrect. With a 0.05 level of significance, statisticians would choose (1): With a p-value of `r format(prob, digits=3)`, the compressive strengths of Acme concrete blocks is not significantly different from a normal random variable. Minitab and Mathematica include black box functions that mimic the above tests as well as for many other tests. They will also calculate p-values theoretically or via simulation. For its estimate of $\sigma$, Minitab uses the sample standard deviation 9.090 rather than the maximum likelihood estimator 8.623, which is what Mathematica uses. So, Mathematica's result matches the work we have done, and Minitab yields a slightly different result. The Anderson-Darling test in `goftest` will take into account the fact that the parameters were estimated from the data but does so in a very different manner: the method of Braun referenced in the help documentation appears to be a general technique for handling "nuisance" parameters that relies on a Central Limit type theorem. Since we have a small sample, I would not want to use this approach with this data. ```{r} goftest::ad.test(acme$x, "pnorm", mu.hat, sigma.hat, estimated = TRUE) ``` We can make use of `goftest::ad.test` to replace the computation of the statistic (although there is not much space savings and no time savings here). ```{r} set.seed(17) runs = 1000 ad = numeric(runs) for (i in 1:runs) { sample = rnorm(n, mu.hat, sigma.hat) smu.hat = mean(sample) ssigma.hat = sqrt((n-1)/n)*sd(sample) y = sort(sample) ad[i] = goftest::ad.test(sample, "pnorm", smu.hat, ssigma.hat)$statistic } prob = length(which(ad >= AD)) / runs; prob ``` There is an Anderson-Darling test in the `nortest` package that provides a black box method that mimics our approach, but it only does so for a null hypothesized normal random variable and it uses the "divide by $n-1$", rather than $n$, standard deviation estimator.. ```{r} nortest::ad.test(acme$x) ``` The Shapiro-Wilk normality test can only be used for testing normality, but it is often used in the literature and is built into the core of R. ```{r} shapiro.test(acme$x) ``` ## Gamma Concrete Block Strength Claim Acme Concrete claims that the compressive strength of its concrete blocks is distributed as a gamma random variable. To test the claim, ten concrete blocks are randomly chosen and their compressive strengths (in hundred pounds per square inch) are measured. We compare this data to the claimed distribution in a variety of ways. First, obtain the method of moments estimates. $\quad \hat{\alpha}\hat{\beta} = \bar{x}$ and $\hat{\alpha}\hat{\beta}^2 = v$ implies $\hat{\beta} = v/\bar{x}$ and $\hat{\alpha} = \bar{x}^2/v$ ```{r} acme = tibble(x = c(49.2,53.9,50.0,44.5,42.2,42.3,32.3,31.3,60.9,47.5)) n = length(acme$x) xbar = mean(acme$x) v = var(acme$x)*(n-1)/n a.hat = xbar^2/v; a.hat b.hat = v/xbar; b.hat ``` Compare the sample and claimed population pdfs. ```{r} ggplot(acme, aes(x = x)) + geom_histogram(aes(y = after_stat(density)), binwidth = 5, center = 32.5, color = "white") + geom_density(color = "blue", linewidth = 2) + geom_function(fun = dgamma, args = list(shape = a.hat, scale = b.hat), color = "red", linewidth = 2) ``` Compare the sample and claimed population distributions with a qq plot. ```{r} ggplot(acme, aes(sample = x)) + stat_qq(distribution = function(p) qgamma(p, a.hat, 1/b.hat)) + geom_abline(slope = 1, intercept = 0, color = "red") ``` Compare the sample and claimed population cdfs. ```{r} ggplot(acme, aes(x = x)) + stat_ecdf() + geom_function(fun = pgamma, args = c(a.hat, 1/b.hat), color = "red") ``` The difference between the sample and claimed population cdf can be summarized with the Anderson-Darling statistic. ```{r} y = sort(acme$x) AD = -n - sum((2*(1:n)-1)/n*(log(pgamma(y, a.hat, 1/b.hat)) + log(1 - pgamma(rev(y), a.hat, 1/b.hat)))); AD ``` Is this a reasonable value for the Anderson-Darling statistic? To find out, we simulate samples from the claimed distribution, calculate their Anderson-Darling statistics, and determine the proportion of them at least as great as the Anderson-Darling statistic for the data. It is important to observe that we have used the data to estimate two population parameters; hence, we must do the same in our simulation. ```{r} runs = 1000 ad = numeric(runs) for (i in 1:runs) { sample = rgamma(n, a.hat, 1/b.hat) sxbar = mean(sample) sv = var(sample)*(n-1)/n sa.hat = sxbar^2/sv sb.hat = sv/sxbar y = sort(sample) ad[i] = -n - sum((2*(1:n)-1)/n*(log(pgamma(y, sa.hat, 1/sb.hat)) + log(1 - pgamma(rev(y), sa.hat, 1/sb.hat)))) } ggplot(tibble(ad = ad), aes(x = ad)) + geom_density() + geom_vline(xintercept = AD, color = "red") prob = length(which(ad >= AD)) / runs; prob ``` With a p-value of `r format(prob, digits=3)`, the compressive strengths of Acme concrete blocks is not significantly different from a gamma random variable. Either (1) Acme Concrete's claim is correct and the probability of obtaining a sample that is at least as unsupportive of Acme's claim as the actual sample is `r format(prob, digits=3)`, or (2) Acme Concrete's claim is incorrect. With a 0.05 level of significance, statisticians would choose (1): With a p-value of `r format(prob, digits=3)`, the compressive strengths of Acme concrete blocks is not significantly different from a gamma random variable. Observe that we have chosen to not reject two different claims: the compressive strength of Acme's concrete blocks may be normally distributed and may be gamma distributed. That is why statisticians never say that they accept a null hypothesis. They only fail to reject a null hypothesis. ## No Answer Copying Claim . Two students were given different 20 question multiple choice (4 per question) exams. They claim to have randomly chosen their answers. But their answers were the same on 15 of the questions. The instructor suspects the students of copying off each other. Who is right? (Based on an actual academic integrity case at a liberal arts college.) **Answer.** Formulate as a hypothesis test. $\qquad H_0$: Answers chosen randomly. $\qquad H_a$: Answers copied. Let $X$ be the number of matches given the null hypothesis $H_0$. The p-value is $P(X \geq 15)$. We provide four approaches to calculating the p-value. **Approach 1:** Simulation with a margin of error of $\pm 1/\sqrt{100000} = \pm 0.003$. ```{r} nruns = 100000 matches = numeric(nruns) for(i in 1:nruns) { answers1 = sample(1:4, 20, replace = TRUE) answers2 = sample(1:4, 20, replace = TRUE) matches[i] = length(which(answers1 == answers2)) } ggplot(tibble(matches), aes(x = matches))+ geom_bar() + geom_vline(aes(xintercept = 15)) pvalue = length(which(matches>=15))/nruns; pvalue ``` **Approach 2:** $X$ is binomial with n = 20 and p = 1/4. ```{r} 1 - pbinom(14, 20, 1/4) ``` **Approach 3:** $X = X_1 + X_2 + \ldots + X_{20}$ where the $X_i$ form a sample from a Bernoulli population with $p = 1/4$. The Central Limit Theorem provides an approximation. ```{r} 1 - pnorm(14.5, 20*1/4, sqrt(20*1/4*(1-1/4))) ``` **Approach 4:** A blackbox hypothesis test. To use this approach, we need to reformulate the hypotheses using a population parameter. Let $p$ be the probability that two answers match. Then the hypotheses can be expressed $\qquad H_0: p = 1/4$. $\qquad H_a: p > 1/4$ We have a sample of size 20 with 15 matches. From the output, we conjecture that the blackbox test makes use of our approach 3. ```{r} prop.test(15, 20, 1/4, alternative="greater") ``` **Conclusion:** With a p-value less than 0.00001, we reject the claim that the two students chose their answers randomly. (Note the dangers of announcing such a p-value and conclusion to quantitatively illiterate people as was done by a colleague at a different liberal arts college in a similar case of academic dishonesty.) ## Steel Bar Improvement Amounts A steel manufacturer has been producing bars with a yield strength that is normally distributed with a mean of 900 MPa and a standard deviation of 50 MPa. The manufacturer hopes that a modified process will increase the mean yield strength of bars to 950 MPa and decrease the standard deviation. The new process is used to produce 15 bars whose yield strengths in MPa are given below. ```{r} steel = tibble(x = c(961, 962, 942, 934, 939, 920, 980, 972, 905, 985, 897, 970, 948, 888, 930)) ``` What is the mean yield strength of bars made by the new process? What is the standard deviation yield strength made with the new process? Has the manufacturer's hopes been fulfilled? (Based on a text book example but is typical of real-world questions.) **Description:** Before an inferential analysis is carried out, always start with a descriptive analysis. ```{r} n = length(steel$x) mu.hat = mean(steel$x) sigma.hat = sd(steel$x) tibble(n, mu.hat, sigma.hat) ``` ```{r} ggplot(steel, aes(x = x)) + geom_histogram(aes(y = after_stat(density)), bins = 5) + geom_density(color = "red", linewidth = 2) + ggtitle("Steel Bars") + xlab("Yield Strength (MPa)") ``` The sample mean is close to 950 MPa, and the sample standard deviation is less than 50 MPa. So, it looks like the manufacturer's hopes may have been fulfilled. It also looks like the yield strengths may be normally distributed. To be sure, we need to use inferential techniques. **Mean Resampling Approach:** Use the data as the best estimate of the population. ```{r} stats = replicate(10000, mean(sample(steel$x, n, replace = TRUE))) quantile(stats, c(0.025,0.975)) ``` **Mean Classical Approach:** Find the values for the null hypothesized mean yield strength that would not be rejected. For the test of $H_0: \mu = \mu_0$ vs. $H_1: \mu \neq \mu_0$, the p-value if $\bar{x}>\mu _0$ is $\qquad 2P\left(\bar{X} \geq \bar{x}\right) = 2P\left(\dfrac{\bar{X}-\mu_0}{\sigma/\sqrt{n}} \geq \dfrac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\right) \overset{\text{CLT}}{\approx} 2P\left(Z \geq \dfrac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\right)$ and the p-value if $\bar{x} < \mu_0$ is $\qquad 2P\left(\bar{X} \leq \bar{x}\right) = 2P\left(\dfrac{\bar{X}-\mu_0}{\sigma/\sqrt{n}} \leq \dfrac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\right) \overset{\text{CLT}}{\approx} 2P\left(Z \leq \dfrac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\right)$ Since we want the p-value to be at least $\alpha = 0.05$, it follows that $\qquad \text{ICDF}(Z,0.025) \leq \dfrac{\bar{x}-\mu_0}{\sigma/\sqrt{n}} \leq \text{ICDF}(Z,0.975)$ which is equivalent to $\qquad \bar{x}-\dfrac{\sigma}{\sqrt{n}}\text{ICDF}(Z,0.975) \leq \mu_0 \leq \bar{x}-\dfrac{\sigma}{\sqrt{n}}\text{ICDF}(Z,0.025)$ which is equivalent to $\qquad \bar{x}+\dfrac{\sigma}{\sqrt{n}}\text{ICDF}(Z,0.025) \leq \mu_0 \leq \bar{x}+\dfrac{\sigma}{\sqrt{n}}\text{ICDF}(Z,0.975)$. Unfortunately, we do not have the population standard deviation, and so we will use the sample standard deviation as an estimate. ```{r} mu.hat + qnorm(c(0.025,0.975))*sigma.hat/sqrt(n) ``` The above assumes that $n$ is sufficiently large for the CLT to work and to allow an estimation of $\sigma$ by $\hat{\sigma} = s$. A more acceptable approach when $\sigma$ is unknown is to use the theorem that if the population is normally distributed with mean $\mu_0$, then $\dfrac{\bar{X}-\mu_0}{S/\sqrt{n}}$ has a Student-t distribution with $n-1$ degrees of freedom. To pursue this approach, we must first check whether the population can be assumed to be normally distributed. The following qq-plot of the actual sample versus a normal random variable with the sample mean and standard deviation indicates that the data could have come from a normal distribution. ```{r} ggplot(steel, aes(sample = x)) + stat_qq(distribution = function(p) qnorm(p, mu.hat, sigma.hat)) + geom_abline(slope = 1, intercept = 0, color = "red") ``` The following comparison of the empirical cdf with a normal cdf having the sample mean and standard deviation indicates that the data could have come from a normal distribution. ```{r} ggplot(steel, aes(x = x)) + stat_ecdf() + geom_function(fun = pnorm, args = c(mu.hat, sigma.hat), color = "red") ``` This code carries out two formal tests of the null hypothesis that the data came from a normal random variable. ```{r} nortest::ad.test(steel$x) shapiro.test(steel$x) ``` With a p-value of 0.78 based on the Anderson-Darling normalilty test, it is reasonable to posit that the data came from a normal population. Some statisticians argue that the qq-plot is a better approach than an inferential test because (1) for small sample sizes almost no data sets can be shown to not arise from a normal distribution, and (2) for large sample sizes data that may be statistically different from normality may nevertheless be sufficiently close to normality for other inferential tests and confidence intervals to remain reliable. Now the previous derivation of a mean confidence interval can be repeated with $S$ replacing $\sigma$ and $T_{n-1}$ replacing $Z$. ```{r} mu.hat + qt(c(0.025,0.975), df = n-1)*sigma.hat/sqrt(n) ``` **Standard Deviation Resampling Approach:** Use the data as the best estimate of the population. ```{r} stats = replicate(10000, sd(sample(steel$x, replace = TRUE))) quantile(stats, c(0.025,0.975)) ``` **Standard Deviation Classical Fail to Reject Approach:** We use the theorem that says if the population is normally distributed with variance $\sigma^2$, then $\qquad (n-1)S^2/\sigma^2 = Z_1 + Z_2 + \cdots + Z_n$ has a chi-squared distribution with $n-1$ degrees of freedom. For the hypotheses test $\qquad H_0: \sigma^2 = \sigma_0^2\quad$ vs.$\quad H_1: \sigma^2 \neq \sigma_0^2$, the p-value is $\qquad 2P(S\geq s)$ if $s>\sigma _0$ and $2P(S\leq s)$ if $s<\sigma _0$. This is equivalent to $\qquad 2P\left(\chi_{n-1}^2 \geq \dfrac{(n-1)s^2}{\sigma_0^2}\right)$ if $s > \sigma_0$, and $2P\left(\chi_{n-1}^2 \leq \dfrac{(n-1)s^2}{\sigma_0^2}\right)$ if $s < \sigma_0$. For the p-value to be at least $\alpha = 0.05$, we must have $\qquad \dfrac{(n-1)s^2}{\sigma_0^2} \leq \text{ICDF}\left(\chi_{n-1}^2,0.975\right)$ if $s > \sigma_0$ and $\dfrac{(n-1)s^2}{\sigma_0^2} \geq \text{ICDF}\left(\chi_{n-1}^2,0.025\right)$ if $s < \sigma_0$. Finally, this is equivalent to ```{r} sigma.hat*sqrt((n-1)/qchisq(c(0.975,0.025), df=n-1)) ``` **Standard Deviation Sampling Distribution from a Normal Population Approach:** An alternate argument is to let $\hat{\sigma} = s$ be our best estimate of $\sigma$ for our normally distributed population. By the earlier theorem, $\qquad 0.95 = P\left(\text{ICDF}\left(\chi_{n-1}^2,0.025\right) \leq \dfrac{(n-1)S^2}{\sigma^2} \leq \text{ICDF}\left(\chi_{n-1}^2,0.975\right)\right)$ which is equivalent to $\qquad 0.95 = P\left(\dfrac{\sigma^2}{n-1}\text{ICDF}\left(\chi_{n-1}^2,0.025\right) \leq S^2 \leq \dfrac{\sigma^2}{n-1}\text{ICDF}\left(\chi_{n-1}^2,0.975\right)\right)$ ```{r} sigma.hat*sqrt(qchisq(c(0.025,0.975), df=n-1)/(n-1)) ``` **Summary:** With 95% confidence from a resampling approach, the mean yield strength of the steel bars is between 927 MPa and 957 MPa, and the standard deviation yield strength of the steel bars is between 20 MPa and 36 MPa. With a significance level of 0.05, we can conclude that the mean yield strength has increased from 900 MPa, the mean yield strength may now be 950 MPa, the distribution of yield strengths may be normal, and the standard deviation yield strength has decreased from 50 MPa. Here is a comparison of the results obtained using our different approaches. $$\begin{array}{lll} \text{Approach} & \text{mean} & \text{st dev} \\ \text{Resampling} & [927, 957] & [20, 36] \\ \text{Sampling distribution from a normal population} & [927, 957] & [19, 41] \\ \text{Student-t rather than CLT} & [925, 959] & \\ \text{Fail to reject} & \text{ditto} & [22, 48] \end{array}$$ ## Penny Side Flips What is the probability of heads when a penny is balanced on its edge and then the surface is jostled? When this was done 100 times, 64 heads appeared. **Answer:** Formulate as a hypothesis test. $\qquad H_0$: Probability of heads $= p_0$ $\qquad H_a$: Probability of heads $\neq p_0$ **Exact Approach.** Let $X$ be the number of heads in a sample of 100 flips given $H_0$. Then $X$ is binomial with $n = 100$ and $p = p_0$. The p-value is $$P\left(\left|X-100p_0\right| \geq \left|64-100p_0\right|\right) = P\left(X \leq 100p_0 - \left|64-100p_0\right|\right) + P\left(100p_0+\left|64-100p_0\right| \leq X\right).$$ For a 95% confidence interval, we need to find values of $p_0$ for which the p-value is at least 0.05. ```{r fig.height=2, fig.width=4} pvalue = function(p0) 1 - pbinom(100*p0 + abs(64-100*p0), 100, p0) + pbinom(100*p0 - abs(64-100*p0), 100, p0) ggplot() + geom_function(fun = pvalue, xlim = c(0.53,0.55)) ``` ```{r fig.height=2, fig.width=4} ggplot() + geom_function(fun = pvalue, xlim = c(0.72,0.74)) ``` With 95% confidence, the probability of heads for this type of randomizing is between 0.545 and 0.730. **Approximation Approach:** Let $X_1, X_2, \ldots, X_{100}$ be the number of heads during flips $1, 2, \ldots, 100$ given $H_0$. Since the population is discrete with probability $1-p_0$ of obtaining $0$ and probability $p_0$ of obtaining $1$, the population is far from normally distributed. Nonetheless, $n$ may be sufficiently large to use the Central Limit Theorem. ```{r} n = 100; mu.hat = 64/100; sigma.hat = sqrt(0.64*(1-0.64)) mu.hat + qnorm(c(0.025,0.975))*sigma.hat/sqrt(n) ``` The confidence interval obtained is similar to the exact answer because $n$ is large and the $p_0$ of interest are not too close to $0$ or $1$. The statistic assumes we know the population standard deviation, but we have actually used the sample standard deviation as an estimate for the population standard deviation. To avoid this concern, our normal recourse is to use the Student-t distribution; however, the population (1 with probablity $p_0$ and 0 with probability $1-p_0$) definitely does not have a normal distribution. On the other hand, $n$ is large enought that the Student-t is approximately normal, and so using Student-t yields a nearly identical result. ```{r} n = 100; mu.hat = 64/100; sigma.hat = sqrt(0.64*(1-0.64)) mu.hat + qt(c(0.025,0.975), n-1)*sigma.hat/sqrt(n) ``` ## Teacher Differences . Zumbrun surveyed 141 secondary mathematics teachers. She divided the teachers into two groups based on whether the teacher had ever taught a stand-alone high school statistics course. Each teacher was also asked several questions that had quantitative answers (sometimes via a 1-7 Likert-type scale). A summary of some of the results are given in the table below. For which questions were there statistically significant differences between the two groups? (Christina Zumbrun, Attitudes and Beliefs Towards Statistics: What are They and Why Do They Matter to All Teachers? Goshen College Science Speakers Presentation, April 4, 2018) $$\begin{array}{lllll} \text{Have you taught a stand-alone high school statistics course?} & \text{No} & 112 & \text{Yes} & 27 \\ & \text{Mean} & \text{SD} & \text{Mean} & \text{SD} \\ \text{How good at mathematics are you?} & 6.33 & 0.64 & 6.52 & 0.51 \\ \text{How many years have you taught high school mathematics?} & 3.40 & 1.68 & 4.63 & 1.67 \\ \text{How many undergraduate statistics courses did you complete?} & 2.38 & 0.89 & 2.37 & 0.88 \\ \text{How good at statistics are you?} & 4.73 & 1.10 & 5.67 & 1.33 \\ \text{How confident are you that you can master introductory statistical content?} & 5.68 & 1.25 & 6.41 & 1.22 \\ \end{array}$$ Notice that for this scenario we have not been given the data, only various statistics. This is enough to calculate confidence intervals, and based on those confidence intervals, obtain hypothesis test conclusions. Suppose $X_1, X_2, \ldots, X_{n_1}$ is a sample from population $X$ having mean $\mu_1$ and variance $\sigma_1^2$, suppose $Y_1, Y_2, \ldots, Y_{n_2}$ is a sample from population $Y$ having mean $\mu_2$ and variance $\sigma_2^2$, and suppose that the two samples are independent of each other. By our basic results, we know that $E\left[\bar{X}-\bar{Y}\right] = \mu_1 - \mu_2$ and $\text{V}\left[\bar{X}-\bar{Y}\right] = \sigma_1^2/n_1 + \sigma_2^2/n_2$. If the populations are normal, the sampling distribution of $\bar{X}-\bar{Y}$ is also normal. If the sample sizes are sufficiently large, the Central Limit Theorem implies that the sampling distribution of $\bar{X}-\bar{Y}$ is approximately normal. Under these conditions, $\qquad P\left(-z(\alpha/2) \leq \dfrac{\left(\bar{X}-\bar{Y}\right)-\left(\mu_1-\mu_2\right)}{\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}} \leq z(\alpha/2)\right) \approx 1-\alpha$ where $z(\alpha/2)$ is chosen so that $P(Z > z(\alpha/2)) = \alpha/2$ where $Z$ is standard normal. With some algebra, we obtain the confidence interval $\qquad P\left(\left(\bar{X} - \bar{Y}\right)-z(\alpha /2)\sqrt{\sigma_1^2/n_1+\sigma_2^2/n_2} \leq \mu_1-\mu_2 \leq \left(\bar{X}-\bar{Y}\right) + z(\alpha/2)\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}\right) \approx 1- \alpha$. This confidence interval can be used if the population variances are known or the sample sizes are large enough so that estimating the population standard deviations from the sample is okay. ```{r} mean.diff.ciz = function(n1, m1, s1, n2, m2, s2, alpha) (m1-m2) + c(-1,1)*qnorm(1-alpha/2)*sqrt(s1^2/n1 + s2^2/n2) mean.diff.ciz(112, 6.33, 0.64, 27, 6.52, 0.51, 0.05) ``` If the sample sizes are not sufficiently large but we can assume that the populations are normal with the same variances ($\sigma_1 = \sigma_2 = \sigma$), then a confidence interval can be calculated via a Student-t distribution. ```{r} mean.diff.cit = function(n1, m1, s1, n2, m2, s2, alpha) (m1-m2) + c(-1,1)*qt(1-alpha/2, n1+n2-2) * sqrt((n1-1)*s1^2 + (n2-1)*s2^2)/sqrt(n1+n2-2)*sqrt(1/n1+1/n2) mean.diff.cit(112, 6.33, 0.64, 27, 6.52, 0.51, 0.05) ``` We can determine whether the variances are significantly different by computing a confidence interval for the ratio of the variances. This one assumes that the populations are normal. ```{r} var.ratio.cif = function(n1, s1, n2, s2, alpha) c(1/qf(1-alpha/2, n1-1, n2-1), qf(1-alpha/2, n2-1, n1-1)) * s1^2/s2^2 var.ratio.cif(112, 0.64, 27, 0.51, 0.05) var.ratio.cif(112, 1.68, 27, 1.67, 0.05) var.ratio.cif(112, 0.89, 27, 0.88, 0.05) var.ratio.cif(112, 1.10, 27, 1.33, 0.05) var.ratio.cif(112, 1.25, 27, 1.22, 0.05) ``` For the Zumbrun data, none of the variances are significantly different at the 0.05 level of significance (note that a two-way test with a 0.05 level of significance uses a 95% confidence interval). ```{r} mean.diff.cit(112, 6.33, 0.64, 27, 6.52, 0.51, 0.1) mean.diff.cit(112, 3.40, 1.68, 27, 4.63, 1.67, 0.1) mean.diff.cit(112, 2.38, 0.89, 27, 2.37, 0.88, 0.1) mean.diff.cit(112, 4.73, 1.10, 27, 5.67, 1.33, 0.1) mean.diff.cit(112, 5.68, 1.25, 27, 6.41, 1.22, 0.1) ``` For the Zumbrun data, the mean answers on three of the five questions is significantly higher for the teachers who had taught a stand-alone statistics course at the 0.05 level of significance (note that a one-way test with a 0.05 level of significance uses a 90% confidence interval or a 95% one-way confidence interval). **Remark 1.** All of the above analysis assumes that the Likert scale used for respondents is an interval scale: respondents considered the difference between two successive numerical answers as being equally distant. **Remark 2.** If we adopt a level of significance of $\alpha = 0.05$ and we ask two samples of respondents from the same population 100 questions, then we would expect to find significant differences between the two groups on about 5 of the 100 questions. As statisticians, our choice of $\alpha$ tells us how often we will incorrectly reject a null hypothesis in our career. Unlike theologians and mathematicians who can imagine a career in which no errors are made, statisticians and experimental scientists know with near certainty that a certain percentage of their professional conclusions will be wrong! If a lab tests thousands of chemical combinations for some beneficial property, we expect many to show that beneficial property even if none have that beneficial property. Labs guard against such false positives or Type I errors (rejecting a true null hypothesis) by effectively choosing much smaller levels of significance. Of course, this means that truly beneficial chemicals will not be discerned (often called Type II errors). A test that reduces the probability of a Type II error while keeping the probability of a Type I error fixed is considered to have greater power. ## Cricket Chirp Rates and Temperature Relationship Consider the following data of the air temperature in degrees Fahrenheit (t) and cricket chirp rate in chirps per minute (c). What, if any, linear relationship exists between these two variables? How precise are predictions made by the model? ```{r} crickets = tibble( t = c(46, 51, 54, 57, 59, 61, 63, 66, 68, 72), c = c(40, 55, 72, 77, 90, 96, 99, 113, 127, 132)) ``` **Answer:** The first decision to make is whether to think of (1) chirp rate as a function of temperature, or (2) temperature as a function of chirp rate. Since it seems more reasonable to think that temperature changes cause chirp rate changes rather than chirp rate changes cause temperature changes, (1) seems more natural. On the other hand, we might be interested in predicting temperature by listening to crickets chirp, in which case (2) would be more natural. So, let us do the analysis both ways. We have used the following code previously to display a scatter plot with the regression line, which is the line that minimizes the sum of the squares of the residuals (the differences of the dependent variable values of the data and the corresponding predictions given by the line). ```{r} ggplot(crickets, aes(x = t, y = c)) + geom_point() + geom_smooth(method = "lm") ``` The graphs show us that linear equations are good models for the data. The following code finds important statistics. ```{r} model1 = lm(c ~ t, data = crickets) model1$coefficients t0 = unname(-model1$coefficients[1]/model1$coefficients[2]); t0 n = length(crickets$c) sse1 = sum((model1$residuals)^2) sst1 = sum((crickets$c - mean(crickets$c))^2) rsq1 = 1 - sse1/sst1; rsq1 se1 = sqrt(sse1/(n-2)); se1 ``` Let $c$ be the chirp rate in chirps per minute and $t$ be the temperature in degrees Fahrenheit. The first regression line can be expressed by the equation $c = -131 + 3.7t$ or $c = 3.7(t - 35)$, has a coefficient of determination (or R-squared) of 0.988, and a standard error of 3.49. One interpretation is that crickets only begin to chirp at a temperature of 35 degrees Fahrenheit, an increase of 1 degree Fahrenheit is associated with a 3.7 chirps per minute increase in the chirp rate, the model predicts 99% of the variation in the chirp rate is based on the temperature, and model predictions deviate from true values by 3.5 chirps per minute on average. ```{r} model2 = lm(t ~ c, data = crickets) model2$coefficients cmean = mean(crickets$c); cmean tmean = mean(crickets$t); tmean n = length(crickets$t) sse2 = sum((model2$residuals)^2) sst2 = sum((crickets$t - mean(crickets$t))^2) rsq2 = 1 - sse2/sst2; rsq2 se2 = sqrt(sse2/(n-2)); se2 ``` The second regression line can be expressed by the equation $t = 35.7 + 0.27c$ or $t = 59.7 + 0.27(c - 90.1)$, has a coefficient of determination (or R-squared) of 0.988, and a standard error of 0.94. One interpretation is that on average the crickets were chirping 90 chirps per minute and the temperature was 60 degrees Fahrenheit, an increase of 1 chirp per minute is associated with a 0.27 degree Fahrenheit increase in the temperature, the model predicts 99% of the variation in the temperature based on the chirp rate, and model predictions deviate from true values by 1 degree Fahrenheit on average. **Remark.** Note that the product of the slopes equals the coefficient of determination: 3.703003 * 0.2667659 = 0.9878348. This is not a coincidence. Recall that the regression equation is $\qquad y - \bar{y} = r\dfrac{\sigma_y}{\sigma_x}(x - \bar{x})$ where $\qquad r = \dfrac{1}{n-1} \displaystyle\sum_{i=1}^n \dfrac{x_i-\bar{x}}{s_x} \dfrac{y_i-\bar{y}}{s_y}$ can be seen to be symmetric in $x$ and $y$. Thus, the product of the two regression line slopes is $\qquad r\dfrac{\sigma_y}{\sigma_x} r\dfrac{\sigma_x}{\sigma_y} = r^2$. This should make it clear that to find the regression line when the variables are interchanged, one cannot simply solve the first regression equation for the other variable. **Inferential Answer.** Inferential information can also be obtained from the model. It is assumed that the chirp rate $Y$ is related to a temperature $x$ via the equation $Y = \beta_0 + \beta_1 x + \sigma Z$, where $\beta_0$, $\beta_1$, and $\sigma$ are population parameters and $Z$ is a standard normal random variable. The `summary` function provides inferential information about the population parameters. ```{r} summary(model1) ``` Three p-values are reported. With a p-value of $3.90 \times 10^{-7}$, $\beta_0 \ne 0$. With a p-value of $6.02 \times 10^{-9}$, $\beta_1 \ne 0$. With a p-value of $6.018 \times 10^{-9}$, the model is significantly different from a constant function (which in this case is equivalent to $\beta_1 \ne 0$), Two confidence intervals can be obtained from the coeeficient standard errors. With 68% confidence, $\beta_0 = -131.0 \pm 8.7$ and $\beta_1 = 3.70 \pm 0.15$. Of course, we can obtain 95\% confidence intervals by multiplying the standard errors by 1.96. The `predict` function provides inferential information about prediction and mean responses. ```{r} new.data = tibble(t = 60) predict(model1, newdata = new.data, interval = 'prediction') predict(model1, newdata = new.data, interval = 'confidence') ``` With 95% confidence (this is the default value, which can be changed with the `level` parameter), if the temperature is 60 degrees Fahrenheit, the chirp rate is between 83 and 100 chirps per minute amd the mean chirp rate is between 89 and 94 chirps per minute. The following code superimposes the prediction (in red) and mean response (in green) 95% confidence intervals on the scatter (in black) and regression line (in blue) plot. Observe that the gray shading that comes from the default `s = TRUE` is the 95% confidence interval for the mean response. ```{r} inferences = tibble( t = seq(45, 74), pred.lwr = predict(model1, newdata = tibble(t), interval='prediction')[,'lwr'], pred.upr = predict(model1, newdata = tibble(t), interval='prediction')[,'upr'], conf.lwr = predict(model1, newdata = tibble(t), interval='confidence')[,'lwr'], conf.upr = predict(model1, newdata = tibble(t), interval='confidence')[,'upr'] ) ggplot(crickets, aes(x = t, y = c)) + geom_point() + geom_smooth(method = "lm") + geom_line(aes(x = t, y = pred.lwr), inferences, color = "red") + geom_line(aes(x = t, y = pred.upr), inferences, color = "red") + geom_line(aes(x = t, y = conf.lwr), inferences, color = "green") + geom_line(aes(x = t, y = conf.upr), inferences, color = "green") ``` Here is how to explore the quadratic model $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \sigma Z$. ```{r} ggplot(crickets, aes(x = t + I(t^2), y = c)) + geom_point() + geom_smooth(method = "lm") model3 = lm(c ~ t + I(t^2), data = crickets) summary(model3) ``` Now none of the coefficients are significantly different from zero, although the model is significantly different from a constant function. The addition of the quadratic term has not significantly improved the model. ## GapMinder In February 2006, a Swedish physician and data advocate named Hans Rosling gave a TED talk titled ["The best stats you've ever seen"](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen) where he presented global economic, health, and development data from the website [gapminder.org](http://www.gapminder.org/tools/#_locale_id=en;&chart-type=bubbles). Describe `lifeExp` (life expectancy in years) as a function of `continent` (note that "Americas" includes countries in both North and South America and that Antarctica is excluded), `pop` (number of people living in the country), and/or `gdpPercap` (gross domestic product in US dollars). ```{r} gapminder_2007 <- gapminder %>% filter(year == 2007) %>% select(-year) gapminder_2007 %>% slice_sample(n = 5) ``` The following is a plot showing all of these variables. This is what is shown in the video but dynamically for each year. ```{r} ggplot(gapminder_2007, aes(y = lifeExp, x = pop, size = gdpPercap, color = continent)) + geom_point() + scale_x_continuous(trans='log10') ``` **Life expectancy as a function of continent.** ```{r} ggplot(gapminder_2007, aes(x = continent, y = lifeExp)) + geom_boxplot() ``` ```{r} model1 = lm(lifeExp ~ continent, data = gapminder_2007) summary(model1) ``` The model is Life expectancy = 54.8 + 18.8I~Americas~ + 15.9I~Asia~ + 22.8I~Europe~ + 25.9I~Oceania~. Already 63.5% of the variation in life expectancy is associated with the continent in which the country is located. Rather than having to recognize that the model says that the mean life expectancy in Africa is 54.8 years, 54.8 + 18.8 = 73.6 years in the Americas, etc., we can have R do the calculations by removing the intercept term. ```{r} model2 = lm(lifeExp ~ 0 + continent, data = gapminder_2007) summary(model2) ``` Recall that for models with no intercept, the errors are computed about 0 rather than the mean of the response data. So, the reported coefficient of determination is not meaningful in this context. **Life expectancy as a function of population and continent.** It is useful to first note that there are some countries with very large populations in comparison with most of the countries. ```{r} ggplot(gapminder_2007, aes(x = pop)) + geom_histogram() ``` This suggests that the logarithm of the population be plotted on the horizontal axis, and perhaps even the variable itself should be the logarithm of the population. ```{r} ggplot(gapminder_2007, aes(x = pop, y = lifeExp, color = continent)) + geom_point() + geom_smooth(method = "lm", formula = y ~ log10(x)) + scale_x_continuous(trans='log10') ``` Notice that in the following model, all of the lines have the same slope. ```{r} model3 = lm(lifeExp ~ continent + log10(pop), data = gapminder_2007) summary(model3) ``` The model used by `geom_smooth` follows. ```{r} model4 = lm(lifeExp ~ continent * log10(pop), data = gapminder_2007) summary(model4) ``` The addition of population to continent provides no discernible increase in the explanatory power of a model to explain life expectancy. **Life expectancy as a function of GDP per capita and continent.** ggplot(gapminder_2007, aes(y = lifeExp, x = pop, size = gdpPercap, color = continent)) + geom_point() + scale_x_continuous(trans='log10') ```{r} ggplot(gapminder_2007, aes(y = lifeExp, x = gdpPercap, color = continent)) + geom_point() + geom_smooth(method = "lm", formula = y ~ log10(x)) + scale_x_continuous(trans='log10') ``` GDP per capita adds some explanatory power to the model. ```{r} model5 = lm(lifeExp ~ continent + gdpPercap, data = gapminder_2007) summary(model5) ``` The interaction terms do not add much explanatory power to the model. ```{r} model6 = lm(lifeExp ~ continent * gdpPercap, data = gapminder_2007) summary(model6) ``` ## World Swim Records Consider the data in `swim.csv`: world record times in the 100 meter freestyle long course swimming race. The variable `year` is the year in which the record was set, `time` is the record time in seconds, and `sex` is whether this is the women's or the men's record. Only the last instance of a record breaking time is recorded in each year. The information was gathered from https://en.wikipedia.org/wiki/World_record_progression_100_metres_freestyle. Find a reasonable model for this data. The following code creates a scatter plot grouped by sex, and superimposes regression lines. ```{r} swim = read_csv("swim.csv") ggplot(swim, aes(x = year, y = time, color = sex)) + geom_point() + geom_smooth(method = "lm", se = FALSE) ``` The trends of the data are clearly nonlinear, and linear models do not make sense theoretically (e.g., record times could never become negative). A simple model that is decreasing and has a positive lower bound is of the form $\qquad y = \dfrac{a}{x-b} + c$ where $x$ is the year and $y$ is the time. The parameter $c$ corresponds to the limiting lower bound and $b$ would be a vertical asymptote occuring before the start of our data. Observe that the parameters are not linear in the formula, and so the `nls` (nonlinear least squared) function must be used instead of the `lm` (linear model) function. While linear models have explicit formulas for finding the parameter values (think of generalizing what we did at the beginning of the semester), nonlinear models must use numerical approximations which involve educated guesses and checks (think of Newton's method for finding the roots of a function). The `nls` function requires the user to provide an initial guess for the `start` argument. ```{r} f.model = nls(time ~ a/(year-b) + c, data = swim |> filter(sex == "F"), start = list(a = 450, b = 1890, c = 45)); f.model m.model = nls(time ~ a/(year-b) + c, data = swim |> filter(sex == "M"), start = list(a = 200, b = 1890, c = 45)); m.model ``` ```{r} pred = tibble( year = seq(1900, 2020), f.time = predict(f.model, newdata = tibble(year)), m.time = predict(m.model, newdata = tibble(year)) ) ggplot(swim, aes(x = year, y = time, color = sex)) + geom_point() + geom_line(aes(x = year, y = f.time), data = pred, color = "red") + geom_line(aes(x = year, y = m.time), data = pred, color = "blue") ``` The model curves look reasonable where the data are located; however, the limiting record of 24.35 seconds for males seems totally unreasonable while the limiting record of 45.7 seconds for females seems only somewhat incredible. We should look for other more reasonable models. We will not pursue this further in this course but note that confidence intervals can be found for the population parameters: ```{r} summary(f.model) ``` Predictions can be made (but without confidence intervals): ```{r} predict(f.model, newdata = tibble(year = 2024)) ``` The coefficient of determination can be calculated (alse SSE and model standard error which were also provided above): ```{r} f.swim = swim |> filter(sex == "F") n = length(f.swim$year); n sse = sum(residuals(f.model)^2); sse sst = sum((f.swim$time - mean(f.swim$time))^2) rsq = 1 - sse/sst; rsq se = sqrt(sse/(n-3)); se ``` Notice that in computing the model standard error, we have divided by $n - 3$ because three model parameters were estimated from the data. ## Coal Differences Random samples of coal were taken from two mines and the heat-producing capacity (in millions of calories per ton) for each specimen was obtained. (This data is from Richard A. Johnson, *Miller & Freund's Probability and Statistics for Engineers*.) ```{r} mine1 = c(8260, 8130, 8350, 8070, 8340) mine2 = c(7950, 7890, 7900, 8140, 7920, 7840) ``` What is the difference between the mean heat-producing capacities of the two mines? **Data Wrangling.** The tidyverse recommends that the data always be placed in a tibble, and here a case would be a coal chunk. ```{r} coal = tibble( mine = c(rep(1, length(mine1)), rep(2, length(mine2))), heat = c(mine1, mine2) ) ``` **Description Answer.** Certainly the difference in the sample means would be a best estimate for the difference in the population means, and a good visual for such a small data set is side-by-side boxplots. ```{r} coal |> group_by(mine) |> summarise(mean = mean(heat)) |>summarise(diff = diff(mean)) ggplot(coal, aes(x = as.factor(mine), y = heat)) + geom_boxplot() ``` **Resampling Inferential Answer.** Use the data as the best estimate of the population. ```{r} stats = replicate(10000, mean(sample(mine1, length(mine1), replace=TRUE)) - mean(sample(mine2, length(mine2), replace=TRUE))) quantile(stats, c(0.025,0.975)) ``` With 95% confidence, the difference between the mean heat-producing capacities of mines 1 and 2 is between 160 and 410 millions of calories per ton. **Classical Inferential Answer.** Assume the populations are normal and use a Student-t statistic. We first ask whether it is reasonable to posit the distributions are normal. The following calculations shows that the answer is "yes." ```{r} shapiro.test(mine1) shapiro.test(mine2) ``` We next examine whether the standard deviations can be considered equal. ```{r} var.ratio.fci = function(n1, s1, n2, s2, alpha) c(1/qf(1-alpha/2, n1-1, n2-1), qf(1-alpha/2, n2-1, n1-1)) * s1^2/s2^2 n1 = length(mine1) s1 = sd(mine1) n2 = length(mine2) s2 = sd(mine2) var.ratio.fci(n1, s1, n2, s2, 0.05) ``` Here is the R blackbox function. ```{r} var.test(mine1, mine2) ``` Since the distributions are not significantly different from normal and the variances are not significantly different, we may be justified in using the same confidence interval used in the *Zumbrun scenario*Teacher Differences* section. ```{r} mean.diff.cit = function(n1, m1, s1, n2, m2, s2, alpha) (m1-m2) + c(-1,1)*qt(1-alpha/2, n1+n2-2) * sqrt((n1-1)*s1^2 + (n2-1)*s2^2)/sqrt(n1+n2-2)*sqrt(1/n1+1/n2) n1 = length(mine1) m1 = mean(mine1) s1 = sd(mine1) n2 = length(mine2) m2 = mean(mine2) s2 = sd(mine2) mean.diff.cit(n1, m1, s1, n2, m2, s2, 0.05) ``` R has a blackbox test and confidence interval for the difference of two means. ```{r} t.test(mine1, mine2) ``` Notice that the confidence interval is wider because the default behavior is to not assume the variances are equal and use a more sophisticated approach. We can obtain the previous confidence interval by adding the parameter `var.equal=TRUE`. With 95% confidence, the difference between the mean heat-producing capacities of mines 1 and 2 is between 130 and 450 millions of calories per ton. ## Facility Usage Differences A student who used the college recreational facilities was interested in whether there is a difference between the facilities used by men and women. A randomly selected sample of users is summarized in the table. Note that I seem to have forgotten the labels for the table (men vs. women, and facility A vs. facility B), but this actually does not matter for a test of independence. $$\begin{array}{ll} 51 & 30 \\ 43 & 48 \\ \end{array}$$ **Answer.** The table with row and column totals is as follows: $$\begin{array}{lrrr} \text{Observed} & \text{C} & \text{D} & \text{Total} \\ \text{E} & 51 & 30 & 81 \\ \text{F} & 43 & 48 & 91 \\ \text{Total} & 94 & 78 & 172 \end{array}$$ The row and column sums determine the expected number in each cell under the assumption that the two variables are independent. For example, the expected number of EC students would be $(172)(81/172)(94/172) \approx 44.27$. $$\begin{array}{lrrr} \text{Expected} & \text{C} & \text{D} & \text{Total} \\ \text{E} & 44.27 & 36.73 & 81.00 \\ \text{F} & 49.73 & 41.27 & 91.00 \\ \text{Total} & 94.00 & 78.00 & 172.00 \end{array}$$ The observed and expected frequencies are different in each cell. The inferential question is whether such a difference cuold have been due to chance alone. We need a statistic that measures the difference between the expectation ($e_i$) and observation ($o_i$) in the $i$th cell. The classical statistic is $\qquad \chi_{(r-1)(c-1)}^2 = \displaystyle\sum_{i=1}^{rc} \dfrac{(o_i-e_i)^2}{e_i} $ where there are $r$ rows and $c$ columns. If the expected value of each cell is at least 5, then the chi-square statistic has approximately a chi-square distribution with $(r-1)(c-1)$ degrees of freedom. For our example, ```{r} observed = matrix(c(51, 30, 43, 48), nrow=2, byrow=TRUE); observed expected = rowSums(observed) %o% colSums(observed) / sum(observed); expected chisquare = sum((observed-expected)^2/expected); chisquare 1 - pchisq(chisquare, 1) ``` With a p-value of 0.04, sex and college recreational facilities used are dependent variables. **Simulation Approach.** . ```{r fig.height=3,fig.width=5} nruns = 10000 stats = numeric(nruns) for (i in 1:nruns) { scol = sample(3:4, 172, prob=c(94,78), replace=TRUE) srow = sample(5:6, 172, prob=c(81,91), replace=TRUE) obs = table(data.frame(srow,scol)) exp = rowSums(obs) %o% colSums(obs) / sum(obs) stats[i] = sum((obs-exp)^2/exp) } pvalue = sum(stats >= chisquare) / nruns; pvalue ``` **R Blackbox.** The first function mimics our approach above. The second applies the Yates continuity correction for 2 by 2 tables. ```{r} chisq.test(observed, correct=FALSE) chisq.test(observed) ``` ## MIT Suicides The number of days between MIT student suicides from 10/8/64 to 6/15/91: 38, 336, 151, 444, 868, 269, 978, 27, 30, 9, 428, 366, 138, 52, 622, 169, 1772, 295, 204, 696, 139, 16, 347, 1, 19, 169, 7, 61, 0, 852, 243, 0. (data obtained from Elaine Chew and Philip Greenspun, Is Suicide at MIT a Poisson Process?, document found at http://philip.greenspun.com/research/suicide-at-mit.pdf.) Is the population exponentially distributed? **Answer.** Estimate the mean number of suicides per day. ```{r} MIT = tibble(suicides = c(38, 336, 151, 444, 868, 269, 978, 27, 30, 9, 428, 366, 138, 52, 622, 169, 1772, 295, 204, 696, 139, 16, 347, 1, 19, 169, 7, 61, 0, 852, 243, 0)) n = length(MIT$suicides); n lambda.hat = 1/mean(MIT$suicides); lambda.hat ``` Compare the data with an exponential distribution via a qq-plot. ```{r} ggplot(MIT, aes(sample = suicides)) + stat_qq(distribution = function(p) qexp(p, lambda.hat)) + geom_abline(slope = 1, intercept = 0, color = "red") ``` Compare the sample and claimed population cdfs. ```{r} ggplot(MIT, aes(x = suicides)) + stat_ecdf() + geom_function(fun = function(x) pexp(x, lambda.hat), color = "red") ``` The difference between the sample and claimed population cdf can be summarized with the Anderson-Darling statistic with the exponential parameter estimated from the data. Since there are some 0 data (which results in an infinite Anderson-Darling statistic), I added 1 to each time interval--not sure that this is totally valid, but it was an attempt to increase the p-value. So, the fact that the p-value was still small bolsters the inferential conclusion. ```{r} n = length(MIT$suicides) lambda.hat = 1/mean(MIT$suicides+1) y = sort(MIT$suicides+1) AD = -n - sum((2*(1:n)-1)/n*(log(pexp(y, lambda.hat)) + log(1 - pexp(rev(y), lambda.hat)))); AD ``` Is this a reasonable value for the Anderson-Darling statistic? To find out, we simulate samples from the claimed distribution, calculate their Anderson-Darling statistics, and determine the proportion of them at least as great as the Anderson-Darling statistic for the data. ```{r} set.seed(17) runs = 1000 ad = numeric(runs) for (i in 1:runs) { sample = rexp(n, lambda.hat) slambda.hat = 1/mean(sample+1) y = sort(sample+1) ad[i] = -n - sum((2*(1:n)-1)/n*(log(pexp(y, slambda.hat)) + log(1 - pexp(rev(y), slambda.hat)))) } ggplot(tibble(ad = ad), aes(x = ad)) + geom_density() + geom_vline(xintercept = AD, color = "red") prob = length(which(ad >= AD)) / runs; prob ``` ```{r} goftest::ad.test(MIT$suicides, "pexp", lambda.hat, estimated = TRUE) ``` ```{r} goftest::ad.test(MIT$suicides+1, "pexp", 1/(1/lambda.hat+1), estimated = TRUE) ``` With a p-value of 0.004 using the Anderson-Darling statistic, the number of days between MIT student suicides is not exponential. Based on the graphs, short time intervals between suicides occur disproportionately to short and long time intervals between suicides. Perhaps a recent suicide tends to encourage another suicide (but remember that statistics only suggests association, not causation).