--- title: "A04 (135 points)" author: "TYPE YOUR NAME HERE" date: "TYPE DATE HERE" format: docx editor: visual --- ```{r} #| message: false #| warning: false #| echo: false library(tidyverse) ``` ## Instructions Complete each exercise either in this qmd file or on paper. Include your name as the author and the date completed in the YAML code at the top of this file, and replace the first sentence in the Acknowledgements section as directed. Any parts completed on paper should either (a) be handed to David Housman, (b) placed in SC 117, or (c) scanned into a pdf file. This qmd file should be rendered to an html, docx, or pdf file. Zip together all relevant files: Rproj, qmd, rendered file, any data or image files, and (optionally) the pdf file containing your answers completed on paper. Upload the zip file in Moodle. Points will be taken off if these instructions are not followed. ## Acknowledgements Replace this sentence with either (1) an acknowledgment of any person who gave you assistance and/or any resource that was used, or (2) a statement that you did not use any outside assistance. By submitting this assignment, the author attests to abiding by the *Collaboration and Academic Integrity* policy stated in the course syllabus. ## Exercise 1 (30 points) Consider the data (1, 2), (2, 4), (3, 6), and (4, 10). Calculate “by hand” (showing the calculation steps) the following quantities: a. (4 points) mean and standard deviation of the x data, b. (4 points) mean and standard deviation of the y data, c. (6 points) correlation coefficient and the coefficient of determination calculated from the correlation coefficient, d. (4 points) regression equation, e. (8 points) SSM, SSE, SST, and the coefficient of determination calculated from other numbers in this part, and f. (2 points) the standard error. g. (2 points) Verify that $\text{SST} = \text{SSM} + \text{SSE}$ for this data. ## Exercise 2 (40 points) In class, we considered the problem of finding the best fit equation of the form $y = b + mx$ to data $(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$ by minimizing $f(b,m) = \sum_{i=1}^{n} (y_i - b - mx_i)^2$. Sometimes we know that our best-fit should go through the origin, and so we consider the problem of finding the best fit equation of the form $y = mx$ to data $(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$ by minimizing $g(m) = \sum_{i=1}^{n} (y_i - mx_i)^2$. a. (8 points) Find $\hat{m}$ that minimizes $g(m) = \sum_{i=1}^{n} (y_i - mx_i)^2$. Be sure to verify that you have found a minimum. If you are having difficulty in doing this in the general case, first look at the $n = 3$ case explicitly. b. (4 points) Use your formula to find the best fit equation $y = \hat{m}x$ to the data (1, 2), (2, 4), (3, 6), and (4, 10). c. (8 points) For the data, find the SSM, SSE, SST, and coefficient of determination. Note that since we are forcing the model to go through the point $(0, 0)$ rather than $(\bar{x}, \bar{y})$, the definition of SST is modified to use $0$ instead of $\bar{y}$ and the definition of SSM is modified to use $0$ instead of $\bar{\hat{y}}$. d. (8 points) Using at least half a page of graph paper, create a scatter plot of the data, the best fit line obtained in part (b), and the best fit line obtained in exercise 1. e. (4 points) Prove that $\text{SST} = \text{SSM} + \text{SSE}$ in general for this one parameter model. f. (8 points) Create a tibble and use the `lm` function to check your results in parts b and c. ## Exercise 3 (10 points) Use an R code chunk to find the linear model to the bogus data that minimizes the maximum error. Explicitly state the model that was found ## Exercise 4 (20 points) *A04Pines.csv* contains measurements of eight Ponderosa pine trees. Column `circum` is the circumference of the trunk at waist height in inches and column `lumber` is the amount of usable lumber obtained in board feet. a. (2 points) Read this data into a name of your choice. b. (8 points) Create an appropriately labeled scatter plot of `lumber` as a function of `circum` and overlay best fit models of the form $y = ax^2$ and $y = cx^3$. c. (6 points) Use the `lm` and associated functions to find the best fit parameter values and coefficents of determination for the two models. d. (4 points) State which model is the better one of the two. Explain why. ## Exercise 5 (35 points) a. (5 points) Set the random seed. Obtain a random sample of 5 cases of the `penguin` data set. Interpret the first displayed case. It may be helpful to look at the corresponding documentation in the "Help" tab. b. (5 points) Save to a different name the `penguin` data with a new column `adelie` that is 1 for Adelie and 0 for other species. Keep only this new column and the `bill_len`, `flipper_len`, and `body_mass` columns. Omit any rows that are missing data in the selected columns. State the number of cases that are included in this new data frame. c. (5 points) Compute the AIC for all non-interactive and interactive models of the form adelie ~ one or two of the three quantitative variables. d. (5 points) Which of the models considered is the best? Is the best model significantly better than all of the other models? e. (5 points) Obtain and visualize the best logistic regression model for `adelie` using a single quantitative variable. f. (3 points) Find the value of the explanatory variable that results in a 50% chance that the penguin is an Adelie. g. (3 points) State what a one unit increase means for the explanatory variable and how much the odds change for a one unit increase in the explanatory variable. h. (4 points) Find McFadden's pseudo-$R^2$ and provide an interpretation.