--- title: "C03-C04 Single Variable Description" author: "David Housman" format: html editor: visual fig-width: 6 fig-asp: 0.618 out-width: "70%" fig-align: center --- ## Qualitative Data Qualitative variables are either nominal (answers/values have no numeric meaning, e.g., gender and religious affiliation) or are ordinal (answers/values have a meaningful order, e.g., level of excitement).  For a single qualitative variable, a **bar chart** displays the distribution of data by assigning a bar for each value (or set of values) with a height corresponding to the value's frequency or relative frequency.  For a nominal variable, the values are usually arranged so that their frequencies are in order.  For an ordinal variable, the values are arranged in their natural order along the horizontal axis.  The **relative frequencies** or **proportions** for each value are the standard statistics.  Means or other quantitative variable statistics should not be computed because equal numerical differences may not correspond to equal intensity differences. #### By Hand Example With the data from our previous class of primary majors (MA, CS, PH, MA, EN, CS, CS, CS, CS, PH), create a frequency bar chart abd proportion of mathematics majors, ## Quantitative Data Quantitative variables have numeric answers/values that have numeric meaning.  For an interval scale, equal differences correspond to equal intensity changes (e.g., grades). A ratio scale adds that doubling is meaningful (e.g., heights and weights), and an absolute scale has no units (e.g., number of siblings).  A **histogram** displays the distribution of data by assigning a bar for each value (or range of values) with a height corresponding to the value's frequency, relative frequency, or density.  A **density plot** smooths a histogram by convolution with a normal probability density. A **box plot** shows the first and third quartiles by the edges of a box, the median as a line within the box, the extreme values at or within the inner fences (1.5 times the interquartile range on either side of the box) as the end of whiskers from the box, and outliers as individual points outside of the inner fences.  An **empirical cumulative distribution function plot** plots the relative cumulative frequency of data on the vertical axis for each real number value on the horizontal axis.  For all visuals, the horizontal axis should be appropriately scaled (usually linear but sometime logarithmic).  For the following statistics, we assume that there are $n$ data $x_1, x_2, \ldots, x_n$ satisfying $x_1 \leq x_2 \leq \cdots \leq x_n$. #### The center of the data can be reported by any of the following: - **mean**: the sum of the data divided by the number of data\ $$\hat{\mu} = \bar{x} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n} x_i.$$ **median**: the middle value\ $$\tilde{x} = \begin{cases} x_{(n+1)/2} & \text{if } n \text{ is odd} \\ (x_{n/2} + x_{n/2+1})/2 & \text{if } n \text{ is even} \end{cases}.$$ - **midrange**: the average of the extremes\ $$\text{MR} = \frac12 (x_1 + x_n).$$ #### The spread of the data can be reported by any of the following: - **variance**: an adjusted average of squared deviations from the mean\ $$s^2 = \frac{1}{n-1}\displaystyle\sum_{i=1}^{n} (x_i - \bar{x})^2$$ - **standard deviation**: the square root of the variance\ $$s = \sqrt{s^2}$$ - **average deviation from the median:**\ $$ \text{AD} = \sum_{i=1}^{n} |x_i-\tilde{x}| $$ - **range**: the difference between the extremes\ $$\text{range} = x_n - x_1.$$ - **interquartile range**: the difference between the third and first quartiles\ $$\text{IQR} = q_{0.75} - q_{0.25}$$ - $100(1-\alpha)$% **coverage interval**: the two values containing the middle $100(1-\alpha)$ percent of the data$$[q_{\alpha/2}, q_{1-\alpha/2}]$$ #### The location of a value $x$ within the data can be reported by either of the following: - **z-score:** the number of standard deviations away from the mean\ $$z = \dfrac{x-\bar{x}}{s}$$ - **cumulative proportion:** the relative frequency of data no greater than the value  #### A pseudo-inverse of the cumulative proportion: - **quantile**: the value for which the relative frequency of data no greater than it is as specified. There is much disagreement over the proper definition for quantile. Suppose $q$ is a number between 0 and 1.  Because the data divides real numbers into $n+1$ segments, Minitab and SPSS (by default) calculate the $q$th quantile (or $100q$^th^ percentile) as the $(n+1)q$^th^ datum counting from 1. This approach is problematic for sufficiently small or large values of $q$.  R (with type 6) chooses the minimum datum value for sufficiently small values of $q$ and the maximum datum value for sufficiently large values of $q$. Motivated by a desire to avoid this problem, R (with its default type 7) calculates the $q$^th^ quantile (or $100q$^th^ percentile) as the $(n-1)q$^th^ datum counting from 0.  #### Relationships There are a variety of relationships among the quantitative visuals and statistics.  The mean is the balance point for the histogram.  The standard deviation is a bit more than the balance point of the histogram folded at the mean.  For bell-shaped distributions, the **empirical rule** holds: about 68%, 95%, and 99.7% of the data are within one, two, and three standard deviations of the mean.  A box plot mimics a histogram in a compressed format, and a density plot smooths a histogram with a continuous curve. #### By Hand Example Consider a small sample of children with ages 2, 3, 3, 5, 8, and 9 years. Create a spike graph. Estimate from the spike graph and calculate all of the above statistics (for the quantiles choose 0.25 and 0.75). Create a box plot aligned with the spike graph. Obtain a graph of the empirical cumulative distribution and the quantiles. #### Measures of Center as Models The mean minimizes $f(x) = \displaystyle\sum_{i=1}^{n} (x_i - x)^2$. The median minimizes $f(x) = \displaystyle\sum_{i=1}^{n} |x_i - x|$. The midrange minimizes $f(x) = \max\{|x_i - x| : i=1,2,\ldots,n\}$. Each of the above functions $f$ can be thought of as a distance between $(x_1, x_2, \ldots, x_n)$ and $(x, x, \ldots, x)$ in $\mathbb{R}^n$. #### Lies, Damn Lies, and Statistics A third world country with many poor but a few rich families is looking for foreign aid.  Which average income should they report?  If they were looking for foreign investment, which average income should they report? ## Introduction Data Set Example 1. Load required packages. ```{r} #| message: false library(tidyverse) ``` 2. Read the key data set. The data set documentation is in the file *03 Introduction.csv*. ```{r} intro = read_csv("03 Introduction.csv") intro = intro |> mutate(NewReligion = factor(Religion, levels = c("M","OC","ON","N")), Excitement = case_match(Excitement, 1 ~ "VF", 2 ~ "F", 3 ~ "N", 4 ~ "E", 5 ~ "VE"), Excitement = factor(Excitement, levels = c("VF","F","N","E","VE")), Height = if_else(Height < 20, NA, Height)) ``` E 3. Obtain three repeatably random rows and provide an interpretation. ```{r} set.seed(123) intro |> slice_sample(n = 3) ``` The first person is a Math 323 student who is male, is not affiliated with a religion, is neutral about the course, hopes for a B grade, is 75 inches tall, and 173 pounds. ## Qualitative Data 1. Obtain a relative frequency bar chart of religion. Interpret the left-most bar. ```{r} ggplot(data = intro, mapping = aes(x = NewReligion, y = 100*after_stat(count/sum(count)))) + geom_bar() + ylab("Percentage") + ggtitle("Math 323 Students") ``` 2. Obtain a frequency bar chart of gender in descending count order. Interpret the left-most bar. 3. Obtain a bar chart of excitement using all five names rather than the corresponding codes. Include an appropriate title. Interpret the right-most bar. ```{r} ggplot(intro, aes(x = Excitement)) + geom_bar() ``` ## Quantitative Data 1. Obtain a (good) histogram of heights. ```{r} intro |> filter(Height < 20) ``` ```{r} ggplot(intro, aes(x = Height)) + geom_histogram(binwidth = 3, color = "white", fill = "blue") + xlab("Height (in)") + ylab("Frequency") + ggtitle("Math 323 Students") ``` 1. Interpret the left most bar. Estimate from the histogram the sample size, mean, median, standard deviation, first quarile, and third quartile. Calculate these statistics. Eight Math 323 students were between 61.5 and 64.5 inches tall. ```{r} intro |> summarise(n = n(), xbar = mean(Height, na.rm = TRUE), xtilde = median(Height, na.rm = TRUE), sd(Height, na.rm = TRUE), q1 = quantile(Height, 0.25, na.rm = TRUE), q3 = quantile(Height, 0.75, na.rm = TRUE)) ``` 1. Obtain density plots of `Height` with different values for `adjust`. Compare with the histogram. ```{r} ggplot(intro, aes(x = Height)) + geom_density(adjust = 1, fill = "blue", alpha = 0.4) + xlab("Height (in)") + ylab("Density") + ggtitle("Math 323 Students") ``` 1. Obtain a box plot of `Height`. Compare with the histogram and density plots. ```{r} ggplot(intro, aes(x = Height)) + geom_boxplot() + xlab("Height (in)") + ylab("Frequency") + ggtitle("Math 323 Students") ``` 1. Obtain a violin plot of `Height`. Compare with the histogram, density plots, and box plot. ```{r} ggplot(intro, aes(x = Height, y = 0)) + geom_violin() + xlab("Height (in)") + ylab("Density") + ggtitle("Math 323 Students") ``` 1. Obtain a plot of the empirical cumulative distribution function. ```{r} intro |> ggplot(aes(x = Height)) + stat_ecdf(geom = "step") ``` 1. Use the empirical cdf graph to estimate the first and third quartiles. Compare with what was calculated earlier.