--- title: "C05 Qualitative Explanatory Variable" author: "David Housman" format: docx editor: visual fig-width: 6 fig-asp: 0.618 out-width: "70%" fig-align: center execute: enabled: false --- ## Introduction Previously, we considered techniques to describe a single variable.  We now examine techniques to describe the relationship between two variables.  We usually think of one variable as the *dependent* or *response* variable and the other variable as the *independent*, *predictor*, or *explanatory* variable.  Again, the description techniques used depend on the types of variables.  Load the important packages. The `janitor` package provides a nice way to create tables based on the tidyverse. ```{r} #| message: false #| warning: false library(tidyverse) library(janitor) ``` Import the illustrative data set, and treat `Excitement` as an ordinal factor variable. ```{r} intro = read_csv("03 Introduction.csv") |> mutate(Excitement = factor(Excitement, levels = "1":"5")) ``` ## Quantitative Response and Qualitative Explanatory Variables The approach is to use the visualizations and statistics used for a quantitative variable for each value of the qualitative variable. Consider `Weight` as a function of `Gender`. 1. Exclude the non-binary students. ```{r} intro2 = intro |> filter(Gender != "N") ``` 2. Obtain box plots. Overlay with jitter plots of blue points to verify the correctness of the box plots. Remove the jitter plots for a finalized visualization. ```{r} ggplot(intro2, aes(x = Gender, y = Weight)) + geom_boxplot() + #geom_jitter(height = 0, width = 0.25, color = "blue") + ylab("Weight (lbs)") ``` 3. What words might you use to describe the shapes of the distributions? Estimate the standard five-number summaries. Describe at least one conclusion that can be made based upon looking at this graph. 4. Obtain the standard five-number summaries and compare with your estimates. ```{r} intro2 |> group_by(Gender) |> summarize( Q0 = min(Weight), Q1 = quantile(Weight, 0.25), Q2 = median(Weight), Q3 = quantile(Weight, 0.75), Q4 = max(Weight)) ``` 5. Obtain violin plots overlaid with line segments for the means. ```{r} ggplot(intro2, aes(x = Gender, y = Weight)) + geom_violin() + stat_summary(fun = mean, geom = "crossbar", width = 0.5, color = "black") + ylab("Weight (lbs)") ``` 6. Estimate counts, means, and standard deviations. 7. Obtain counts, means, and standard deviations, and compare with your estimates. ```{r} intro2 |> group_by(Gender) |> summarize( n = n(), mean = mean(Weight), sd = sd(Weight)) ``` 8. Obtain overlaid density plots. Unlike the other graphs, the response variable is typically represented on the horizontal axis. What statistics would be easy to estimate from this visualization? ```{r} ggplot(intro2, aes(y = Weight, color = Gender, fill = Gender)) + geom_density(alpha = 0.5) + ylab("Weight (lbs)") ``` 9. Obtain vertically aligned histograms. What statistics would be easy to estimate from this visualization? ```{r} ggplot(intro2, aes(x = Weight)) + geom_histogram(binwidth = 20, center = 100, color = "white") + facet_wrap(~ Gender, ncol = 1) + xlab("Weight (lbs)") ``` 10. Obtain grouped empirical cumulative distribution functions plot. What statistics would be easy to estimate from this visualization? ```{r} ggplot(intro2, aes(x = Weight, color = Gender)) + stat_ecdf() + xlab("Weight (lbs)") ``` ## Qualitative Response and Explanatory Variables The approach is to use the visualizations and statistics used for a qualitative variable for each value of the explanatory qualitative variable. Consider `Excitement` as a function of `Gender`. 1. Obtain a stacked frequency bar chart. ```{r} ggplot(intro, aes(x = Gender, fill = Excitement)) + geom_bar() ``` 2. Obtain a dodged frequency bar chart. ```{r} ggplot(intro, aes(x = Gender, fill = Excitement)) + geom_bar(position = position_dodge2(preserve = "single")) ``` 3. Obtain a relative frequency within each group stacked bar chart. ```{r} ggplot(intro, aes(x = Gender, fill = Excitement)) + geom_bar(position = "fill") ``` 4. Obtain a table of frequencies. ```{r} intro |> tabyl(Excitement, Gender) ``` 5. Obtain a table of percentages by group. ```{r} intro |> tabyl(Excitement, Gender) |> adorn_percentages("col") |> adorn_pct_formatting(digits = 2) ```