---
title: "C05 Qualitative Explanatory Variable"
author: "David Housman"
format: docx
editor: visual
fig-width: 6
fig-asp: 0.618
out-width: "70%"
fig-align: center
execute:
  enabled: false
---

## Introduction

Previously, we considered techniques to describe a single variable.  We now examine techniques to describe the relationship between two variables.  We usually think of one variable as the *dependent* or *response* variable and the other variable as the *independent*, *predictor*, or *explanatory* variable.  Again, the description techniques used depend on the types of variables. 

Load the important packages. The `janitor` package provides a nice way to create tables based on the tidyverse.

```{r}
#| message: false
#| warning: false
library(tidyverse)
library(janitor)
```

Import the illustrative data set, and treat `Excitement` as an ordinal factor variable.

```{r}
intro = read_csv("03 Introduction.csv") |>
  mutate(Excitement = factor(Excitement, levels = "1":"5"))
```

## Quantitative Response and Qualitative Explanatory Variables

The approach is to use the visualizations and statistics used for a quantitative variable for each value of the qualitative variable. Consider `Weight` as a function of `Gender`.

1.  Exclude the non-binary students.

```{r}
intro2 = intro |> filter(Gender != "N")
```

2.  Obtain box plots. Overlay with jitter plots of blue points to verify the correctness of the box plots. Remove the jitter plots for a finalized visualization.

```{r}
ggplot(intro2, aes(x = Gender, y = Weight)) +
  geom_boxplot() +
  #geom_jitter(height = 0, width = 0.25, color = "blue") +
  ylab("Weight (lbs)")
```

3.  What words might you use to describe the shapes of the distributions? Estimate the standard five-number summaries. Describe at least one conclusion that can be made based upon looking at this graph.

4.  Obtain the standard five-number summaries and compare with your estimates.

```{r}
intro2 |>
  group_by(Gender) |>
  summarize(
    Q0 = min(Weight),
    Q1 = quantile(Weight, 0.25),
    Q2 = median(Weight),
    Q3 = quantile(Weight, 0.75),
    Q4 = max(Weight))
```

5.  Obtain violin plots overlaid with line segments for the means.

```{r}
ggplot(intro2, aes(x = Gender, y = Weight)) +
  geom_violin() +
  stat_summary(fun = mean, geom = "crossbar",
               width = 0.5, color = "black") +
  ylab("Weight (lbs)")
```

6.  Estimate counts, means, and standard deviations.

7.  Obtain counts, means, and standard deviations, and compare with your estimates.

```{r}
intro2 |>
  group_by(Gender) |>
  summarize(
    n = n(),
    mean = mean(Weight),
    sd = sd(Weight))
```

8.  Obtain overlaid density plots. Unlike the other graphs, the response variable is typically represented on the horizontal axis. What statistics would be easy to estimate from this visualization?

```{r}
ggplot(intro2, aes(y = Weight, color = Gender, fill = Gender)) +
  geom_density(alpha = 0.5) +
  ylab("Weight (lbs)")
```

9.  Obtain vertically aligned histograms. What statistics would be easy to estimate from this visualization?

```{r}
ggplot(intro2, aes(x = Weight)) +
  geom_histogram(binwidth = 20, center = 100, color = "white") +
  facet_wrap(~ Gender, ncol = 1) +
  xlab("Weight (lbs)")
```

10. Obtain grouped empirical cumulative distribution functions plot. What statistics would be easy to estimate from this visualization?

```{r}
ggplot(intro2, aes(x = Weight, color = Gender)) +
  stat_ecdf() +
  xlab("Weight (lbs)") 
```

## Qualitative Response and Explanatory Variables

The approach is to use the visualizations and statistics used for a qualitative variable for each value of the explanatory qualitative variable. Consider `Excitement` as a function of `Gender`.

1.  Obtain a stacked frequency bar chart.

```{r}
ggplot(intro, aes(x = Gender, fill = Excitement)) +
  geom_bar()
```

2.  Obtain a dodged frequency bar chart.

```{r}
ggplot(intro, aes(x = Gender, fill = Excitement)) +
  geom_bar(position = position_dodge2(preserve = "single"))
```

3.  Obtain a relative frequency within each group stacked bar chart.

```{r}
ggplot(intro, aes(x = Gender, fill = Excitement)) +
  geom_bar(position = "fill")
```

4.  Obtain a table of frequencies.

```{r}
intro |>
  tabyl(Excitement, Gender)
```

5.  Obtain a table of percentages by group.

```{r}
intro |> 
  tabyl(Excitement, Gender) |>
  adorn_percentages("col") |>
  adorn_pct_formatting(digits = 2)
```