C02 Variable and Data Types

Author

David Housman

Course Overview

A variable is a question asked of entities of interest and data are the known answers. More formally, a variable is a function from a population or sample (a set of observations or cases) to a set of possible values, and data are the range of the function for a given sample.

Variable Types

Variables are classified into different types based upon how numerically meaningful the answers are. Qualitative variables are either nominal (answers/values have no numeric meaning) or are ordinal (answers/values only have a meaningful order). Identification variables are qualitative variables that uniquely identify each case. Quantitative variables have numeric answers/values that have numeric meaning greater than just order. For an interval scale, equal differences correspond to equal intensity changes. A ratio scale adds that doubling is meaningful. An absolute scale has no units. In R, qualitative variables can be char or factor variables, and quantitative variables can be int, dbl, or time.

Software

  1. R is a free software environment for statistical computing and graphics. You should install this on your computer or use a lab computer.
  2. RStudio is an integrated development environment (IDE) for R and Python. You should install this on your computer or use a lab computer.
  3. An RStudio project keeps related files in a single working directory. By double clicking on the Classes.Rproj file, RStudio starts with the project we will often use during class time.
  4. A Quarto document is an efficient way to combine text, code, and results in draft and polished formats. These words are in the 02 Variable and Data Types.qmd file.
  5. The tidyverse package (which includes several dependent packages) extends the capabilities of R. You should install this package (see the tab in the lower right window of RStudio). This makes the package(s) available.

Basic Computation

Create a code chunk by pressing Ctrl-Alt-I, and run the code by pressing Ctrl-Shift-Enter.

  1. Compute \(3+4\cdot 3^2\).
3 + 4*3^2
[1] 39
  1. Create a vector with four single digit integers.
c(3,7,2,1)
[1] 3 7 2 1
  1. Create a vector of four names.
c("David", "Jeanne", "Kate", "Genevieve")
[1] "David"     "Jeanne"    "Kate"      "Genevieve"
  1. Create a vector with the one digit integers in order.
0:9
 [1] 0 1 2 3 4 5 6 7 8 9
  1. Create a vector of the numbers 0.0, 0.5, 1.0, 1.5, …., 9.5, 10.0.
seq(0.0, 10.0, 0.5)
 [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0
[16]  7.5  8.0  8.5  9.0  9.5 10.0
seq(from = 0.0, to = 10.0, by = 0.5)
 [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0
[16]  7.5  8.0  8.5  9.0  9.5 10.0
  1. Guess what each of the following computations do, and then check your guess.
x = 1:5
x + 3
x + x
1:8 * 1:2

Data Frames

  1. Load the tidyverse package and any dependent packages. Observe the message.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  1. View the mpg data frame and its documentation. Each row is a case, and each column is a variable.
View(mpg)
  1. Obtain the second, ninth, and fourth rows of mpg using base and tidyverse. Provide an interpretation of the second row. What is the type of each variable according to our classification scheme and according to R?
mpg[c(2,9,4),]
# A tibble: 3 × 11
  manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
  <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 audi         a4           1.8  1999     4 manua… f        21    29 p     comp…
2 audi         a4 quattro   1.8  1999     4 auto(… 4        16    25 p     comp…
3 audi         a4           2    2008     4 auto(… f        21    30 p     comp…
mpg |> slice(c(2,9,4))
# A tibble: 3 × 11
  manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
  <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 audi         a4           1.8  1999     4 manua… f        21    29 p     comp…
2 audi         a4 quattro   1.8  1999     4 auto(… 4        16    25 p     comp…
3 audi         a4           2    2008     4 auto(… f        21    30 p     comp…

The second row is a particular model of car manufactured by Audi (nominal), model a4 (nomial), manufactured in 1999 (interval). It has an engine displacement of 1.8 litres (ratio), 4 cylinders (absolute), manual transmission (nominal, binary if the number of speeds were not considered), front-wheel drive train (nominal), a petro fuel type (nominal), and is a compact type of car (nominal,, not quite ordinal). In EPA testing, it obtained 21 miles per gallon when doing city driving (ratio) and 29 miles per gallon with doing highway driving (ratio).

  1. Read in the 02ClassData.csv data file and save to math323.
math323 = read_csv("C02Data.csv")
Rows: 10 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Name, Major, GradProf
dbl  (3): Prog, Hand, Absolute
time (1): Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Observe the presence of math323 in the Environment tab in the upper right window of RStudio. Double click to view the data frame.

  2. Observe the message that states the data type assigned to each variable. Use help to determine better types to assign., and then view the third row.

math323 = read_csv("C02Data.csv", col_types = "cfftdif")
math323 %>% slice(3)
# A tibble: 1 × 7
  Name  Major   Prog  Time    Hand Absolute GradProf
  <chr> <fct>   <fct> <time> <dbl>    <int> <fct>   
1 Elise Physics 3     08:00   21.4        2 yes     

Render

  1. Render this file as an html document. Observe how the YAML in the header is incorporated.
  2. Add an option to the code chunk containing the View command so that it is not evaluated.