--- title: "C27-28 Random Number Generation" author: "David Housman" format: docx editor: source fig-width: 6 fig-height: 2.4 --- ```{r} #| message: false #| warning: false #| echo: false library(tidyverse) ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(eval = FALSE) ``` ## Uniform Random Number Generator Let $U$ be a random variable with a uniform continuous distribution on $[0,1)$. Here is R's function to generate five values from $U$. ```{r runif} runif(5) ``` The R code below shows that R only generated pseudorandom numbers: specifying a seed for "random" number generation results in the same "random" numbers! This can actually be useful for code testing. ```{r set.seed} set.seed(17) runif(5) set.seed(17) runif(5) ``` The linear congruential generator was a standard method used in computer systems because of its simplicity and speed. Starting with a seed $x_0$, let $x_i = a x_{i-1} + c \pmod{m}$ where $a$, $c$, and $m$ are fixed integers called the multiplier, increment, and modulus, respectively. The $i$th random real in $[0,1)$ is $x_i/m$. ```{r rlcg} x = 19570830 #the seed value rlcg = function() { #updating the global variable x <<- (1103515245 * x + 12345) %% 2^31 x/2^31 #the random number in [0, 1] } tibble(x = replicate(10000, rlcg())) |> ggplot(aes(x = x)) + geom_histogram(breaks = seq(0, 1, 0.1), color = "white") ``` By default, R uses a more modern and higher quality method called the Mersenne-Twister (although the user can choose among several algorithms). A more in depth examination of pseudo-random number generation would be appropriate in a second probability course or a number theory course. ## An Easy Continuous Random Number Generator Suppose the random variable $X$ has the probability density function $f(x) = 3/x^4$ for $1 \leq x < \infty$. a. Verify that $f$ is a pdf. b. Determine the mean and standard deviation of $X$. c. Define the pdf, cdf, inverse cdf, and random generator as R functions. ```{r} deasy = function(x) ifelse(1 <= x, 3/x^4, 0) peasy = function(x) ifelse(1 <= x, 1 - 1/x^3, 0) qeasy = function(p) ifelse(0 <= p & p < 1, (1-p)^(-1/3), ifelse(p == 1, Inf, NaN)) reasy = function(n) qeasy(runif(n)) ``` d. As we have seen, R was built to handle most computations with vectors as simply as with numbers. The function `if` is an exception, but the function `ifelse` is the vectorized form of the `if` function. R also treats `-Inf` and `Inf` as numbers when used in reasonable ways, e.g., `3/Inf = 0` and `-2/0 = -Inf` but `cos(Inf)` and -Inf + Inf` both return `NaN`, the reserved word for "not a number". If we could be sure that users of the `easy` function would not input "incorrect" values (i.e., asking for the pdf at a number below 1, the cdf for a number below 1, or an inverse cdf for a number below 0 or 1 or above), then the conditional code would not be needed. Check that the first three functions handle extreme cases. ```{r} deasy(c(-Inf, -3, Inf)) == c(0, 0, 0) peasy(c(-Inf, -3, Inf)) == c(0, 0, 1) qeasy(c(0, 1)) == c(1, Inf) qeasy(c(-Inf, -3, 3, Inf)) #all NaN ``` e. Compare a simulation with theory via the means, standard deviations, pdfs, cdfs, and quantiles. ```{r} #Simulate runs = 1000 simulation = tibble(x = reasy(runs)) #Compute the theoretical distribution theory = tibble( x = seq(0, max(simulation$x), length.out=100), pdf = deasy(x), cdf = peasy(x) ) #Compare simulation and theoretical statistics rbind( simulation |> summarize(type = "simulation", mean = mean(x), sd = sd(x)), tibble(type = "theory", mean = 1.5, sd = sqrt(0.75)) ) #Compare simulation and theoretical pdfs ggplot(simulation, aes(x = x, y = after_stat(density))) + geom_histogram(bins = 30, color = "white") + geom_line(aes(x = x, y = pdf), theory, color = "red") #Compare simulation and theoretical cdfs ggplot(simulation, aes(x = x)) + stat_ecdf() + geom_line(aes(x = x, y = cdf), theory, color = "red") #Compare simulation and theory with a quantile-quantile plot ggplot(simulation, aes(sample = x)) + stat_qq(distribution = qeasy, alpha = 0.2) + geom_abline(slope = 1, intercept = 0, color = "red") ``` ## The Precision of Simulations How close do we expect a random number generator to be in comparison with theory? If theory says that the probability of generating numbers within a particular interval $[a,b)$ is $p$ and we generate $n$ random numbers, then the number of random numbers that are in $[a,b)$ is a random variable $X$ that has a binomial distribution with the parameters $n$ and $p$. Hence, the mean and standard deviation of $X$ is $\mu_X = np$ and $\sigma_X = \sqrt{np(1-p)}$. The proportion of random numbers that are in $[a,b)$ is the random variable $Y = X/n$, which therefore has the mean and standard deviation $\mu_Y = p$ and $\sigma_Y = \sqrt{p(1-p)/n}$. Thus, we expect that 95% of the time, the proportion of random numbers that are in $[a,b)$ will be within $2\sqrt{p(1-p)/n}$ of the theoretical probability $p$. Since $p(1-p) \leq 1/4$ for all $p \in [0,1]$, it follows that $2\sqrt{p(1-p)/n} \leq 1/\sqrt{x}$. The later formula is often called the *margin of error* formula. Since $1/\sqrt{1600} = 0.025$, pollsters who have surveyed 1600 randomly selected people will state that with 95% confidence the reported percentages could have an error of $\pm 2.5\%$. Most pollsters use a stratified, rather than simple, random sample so that they can report meaningful results for subgroups of the population (e.g., by political party or gender), but this means that the margin of error for the full sample will be larger than indicated by the above formula. Let's apply this to the 1000 numbers generated by `reasy` (which were stored in the tibble `simulation`). The first line finds the 0.00, 0.20, 0.40, 0.60, 0.80, and 1.00 quantiles for the theoretical distribution. The second line creates a frequency distribution for the randomly generated numbers based upon the theoretical quantiles. We would expect about $1000/5 = 200$ in each bin. The third line finds the relative frequencies for each bin without the names. We would expect each number to be about $0.200$. The fourth line finds the deviations of the relative frequencies from the expected relative frequency. The fifth line finds the margin of error and the sixth line finds the more precise margin of error. We expect all or most of the deviations to be less than these numbers. ```{r easy comparison} breaks = qeasy(seq(0,1,1/5)); breaks freqdist = table(cut(simulation$x, breaks, right=FALSE)); freqdist relfreq = unname(freqdist)/runs; relfreq deviations = abs(relfreq-0.200); deviations moe = 1/sqrt(runs); moe pmoe = 2*sqrt(0.200*(1-0.200)/runs); pmoe ``` ### A Hard Continuous Random Number Generator Suppose the random variable $X$ has the probability density function $f(x) = \frac{2}{\sqrt{\pi}}e^{-x^2}$ for $0 \leq x < \infty$. Carry out the same steps as were done for the easy random variable. ```{r} #| echo: false f = function(x) (2/sqrt(pi))*exp(-x^2) integrate(f, 0, Inf) mu = integrate(function(x) x*f(x), 0, Inf)$value; mu sigma = sqrt(integrate(function(x) (x-mu)^2*f(x), 0, Inf)$value); sigma ``` What makes this hard is that the cdf and inverse cdf cannot be expressed with simple formulas. Instead numerical approximation is used, but the `integrate` and `uniroot` functions only work with numbers, not vectors. R provides the `Vectorize` function so that we can write code that would normally work only for numeric input but can then be used with vector input. It was found empirically, that $F(3.86) = 1$ to the level of machince precision. ```{r hardrng} #| echo: false #Define hard random variable functions dhard = function(x) ifelse(0 <= x, 2/sqrt(pi)*exp(-x^2), 0) phard = Vectorize(function(x) { if (x < 0) 0 else if (x < 3.86) integrate(dhard, 0, x)$value else 1 }) qhard = Vectorize(function(p) { if (p < 0 || 1 < p) NaN else if (p == 1) Inf else uniroot(function(x) phard(x)-p, c(0,3.86))$root }) rhard = function(n) qhard(runif(n)) #Check extreme cases dhard(c(-Inf, -3, Inf)) == c(0, 0, 0) phard(c(-Inf, -3, Inf)) == c(0, 0, 1) qhard(c(0, 1)) == c(0, Inf) qhard(c(-Inf, -3, 3, Inf)) #all NaN #Simulate runs = 1000 simulation = tibble(x = rhard(runs)) #Compute the theoretical distribution theory = tibble( x = seq(0, max(simulation$x), length.out=100), pdf = dhard(x), cdf = phard(x) ) #Compare simulation and theoretical statistics rbind( simulation |> summarize(type = "simulation", mean = mean(x), sd = sd(x)), tibble(type = "theory", mean = mu, sd = sigma) ) #Compare simulation and theoretical pdfs ggplot(simulation, aes(x = x, y = after_stat(density))) + geom_histogram(bins = 30, boundary = 0, color = "white") + geom_line(aes(x = x, y = pdf), theory, color = "red") #Compare simulation and theoretical cdfs ggplot(simulation, aes(x = x)) + stat_ecdf() + geom_line(aes(x = x, y = cdf), theory, color = "red") #Compare simulation and theory with a quantile-quantile plot ggplot(simulation, aes(sample = x)) + stat_qq(distribution = qhard, alpha = 0.2) + geom_abline(slope = 1, intercept = 0, color = "red") ```