9 Simulating Distributions

Reading: 7 minute(s) at 200 WPM

Videos: 48 minutes

Objectives

This chapter is heavily from Dr. Theobold’s course-page material.

9.1 Simulation

In statistics, we often want to simulate data (or create fake data) for a variety of purposes. For example, in your first statistics course, you may have flipped coins to “simulate” a 50-50 chance. In this section, we will learn how to simulate data from statistical distributions using R.

Required Reading – R Programming for Data Science : Simulation

9.1.1 Setting a Random Number Seed

Functions like rnorm() rely on something called pseudo-randomness. Because computers can never be truly random, complicated processes are implemented to make “random” number generation be so unpredictable as to behave like true randomness.

This means that projects involving simulation are harder to make reproducible. For example, here are two identical lines of code that give different results!

rnorm(1, mean = 0, sd = 1)

[1] 0.1623273

rnorm(1, mean = 0, sd = 1)

[1] -0.3184077

Fortunately, pseudo-randomness depends on a seed, which is an arbitrary number where the randomizing process starts. Normally, R will choose the seed for you, from a pre-generated vector:

head(.Random.seed)

[1]       10403           4 -1176886906 -2143185136   501511091  -535777199

However, you can also choose your own seed using the set.seed() function. This guarantees your results will be consistent across runs (and hopefully computers):

set.seed(1234)
rnorm(1, mean = 0, sd = 1)

[1] -1.207066

set.seed(1234)
rnorm(1, mean = 0, sd = 1)

[1] -1.207066

Of course, it doesn’t mean the results will be the same in every subsequent run if you forget or reset the seed in between each line of code!

set.seed(1234)
rnorm(1, mean = 0, sd = 1)

[1] -1.207066

## Calling rnorm() again without a seed "resets" the seed! 
rnorm(1, mean = 0, sd = 1)

[1] 0.2774292

It is very important to always set a seed at the beginning of a Quarto document that contains any random steps, so that your rendered results are consistent.

Note, though, that this only guarantees your rendered results will be the same if the code has not changed.

Changing up any part of the code will re-randomize everything that comes after it!

When writing up a report which includes results from a random generation process, in order to ensure reproducibility in your document, use `r ` to include your output within your written description with inline code.

Reproducibility: inline code example

my_rand <- rnorm(1, mean = 0, sd = 1)
my_rand

[1] 1.084441

Using r my_rand will display the result within my text:

My random number is 1.0844412.

Alternatively, you could have put the rnorm code directly into the inline text r rnorm(1, mean = 0, sd = 1), but this can get messy if you have a result that requires a larger chunk of code.

9.1.2 Plotting Density Distributions

The code below creates a tibble (read fancy data frame) of 100 heights randomly simulated (read drawn) from a normal distribution with a mean of 67 and standard deviation of 3.

set.seed(93401)
my_samples <- tibble(height = rnorm(n    = 100, 
                                    mean = 67, 
                                    sd   = 3)
                     )
my_samples |> 
  head()

# A tibble: 6 × 1
  height
   <dbl>
1   63.1
2   66.7
3   68.2
4   63.4
5   68.9
6   71.4

To visualize the simulated heights, we can look at the density of the values. We plot the simulated values using geom_histogram() and define the local \(y\) aesthetic to plot calculate and plot the density of these values. We can then overlay the normal distribution curve (theoretical equation) with our specified mean and standard deviation using dnorm within stat_function()

my_samples |> 
  ggplot(aes(x = height)) +
  geom_histogram(aes(y = ..density..), 
                 binwidth = 1.75, 
                 fill = "grey"
                 ) +
  stat_function(fun = ~ dnorm(.x, mean = 67, sd = 3), 
                col = "cornflowerblue", 
                lwd = 2
                ) + 
  xlim(c(55, 80))

9.2 Linear Regression

You now have the skills to import, wrangle, and visualize data. All of these tools help us prepare our data for statistical modeling. While we have sprinkled some formal statistical analyses throughout the course, in this section we will be formally reviewing Linear Regression. First let’s review simple linear regression. Linear regression models the linear relationship between two quantitative variables.

Review of Simple Linear Regression and Conditions

Handy function shown in the reading! skim from the skimr package.

9.2.1 Linear Regression in R

To demonstrate linear regression in R, we will be working with the penguins data set.

library(palmerpenguins)
data(penguins)
head(penguins) |> 
  kable()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

When conducting linear regression with tools in R, we often want to visualize the relationship between the two quantitative variables of interest with a scatterplot. We can then use either geom_smooth(method = "lm") (or equivalently stat_smooth(method = "lm") to add a line of best fit (“regression line”) based on the ordinary least squares (OLS) equation to our scatter plot. The regression line is shown in a default blue line with the standard error uncertainty displayed in a gray transparent band (use se = FALSE to hide the standard error uncertainty band). These visual aesthetics can be changed just as any other plot aesthetics.

penguins |>
  ggplot(aes(x = bill_depth_mm, 
             y = bill_length_mm
             )
         ) +
  geom_point() +
  geom_smooth(method = "lm") + 
  labs(x = "Bill Depth (mm)",
       subtitle = "Bill Length (mm)",
       title = "Relationship between penguin bill length and depth"
       ) +
  theme(axis.title.y = element_blank())

Be careful of “overplotting” and use geom_jitter() instead of geom_point() if your data set is dense. This is strictly a data visualization tool and will not alter the original values.

In simple linear regression, we can define the linear relationship with a mathematical equation given by:

\[y = a + b\cdot x\]

Remember \(y = m\cdot x+b\) from eighth grade?!

where

\(y\) are the values of the response variable,
\(x\) are the values of the explanatory/predictor variable,
\(a\) is the \(y\)-intercept (average value of \(y\) when \(x = 0\)), and
\(b\) is the slope coefficient (for every 1 unit increase in \(x\), the average of \(y\) increases by b)

Remember “rise over run”!

In statistics, we use slightly different notation to denote this relationship with the estimated linear regression equation:

\[\hat y = b_0 + b_1\cdot x.\]

Note that the “hat” symbol above our response variable indicates this is an “estimated” value (or our best guess).

We can “fit” the linear regression equation with the lm function in R. The formula argument is denoted as y ~ x where the left hand side (LHS) is our response variable and the right hand side (RHS) contains our explanatory/predictor variable(s). We indicate the data set with the data argument and therefore use the variable names (as opposed to vectors) when defining our formula. We name (my_model) and save our fitted model just as we would any other R object.

my_model <- lm(bill_length_mm ~ bill_depth_mm, 
               data = penguins
               )

Now that we have fit our linear regression, we might be wondering how we actually get the information out of our model. What are the y-intercept and slope coefficient estimates? What is my residual? How good was the fit? The code options below help us obtain this information.

This is what is output when you just call the name of the linear model object you created (my_model). Notice, the output doesn’t give you much information and it looks kind of bad.

my_model


Call:
lm(formula = bill_length_mm ~ bill_depth_mm, data = penguins)

Coefficients:
  (Intercept)  bill_depth_mm  
      55.0674        -0.6498

This is what is output when you use the summary() function on a linear model object. Notice, the output gives you a lot of information, some of which is really not that useful. And, the output is quite messy!

summary(my_model)


Call:
lm(formula = bill_length_mm ~ bill_depth_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.8949  -3.9042  -0.3772   3.6800  15.5798 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    55.0674     2.5160  21.887  < 2e-16 ***
bill_depth_mm  -0.6498     0.1457  -4.459 1.12e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.314 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.05525,   Adjusted R-squared:  0.05247 
F-statistic: 19.88 on 1 and 340 DF,  p-value: 1.12e-05

The tidy() function from the {broom} package takes a linear model object and puts the “important” information into a tidy tibble output.

Ah! Just right!

library(broom)
tidy(my_model)

# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     55.1       2.52      21.9  6.91e-67
2 bill_depth_mm   -0.650     0.146     -4.46 1.12e- 5

If you are sad that you no longer have the statistics about the model fit (e.g., R-squared, adjusted R-squared, \(\sigma\)), you can use the glance() function from the broom package to grab those!

broom::glance(my_model)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0552        0.0525  5.31      19.9 0.0000112     1 -1056. 2117. 2129.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>