library(youdrawitR)
library(palmerpenguins) # Load the penguins dataset
Introduction
customDataGen()
is a versatile function that processes a
dataframe to generate data that is suitable for visualizing with the
drawr()
function. This function allows you to specify the
type of regression, the degree of polynomial or loess regression,
whether to apply a log transformation to y for the fitted line, and
whether a confidence interval should be generated.
In this example, we will use the penguins dataset and use the
customDataGen()
function to process it. We will look
specifically at penguins on the Biscoe island.
biscoe_penguins <- subset(penguins, island == "Biscoe")
# Use customDataGen to process the data
custom_data <- customDataGen(
df = biscoe_penguins,
xvar = "body_mass_g",
yvar = "flipper_length_mm",
regression_type = "linear",
log_y = FALSE,
conf_int = TRUE # conf_int can only be true for linear regression
)
The customDataGen()
function does not necessarily need
you to provide an xvar and yvar, if none are provided it will use the
first column as the xvar and the second column as the yvar. For
additional information on parameters of customDataGen()
function look at the function documentation.
The customDataGen()
function returns a list containing
the point data and line data processed from the inputted data frame.
# Print out the custom data
custom_data
#> $line_data
#> # A tibble: 163 × 7
#> data x y coef int lower_bound upper_bound
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 line_data 2850 180. 0.0159 134. 177. 182.
#> 2 line_data 2850 180. 0.0159 134. 177. 182.
#> 3 line_data 2900 181. 0.0159 134. 178. 183.
#> 4 line_data 2925 181. 0.0159 134. 178. 184.
#> 5 line_data 3075 183. 0.0159 134. 181. 186.
#> 6 line_data 3150 185. 0.0159 134. 182. 187.
#> 7 line_data 3150 185. 0.0159 134. 182. 187.
#> 8 line_data 3175 185. 0.0159 134. 183. 187.
#> 9 line_data 3200 185. 0.0159 134. 183. 188.
#> 10 line_data 3200 185. 0.0159 134. 183. 188.
#> # ℹ 153 more rows
#>
#> $point_data
#> # A tibble: 163 × 3
#> data x y
#> <chr> <dbl> <dbl>
#> 1 point_data 2850 181
#> 2 point_data 2850 184
#> 3 point_data 2900 187
#> 4 point_data 2925 193
#> 5 point_data 3075 183
#> 6 point_data 3150 172
#> 7 point_data 3150 185
#> 8 point_data 3175 181
#> 9 point_data 3200 187
#> 10 point_data 3200 193
#> # ℹ 153 more rows
Now let’s input the data into the drawr()
function to
visualize the processed data. Additionally we can specify the axis
labels and titles of the graph, as well as produce a 95% confidence
interval since it was generated using customDataGen earlier.
drawr(custom_data,
title = "Flipper Length vs Body Mass",
subtitle = "For Penguins on Biscoe Island",
x_lab = "Body Mass (g)",
y_lab = "Flipper Length (mm)",
conf_int = TRUE)
Try drawing for yourself in the plot above! See if you can replicate the regression line. Additionally, you can also draw the 95% confidence interval boundaries using the “new line” button. Have fun experimenting with this interactive plot!
Different Regression Options
The customDataGen()
currently offers four regression
options: linear, polynomial, logistic, and loess. Since we already saw
linear let’s take a look at the others.
Logistic Regression
# Convert Species into a binary categorical variable
biscoe_penguins$binary_species <- ifelse(biscoe_penguins$species == "Adelie", "Adelie", "other")
The customDataGen()
function can only generate logistic
regression data for binary categorical variables. In this case we will
use either Adelie or other for species. You do not need to worry about
changing this variable to a factor as the customDataGen function will do
it for you.
# Generate custom data for logistic regression
custom_data_logistic <- customDataGen(
df = biscoe_penguins,
xvar = "bill_length_mm", # Predictor variable (numeric): Bill length in millimeters
yvar = "binary_species", # Response variable (binary categorical): Adelie (1) or Other (0)
regression_type = "logistic",
success_level = "Adelie"
)
The success_level argument specifies which of the two levels of the binary response variable is considered as the “event” or “success”. In this case, we are interested in the occurrence of the ‘Adelie’ species, and so we set “Adelie” as the success_level.
If success_level is not provided, the function will default to using the first level of the binary response variable, sorted alphabetically. So, if your binary response has levels “yes” and “no”, and you do not specify a success_level, the function will treat “no” as the “success” level, since “no” comes before “yes” alphabetically.
drawr(custom_data_logistic,
title = "Probability of 'Adelie' Species Based on Bill Length",
aspect_ratio = 1.2)
Here’s another chance for you to test your data drawing skills! Try to draw the logistic regression curve showing the probability of a penguin being of ‘Adelie’ species based on its bill length.
Polynomial Regression
Polynomial regression can be used to model relationships between variables that aren’t linear. In this case, we’ll generate a second degree polynomial regression using “bill_length_mm” as the x variable, and “body_mass_g” as the y variable.
# Generate custom data for polynomial regression
custom_data_poly <- customDataGen(
df = biscoe_penguins,
xvar = "bill_length_mm",
yvar = "body_mass_g",
regression_type = "polynomial",
degree = 2 # default is 2 if not specified for poly regression
)
In this example, we used a degree of 2, which means that we are fitting a quadratic polynomial to the data. Higher degrees will fit more complex polynomial curves, but remember to be aware that higher-degree polynomials can lead to over fitting.
Let’s visualize this data using the drawr()
function:
drawr(custom_data_poly,
title = "Bill Length vs Body Mass: Poly Regression")
Go ahead, try to draw the polynomial regression curve showing the relationship between bill length and body mass. This may be more challenging than the previous plots as the relationship is not linear!
Loess Regression
Loess regression is a non-parametric method that uses local data fitting to fit a smooth curve through points in a scatter plot. We’ll generate a loess regression using “bill_depth_mm” as the x variable, and “body_mass_g” as the y variable.
# Generate custom data for loess regression
custom_data_loess <- customDataGen(
df = penguins,
xvar = "bill_depth_mm",
yvar = "body_mass_g",
regression_type = "loess",
degree = 1, # default if not specified is 1 for loess regression (must be 0, 1, or 2)
span = 0.75 # default if not specified is 0.75 (ranges between 0 and 1)
)
In this example, we used a span of 0.75, which determines the amount of data considered for each local fit. Adjusting the span parameter allows you to control the flexibility of the loess fit. Too large of a span will result in the regression being over-smoothed, resulting in bias and loss of information, while too small of a span with insufficient data can result in larger variance and over fitting. For more details on selecting the optimal smoothing parameter, you can refer to this guide.
The degree parameter determines the degree of the polynomials used for the local fitting. A degree of 1 fits straight lines, while a degree of 2 fits parabolas. Using a higher degree can capture more complex patterns, but be careful not to over fit. Using a degree of 0 will turn loess into a weighted moving average.
Now we can visualize this data:
drawr(custom_data_loess,
title = "Bill Depth vs Body Mass: Loess Regression")
Try to draw the loess regression curve. This might be the most challenging plot yet, as the relationship between bill depth and body mass is complex and non-linear, you can almost imagine it as giving a kid a crayon and telling them to draw a line through the points. Good luck!
Logarithmic Scale
The youdrawitR
package offers functionality to plot data
on a logarithmic scale. This can be particularly useful when dealing
with data that exhibit exponential growth or decay. By transforming such
data onto a logarithmic scale, exponential trends can be made linear,
which can simplify the task of drawing the trend.
When using the customDataGen()
function, you can specify
log_y = TRUE to indicate that the y variable should be transformed to
the logarithmic scale. This will transform the fitted line as log(y) ~
x. If log_y is not specified or is FALSE, the fitted line is not
transformed. If using a log transformation remember that all y variable
data must be positive. Additionally, it is important to note that the
youdrawitR
package currently only supports the log
transformation with the linear regression option.
In the drawr()
function, you can set the linear argument
to anything other than “true” to put the graph on a logarithmic scale.
You should do this whenever log_y = TRUE in
customDataGen()
. Also, you can specify the base of the
logarithm with the log_base argument. If log_base is not provided or is
NULL, a natural logarithm (base e) is used. Be sure to use the same
log_base value in both customDataGen()
and
drawr()
functions. In order to make the transformed fitted
line compatible with the logarithmic scale, the drawr()
function exponentiates the y-values of the line data so the fitted line
is correctly adjusted to represent the linear trend on the logarithmic
scale.
Let’s see this feature in action:
set.seed(123)
# Generate x-values
x_values <- seq(1, 100, by = 1)
# Generate y-values with exponential growth and some random noise
y_values <- exp(0.05 * x_values) * rnorm(length(x_values), mean = 1, sd = 0.2)
# Make sure all y-values are positive
y_values <- ifelse(y_values <= 0, abs(y_values), y_values)
# Combine into a data frame
exp_data <- data.frame(
x = x_values,
y = y_values
)
# Generate custom data for linear regression with log-transformed y
custom_data_log <- customDataGen(
df = exp_data,
regression_type = "linear", # must be linear for log_y to be TRUE
log_y = TRUE,
log_base = NULL
)
# Plot the data with a logarithmic y-scale
drawr(custom_data_log,
linear = FALSE,
log_base = NULL,
title = "Log Scale")
In the plot above, you can try to draw the trend line. You’ll notice that despite the original data having an exponential relationship between x and y, the line data and point data on the log scale exhibit a linear relationship. Hence, plotting data on a logarithmic scale can simplify the task of identifying and drawing trends in data that have exponential growth patterns.
Let’s compare to how the graph would look on a linear scale:
# Generate custom data
custom_data_linear <- customDataGen(
df = exp_data,
regression_type = "linear",
log_y = FALSE,
)
# Plot the data with a linear y-scale
drawr(custom_data_linear,
linear = "true",
title = "Linear Scale")
As you can see, the exponential nature of the data is much more challenging to draw and understand on a linear scale. Although there may be different regression types that could potentially fit this trend better on a linear scale, transforming the data to a logarithmic scale is always a viable option. It transforms complex exponential growth patterns into more straightforward linear relationships, making the task of drawing and understanding trends significantly easier.