(logical_vec <- c(T, F, T, T))
[1] TRUE FALSE TRUE TRUE
(numeric_vec <- c(3, 1, 4, 5))
[1] 3 1 4 5
(char_vec <- c("A", "AB", "ABC", "ABCD"))
[1] "A" "AB" "ABC" "ABCD"
Reading: 20 minute(s) at 200 WPM
Videos: 29 minute(s)
We just learned the rule of “don’t repeat yourself more than two times” and to instead automate our procedures with functions in order to remove duplication of code. We have used tools such as across()
to help eliminate this copy-paste procedure even further. This is a form of iteration in programming as across()
“iterates” over variables, applying a function to manipulate each variable and then doing the same for the next variable.
while()
and for()
loops are a common form of iteration that can be extremely useful when logically thinking through a problem, however are extremely computationally intensive. Therefore, loops will not be the focus of this chapter. If you are interested, you can go read about loops in the pre-reading material of this text.
You can read all about iteration in the previous version of R4DS.
In the pre-reading, we introduce the different data structures we have worked with in R. We are going to do a review of some of the important data structures for this chapter.
A vector is a 1-dimensional data structure that contains items of the same simple (‘atomic’) type (character, logical, integer, factor).
(logical_vec <- c(T, F, T, T))
[1] TRUE FALSE TRUE TRUE
(numeric_vec <- c(3, 1, 4, 5))
[1] 3 1 4 5
(char_vec <- c("A", "AB", "ABC", "ABCD"))
[1] "A" "AB" "ABC" "ABCD"
You index a vector using brackets: to get the \(i\)th element of the vector x
, you would use x[i]
in R or x[i-1]
in python (Remember, python is 0-indexed, so the first element of the vector is at location 0).
logical_vec[3]
[1] TRUE
numeric_vec[3]
[1] 4
char_vec[3]
[1] "ABC"
You can also index a vector using a logical vector:
numeric_vec[logical_vec]
[1] 3 4 5
char_vec[logical_vec]
[1] "A" "ABC" "ABCD"
logical_vec[logical_vec]
[1] TRUE TRUE TRUE
A list is a 1-dimensional data structure that has no restrictions on what type of content is stored within it. A list is a “vector”, but it is not an atomic vector - that is, it does not necessarily contain things that are all the same type.
(
mylist <- list(
logical_vec,
numeric_vec,
third_thing = char_vec[1:2]
)
)
[[1]]
[1] TRUE FALSE TRUE TRUE
[[2]]
[1] 3 1 4 5
$third_thing
[1] "A" "AB"
List components may have names (or not), be homogeneous (or not), have the same length (or not).
Indexing necessarily differs between R and python, and since the list types are also somewhat different (e.g. lists cannot be named in python), we will treat list indexing in the two languages separately.
There are 3 ways to index a list:
mylist[1]
[[1]]
[1] TRUE FALSE TRUE TRUE
mylist[2]
[[1]]
[1] 3 1 4 5
mylist[c(T, F, T)]
[[1]]
[1] TRUE FALSE TRUE TRUE
$third_thing
[1] "A" "AB"
mylist[[1]]
[1] TRUE FALSE TRUE TRUE
mylist[["third_thing"]]
[1] "A" "AB"
x$name
. This is equivalent to using x[["name"]]
. Note that this does not work on unnamed entries in the list.mylist$third_thing
[1] "A" "AB"
To access the contents of a list object, we have to use double-indexing:
mylist[["third_thing"]][[1]]
[1] "A"
You can get a more thorough review of vectors and lists from Jenny Bryan’s purrr tutorial introduction (Bryan n.d.).
Operations in R are (usually) vectorized - that is, by default, they operate on vectors. This is primarily a feature that applies to atomic vectors (and we don’t even think about it):
[1] 4.3605735 2.6726954 3.2503800 1.6846993 3.8238961 4.3775941 2.9211669
[8] 4.0266903 4.3830421 0.4627222
With vectorized functions, we don’t have to use a for loop to add these two vectors with 10 entries each together. In languages which don’t have implicit support for vectorized computations, this might instead look like:
a <- rnorm(10)
b <- rnorm(10, mean = 3)
result <- rep(0, 10)
for (i in 1:10) {
result[i] <- a[i] + b[i]
}
result
[1] 1.7103006 2.1922733 2.7639597 3.5712390 3.6324717 3.8090444 4.6425329
[8] 5.3263927 0.3569237 2.9238857
That is, we would apply or map the +
function to each entry of a and b. For atomic vectors, it’s easy to do this by default; with a list, however, we need to be a bit more explicit (because everything that’s passed into the function may not be the same type).
::: column-margin I find the purrr
package easier to work with, so we won’t be working with the base functions (the apply family) in this course. You can find a side-by-side comparison in the purrr
tutorial.
You can also watch Dr. Theobold’s video to learn more:
The R package purrr
(and similar base functions apply
, lapply
, sapply
, tapply
, and mapply
) are based on extending “vectorized” functions to a wider variety of vector-like structures.
The concept of functional programming is a bit hard to define rigorously at the level we’re working at, but generally, functional programming is concerned with pure functions: functions that have an input value that determines the output value and create no other side effects.
What this means is that you describe every step of the computation using a function, and chain the functions together. At the end of the computations, you might save the program’s results to an object, but (in general), the goal is to not change things outside of the “pipeline” along the way.
This has some advantages:
The introduction of the pipe in R has made chaining functions together in a functional programming-style pipeline much easier. purrr
is just another step in this process: by making it easy to apply functions to lists of things (or to use multiple lists of things in a single function), purrr
makes it easier to write clean, understandable, debuggable code.
This example is modified from the motivation section of the Functional Programming chapter in Advanced R (Wickham 2019).
Suppose we want to replace every -99 in the following sample dataset with an NA. (-99 is sometimes used to indicate missingness in datasets).
The “beginner” approach is to just replace each individual -99 with an NA:
df1 <- df
df1[6,2] <- NA
df1[1,3] <- NA
df1[5,4] <- NA
df1
a b c d e f
1 7 5 NA 2 5 2
2 5 5 5 3 6 1
3 6 8 5 9 9 4
4 4 2 2 6 6 8
5 6 7 6 NA 10 6
6 9 NA 4 7 5 1
This is tedious, and painful, and won’t work if we have a slightly different dataset where the -99s are in different places. So instead, we might consider being a bit more general:
df2 <- df
df2$a[df2$a == -99] <- NA
df2$b[df2$b == -99] <- NA
df2$c[df2$c == -99] <- NA
df2$d[df2$d == -99] <- NA
df2$e[df2$e == -99] <- NA
df2$f[df2$f == -99] <- NA
df2
a b c d e f
1 7 5 NA 2 5 2
2 5 5 5 3 6 1
3 6 8 5 9 9 4
4 4 2 2 6 6 8
5 6 7 6 NA 10 6
6 9 NA 4 7 5 1
This requires a few more lines of code, but is able to handle any data frame with 6 columns a
- f
. It also requires a lot of copy-paste and can leave you vulnerable to making mistakes.
The standard rule is that if you copy-paste the same code 3x, then you should write a function, so let’s try that instead:
fix_missing <- function(x, missing = -99){
x[x == missing] <- NA
x
}
df3 <- df
df3$a <- fix_missing(df$a)
df3$b <- fix_missing(df$b)
df3$c <- fix_missing(df$c)
df3$d <- fix_missing(df$d)
df3$e <- fix_missing(df$e)
df3$f <- fix_missing(df$f)
df3
a b c d e f
1 7 5 NA 2 5 2
2 5 5 5 3 6 1
3 6 8 5 9 9 4
4 4 2 2 6 6 8
5 6 7 6 NA 10 6
6 9 NA 4 7 5 1
This still requires a lot of copy-paste, and doesn’t actually make the code more readable. We can more easily change the missing value, though, which is a bonus.
We have a function that we want to apply or map to every column in our data frame. We could use a for()
loop (doing this for demonstrative purposes only, I expect you to use more efficient tools in class):
fix_missing <- function(x, missing = -99){
x[x == missing] <- NA
x
}
df4 <- df
for (i in 1:ncol(df)) {
df4[,i] <- fix_missing(df4[,i])
}
df4
a b c d e f
1 7 5 NA 2 5 2
2 5 5 5 3 6 1
3 6 8 5 9 9 4
4 4 2 2 6 6 8
5 6 7 6 NA 10 6
6 9 NA 4 7 5 1
This is more understandable and flexible than the previous function approach as well as the naive approach - we don’t need to know the names of the columns in our data frame, or even how many there are. It is still quite a few lines of code, though.
Iterating through a list (or columns of a data frame) is a very common task, so R has a shorthand function for it. You could us lapply
from base R, but we will be learning the map
family of functions from the purrr
package.
fix_missing <- function(x, missing = -99){
x[x == missing] <- NA
x
}
df5 <- df
df5 <- map_dfc(df5, fix_missing)
df5
# A tibble: 6 × 6
a b c d e f
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7 5 NA 2 5 2
2 5 5 5 3 6 1
3 6 8 5 9 9 4
4 4 2 2 6 6 8
5 6 7 6 NA 10 6
6 9 NA 4 7 5 1
By default, map
returns a list (see below), but we can use map_dfc
to return a data frame created by binding the columns together.
map()
- returns a list
df6 <- df
map(df6, fix_missing)
$a
[1] 7 5 6 4 6 9
$b
[1] 5 5 8 2 7 NA
$c
[1] NA 5 5 2 6 4
$d
[1] 2 3 9 6 NA 7
$e
[1] 5 6 9 6 10 5
$f
[1] 2 1 4 8 6 1
We’ve replaced 6 lines of code that only worked for 6 columns named a
- f
with a single line of code that works for any data frame with any number of rows and columns, so long as -99 indicates missing data. In addition to being shorter, this code is also somewhat easier to read and much less vulnerable to typos.
map()
purrr
is a part of the tidyverse, so you should already have the package installed. When you load the tidyverse with library()
, this also loads purrr
.
install.packages("purrr")
library(purrr)
📖 (REQUIRED) Please read Sections 21.5 through 21.7 R for Data Science
The Joy of Functional Programming (for Data Science): Hadley Wickham’s talk on purrr and functional programming. ~1h video and slides.
(The Joy of Cooking meets Data Science, with illustrations by Allison Horst)
Pirating Web Content Responsibly with R and purrr (a blog post in honor of international talk like a pirate day) (Rudis 2017)
Setting NAs with purrr (uses the naniar
package)
Mappers with purrr - handy ways to make your code simpler if you’re reusing functions a lot.