8 Functional Programming

Reading: 20 minute(s) at 200 WPM

Videos: 29 minute(s)

Objectives

Use functional programming techniques to create code which is well organized and easier to understand and maintain

8.1 Introduction to Iteration

We just learned the rule of “don’t repeat yourself more than two times” and to instead automate our procedures with functions in order to remove duplication of code. We have used tools such as across() to help eliminate this copy-paste procedure even further. This is a form of iteration in programming as across() “iterates” over variables, applying a function to manipulate each variable and then doing the same for the next variable.

while() and for() loops are a common form of iteration that can be extremely useful when logically thinking through a problem, however are extremely computationally intensive. Therefore, loops will not be the focus of this chapter. If you are interested, you can go read about loops in the pre-reading material of this text.

You can read all about iteration in the previous version of R4DS.

8.2 Review of Lists and Vectors

In the pre-reading, we introduce the different data structures we have worked with in R. We are going to do a review of some of the important data structures for this chapter.

A vector is a 1-dimensional data structure that contains items of the same simple (‘atomic’) type (character, logical, integer, factor).

(logical_vec <- c(T, F, T, T))

[1]  TRUE FALSE  TRUE  TRUE

(numeric_vec <- c(3, 1, 4, 5))

[1] 3 1 4 5

(char_vec <- c("A", "AB", "ABC", "ABCD"))

[1] "A"    "AB"   "ABC"  "ABCD"

You index a vector using brackets: to get the $i$th element of the vector x, you would use x[i] in R or x[i-1] in python (Remember, python is 0-indexed, so the first element of the vector is at location 0).

logical_vec[3]

[1] TRUE

numeric_vec[3]

[1] 4

char_vec[3]

[1] "ABC"

You can also index a vector using a logical vector:

numeric_vec[logical_vec]

[1] 3 4 5

char_vec[logical_vec]

[1] "A"    "ABC"  "ABCD"

logical_vec[logical_vec]

[1] TRUE TRUE TRUE

A list is a 1-dimensional data structure that has no restrictions on what type of content is stored within it. A list is a “vector”, but it is not an atomic vector - that is, it does not necessarily contain things that are all the same type.

(
  mylist <- list(
    logical_vec, 
    numeric_vec, 
    third_thing = char_vec[1:2]
  )
)

[[1]]
[1]  TRUE FALSE  TRUE  TRUE

[[2]]
[1] 3 1 4 5

$third_thing
[1] "A"  "AB"

List components may have names (or not), be homogeneous (or not), have the same length (or not).

8.2.1 Indexing

Indexing necessarily differs between R and python, and since the list types are also somewhat different (e.g. lists cannot be named in python), we will treat list indexing in the two languages separately.

A pepper shaker containing several individual paper packets of pepper — An unusual pepper shaker which we’ll call `pepper`

A pepper shaker containing a single individual paper packet of pepper. — An unusual pepper shaker which we’ll call `pepper`

There are 3 ways to index a list:

With single square brackets, just like we index atomic vectors. In this case, the return value is always a list.

mylist[1]

[[1]]
[1]  TRUE FALSE  TRUE  TRUE

mylist[2]

[[1]]
[1] 3 1 4 5

mylist[c(T, F, T)]

[[1]]
[1]  TRUE FALSE  TRUE  TRUE

$third_thing
[1] "A"  "AB"

With double square brackets. In this case, the return value is the thing inside the specified position in the list, but you also can only get one entry in the main list at a time. You can also get things by name.

mylist[[1]]

[1]  TRUE FALSE  TRUE  TRUE

mylist[["third_thing"]]

[1] "A"  "AB"

Using x$name. This is equivalent to using x[["name"]]. Note that this does not work on unnamed entries in the list.

mylist$third_thing

[1] "A"  "AB"

To access the contents of a list object, we have to use double-indexing:

mylist[["third_thing"]][[1]]

[1] "A"

Note

You can get a more thorough review of vectors and lists from Jenny Bryan’s purrr tutorial introduction (Bryan n.d.).

8.3 Vectorized Operations

Operations in R are (usually) vectorized - that is, by default, they operate on vectors. This is primarily a feature that applies to atomic vectors (and we don’t even think about it):

(rnorm(10) + rnorm(10, mean = 3))

 [1] 4.3605735 2.6726954 3.2503800 1.6846993 3.8238961 4.3775941 2.9211669
 [8] 4.0266903 4.3830421 0.4627222

With vectorized functions, we don’t have to use a for loop to add these two vectors with 10 entries each together. In languages which don’t have implicit support for vectorized computations, this might instead look like:

a <- rnorm(10)
b <- rnorm(10, mean = 3)

result <- rep(0, 10)
for (i in 1:10) {
  result[i] <- a[i] + b[i]
}

result

 [1] 1.7103006 2.1922733 2.7639597 3.5712390 3.6324717 3.8090444 4.6425329
 [8] 5.3263927 0.3569237 2.9238857

That is, we would apply or map the + function to each entry of a and b. For atomic vectors, it’s easy to do this by default; with a list, however, we need to be a bit more explicit (because everything that’s passed into the function may not be the same type).

::: column-margin I find the purrr package easier to work with, so we won’t be working with the base functions (the apply family) in this course. You can find a side-by-side comparison in the purrr tutorial.

You can also watch Dr. Theobold’s video to learn more:

The R package purrr (and similar base functions apply, lapply, sapply, tapply, and mapply) are based on extending “vectorized” functions to a wider variety of vector-like structures.

8.4 Functional Programming

The concept of functional programming is a bit hard to define rigorously at the level we’re working at, but generally, functional programming is concerned with pure functions: functions that have an input value that determines the output value and create no other side effects.

What this means is that you describe every step of the computation using a function, and chain the functions together. At the end of the computations, you might save the program’s results to an object, but (in general), the goal is to not change things outside of the “pipeline” along the way.

This has some advantages:

Easier parallelization
- “Side effects” generally make it hard to parallelize code because e.g. you have to update stored objects in memory, which is hard to do with multiple threads accessing the same memory.
Functional programming tends to be easier to read
- You can see output and input and don’t have to work as hard to keep track of what is stored where .
Easier Debugging
- You can examine the input and output at each stage to isolate which function is introducing the problem.

The introduction of the pipe in R has made chaining functions together in a functional programming-style pipeline much easier. purrr is just another step in this process: by making it easy to apply functions to lists of things (or to use multiple lists of things in a single function), purrr makes it easier to write clean, understandable, debuggable code.

Functional Programming Example

This example is modified from the motivation section of the Functional Programming chapter in Advanced R (Wickham 2019).

Suppose we want to replace every -99 in the following sample dataset with an NA. (-99 is sometimes used to indicate missingness in datasets).

# Generate a sample dataset
set.seed(1014)
df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep = TRUE)))
names(df) <- letters[1:6]
df

  a   b   c   d  e f
1 7   5 -99   2  5 2
2 5   5   5   3  6 1
3 6   8   5   9  9 4
4 4   2   2   6  6 8
5 6   7   6 -99 10 6
6 9 -99   4   7  5 1

The “beginner” approach is to just replace each individual -99 with an NA:

df1 <- df
df1[6,2] <- NA
df1[1,3] <- NA
df1[5,4] <- NA

df1

  a  b  c  d  e f
1 7  5 NA  2  5 2
2 5  5  5  3  6 1
3 6  8  5  9  9 4
4 4  2  2  6  6 8
5 6  7  6 NA 10 6
6 9 NA  4  7  5 1

This is tedious, and painful, and won’t work if we have a slightly different dataset where the -99s are in different places. So instead, we might consider being a bit more general:

df2 <- df
df2$a[df2$a == -99] <- NA
df2$b[df2$b == -99] <- NA
df2$c[df2$c == -99] <- NA
df2$d[df2$d == -99] <- NA
df2$e[df2$e == -99] <- NA
df2$f[df2$f == -99] <- NA
df2

  a  b  c  d  e f
1 7  5 NA  2  5 2
2 5  5  5  3  6 1
3 6  8  5  9  9 4
4 4  2  2  6  6 8
5 6  7  6 NA 10 6
6 9 NA  4  7  5 1

This requires a few more lines of code, but is able to handle any data frame with 6 columns a - f. It also requires a lot of copy-paste and can leave you vulnerable to making mistakes.

The standard rule is that if you copy-paste the same code 3x, then you should write a function, so let’s try that instead:

fix_missing <- function(x, missing = -99){
  x[x == missing] <- NA
  x
}

df3 <- df
df3$a <- fix_missing(df$a)
df3$b <- fix_missing(df$b)
df3$c <- fix_missing(df$c)
df3$d <- fix_missing(df$d)
df3$e <- fix_missing(df$e)
df3$f <- fix_missing(df$f)
df3

  a  b  c  d  e f
1 7  5 NA  2  5 2
2 5  5  5  3  6 1
3 6  8  5  9  9 4
4 4  2  2  6  6 8
5 6  7  6 NA 10 6
6 9 NA  4  7  5 1

This still requires a lot of copy-paste, and doesn’t actually make the code more readable. We can more easily change the missing value, though, which is a bonus.

We have a function that we want to apply or map to every column in our data frame. We could use a for() loop (doing this for demonstrative purposes only, I expect you to use more efficient tools in class):

fix_missing <- function(x, missing = -99){
  x[x == missing] <- NA
  x
}

df4 <- df
for (i in 1:ncol(df)) {
  df4[,i] <- fix_missing(df4[,i])
}
df4

  a  b  c  d  e f
1 7  5 NA  2  5 2
2 5  5  5  3  6 1
3 6  8  5  9  9 4
4 4  2  2  6  6 8
5 6  7  6 NA 10 6
6 9 NA  4  7  5 1

This is more understandable and flexible than the previous function approach as well as the naive approach - we don’t need to know the names of the columns in our data frame, or even how many there are. It is still quite a few lines of code, though.

Iterating through a list (or columns of a data frame) is a very common task, so R has a shorthand function for it. You could us lapply from base R, but we will be learning the map family of functions from the purrr package.

fix_missing <- function(x, missing = -99){
  x[x == missing] <- NA
  x
}

df5 <- df
df5 <- map_dfc(df5, fix_missing)
df5

# A tibble: 6 × 6
      a     b     c     d     e     f
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     7     5    NA     2     5     2
2     5     5     5     3     6     1
3     6     8     5     9     9     4
4     4     2     2     6     6     8
5     6     7     6    NA    10     6
6     9    NA     4     7     5     1

By default, map returns a list (see below), but we can use map_dfc to return a data frame created by binding the columns together.

map() - returns a list

df6 <- df
map(df6, fix_missing)

$a
[1] 7 5 6 4 6 9

$b
[1]  5  5  8  2  7 NA

$c
[1] NA  5  5  2  6  4

$d
[1]  2  3  9  6 NA  7

$e
[1]  5  6  9  6 10  5

$f
[1] 2 1 4 8 6 1

We’ve replaced 6 lines of code that only worked for 6 columns named a - f with a single line of code that works for any data frame with any number of rows and columns, so long as -99 indicates missing data. In addition to being shorter, this code is also somewhat easier to read and much less vulnerable to typos.

8.5 Introduction to `map()`

purrr is a part of the tidyverse, so you should already have the package installed. When you load the tidyverse with library(), this also loads purrr.

install.packages("purrr")
library(purrr)

Download the purrr cheatsheet.

📖 (REQUIRED) Please read Sections 21.5 through 21.7 R for Data Science

Learn More About Purrr

The Joy of Functional Programming (for Data Science): Hadley Wickham’s talk on purrr and functional programming. ~1h video and slides.
(The Joy of Cooking meets Data Science, with illustrations by Allison Horst)
Pirating Web Content Responsibly with R and purrr (a blog post in honor of international talk like a pirate day) (Rudis 2017)
Happy R Development with purrr
Web mining with purrr
Text Wrangling with purrr
Setting NAs with purrr (uses the naniar package)
Mappers with purrr - handy ways to make your code simpler if you’re reusing functions a lot.
Function factories - code optimization with purrr
Stats and Machine Learning examples with purrr

References

Bryan, Jennifer. n.d. “Lessons and Examples.” Purrr Tutorial. Accessed November 14, 2022. https://jennybc.github.io/purrr-tutorial/index.html.

Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. 1st ed. O’Reilly Media. https://r4ds.had.co.nz/.

Rudis, Bob. 2017. “Pirating Web Content Responsibly With R.” Rud.is. https://rud.is/b/2017/09/19/pirating-web-content-responsibly-with-r/.

Wickham, Hadley. 2019. “Functional Programming.” In Advanced R, 2nd ed. The R Series. Chapman; Hall/CRC. http://adv-r.had.co.nz/Functional-programming.html.

Objectives

8.1 Introduction to Iteration

Read more

8.2 Review of Lists and Vectors

8.2.1 Indexing

8.3 Vectorized Operations

8.4 Functional Programming

Functional Programming Example

8.5 Introduction to map()

Learn More About Purrr

References

8.5 Introduction to `map()`