human_talk <- "blah, blah, blah. Do you want to go for a walk?"
dog_hears <- str_extract(human_talk, "walk")
dog_hears
[1] "walk"
Reading: 16 minute(s) at 200 WPM + r4ds required readings
Videos: 32 minutes
This chapter is heavily outsourced to r4ds as they do a much better job at providing examples and covering the extensive functionality of each of the packages than I myself would ever be able to.
stringr
lubridate
stringr
Nearly always, when multiple variables are stored in a single column, they are stored as character variables. There are many different “levels” of working with strings in programming, from simple find-and-replaced of fixed (constant) strings to regular expressions, which are extremely powerful (and extremely complicated).
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. - Jamie Zawinski
(required) Go read about strings in r4ds.
stringr
Download the stringr
cheatsheet.
Task | stringr | |
Replace pattern with replacement
|
str_replace(x, pattern, replacement) and str_replace_all(x, pattern, replacement)
|
|
Convert case |
str_to_lower(x) , str_to_upper(x) , str_to_title(x)
|
|
Strip whitespace from start/end |
str_trim(x) , str_squish(x)
|
|
Pad strings to a specific length | str_pad(x, …) |
|
Test if the string contains a pattern | str_detect(x, pattern) |
|
Count how many times a pattern appears in the string | str_count(x, pattern) |
|
Find the first appearance of the pattern within the string | str_locate(x, pattern) |
|
Find all appearances of the pattern within the string | str_locate_all(x, pattern) |
|
Detect a match at the start/end of the string |
str_starts(x, pattern) ,str_ends(x, pattern)
|
|
Subset a string from index a to b | str_sub(x, a, b) |
|
Convert string encoding | str_conv(x, encoding) |
Matching exact strings is easy - it’s just like using find and replace.
human_talk <- "blah, blah, blah. Do you want to go for a walk?"
dog_hears <- str_extract(human_talk, "walk")
dog_hears
[1] "walk"
But, if you can master even a small amount of regular expression notation, you’ll have exponentially more power to do good (or evil) when working with strings. You can get by without regular expressions if you’re creative, but often they’re much simpler.
You may find it helpful to follow along with this section using this web app built to test R regular expressions. The subset of regular expression syntax we’re going to cover here is fairly limited, but you can find regular expressions to do just about anything string-related. As with any tool, there are situations where it’s useful, and situations where you should not use a regular expression, no matter how much you want to.
Here are the basics of regular expressions:
[]
enclose sets of characters[abc]
will match any single character a
, b
, c
-
specifies a range of characters (A-z
matches all upper and lower case letters)-
exactly, precede with a backslash (outside of []
) or put the -
last (inside []
).
matches any character (except a newline)\
(in most languages) or \\
(in R). So \.
or \\.
will match a literal .
, \$
or \\$
will match a literal $
.num_string <- "phone: 123-456-7890, nuid: 12345678, ssn: 123-45-6789"
ssn <- str_extract(num_string, "[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]")
ssn
[1] "123-45-6789"
Listing out all of those numbers can get repetitive, though. How do we specify repetition?
*
means repeat between 0 and inf times+
means 1 or more times?
means 0 or 1 times – most useful when you’re looking for something optional{a, b}
means repeat between a
and b
times, where a
and b
are integers. b
can be blank. So [abc]{3,}
will match abc
, aaaa
, cbbaa
, but not ab
, bb
, or a
. For a single number of repeated characters, you can use {a}
. So {3, }
means “3 or more times” and {3}
means “exactly 3 times”library(stringr)
str_extract("banana", "[a-z]{1,}") # match any sequence of lowercase characters
[1] "banana"
str_extract("banana", "[ab]{1,}") # Match any sequence of a and b characters
[1] "ba"
str_extract_all("banana", "(..)") # Match any two characters
[[1]]
[1] "ba" "na" "na"
str_extract("banana", "(..)\\1") # Match a repeated thing
[1] "anan"
num_string <- "phone: 123-456-7890, nuid: 12345678, ssn: 123-45-6789, bank account balance: $50,000,000.23"
ssn <- str_extract(num_string, "[0-9]{3}-[0-9]{2}-[0-9]{4}")
ssn
[1] "123-45-6789"
phone <- str_extract(num_string, "[0-9]{3}.[0-9]{3}.[0-9]{4}")
phone
[1] "123-456-7890"
nuid <- str_extract(num_string, "[0-9]{8}")
nuid
[1] "12345678"
bank_balance <- str_extract(num_string, "\\$[0-9,]+\\.[0-9]{2}")
bank_balance
[1] "$50,000,000.23"
There are also ways to “anchor” a pattern to a part of the string (e.g. the beginning or the end)
^
has multiple meanings:
^
matches the beginning of a string[
, e.g. [^abc]
, ^
means “not” - for instance, “the collection of all characters that aren’t a, b, or c”.$
means the end of a stringCombined with pre and post-processing, these let you make sense out of semi-structured string data, such as addresses.
address <- "1600 Pennsylvania Ave NW, Washington D.C., 20500"
house_num <- str_extract(address, "^[0-9]{1,}")
# Match everything alphanumeric up to the comma
street <- str_extract(address, "[A-z0-9 ]{1,}")
street <- str_remove(street, house_num) %>% str_trim() # remove house number
city <- str_extract(address, ",.*,") %>% str_remove_all(",") %>% str_trim()
zip <- str_extract(address, "[0-9-]{5,10}$") # match 5 and 9 digit zip codes
()
are used to capture information. So ([0-9]{4})
captures any 4-digit numbera|b
will select a or b.If you’ve captured information using (), you can reference that information using backreferences. In most languages, those look like this: \1
for the first reference, \9
for the ninth. In R, backreferences are \\1
through \\9
.
In R, the \
character is special, so you have to escape it. So in R, \\1
is the first reference, and \\2
is the second, and so on.
phone_num_variants <- c("(123) 456-7980", "123.456.7890", "+1 123-456-7890")
phone_regex <- "\\+?[0-9]{0,3}? ?\\(?([0-9]{3})?\\)?.?([0-9]{3}).?([0-9]{4})"
# \\+?[0-9]{0,3} matches the country code, if specified,
# but won't take the first 3 digits from the area code
# unless a country code is also specified
# \\( and \\) match literal parentheses if they exist
# ([0-9]{3})? captures the area code, if it exists
# .? matches any character
# ([0-9]{3}) captures the exchange code
# ([0-9]{4}) captures the 4-digit individual code
str_extract(phone_num_variants, phone_regex)
[1] "(123) 456-7980" "123.456.7890" "+1 123-456-7890"
str_replace(phone_num_variants, phone_regex, "\\1\\2\\3")
[1] "1234567980" "1234567890" "1234567890"
# We didn't capture the country code, so it remained in the string
human_talk <- "blah, blah, blah. Do you want to go for a walk? I think I'm going to treat myself to some ice cream for working so hard. "
dog_hears <- str_extract_all(human_talk, "walk|treat")
dog_hears
[[1]]
[1] "walk" "treat"
Putting it all together, we can test our regular expressions to ensure that they are specific enough to pull out what we want, while not pulling out other similar information:
strings <- c("abcdefghijklmnopqrstuvwxyzABAB",
"banana orange strawberry apple",
"ana went to montana to eat a banana",
"call me at 432-394-2873. Do you want to go for a walk? I'm going to treat myself to some ice cream for working so hard.",
"phone: (123) 456-7890, nuid: 12345678, bank account balance: $50,000,000.23",
"1600 Pennsylvania Ave NW, Washington D.C., 20500")
phone_regex <- "\\+?[0-9]{0,3}? ?\\(?([0-9]{3})?\\)?.?([0-9]{3}).([0-9]{4})"
dog_regex <- "(walk|treat)"
addr_regex <- "([0-9]*) ([A-z0-9 ]{3,}), ([A-z\\. ]{3,}), ([0-9]{5})"
abab_regex <- "(..)\\1"
tibble(
text = strings,
phone = str_detect(strings, phone_regex),
dog = str_detect(strings, dog_regex),
addr = str_detect(strings, addr_regex),
abab = str_detect(strings, abab_regex))
# A tibble: 6 × 5
text phone dog addr abab
<chr> <lgl> <lgl> <lgl> <lgl>
1 abcdefghijklmnopqrstuvwxyzABAB FALSE FALSE FALSE TRUE
2 banana orange strawberry apple FALSE FALSE FALSE TRUE
3 ana went to montana to eat a banana FALSE FALSE FALSE TRUE
4 call me at 432-394-2873. Do you want to go for a walk… TRUE TRUE FALSE FALSE
5 phone: (123) 456-7890, nuid: 12345678, bank account b… TRUE FALSE FALSE FALSE
6 1600 Pennsylvania Ave NW, Washington D.C., 20500 FALSE FALSE TRUE FALSE
(semi-required) Go read more about regular expressions in r4ds.
Read at least through section 17.4.1.
lubridate
In order to fill in an important part of our toolbox, we need to learn how to work with date variables. These variables feel like they should be simple and intuitive given we all work with schedules and calendars everyday. However, there are little nuances that we will learn to make working with dates and times easier.
(Required) Go read about dates and times in r4ds.
A more in-depth discussion of the POSIXlt and POSIXct data classes.
A tutorial on lubridate
- scroll down for details on intervals if you have trouble with %within% and %–%
–>