Quick Guide to Regex in R
Contents
The purpose of this guide is to bridge the gap between understanding what a regular expression is and how to use them in R. If you’re brand new to regular expressions, I highly recommend checking out RegexOne.
Hadley Wickham’s stringr package makes using regular expressions in R a breeze. I use it to avoid the complexity of base R’s regex functions grep, grepl, regexpr, gregexpr, sub and gsub where even the function names are cryptic.
Setup
library(stringr)
sentence <- "We bought our Golden Retriever, Snuggles, for $30 on 1/1/2015 at 1017 Main St. where they have many dogs."
Does the string contain a pattern?
# Does the sentence contain the word “the”?
# disregard adjacent characters
str_detect(sentence, "the")
## [1] TRUE
# consider word boundaries on both sides of the word "the"
str_detect(sentence, "\\bthe\\b")
## [1] FALSE
Extracting patterns
# What’s the first number that appears in the sentence?
# find the first digit
str_extract(sentence, "\\d")
## [1] "3"
# find the first sequence of digits
str_extract(sentence, "\\d+")
## [1] "30"
# find the first match for [^\\b]\\d+ followed by a word break where
# [^\\b]\\d+ matches everything except a word boundary followed by 1 or more digits
str_extract(sentence, "[^\\b]\\d+(?=\\b)")
## [1] "$30"
# find all sequences of numbers
str_extract_all(sentence, "\\b\\d+\\b")
## [[1]]
## [1] "30" "1" "1" "2015" "1017"
Counting matching patterns
# How many times does the word “dog” appear in the sentence?
# count occurences of the word "dog"
str_count(sentence, "dog")
## [1] 1
# count occurences of the word "dog" and require word boundaries
# on both sides of the word
str_count(sentence, "\\bdog\\b")
## [1] 0
Replacing matching patterns
# Replace the 2nd digit with a 9
str_replace(sentence, "(?<=\\d)[^\\d]*(\\d)", "9")
## [1] "We bought our Golden Retriever, Snuggles, for $39 on 1/1/2015 at 1017 Main St. where they have many dogs."
# Replace every 0 or 1 with a 6
str_replace_all(sentence, "(0|1)", "6")
## [1] "We bought our Golden Retriever, Snuggles, for $36 on 6/6/2665 at 6667 Main St. where they have many dogs."
# Replace all instances of multiple spaces with a single space
str_replace_all(sentence, "\\s{2,}", " ")
## [1] "We bought our Golden Retriever, Snuggles, for $30 on 1/1/2015 at 1017 Main St. where they have many dogs."