Regular Expressions
Regular expression is a sequence of characters that describe a patttern to match. The concept was formalized by American mathematician Stephen Cole Kleene. A regular expression pattern can contain a combination of alphanumeric and special characters. Let us take a closer look how these special characters can be used to craft regular expressions in R.
Metacharacters
Metacharacters are characters that have a special meaning within a regular expression. Unlike other characters that are used to match themselves, metacharacters have a reserved status and cannot be matched explicity. The following table shows a list of metacharacters used in regular expressions.
Metacharacter | Description |
---|---|
^ | Matches at the start of the string |
$ | Matches at the end of the string |
() | Define a subexpression to be matched and retrieved later. |
| | Matches the pattern before or pattern after |
[ ] | Matches a single character that is contained within bracket |
. | Matches any single character |
We shall now see | how metacharacters can be used to match different patterns with a few examples. |
string <- c('hands', 'data', 'on', 'data$cience', 'handsondata$cience', 'handson')
grep(pattern='^data', string, value=TRUE) # Matching the occurance of pattern at the begining of the string.
## [1] "data" "data$cience"
grep(pattern='on$', string, value=TRUE) # Matching occurance of pattern at the end of the string.
## [1] "on" "handson"
str_detect(pattern='(nd)+', string) # Detecting if the pattern (nd) occurs atleast ones.
## [1] TRUE FALSE FALSE FALSE TRUE TRUE
Inorder to match a metacharacter in R, we use \(\backslash\backslash\) (double backslash) before them.
grep(pattern='\\$', string, value=TRUE) # Matching the metacharacter $
## [1] "data$cience" "handsondata$cience"
Quantifiers
Quantifiers are used to match repitition of pattern within a string. The following table shows a list of quantifiers.
Quantifier | Description |
---|---|
* | The preceeding item is matched 0 or more times |
+ | The preceeding item is matched 1 or more times |
? | The preceeding item is matched at most 1 times. |
{n} | The preceeding item is matched n times. |
{n,} | The preceeding item is matched atleast n times. |
Let us see some examples of quantifiers in practice.
strings <- c('aaab', 'abb', 'bc', 'abbcd', 'bbbc', 'abab', 'caa')
grep(pattern='ab*b', strings, value=TRUE) # Matching strings containining a pattern atleast 0 times.
## [1] "aaab" "abb" "abbcd" "abab"
grep(pattern='abbc?', strings, value=TRUE) # Matching strings containing the pattern atmost ones.
## [1] "abb" "abbcd"
grep(pattern='b{2,}?', strings, value=TRUE) # Matching strings containing the pattern atleast 2 times.
## [1] "abb" "abbcd" "bbbc"
Character classes
A character class is a set that characterises a category of characters. They are enclosed within [] and they match one of the mentioned characters in the set. For example the character class [0-9] matches the first digit occuring in the string. Below are a set of character classes.
Character.Class | Description |
---|---|
[0-9] | Digits |
[a-z] | Lower-case letters |
[A-Z] | Upper-case letters |
[a-zA-Z] | Alphabetic characters |
[^a-zA-Z] | Non-alphabetic characters |
[a-zA-Z0-9] | Alphanumeric characters |
[] | Space characters |
[!,:;`)}@-]$*+.?[^{|(\\#%&~_/<=>'] | Punctuation characters |
Let us see some simple examples of using character classes in regular expressions.
string <- c('abc12', '@#$', '345', 'ABcd')
grep(pattern='[0-9]+', string, value=TRUE) # Matching strings that have digits.
## [1] "abc12" "345"
grep(pattern='[A-Z]+', string, value=TRUE)# Matching strings that have capital letters.
## [1] "ABcd"
grep(pattern='[^@#$]+', string, value=TRUE) # Matching strings not do not have special characters.
## [1] "abc12" "345" "ABcd"
Alternatively R allows the use of POSIX character classes which are represented within [[]] (double braces).
grep(pattern='[[:alpha:]]', string, value=TRUE) # Matching alpha numeric characters.
## [1] "abc12" "ABcd"
grep(pattern='[[:upper:]]', string, value=TRUE) # Matching upper case characters.
## [1] "ABcd"
Functions in R to Support Regular Expressions
R has a great array of functions that support regular expressions. The following is a list of functions from the base and stringr package that support regular expressions.
Function | Description |
---|---|
grep() | Returns index of elements that matched |
grepl() | Returns boolean values indicating if a pattern exist in the string (TRUE & FALSE) |
regexpr() | Returns the first match of the pattern in string |
gregexpr() | Returns all matches of pattern in string |
regexec() | Combines results of regexpr() and gregexpr() |
sub() | Replaces the first match of pattern with replacement |
gsub() | Replaces all matches of pattern with replacement |
strsplit() | Split string in to vector according to pattern match |
str_detect() | Detect a presence or absence of a pattern in a string |
str_extract() | Extracts first occurance of pattern in string. |
str_extract_all() | Extracts all occurance of pattern in string. |
str_match() | Extract first matched group from a string |
str_match_all() | Extract all matched groups from a string |
str_locate() | Locate the position of the frst occurence of a pattern in a string |
str_locate_all() | Locate the position of all occurences of a pattern in a string |
str_replace() | Returns the first match of the pattern in string |
str_replace_all() | Returns all matches of pattern in string |
str_split() | Split up a string into a variable number of pieces |
str_split_fixed() | Split up a string into a fixed number of pieces |