Regular Expressions
Regular expression is a sequence of characters that describe a patttern to match. The concept was formalized by American mathematician Stephen Cole Kleene. A regular expression pattern can contain a combination of alphanumeric and special characters. Let us take a closer look how these special characters can be used to craft regular expressions in R.
Metacharacters
Metacharacters are characters that have a special meaning within a regular expression. Unlike other characters that are used to match themselves, metacharacters have a reserved status and cannot be matched explicity. The following table shows a list of metacharacters used in regular expressions.
| Metacharacter | Description |
|---|---|
| ^ | Matches at the start of the string |
| $ | Matches at the end of the string |
| () | Define a subexpression to be matched and retrieved later. |
| | | Matches the pattern before or pattern after |
| [ ] | Matches a single character that is contained within bracket |
| . | Matches any single character |
| We shall now see | how metacharacters can be used to match different patterns with a few examples. |
string <- c('hands', 'data', 'on', 'data$cience', 'handsondata$cience', 'handson')
grep(pattern='^data', string, value=TRUE) # Matching the occurance of pattern at the begining of the string.## [1] "data" "data$cience"
grep(pattern='on$', string, value=TRUE) # Matching occurance of pattern at the end of the string.## [1] "on" "handson"
str_detect(pattern='(nd)+', string) # Detecting if the pattern (nd) occurs atleast ones.## [1] TRUE FALSE FALSE FALSE TRUE TRUE
Inorder to match a metacharacter in R, we use \(\backslash\backslash\) (double backslash) before them.
grep(pattern='\\$', string, value=TRUE) # Matching the metacharacter $ ## [1] "data$cience" "handsondata$cience"
Quantifiers
Quantifiers are used to match repitition of pattern within a string. The following table shows a list of quantifiers.
| Quantifier | Description |
|---|---|
| * | The preceeding item is matched 0 or more times |
| + | The preceeding item is matched 1 or more times |
| ? | The preceeding item is matched at most 1 times. |
| {n} | The preceeding item is matched n times. |
| {n,} | The preceeding item is matched atleast n times. |
Let us see some examples of quantifiers in practice.
strings <- c('aaab', 'abb', 'bc', 'abbcd', 'bbbc', 'abab', 'caa')
grep(pattern='ab*b', strings, value=TRUE) # Matching strings containining a pattern atleast 0 times.## [1] "aaab" "abb" "abbcd" "abab"
grep(pattern='abbc?', strings, value=TRUE) # Matching strings containing the pattern atmost ones.## [1] "abb" "abbcd"
grep(pattern='b{2,}?', strings, value=TRUE) # Matching strings containing the pattern atleast 2 times.## [1] "abb" "abbcd" "bbbc"
Character classes
A character class is a set that characterises a category of characters. They are enclosed within [] and they match one of the mentioned characters in the set. For example the character class [0-9] matches the first digit occuring in the string. Below are a set of character classes.
| Character.Class | Description |
|---|---|
| [0-9] | Digits |
| [a-z] | Lower-case letters |
| [A-Z] | Upper-case letters |
| [a-zA-Z] | Alphabetic characters |
| [^a-zA-Z] | Non-alphabetic characters |
| [a-zA-Z0-9] | Alphanumeric characters |
| [] | Space characters |
| [!,:;`)}@-]$*+.?[^{|(\\#%&~_/<=>'] | Punctuation characters |
Let us see some simple examples of using character classes in regular expressions.
string <- c('abc12', '@#$', '345', 'ABcd')
grep(pattern='[0-9]+', string, value=TRUE) # Matching strings that have digits.## [1] "abc12" "345"
grep(pattern='[A-Z]+', string, value=TRUE)# Matching strings that have capital letters.## [1] "ABcd"
grep(pattern='[^@#$]+', string, value=TRUE) # Matching strings not do not have special characters.## [1] "abc12" "345" "ABcd"
Alternatively R allows the use of POSIX character classes which are represented within [[]] (double braces).
grep(pattern='[[:alpha:]]', string, value=TRUE) # Matching alpha numeric characters.## [1] "abc12" "ABcd"
grep(pattern='[[:upper:]]', string, value=TRUE) # Matching upper case characters.## [1] "ABcd"
Functions in R to Support Regular Expressions
R has a great array of functions that support regular expressions. The following is a list of functions from the base and stringr package that support regular expressions.
| Function | Description |
|---|---|
| grep() | Returns index of elements that matched |
| grepl() | Returns boolean values indicating if a pattern exist in the string (TRUE & FALSE) |
| regexpr() | Returns the first match of the pattern in string |
| gregexpr() | Returns all matches of pattern in string |
| regexec() | Combines results of regexpr() and gregexpr() |
| sub() | Replaces the first match of pattern with replacement |
| gsub() | Replaces all matches of pattern with replacement |
| strsplit() | Split string in to vector according to pattern match |
| str_detect() | Detect a presence or absence of a pattern in a string |
| str_extract() | Extracts first occurance of pattern in string. |
| str_extract_all() | Extracts all occurance of pattern in string. |
| str_match() | Extract first matched group from a string |
| str_match_all() | Extract all matched groups from a string |
| str_locate() | Locate the position of the frst occurence of a pattern in a string |
| str_locate_all() | Locate the position of all occurences of a pattern in a string |
| str_replace() | Returns the first match of the pattern in string |
| str_replace_all() | Returns all matches of pattern in string |
| str_split() | Split up a string into a variable number of pieces |
| str_split_fixed() | Split up a string into a fixed number of pieces |