Hands on Data Science with R

Regular Expressions

Regular expression is a sequence of characters that describe a patttern to match. The concept was formalized by American mathematician Stephen Cole Kleene. A regular expression pattern can contain a combination of alphanumeric and special characters. Let us take a closer look how these special characters can be used to craft regular expressions in R.

Metacharacters

Metacharacters are characters that have a special meaning within a regular expression. Unlike other characters that are used to match themselves, metacharacters have a reserved status and cannot be matched explicity. The following table shows a list of metacharacters used in regular expressions.

(#tab:Table 1: metacharacters)Metacharacters in Regular Expression
Metacharacter	Description
^	Matches at the start of the string
$	Matches at the end of the string
()	Define a subexpression to be matched and retrieved later.
\|	Matches the pattern before or pattern after
[ ]	Matches a single character that is contained within bracket
.	Matches any single character
We shall now see	how metacharacters can be used to match different patterns with a few examples.

string <- c('hands', 'data', 'on', 'data$cience', 'handsondata$cience', 'handson')

grep(pattern='^data', string, value=TRUE) # Matching the occurance of pattern at the begining of the string.

## [1] "data"        "data$cience"

grep(pattern='on$', string, value=TRUE) # Matching occurance of pattern at the end of the string.

## [1] "on"      "handson"

str_detect(pattern='(nd)+', string) # Detecting if the pattern (nd) occurs atleast ones.

## [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE

Inorder to match a metacharacter in R, we use $\backslash\backslash$ (double backslash) before them.

grep(pattern='\\$', string, value=TRUE) # Matching the metacharacter $

## [1] "data$cience"        "handsondata$cience"

Quantifiers

Quantifiers are used to match repitition of pattern within a string. The following table shows a list of quantifiers.

(#tab:Table 1: quantifiers)Quantifiers in Regular Expression
Quantifier	Description
*	The preceeding item is matched 0 or more times
+	The preceeding item is matched 1 or more times
?	The preceeding item is matched at most 1 times.
{n}	The preceeding item is matched n times.
{n,}	The preceeding item is matched atleast n times.

Let us see some examples of quantifiers in practice.

strings <- c('aaab', 'abb', 'bc', 'abbcd', 'bbbc', 'abab', 'caa')
grep(pattern='ab*b', strings, value=TRUE) # Matching strings containining a pattern atleast 0 times.

## [1] "aaab"  "abb"   "abbcd" "abab"

grep(pattern='abbc?', strings, value=TRUE) # Matching strings containing the pattern atmost ones.

## [1] "abb"   "abbcd"

grep(pattern='b{2,}?', strings, value=TRUE) # Matching strings containing the pattern atleast 2 times.

## [1] "abb"   "abbcd" "bbbc"

Character classes

A character class is a set that characterises a category of characters. They are enclosed within [] and they match one of the mentioned characters in the set. For example the character class [0-9] matches the first digit occuring in the string. Below are a set of character classes.

(#tab:Table 3: Character class)Character Class
Character.Class	Description
[0-9]	Digits
[a-z]	Lower-case letters
[A-Z]	Upper-case letters
[a-zA-Z]	Alphabetic characters
[^a-zA-Z]	Non-alphabetic characters
[a-zA-Z0-9]	Alphanumeric characters
[]	Space characters
[!,:;`)}@-]$*+.?[^{\|(\\#%&~_/<=>']	Punctuation characters

Let us see some simple examples of using character classes in regular expressions.

string <- c('abc12', '@#$', '345', 'ABcd')
grep(pattern='[0-9]+', string, value=TRUE) # Matching strings that have digits.

## [1] "abc12" "345"

grep(pattern='[A-Z]+', string, value=TRUE)# Matching strings that have capital letters.

## [1] "ABcd"

grep(pattern='[^@#$]+', string, value=TRUE) # Matching strings not do not have special characters.

## [1] "abc12" "345"   "ABcd"

Alternatively R allows the use of POSIX character classes which are represented within [[]] (double braces).

grep(pattern='[[:alpha:]]', string, value=TRUE) # Matching alpha numeric characters.

## [1] "abc12" "ABcd"

grep(pattern='[[:upper:]]', string, value=TRUE) # Matching upper case characters.

## [1] "ABcd"

Functions in R to Support Regular Expressions

R has a great array of functions that support regular expressions. The following is a list of functions from the base and stringr package that support regular expressions.

(#tab:Table 4: Functions supporting regular expressions)Functions Supporting Regular Expressions in R
Function	Description
grep()	Returns index of elements that matched
grepl()	Returns boolean values indicating if a pattern exist in the string (TRUE & FALSE)
regexpr()	Returns the first match of the pattern in string
gregexpr()	Returns all matches of pattern in string
regexec()	Combines results of regexpr() and gregexpr()
sub()	Replaces the first match of pattern with replacement
gsub()	Replaces all matches of pattern with replacement
strsplit()	Split string in to vector according to pattern match
str_detect()	Detect a presence or absence of a pattern in a string
str_extract()	Extracts first occurance of pattern in string.
str_extract_all()	Extracts all occurance of pattern in string.
str_match()	Extract first matched group from a string
str_match_all()	Extract all matched groups from a string
str_locate()	Locate the position of the frst occurence of a pattern in a string
str_locate_all()	Locate the position of all occurences of a pattern in a string
str_replace()	Returns the first match of the pattern in string
str_replace_all()	Returns all matches of pattern in string
str_split()	Split up a string into a variable number of pieces
str_split_fixed()	Split up a string into a fixed number of pieces