Regular Expressions, often called RegEx, is a sequence of characters that defines a search pattern to run against another string of data. This is often used to perform a find and replace or check for validation.
Different languages apply regular expressions differently. The strictest use the full range of special characters and commands, while others have properties which are set in their object instead for some of the options.
While the idea has been around since the 1950’s, regular expressions became more popular in the 1980’s as use to computer languages and applications became applicable. They can be found in word processors for finding sub-strings, search engines, and text and lexical analysis tools.
POSIX is the most common implementation of the rules, and what we’ll look at since that is what you would most likely find in the real world environment. There are various extended formats, but we’ll try to keep it simple.
The Basic Rules
There are some basic rules to RegEx, that knowing them will allow you to start building out your regular expressions. It helps define what we are searching for, often called a token.
Most often the token will need to be contained within certain characters. This is based on the language/library being used, but most commonly it will be either forward slashes or a pair of double quotes.
Wildcards
The wildcard character, the period, will allow you to search for any character. There is not limit with the wildcard characters as to what it can represent.
a.e
This example will let you find any three character match that starts with an a and ends with an e. So while ape, ate, and ale will all match, so will a8e, aue, and a$e.
While the wildcard is powerful, it can cause issues with “catching” too many possible combinations.
Boolean Or
Since the wildcard can be a little too “wild” at times, we might want to implicitly specify what options we want to see. For example, if we only wanted to match ape, ate, or ale, we could define a regex that does that.
The best way is to use a boolean or, or the pipe symbol (|) to specify value1 or value 2. This of course can be expanded to multiple levels of complexity.
value1|value2
ape|ate|ale
Grouping
Regex expressions can use grouping to make it easier to look for a fixed string and little changes. Consider the previous example where you have two set characters and one that changes. Combining grouping with the boolean or we can write a smaller regex. However, some might state that it’s more complicated, which seems to vary per person.
a(p|t|l)e
In this example, we start with looking for an a, then look for a p or t or l. If that is found, then we search for an e.
You can have as many characters as you want to in a group, including whole words.
Brackets
Square Brackets allow you to define any one character found within the brackets. So if you wanted to find a, b, or c, would write it as below.
[abc]
This is easier than using the boolean or in many cases. Especially when working with ranges which can be included within brackets
Range
Image the complexity of trying to specify every number, or every letter using the boolean or. Not only would it take a long time, it might be difficult to impossible to do without errors. Even when using brackets, it can be a real challenge.
Therefore inside brackets we can use a dash to list all characters between the two listed. So for example if we wanted to get all lower case letters we could use the following.
[a-z]
All uppercase, then we’d modify it to be:
[a-zA-Z]
Quantification
Quantification, is specifying how many times a character is supposed to appear. Regex describes five basic formats, (with a variation for a sixth) which allow you to build a quantifier for your token as to explain how many times it should appear for a match.
It could be as small as zero or one times, or as many as you want, as long as it exist. Here are the rules for quantification.
? | The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both “color” and “colour”. |
* | The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches “ac”, “abc”, “abbc”, “abbbc”, and so on. |
+ | The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches “abc”, “abbc”, “abbbc”, and so on, but not “ac”. |
{n} | The preceding item is matched exactly n times. |
{min,} | The preceding item is matched min or more times. |
{min,max} | The preceding item is matched at least min times, but not more than max times. |
If you want your quantification to support multiple characters, you will need to group it in parenthesis.
Let’s say you wanted to support the US Social Security Number, you could do it with:
[0-9]{3}-[0-9]{2}-[0-9]{4}
A phone number seems simple, until you start looking at country codes, dashes, parenthesis, dots, or nothing for separators, etc. This is how regex gets complicated, because the real rules for building out an email, phone number, etc are actually very difficult. Even simple things are not – consider a rule for defining an instance name (variable, object, or function/method) in C++. It might look something like:
[a-zA-Z]{1}[a-zA-Z_0-9]{0,31}
[a-zA-Z]{1}[a-zA-Z_0-9]*
This has meets the requirement of starting with a single letter, upper case or lower case, and then having between 0 and 31 letters, numbers, or underscores in the name. This limits you to 32 characters, which some old compilers did for internal memory handling themselves.
The second one gets rid of the 32 character limit and opens it up to be anything from nothing on to infinity…which may not be the best choice.
This is just an introductory level into RegEx. As you build more and more complex patterns, you get into complexity that can provide for a lot of simple and obvious results, but also be potentially difficult to build and debug.
I always recommend starting simple, and building complexity into the process to ensure that the answer is correct that you get out of your regular expression.
Regular Expressions was originally found on Access 2 Learn