Regular Expressions

Text Processing

Text processing is very important. A lot of people do tedious work (the kind of work that computers were invented for) because they don't recognize the power of text processing. If you find yourself doing tedious work with your computer (for example updating the same link on more than three webpages) I recommend that you take a step back and ask if your task can be automated. Even if it takes you longer to figure out how to automate something for the first time, you'll be ahead of the game for the next time. (On a side note, an operating system can be thought of as nothing but a collection of text files, thus if you can interface well with text files you will be a much more productive computer user overall. )

Text processing revolves around strings (since text is reducible to strings) and regular expressions were invented with the goal of mastering strings. We can define a regular expression as:

A string that represents a pattern.

In order to make a string represent a pattern we can reserve certain characters and give them a special meaning. By doing this we can create a powerful language to describe strings.

Regular expressions come from the text processing traditions of Unix (in other words POSIX). The programming language Perl was developed to take text processing to a new level which involved an extension (and slight change) of regular expressions. PHP comes with libraries for both POSIX Compatible Regular Expressions and Perl Compatible Regular Expressions making it a very powerful tool for text processing, and thus for web programming as well.

Our examination of regular expressions will be based on Chapter 4 (pages 95 through 115) of Programming PHP.

The Function ereg

We will begin by studying examples with the function ereg which can be called like this:

$t_or_f = ereg($regex, $string);

In the above example there are three variables:

$regex is a regular expression which describes a string.
$string is a string that might contain the patterns described in $regex.
$t_or_f is a variable that will be true or false depending on whether or not the pattern in $regex was found in $string.

It is important to note that if you assign $regex to a value that you use single quotes (') as opposed to double quotes ("). For example let's suppose that I wanted to assign $regex to $m. There might seem to be two ways to do this:

$regex = "$m";  
$regex = '$m';

In the above example the first line assigns $regex to the variable $m since the dollar sign has special meaning in PHP (dollar sign also has special meaning in regular expressions as we will see later). The first line does not produce the desired effect of setting $regex to "$m" since the double quotes treat $m like a variable and return its value. If $m is undefined then you will set $regex to a null-string (""). The second line of the above which uses single quotes (') does not treat $m like a variable and thus $regex is actually set to the value of the string '$m'.

Most of the characters that can be in $regex are literals, meaning that they match only themselves. Let's suppose that we call ereg like this:

ereg('GNU', "GNU's not Unix!");

The above example will return true. Note that the string "GNU" was passed as the regular expression. A regular expression is a string that represents a pattern by using special characters. In this case no character within "GNU" has any special meaning, but it is true that "GNU" was contained within "GNU's not Unix!" so ereg returned true.

It is worth mentioning that we used "GNU" in capitals as opposed to "gnu" in lowercase. Regular expressions are case sensitive by default. Consider the following:

ereg('gnu', "GNU's not Unix!");

The above example will return false. If we wanted to evaluate regular expressions and ignore case we would use the eregi function, which is a case insensitive version of ereg.

Special Characters

We will now introduce our first regular expression special character:

Symbol Name Meaning

^ caret When placed at the beginning of a regular expression indicates that it must match the beginning of a string. More precisely, it anchors the regular expression to beginning of a string.

Given the above definition we can say:

ereg('^GNU', "GNU's not Unix!");

The above example will return true. Because "GNU" is at the beginning of "GNU's not Unix!". If we say:

ereg('^GNU', "I prefer to use GNU");

The above example will return false. Because "GNU" is not at the beginning of "I prefer to use GNU." However, if we use "$" like this:

ereg('GNU$', "I prefer to use GNU");

The above example will return true. Because "GNU" is at the end of "I prefer to use GNU". The above example will pick out strings that end with "GNU". We can define this new symbol like this:

Symbol Name Meaning

$ dollar sign When placed at the end of a regular expression indicates that it must match the end of a string. More precisely, it anchors the regular expression to end of a string.

If we were to use this new symbol on our previous sentence like this:

ereg('GNU$', "GNU's not Unix!");

The above example will return false. Because "GNU's not Unix!" doesn't end with "GNU".

Let's study another special character called period (.). We can use period like this:

ereg('G.U', "GNU's not Unix!");

The above example will return true. Because the period will match any single character. In the above example it is matching the "N" inside of "GNU". We can define period like this:

Symbol Name Meaning

. period When placed within a regular expression represents any single character.

Let's evaluate some expressions with period.

ereg('c.t', "cat");

The above example will return true. Because "." can represent "a".

ereg('c.t', "cut");

The above example will return true. Because "." can represent "u".

ereg('c.t', "c t");

The above example will return true. Because "." can represent " " (space).

ereg('c.t', "bat");

The above example will return false. Because the regex "c.t" expects the string to start with "c", have any single character after, and then end with "t".

ereg('c.t', "ct");

The above example will return false. Because the regex "c.t" expects the string to start with "c", have any single character after, and then end with "t". In this case "" (null-string) is not any single character.

The Backslash

If you want to match any of the above (or future) special characters, or meta-characters you have to use the meta-character "\" (backslash). Here are some examples that use backslash.

ereg('\$2\.00', "I want my $2.00!");

The above example will return true. Because the special meaning of "$" and "." have been removed by the "\". Here is what happens if we ignore the use of "\":

ereg('$2.00', "I want my $2.00!");

The above example will return false. Because the special values of "$" and "." have a special meaning and the regex '$2.00' is not satisfied in "I want my $2.00!".

Character Classes

To add some flexibility to your regular expressions you can use "[" and "]" to denote a set of alternate characters.

ereg('c[aeiou]t', "I cut my hand");

The above example will return true. Because 'c[aeiou]t' translates to the following possibilities:

and cut is contained in the above list. Here is another example:

ereg('c[aeiou]t', "What cart?");

The above example will return false. Because there is no "r" in the regular expression. Consider this:

ereg('c[aeiou]t', "14ct gold");

The above example will return false. Because the regular expression expects a character between the "c" and the "t".

The Dash

The alphabetical ordering of characters and the numeric ordering of numbers allow you to specify an entire range of symbols by typing less and using the "-" operator. Consider the following:

[0-9] matches any single digit
[a-z] matches any single lowercase letter
[A-Z] matches any single uppercase letter
[0-9A-Za-z] matches any single digit, capital letter, or lowercase letter

Remember that the above relies on a standard ordering. If you wanted to denote the set of lower case non-vowels (those letters that were not a,e,i,o,u) you could do the following:

[b-df-hj-np-tv-z]

Here is the above in the context of some PHP:

ereg('[b-df-hj-np-tv-z]', "b");       // returns true

ereg('[b-df-hj-np-tv-z]', "a");       // returns false

Note that using the above method to pick out the set of non-vowels is not minimal enough. It would be better if you could specify the set of vowels and then ask for those characters which are not contained within that set. The next special character that we will study allows us to define such a complement of a set.

The Complement of a Character Class

One thing that gets confusing about using "[" and "]" is that some regular expressions take on new meaning. For example within a set of hard brackets "$" and "." loose their special meaning and become literals, but "^" takes on a whole new meaning.

The "^" within a set of hard brackets means "not", or "the complement of a set". Observe the following:

ereg('[^aeiou]', "b");       // returns true

ereg('[^aeiou]', "a");       // returns false

We can everything that we have learned so far about character classes in the following example:

ereg('c[^aeiou]t', "I cut my hand");       // returns false

ereg('c[^aeiou]t', "Reboot chthton");       // returns true

ereg('c[^aeiou]t', "14ct gold");       // returns false

In the above cases the "^" is being used to negate the character class of vowels. More precisely, the above asks ereg to look for a "c" followed by a non-vowel followed by a "t".

The first of the above returns false since "cut" contains a vowel. The second returns true since "chthton" contains "cht" which conforms to the regular expression. The third returns false since "ct" contains a null-string ("") between "c" and "t" and null-string is not a non-vowel.

Repeating Patterns

Sometimes you might want your regular expression to pick out a pattern that repeats. To this end it gives you the following syntactic structure:

regex{n}

where regex could be any regular expression. The above describes the regular expression regex occurring exactly n times. The following example should make this clear:

ereg('b{3}', "bbb");       // returns true

ereg('b{3}', "bb");       // returns false

One practical application of pattern repetition is to check for valid phone numbers. Here is a regular expression that will check for U.S. phone numbers including the area code:

[0-9]{3}-[0-9]{3}-[0-9]{4}

The above asks for any three numbers followed a dash ("-") and then another three numbers followed a dash ("-") and finally a sequence of four numbers. Here is an example of this expression applied in PHP:

ereg('[0-9]{3}-[0-9]{3}-[0-9]{4}', "445-4357");       // returns false

ereg('[0-9]{3}-[0-9]{3}-[0-9]{4}', "732-445-4357");       // returns true

The above sequences denote a quantifier which spcifies how many times a regex is supposed to be repeat. The above example can be defined more formally like this:

Symbol Name Meaning

{n} N/A When placed after a regular expression returns true if and only if the regular expression repeats exactly n times.

PHP comes with other quantifiers. The question mark ("?") can be defined like this:

Symbol Name Meaning

? question mark When placed after a regular expression returns true if and only if the regular expression repeats 0 or 1 times.