More Regular Expressions

Overview

So far we have used regular expression to verify user input. In our next discussion we will increase our regular expression vocabulary and apply our new vocabulary to write more powerful PHP code.

Alternatives

You can use the pipe character (|) to denote alternatives in regular expressions. For example:

ereg('GNU|BSD', "GNU is a Unix like OS.");       // returns true

ereg('GNU|BSD', "BSD is a Unix like OS.");       // returns true

ereg('GNU|BSD', "Linux is a Unix like OS.");       // returns false

You can combine the pipe with other regular expression ideas:

ereg('^([a-z]|[0-9])', "abc");       // returns true

ereg('^([a-z]|[0-9])', "123");       // returns true

ereg('^([a-z]|[0-9])', "abc123");       // returns true

ereg('^([a-z]|[0-9])', "Abc123");       // returns false

ereg('^([a-z]|[0-9])', "aBc123");       // returns true

In the above example '^([a-z]|[0-9])' will only return true if the string starts with a lowercase letter or a number.

Sub-patterns

Parentheses "(" and ")" can be used to group bits of regular expressions together to be treated as a single unit, or sub-pattern. For example:

ereg('A (very )+reliable', "A very very reliable OS.");       // returns true

ereg('^(PHP|MySQL)$', "PHP");       // returns true

ereg('^(PHP|MySQL)$', "MySQL");       // returns true

Regular Expression Replacement Functions

PHP has a function called ereg_replace which finds a regular expression the same way that ereg does, except that it replaces any matches that it finds with another string that it takes as an argument. For example:

$string = "This is a test";
print ereg_replace(" is", " was", $string);

When the above code is run it produces the following:

We will be studying a function which is similar to ereg_replace() in our next example.

Perl Compatible Regular Expressions

PHP's Perl Compatible Regular Expressions use a slightly different syntax because they borrow syntax (and semantics) from Perl. We will start by briefly studying PHP's preg_match function. The preg_match() function is very similar the ereg() function.

preg_match('/Perl/', "Perl is a nice language.");       // returns true

preg_match('/Perl/', "PHP is a nice language.");       // returns false

Note that the preg_match() function is very similar the ereg() function and that the main difference that you should notice is how preg_match() declares it regular expression.

Delimiters

Note the use of the forward slashes ("/") at the beginning and end of this regular expression declaration:

$regex = '/Perl/';

In the above example the forward slashes are delimiters of the regular expression and any Perl Compatible Regular Expression will expect them in any regular expression. If you do not include the delimiters PHP will give a warning message.

Perl's regular expressions require delimiters for reason's of the Perl language itself. Interestingly, Perl (not to be confused with PHP's Perl Compatible Regular Expressions library) does not supply formally defined functions like the following:

Instead it has its own operators that perform the above operations directly by using the delimiters:

To replace a string using a regular expression in Perl you say:
s/regex/new_string/
In the above example "s" stands for substitute.
To match a string using a regular expression in Perl you say:
m/regex/
In the above example "m" stands for match.

PHP supports some of the options that Perl does regarding the use of delimiters but not the ones above. The above example is there to give you an idea of why Perl's regular expressions require delimiters.

Some delimiter options that PHP does support includes "i" which stands for case insensitive. Note that the Perl Compatible Regular Expressions do not include a pregi_match() function the way that POSIX includes an ereg function and a eregi function. If you want preg_match to ignore case you have to use the delimiter like this:

preg_match('/Perl/i', "Perl is a nice language.");       // returns true

preg_match('/Perl/i', "I like perl.");       // returns true

Note what happens when you don't use the delimiter:

preg_match('/Perl/', "Perl is a nice language.");       // returns true

preg_match('/Perl/', "I like perl.");       // returns false

Another delimiter that PHP borrows from Perl is "x" which means the expressive operator. This operator allows you to comment and space out your regular expressions. This is a nice option to have if your regular expression is complicated. For example:

preg_match('/([A-Za-z]+)\s+\1/', "Perl Perl is a nice language.");       // returns true

Is one way to search for repeated words but this declaration:

$regex = '/([A-Za-z]+)\s+\1/';

is difficult to read. It can be expressed more cleanly like this:


$regex = '/
           (             # start capture
	    [A-Za-z]+    # one word
	   )             # stop capture
           \s+           # white space
           \1            # the same word again
         /x'             
;

In the above example using "x" allows us to use the "#" to make a comment and it ignores any white space within the definition of the regular expression so that we could space things out. Note also in the above example that we introduced two new characters:

"\s" denotes a space
"\1" denotes the first string that we captured with our first set of parentheses.

Alternate Delimiter Syntax

The forward slash delimiter can be tedious to use for some examples so the Perl Compatible Regular Expression Library allows you to use alternate syntaxes. The following example denotes a path on a file system and uses the backslash "\" character to escape the special meaning of the forward slash "/":

preg_match('/\/usr\/local\//', "/usr/local/bin/perl/");       // returns true

A simpler way to express the above is any of the following:

preg_match('{/usr/local/}', "/usr/local/bin/perl/");       // returns true

preg_match('#/usr/local/#', "/usr/local/bin/perl/");       // returns true

preg_match('[/usr/local/]', "/usr/local/bin/perl/");       // returns true

preg_match('(/usr/local/)', "/usr/local/bin/perl/");       // returns true

The "<" and ">" characters are also supported.

Regular Expressions, Strings, and Arrays

Our study of string processing began with a study of basic arrays which could allow us to see strings as arrays of characters. We will now examine some functions from the Perl Compatible Regular Expression Library which allow us to break a string apart by a regular expression and then store each element in an array.

PHP's has a function called preg_replace which can perform a regular expression search and replace on an array. The function has the following syntax:

$array = preg_replace($regex_find, $regex_replace, $array);

Here is an example of this function in use:

<?
$animals = array(
                 "Camels",
                 "Wildebeests",
                 "Penguins",
                 "Dogs"
		 );

$animals = preg_replace('/Dogs/', 'Cats', $animals);

foreach($animals as $animal) {
    print "$animal<br>";
}
?>

When the above code is run we get the following:

";
}
?>

You should notice that the preg_replace() function replaced "Dogs" with "Cats". You should also notice in the above example that our regular expression was a collection of literals which was "Dogs". Notice that we put our regular expression inside of delimiters ("/") since we are using the .

The above example should give you and idea of what the preg_replace() function does. You should note that it looks very similar to the ereg_replace() function at first, but observe what we can do if you pass it a more complicated regular expression. Page 113 of Programming PHP. Has the following example, which shows how powerful the Perl compatible regular expression library can be:

<?
$long_names = array(
           "Fred Flinstone", 
           "Barney Rubble", 
           "Wilma Flinstone", 
           "Betty Rubble"
         ); 

$short_names = preg_replace('/(\w)\w* (\w+)/', '\1 \2', $long_names);

foreach($short_names as $name) {
    print "$name<br>";
}
?>

When the above code is run we get the following:

";
}
?>

The power of the above code lies in the regular expressions that you can pass as arguments to preg_replace(). Here is the first argument:

'/(\w)\w* (\w+)/'

Note that the above has a regular expression that we have not yet talked about.

Symbol Name Meaning

\w backslash w The Word Identifier character is the regex equivalent of [0-9A-Za-z_]

If we break the above apart we have the following:

(\w) denotes a grouping of the first character.
\w* denotes 0 or more characters.
(\w+) denotes a grouping of the 1 or more characters

Note that there is a space between item 1 and 2. I.e it is:

'/(\w)\w* (\w+)/'

and not:

'/(\w)\w*(\w+)/'

What the above does is group the first character, pick out the next set of characters up to the space and then groups the remaining characters.

'\1 \2'

The key to how this code works is how it is able to access the groupings. The above line asks for the first and second groupings. Since the first grouping asks for the first character you will get the first initial returned, and since the second grouping calls for all of the second word, you will get all of the last name. Thus the above replacement which is asked in terms of the regular expression is able to return just the first initial and the last name. (Note also that the above is not specified with delimiters.)

The above is one of those cases were one strange looking line of Perl based PHP:

$short_names = preg_replace('/(\w)\w* (\w+)/', '\1 \2', $long_names);

is as powerful as about a 50 lines of C or Java.

Splitting Strings into Arrays

PHP's comes with a function called preg_split, which splits a string by a regular expression and puts each new sub-string in an array. Let's suppose that we had a list like this:

Bread|Fish|Cheese

Note that the above list is separated by pipes (|). If we wanted to take every item on the list and store it in an array (and then print the array) we could do the following:

$array = preg_split('/\|/', 'Bread|Fish|Cheese');
foreach($array as $element)
   print "$element<br>";

If we were to run the above code it would look like this:

Bread
Fish
Cheese

Note that in the code above that we used the back slash to escape the special meaning of the pipe.

Splitting by HTML

Let's suppose that we wanted to use a regular expression to describe what is inside of an HTML tag so that we could strip the content out of an HTML tag and put that content into an array. We would first need a regular expression that described what is inside of an HTML tag.

We should start by declaring a grouping, since we are going to want to group HTML in distinction to non-HTML:

$regex = '/()/';

Next we will specify how all HTML starts and ends:

$regex = '/(<>)/';

Within the HTML delimiters we will want to capture any character that is not a greater-than (">"), since a greater-than represents the end of the HTML tag. To represent this set of characters we use the following:

[^>]

Remember that the hard brackets represent any set of symbols. Recall also that when the caret ("^") is used inside of a set of hard brackets that it means the negation of whatever it precedes. Since we want zero or more versions of the above we add the asterisk:

[^>]*

When we add the above the final result is this:

$regex = '/(<[^>]*>)/';

We can now test the above with a sample HTML string:


$html = "<html><a href=\"http://www.php.net\">PHP</a></html>";
$regex = '/(<[^>]*>)/';
$array = preg_split($regex, $html);
foreach($array as $element)
   print "$element<br>";

If we were to run the above we would just see the string "PHP".

jfulton [at] member.fsf.org
22 Aug 2013