PHP - regex/preg syntax

From Global Programming Syntax

Jump to: navigation, search

Introduction:
In case you haven't heard of regex before or its related functions, this introduction shall guide you through. Regex is like a language to match and identify strings through a pattern or algorithm. By saying algorithm, it means using 'if conditions' along with type of letters to match or not to match. And to specify all these different elements as described in the previous sentence, it is done through a regex algorithm which in php are known as the preg_ functions.

So how does this relate to php? Php has this algorithm embeded into several functions including preg_replace, preg_match, preg_match_all and preg_split each allowing you to match advanced combinations of strings. Below is an example of replacing all non-number and non-letter symbols with and underscore:

preg_replace('/[^0-9a-zA-Z]/is','_',$string);

Contents

Basics before going near the advanced

First of all, if you read the introduction you may have noticed that the following functions support regex. Also there are several other functions that do the same job (eg. split() vs preg_split()) and preg_ functions from the authors experience seem to win in performance every time. It is also noted that if there is an equivelent function without regex that does the job then it is better to use that function (eg. explode() vs preg_split()) since the regex uses a fair bit of cpu compared to the non-regex equivelents. So below is a list of the standard regex functions and their non-regex equivelents

Regex functionNon-regex equivelent
preg_match_all()str_word_count()
preg_match()substr_count()
preg_replace()str_replace()
str_ireplace()
preg_split()explode()

The above are the most commonly used functions. Note that preg_match() is mainly used in if statements while preg_match_all() is mainly used to retrieve the results of the match as an array.

So now you know the alternative functions to use when possible, below will describe how to use the regex functions. To use a regex function, you will need to get familiar will the modifier. The modifier is the '/i' or '/is' at the end of the string. It basically tells the function weather the function is case sensitive or in sensitive and various other things. It is also possible to specify more than one modifier after the last slash as each modifier is technically on letter long. For example, a common one is '/is' So below is a breif description of each one:

ModifierDescription
AForces regex to match only at the start of the input string. The input string is the string to be searched/replaced.
DMatches the dollar symbol only at the end of a string and not before a new line. Note that this descriptions accuracy is unknown.
eThis modifier only applies to preg_replace. It allows the usage of php code inside the second parameter.
preg_replace("/\<([^\>]+)\>/e", "'<'.strtoupper('$1').'>'", $html_string);
iMake the regex match case insensitive meaning it will take no notice of upper or lower case.
JThis modifier is probably one that will rarely be used. It allows duplicate names for subpatterns.
mAllows multiple lines for the regex string. ^ and $ match start and end of lines and by default, multi-lined regex strings are treated as single line strings.
$input="user\r\nhotmail.com";
preg_replace('/^(.*)
(.*)/m'
,'$1@$2',$input);
SIf your regex pattern matches several times then you may want to use this modifier to speed things up a bit. All it does is simply tells regex to spend more time to analysing so that when performing the regex multiple times it will be quicker to apply.
sAllows the dot character ( . ) to match any character including new lines. Without this modifier the wildcard dot character will not match new lines although an example by the author could not proove this fact with preg_replace. Also note that the dot character will only take effect when outside the [] brackets and will only represent one character unless an astrisk ( * ) is followed immediately after the dot.
$input='abcdefg';
//below shows the general effects of the dot symbol
echo preg_replace("/(.)(.*)/",'$1--$2',$input); //outputs: a--bcdefg
echo '<br><xmp>';
 
//below is the example used to show that the dot may match new lines without the s modifier
$input="asdf\r\nuser2";
echo preg_replace('/(.*)/','$1',$input);
echo '</xmp>';
Also this modifier can help make preg_match_all() work by changing the output but that is rare.
UWhen this modifier is specified, it makes regex replace more small combinations instead of 1 big combination. Below is an example for if you were to use a wildcard symbol to remove all characters before 'z'. Without the /U modifier, it will only output the last z since the wildcard (.*) will try to contain as much data as possible.
$input='111z000z111z000z111z000z111z000z';
echo preg_replace('/(.*)z/U','z',$input); //Outputs: zzzzzzzz
echo '<br>';
echo preg_replace('/(.*)z/','z',$input); // Outputs: z
uThis modifer enables additional regex functionality which usually is not required.
XJust like the /u and /x modifiers, this modifier also enables additional regex functionality.
xAgain, this modifier enables additional functionality just like the /u and /X modifiers and usually is not required.


What are the ^ and $ symbols for?

If when using the regex, you want the regex pattern to match only 1 line then by placing ^ at the beginning and $ at the end, only single line strings will match. The ^ symbol represents the beginning of a line while the $ symbol represents the end of that same line. Below is an example using preg_match_all to turn the string into an array of each line.

$input="111
222\r\n333
444\r\n555\r\n666"
;
preg_match_all('/^(.*)$/Um',$input,$output);
echo '<xmp>';
print_r($output);
echo '</xmp>';

You may find that explode() would be a more efficent function to do the job but this is just an example to deminstrate the abilities of regex/preg.
As you can see, the outputed the above text. Next is how to read it. It is very simple so first you need to know which function to use to display the array data. Generally the best one is print_r($array) surrounded by an echo '<xmp>' and echo '</xmp>'. Those echo statements tell the browser to display the text as monotext so that new lines will appear and it won't be messed up. After using the print_r() function as demonstrated above, you will see something like the below. That is all the data inside the array so for example, $array[0][0] equals 111 and $array[0][1] equals 222. That is known as a multi-dimensional array so the basically, the output is just showing 2 arrays inside an array and those 2 arrays each contain 5 or 6 values.

Array
(
    [0] => Array
        (
            [0] => 111
            [1] => 222
            [2] => 333
            [3] => 444
            [4] => 555
            [5] => 666
        )

    [1] => Array
        (
            [0] => 111
            [1] => 222
            [2] => 333
            [3] => 444
            [4] => 555
            [5] => 666
        )

)

Regex with preg_replace()

The preg replace function is commonly used when dealing with strings as it can help solve a variety of problems. First we shall look at using brackets in the preg replace regex. Below is an example showing three words that can be matched case insensitive and will remove the text surrounding that word.

$input='this is dword3 test.';
echo preg_replace('/^.*(word1|word2|dword3).*$/i','$1',$input);

As demonstrated in the code above, the brackets act as like an if statment. And when the brackets are used in the preg_replace() or preg_match_all() functions, the brackets also act as a variable container since in preg_replace() the data in the brackets can be passed on to the replacement value (eg. $1). Also in preg_match_all(), the brackets can help determine what the array result will be.
So to help you understand this demonstration, the ^$ show the beginning and ending of the input line. The .* is a wildcard containing the surrounding characters. And in the middle is possible words that can be matched. Then on the second parameter is a variable refering the the first set of brackets in the regex.

Next example: What if you wanted to be able to match both 'word' and 'dword' case insensitive with any number after 'dword' or 'word'. The following script will do just that. It will match combinations such as 'dword123', 'WoRd11', 'DwoRd81038' etc.

$input='this is Dword3235 test.';
echo preg_replace('/^.*(word[0-9.]++|dword[0-9.]++).*$/iU','$1',$input);

So how does it work. Very much like the previous script except this script introduces the [0-9]++ feature and an additional modifier as explained in the table above. Basically, with the [] brackets, you can place between them characters you wish for regex to search for and the ++ tells regex to search for a string of those characters instead of just one character. And as you may have seen in the example, you can also specify ranges of characters such as 0-9 A-Z a-z.

Optional brackets and escaping characters

Below is an example of converting a url to just the domain name and introduces you to a few new items. You may have noticed that to match both http:// and https:// it has the s in brackets with a question mark after it. That simply makes the s optional and can be done with any number of characters. You may also notice that a number of characters are also escaped (have a slash before it). That is because those particular characters represent something in regex just like the question mark represents the optional bracket. So to be able to use those characters as part of a string instead of part of a regex syntax, you simply just place the \ symbol before it. Those characters include */?[].()+^$

$url_input='http://www.example.com/test/page.php?id=1830';
echo preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$url_input);

Making a validator with preg_replace()

Usually when doing mysql queries and $_GET input, you will want to remove any nasty code which a hacker may have stragetically planted into the url or pinged through the $_POST sector. That's right. Just because you use $_POST doesn't mean that it's secure. There is allways the possibility of using curl to send data through $_POST. That is why it is best to filter any user input weather it's $_POST or $_GET and below are some examples of how. You will find that usually there are some pre-made functions that can be used such as mysql_real_escape_string() and htmlentities() but there are cases where regex can do a better job. Below is an example of how to remove all non integre/number characters.

$input='a1b2c3d4e5f6g7h8i9j0k!l@m#n$o%p^q&r*s(t)u_v+w|x}y{z';
echo preg_replace('/[^0-9]/','',$input); //Outputs: 1234567890

And so you can place any set of characters you wish to keep within the [^] brackets and all others will be removed. A nice and easy way to validate.

Delimiters

The delimiters are the 2 slashes that surround the regex and can be substituted for different symbols. An example of matching 'text' case insensitive with each type of delimiter is as follows.

$input='This is the Text.';
preg_match('/text/i',$input); //general usage delimiter
preg_match('#text#i',$input);
preg_match('%text%i',$input);
preg_match('&text&i',$input);
preg_match('@text@i',$input);
preg_match('`text`i',$input);
preg_match('~text~i',$input);
preg_match('"text"i',$input);
preg_match("'text'i",$input);
//below example same as example above
//except quotes are escaped to fit into string.
preg_match('\'text\'i',$input);

So as you can see, there are many different delimiters that you can use and I suppose you ask Why? The answer is simple. What if there are a lot of slashes in your regex? Then you would need to escape each of those slashes or you could just change the delimiter.

Full Video Tutorial

Below is a regex video tutorial teaching all the basics and some of the great things you can do with regex. But keep in mind that regex is cpu hungry.
[video src=php_regex size=600x474]

Personal tools
languages
page stats
Toolbox