Breaking down the Regex

In my previous post I used some complex Regex with PHP to manipulate some HTML.

The principles in the Regex can be used to do manipulate all kinds of HTML if you know how to break it down. Reading Regex can be a pain, and I always wish people would break it down, so, for my own reference as much as anything else, here it is:

The Regex:

/<objects+[^>]*widths*=s*(?:”([^”]+)”)s+[^>]*heights*=s*(?:”([^”]+)”)>((<params+[^>]*>)*)<embeds+[^>]*widths*=s*(?:”([^”]+)”)s+[^>]*heights*=s*(?:”([^”]+)”)></[e]mbed></object>/ims

The Breakdown

Wow. That’s some hefty regex, so let’s break it down:

/ start of regex string
<object find string literal <object
s+ 1+ + blankspace chars s 
[^>]* 0+ chars * not > [^>]
width string literal width
s* 0+ * blankspace chars s
= string literal =
s* 0+ * blankspace chars s
(?:([^”]+))

string literal
1+ + characters not ” [^”]
string literal

back reference 1 () is everything between doublequotes

n.b. ? normally means 0/1 preceding characters. I believe in this instance the ?: negates the back reference from the outer brackets

s+ 1+ + blankspace chars s 
[^>]* 0+ chars * not > [^>]
height string literal height
s* 0+ * blankspace chars s
= string literal =
s* 0+ * blankspace chars s
(?:([^”]+))

string literal
1+ + characters not ” [^”]
string literal

back reference 2 () 

> string literal >
((<params+[^>]*>)*)

0+ * of the following:
   string literal <param
   1+ + blankspace chars s
   0+ chars * not > [^>]
   string literal >

Each match is a new back reference () (3 and 4 in this example)

<embed string literal <nmbed
s+ 1+ + blankspace chars s 
[^>]* 0+ chars * not > [^>]
width string literal width
s* 0+ * blankspaces s
= string literal =
s* 0+ * blankspaces s
(?:([^”]+))

string literal
1+ + characters not ” [^”]
string literal

back reference 5 () is everything between doublequotes

s+ blankspace char s one or more times +
[^>]* 0+ chars * not > [^>]
height string literal height
s* 0+ blankspaces s
= string literal =
s* 0+ blankspaces s
(?:([^”]+))

string literal
1+ + characters not ” [^”]
string literal

back reference 6 () is everything between doublequotes

([^>]*) 0+ chars * not > [^>]

back reference 7 ()

></[e]mbed></object> string literal ></embed></object>

/e is the escape characte, so placing the e in a group [e] isolates it and resolves the issue
/ is escaped with

/i

end regex /
ignore case i 

Test it out! Grab the sample html and the regex above (everything between the start and end slashes) and paste them in to http://regex.larsolavtorvik.com/

You can paste it in one section at a time to see how the regex builds up piece by piece.

Leave a Reply

Your e-mail address will not be published. Required fields are marked *