[PHP] Remove HTML tags

This is a first test post. I intend to use this account as a place to post snippets of code and answers to technical problems I’ve had while designing and programming websites.

 

Let’s see what posterous does with some code.

There is an activity feed on a website I run – some of the posts are a bit long, so we want to truncate them, facebook style, with a more link. This is easily done with the current system, but truncation can be a bit complex.

It’s easy to truncate text, but where you have html in the post, things get a bit complex – if you truncate to a set character length, you might break a tag in two, or leave an unclosed tag, breaking the site considerably!

So we need to strip html from the post as part of the process of deciding whether to truncate it. This is only step one if you have a post with several videos, but less text than your character limit, then you could still end up with a long post, but it’s an important first step.

To save time, I had a hunt for a pre-written function as a starting point for my new code, and this is what I found:

/** * Remove HTML tags, including invisible text such as style and * script code, and embedded objects. Add line breaks around * block-level tags to prevent word joining after tag removal. */ function strip_html_tags( $text ) { $text = preg_replace( array( // Remove invisible content '@]*?>.*?@siu', '@<!-- ]*?>.*? -->@siu', '@<!-- ]*?.*? // -->@siu', '@]*?.*?@siu', '@]*?.*?@siu', '@]*?.*?@siu', '@]*?.*?@siu', '@<!--]*?.*?-->@siu', '@]*?.*?@siu', // Add line breaks before and after blocks '@<!--?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array( ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "n$0", "n$0", "n$0", "n$0", "n$0", "n$0", "n$0", "n$0", ), $text ); return strip_tags( $text ); }

 

Aaaaand Posterous has COMPLETELY mangled the code. BBCode style {code}{/code} tags aren’t displaying line breaks, which I had working earlier. Hmm.
(Note: Curly brackets used instead of square brackets there, as even if I use HTML char entities &#91; and &#93; in the post HTML source, Posterous’ parser over-zealously interprets this as BBCode)

They have also stripped out chunks which could represent dangerous code – fair enough that they need to ensure people don’t inject malicious code into the pages, which would be bad for them and bad for users, but if I NEED to post some code which contains the TEXT of a script or embed tag, say, as in this case, it’s a bit useless for me, and this code is not dangerous – it’s not even valid markup, but Regex. That’s going to take some work for them to get their parser to tell the difference.

The alternative method is Posterous’ markdown tag, and you can specify syntax with #!{language}

However, if I use this with #!php, the code is still mangled, syntax is not highlighted (possibly due to my template?), and anything OUTSIDE of the markdown tags disappears.

Ah well, it’s a new service, I’m sure they will iron out the bugs.

You still get the code, courtesy of:
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page

The code is covered by the OSI BSD license so you can use, modify, redistribute, and sell as you see fit.
http://www.opensource.org/licenses/bsd-license.php

 

One thought on “[PHP] Remove HTML tags”

Leave a Reply

Your e-mail address will not be published. Required fields are marked *