markdown via email test

This is a test of that same script send via RTF email:

And it’s gone a bit wrong again. Well, this isn’t Posterous’ fault, but Outlook’s. When I highlight a block of code and tab it in wholesale, instead of inserting tabs, Outlook decides to fuck around with the HTML. What a surprise! I’d never have expected that.

[p class=MsoNormal style=‘margin-left:72.0pt’]function strip_html_tags( $text )[o:p][/o:p][/p]

n.b. < > replaced with [ ]

OK, I can see the point in using styles, to a point. But a new set of p tags with the same style tag AND a class on EVERY LINE? Oh, COME ON Microsoft! And what on EARTH is [o:p][/o:p] for? BAD Outlook. No biskwits for you.

Can Posterous work around this? It’s easy to replace every instance of style=‘margin-left:72.0pt’ with two tabs, but does every user have Outlook set up to use that size and units? If not, it gets damn complicated. Good luck to them, I wouldn’t want to do it. Dealing with MS’ idea of HTML is a daily nightmare for me and probably anyone who ever reads this (probably no-one!)

Posterous Chris is still working on the problem with MCE chewing up the Object tags in the script, but that’s a separate issue.

/**

  • Remove HTML tags, including invisible text such as style and

  • script code, and embedded objects. Add line breaks around

  • block-level tags to prevent word joining after tag removal.

*/

function strip_html_tags( $text )

{

Props to Posterous

I mailed Posterous earlier regarding the problems I was having with markdown mode, and before the end of the day we’d had a short discussion, they’d found the source of the problem and we should have a fix in the works very soon.

Top marks!

Allowing SFTP access in Plesk 9

This is an easy one.

Go to Plesk > {your domain} > Web Hosting Settings

Set ‘Shell access to server with’ FTP user’s credentials to

This is a potential security risk, so you might not want to enable it for reseller domains.

[update 20110217-13:57]
To clarify, obviously using SFTP *increases* security when transferring files – the security risk I refer to is that enabling it will give that user shell access with the FTP details, which could be a security risk, so you might not want to enable it for users accounts which other people have the details of.
[/update]

Breaking down the Regex

In my previous post I used some complex Regex with PHP to manipulate some HTML.

The principles in the Regex can be used to do manipulate all kinds of HTML if you know how to break it down. Reading Regex can be a pain, and I always wish people would break it down, so, for my own reference as much as anything else, here it is:

The Regex:

/<objects+[^>]*widths*=s*(?:”([^”]+)”)s+[^>]*heights*=s*(?:”([^”]+)”)>((<params+[^>]*>)*)<embeds+[^>]*widths*=s*(?:”([^”]+)”)s+[^>]*heights*=s*(?:”([^”]+)”)></[e]mbed></object>/ims

The Breakdown

Wow. That’s some hefty regex, so let’s break it down:

/ start of regex string
<object find string literal <object
s+ 1+ + blankspace chars s 
[^>]* 0+ chars * not > [^>]
width string literal width
s* 0+ * blankspace chars s
= string literal =
s* 0+ * blankspace chars s
(?:([^”]+))

string literal
1+ + characters not ” [^”]
string literal

back reference 1 () is everything between doublequotes

n.b. ? normally means 0/1 preceding characters. I believe in this instance the ?: negates the back reference from the outer brackets

s+ 1+ + blankspace chars s 
[^>]* 0+ chars * not > [^>]
height string literal height
s* 0+ * blankspace chars s
= string literal =
s* 0+ * blankspace chars s
(?:([^”]+))

string literal
1+ + characters not ” [^”]
string literal

back reference 2 () 

> string literal >
((<params+[^>]*>)*)

0+ * of the following:
   string literal <param
   1+ + blankspace chars s
   0+ chars * not > [^>]
   string literal >

Each match is a new back reference () (3 and 4 in this example)

<embed string literal <nmbed
s+ 1+ + blankspace chars s 
[^>]* 0+ chars * not > [^>]
width string literal width
s* 0+ * blankspaces s
= string literal =
s* 0+ * blankspaces s
(?:([^”]+))

string literal
1+ + characters not ” [^”]
string literal

back reference 5 () is everything between doublequotes

s+ blankspace char s one or more times +
[^>]* 0+ chars * not > [^>]
height string literal height
s* 0+ blankspaces s
= string literal =
s* 0+ blankspaces s
(?:([^”]+))

string literal
1+ + characters not ” [^”]
string literal

back reference 6 () is everything between doublequotes

([^>]*) 0+ chars * not > [^>]

back reference 7 ()

></[e]mbed></object> string literal ></embed></object>

/e is the escape characte, so placing the e in a group [e] isolates it and resolves the issue
/ is escaped with

/i

end regex /
ignore case i 

Test it out! Grab the sample html and the regex above (everything between the start and end slashes) and paste them in to http://regex.larsolavtorvik.com/

You can paste it in one section at a time to see how the regex builds up piece by piece.

PHP PCRE Regex: Halving video size – object and embed

Still with the activity stream.

Say I want to take only the first pair of object tags from a post, and halve the size of the object before displaying it? This way I keep a post short if it has several objects, and keep the size of videos down until the user clicks the ‘view more’ button.

Sample HTML

PHP code

Where $haystack is our html string as above

Result

Create new video tag

Result

embedembedembedembedembed

Why does it say ‘embed’ repeatedly there? I dunno! Posterous, you are preposterous at times…

Online AJAX’d PHP Regex tool

As part of manipulating the activity stream, I need to do some precise matching of tags and attributes, so I can remove or manipulate them.

This of course means tasty, tasty regex. The worst part of working with regex is testing your matches (rewrite your expression, upload, refresh, repeat), and I have some tools installed to help with that – however they aren’t all geared towards PHP flavoured Regex.

Which is where this comes in – an online real time, real world PHP regex test utility. It’s great.

http://regex.larsolavtorvik.com/

1275051955-clip-59kb

Another attempt at PHP syntax highlighting

So this is an illustration of Posterous’ markdown tag. Lets see how we go:

Well, starters, if I include a link to the markdown help page on the text of the word markdown above, then that’s it. End of post. Bit shit really.Here’s the link pasted in plain text. Will it break again?

http://posterous.com/help/markdown

No. Was the parser getting confused by the word markdown being in the href attribute of the A tag? or by the word markdown being in the middle of the A and /A tags? That is possible – an overly greedy Regex could confuse {a}markdown{/a} with {markdown}, or something along those lines.Anyway, on with the syntax highlighting.See here for my first attempt:http://disasterman.posterous.com/tech-snippetsAnd here’s the code with syntax highlighting:

strip_html_tags() Function

Well, this is still being mangled badly.

The trick to getting Syntax highlighting is to precede all code lines with two tab characters. My booboo – failed to spot that.

This causes problems in itself though – if you send post with a plain text email, line breaks will be introduced, which will break code and be missing the tabs for formatting on the new lines.

There should be 36 lines of code above, but due to the Posterous parser introducing html comment tags all over the shop, you can’t see most of it. Well, for now, some code simply can not be posted it would seem.

Get the original code courtesy of:http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page

The code is covered by the OSI BSD license so you can use, modify,redistribute, and sell as you see fit.http://www.opensource.org/licenses/bsd-license.php

[PHP] Remove HTML tags

This is a first test post. I intend to use this account as a place to post snippets of code and answers to technical problems I’ve had while designing and programming websites.

 

Let’s see what posterous does with some code.

There is an activity feed on a website I run – some of the posts are a bit long, so we want to truncate them, facebook style, with a more link. This is easily done with the current system, but truncation can be a bit complex.

It’s easy to truncate text, but where you have html in the post, things get a bit complex – if you truncate to a set character length, you might break a tag in two, or leave an unclosed tag, breaking the site considerably!

So we need to strip html from the post as part of the process of deciding whether to truncate it. This is only step one if you have a post with several videos, but less text than your character limit, then you could still end up with a long post, but it’s an important first step.

To save time, I had a hunt for a pre-written function as a starting point for my new code, and this is what I found:

/** * Remove HTML tags, including invisible text such as style and * script code, and embedded objects. Add line breaks around * block-level tags to prevent word joining after tag removal. */ function strip_html_tags( $text ) { $text = preg_replace( array( // Remove invisible content '@]*?>.*?@siu', '@<!-- ]*?>.*? -->@siu', '@<!-- ]*?.*? // -->@siu', '@]*?.*?@siu', '@]*?.*?@siu', '@]*?.*?@siu', '@]*?.*?@siu', '@<!--]*?.*?-->@siu', '@]*?.*?@siu', // Add line breaks before and after blocks '@<!--?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array( ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "n$0", "n$0", "n$0", "n$0", "n$0", "n$0", "n$0", "n$0", ), $text ); return strip_tags( $text ); }

 

Aaaaand Posterous has COMPLETELY mangled the code. BBCode style {code}{/code} tags aren’t displaying line breaks, which I had working earlier. Hmm.
(Note: Curly brackets used instead of square brackets there, as even if I use HTML char entities &#91; and &#93; in the post HTML source, Posterous’ parser over-zealously interprets this as BBCode)

They have also stripped out chunks which could represent dangerous code – fair enough that they need to ensure people don’t inject malicious code into the pages, which would be bad for them and bad for users, but if I NEED to post some code which contains the TEXT of a script or embed tag, say, as in this case, it’s a bit useless for me, and this code is not dangerous – it’s not even valid markup, but Regex. That’s going to take some work for them to get their parser to tell the difference.

The alternative method is Posterous’ markdown tag, and you can specify syntax with #!{language}

However, if I use this with #!php, the code is still mangled, syntax is not highlighted (possibly due to my template?), and anything OUTSIDE of the markdown tags disappears.

Ah well, it’s a new service, I’m sure they will iron out the bugs.

You still get the code, courtesy of:
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page

The code is covered by the OSI BSD license so you can use, modify, redistribute, and sell as you see fit.
http://www.opensource.org/licenses/bsd-license.php

 

Code samples, links and related stuff. And a recipe for risotto.