Regular Expression Visualiser
JS style only, but nonetheless brilliant…
JS style only, but nonetheless brilliant…
I get a lot of automated mail that I do need for analysis, but it can be a pain to delete it all.
Admittedly, if I set filters better it would be easier, but even so, if you have thousands of emails in a folder and you need to delete them using webmail remotely, it is time consuming.
So let’s do it with SSH.
WARNING. This shit is dangerous. If any of this doesn’t make sense DON’T DO IT!
Login to your server via SSH
Change to your mail dir (this is the location on my CentOS 5 box)
Note the mail address format – @ is escaped with , period are replaced with underscores:
$ cd /home/{domain username}/mail/.{user@domin_tld}/{folder}/
If you are cleaning your inbox the folder is
cur
Test the message selection before you delete!
I’m deleting Cron messages from a server. The message source for these all contain ‘Subject: Cron <user@server>’ (I’ve anonymised the server address)
Note the search is regex, so special chars must be escaped
$ grep -l 'Subject: Cron <user@server>' * > grep.txt
Then I open the grep.txt file, copy some filenames and then open those files to double check I’m selecting the right messages.
Once I’m satisfied I’m not selecting anything I shouldn’t:
$ grep -l 'Subject: Cron' * | xargs rm -f
Shazzam. From >3000 emails to 150 in two minutes. I should probably check that account more often…
Youtube has introduced a new URL shorthand – youtu.be
For example youtu.be/DIArJjU8HjE takes you to
youtube.com/watch?v=DIArJjU8HjE&feature=youtu.be
Annoyingly this means that any users pasting a youtu.be URL into your video parsing scripts will be told it doesn’t work, so we’d best get updating those scripts!
Pop this in before you start processing the url, where $url is, not surprisingly, the url to be parsed.
//dMb 21/4/2011 Hack to handle youtu.be URLs dave@absolutedisaster.co.uk$needle = '/^(http://)*(www.)*(youtu.be)/([A-Za-z0-9]{11})$/';preg_match($needle, $url, $result);if($result[3] == 'youtu.be'){ $url = 'http://www.youtube.com/watch?v=' . $result[4];}//dMb end hack
[update 20110720-1657]While implementing this and similar for grabbing youtube thumbnails today I suddenly though, what if youtube ever starts using https? Well, the odds may be slim, but the costs of allowing for it are negligible, so why take the risk?
//dMb 21/4/2011 Hack to handle youtu.be URLs dave@absolutedisaster.co.uk//dMb 20/7/2011 Updated to allow for https$needle = '/^(http(s?)://)*(www.)*(youtu.be)/([A-Za-z0-9]{11})$/';preg_match($needle, $url, $result);if($result[4] == 'youtu.be'){ $url = 'http' . $result[2] . '://www.youtube.com/watch?v=' . $result[5];}//dMb end hack
[/update]
That particular link is the fantastic new video from my friends Six Toes – currently featured on the front of the Depeche Mode website.
Because of the obtuse syntax, some would contend that regular expressions are unlike riding a bicycle or tying your shoes in that it’s difficult to retain a working understanding of them over the long term.
Hmm. True dat, if you don’t use them often.
This article by W. Jason Gilmore hopes to make it a bit easier to learn, but of course can also be used as a simple reference.
http://phpbuilder.com/columns/Regular-Expressions/Jason_Gilmre072010.php3
In my previous post I used some complex Regex with PHP to manipulate some HTML.
The principles in the Regex can be used to do manipulate all kinds of HTML if you know how to break it down. Reading Regex can be a pain, and I always wish people would break it down, so, for my own reference as much as anything else, here it is:
The Regex:
/<objects+[^>]*widths*=s*(?:”([^"]+)”)s+[^>]*heights*=s*(?:”([^"]+)”)>((<params+[^>]*>)*)<embeds+[^>]*widths*=s*(?:”([^"]+)”)s+[^>]*heights*=s*(?:”([^"]+)”)></[e]mbed></object>/ims
The Breakdown
Wow. That’s some hefty regex, so let’s break it down:
| / | start of regex string |
| <object | find string literal <object |
| s+ | 1+ + blankspace chars s |
| [^>]* | 0+ chars * not > [^>] |
| width | string literal width |
| s* | 0+ * blankspace chars s |
| = | string literal = |
| s* | 0+ * blankspace chars s |
| (?:“([^"]+)“) |
string literal “ |
| s+ | 1+ + blankspace chars s |
| [^>]* | 0+ chars * not > [^>] |
| height | string literal height |
| s* | 0+ * blankspace chars s |
| = | string literal = |
| s* | 0+ * blankspace chars s |
| (?:“([^"]+)“) |
string literal “ |
| > | string literal > |
| ((<params+[^>]*>)*) |
0+ * of the following: |
| <embed | string literal <nmbed |
| s+ | 1+ + blankspace chars s |
| [^>]* | 0+ chars * not > [^>] |
| width | string literal width |
| s* | 0+ * blankspaces s |
| = | string literal = |
| s* | 0+ * blankspaces s |
| (?:“([^"]+)“) |
string literal “ |
| s+ | blankspace char s one or more times + |
| [^>]* | 0+ chars * not > [^>] |
| height | string literal height |
| s* | 0+ blankspaces s |
| = | string literal = |
| s* | 0+ blankspaces s |
| (?:“([^"]+)“) |
string literal “ |
| ([^>]*) | 0+ chars * not > [^>] back reference 7 () |
| ></[e]mbed></object> | string literal ></embed></object>
/e is the escape characte, so placing the e in a group [e] isolates it and resolves the issue / is escaped with |
| /i |
end regex / |
Test it out! Grab the sample html and the regex above (everything between the start and end slashes) and paste them in to http://regex.larsolavtorvik.com/
You can paste it in one section at a time to see how the regex builds up piece by piece.
(<tag)((?!.*attr=")[^>]*)>
So to find images missing alt attributes:
(<img)((?!.*alt=")[^>]*)>
There are groups in there so you can do a find/replace and insert the missing tag:
$1 alt=""$2>
Now it is trivial to find the alt=”" tags and fill them in…
I have been working on integrating the Zencoder API with my CMS.
Where you have a configuration file that users can, of course, completely mangle, you need to do a lot of error checking to ensure that things are correctly set and keep the users informed.
Most of the things you will wanting to be checking are pretty straightforward, but one of the slightly more complex ones is the protocol for returning the encoded files.
Zencoder supports ftp, sftp and ftps, as well as Cloud File and Amazon S3.
Regex for testing protocols is widely available online, but we are looking at slightly less usual options than http(s) vs ftp.
First we need to allow for all 3 flavours of ftp/^(s?(ftp)s?):///
Cloud file uses their cf:// protocol, or alternatively cf+xx:// where xx is the two letter country code of the location – currently Cloud File supports us and uk, defaulting to us if not specified. However, realistically, Cloud File may support more countries in future, so we need to allow for them./^((cf)(+[a-z]{2})?):///
Amzon S3 uses their s3:// protocol./^(s3):///i';
Sticking all three together we get:/^(s?(ftp)s?|((cf)(+[a-z]{2})?)|s3):///'
To put it into action:$host = 'ftps://mydomain.com';$zencoder_protocol_regex = '/^(s?(ftp)s?|((cf)(+[a-z]{2})?)|s3):///i';if(strpos($host, '://') > 0){ preg_match($zencoder_protocol_regex, $host, $result); if(count($result) == 0){ //Incorrect protocol }else{ if($result[0] = 'ftp://'){ //Protocol is ftp - give security warning } $protocol = $result[0]; $host = str_replace($result[0], '' , $host); }}else{ $protocol = 'ftps://';}
The next crucial step is to construct a valid destination url that includes a username and password, while allowing for S3 accounts, which don’t require user/pass. Non alpha-numeric characters need to be percent encoded:$user = 'username';$pass = 'password';$file = 'filename.ext';if(strlen($user) > 0){ $user = rawurlencode($user) . ':';}if(strlen($pass) > 0){ $pass = rawurlencode($pass) . '@';}$output = $protocol . $user . $pass . $host . '/' . $file
So if we pass in the following variables:$host = 'ftps://ftp.mydomain.com';$user = 'bob@mydomain.com';$pass = 'foo!';$file = 'filename.ext';
We will get:
ftps://bob%40mydomain.com:foo%21@ftp.mydomain.com
FTP: Completely unsecure – username and password sent in plain text
FTPS: User/pass sent over TLS/SSL – data not encrypted
SFTP: All data encrypted over SSH
So why not use SFTP all the time? Well, sometimes it’s just not available if you are on a shared server, and if it is, it’s often not available for additional FTP users as SSH requires shell access. As you don’t want to be giving away your master FTP account details, an additional FTP account (which can point directly at the required folder) is preferable, and unless your videos are highly sensitive or valuable, FTPS should do just fine.
http://pastebin.com/iB3X1Dq5
This is the production code on my server, so it has a lot of things that are custom to the CMS – it should be pretty straightforward to follow though. You will also need to trawl it for the variables that need to be passed to the script, but you should be reading the Zencoder docs to familiarise yourself with the requirements anway. Sorry I don’t have time to expand it all currently.
It is configured to output webm, ogg and mp4 files for HTML5 video players.
You will also need the API class from Zencoder – here’s a copy for convenience:
http://pastebin.com/eVbF879j
I haven’t yet integrated thumbnails, or processing a Zencoder notification returned to a script
Many moons ago, one of my webmastery guru’s told me that he considered www to be antiquated bullshit in a domain name – a waste of time and space. I agreed with his logic, and have supported the cause ever since.
The issue arose today when a client noted an issue with his Joomla sites – if you login on http://domain.tld, then click on a link that takes you to http://www.domain.tld, you will no longer be logged in, as the cookies are set for different domain names. To the casual user with or without www is the same thing, but as far as t’internet is concerned, www.domain.tld is actually a SUBDOMAIN of domain.tld.
It is also an issue for SEO and indexing, because as far as search engines are concerned, the ‘two’ sites are counted differently, meaning that links to one or t’other do not count towards a single total for page rank. It also means that the ‘two’ are considered duplicate content, reducing the perceived value of your data, especially as they both share an IP address so it looks to the search engines as if you are engaging in blackhat SEO techniques. This was certainly true in the past, and while I would have thought that search engines would be quite beyond such blatant foolishness, it’s best to play safe.
Luckily this is very easily cured.
If you ARE using Joomla and are not very technically minded, there is an SEO Canonicalisation Plugin plugin that will sort you out.
Wait, canonicalisation? REALLY? What kind of etymological rape have you people committed there? Can I suggest, I dunno, ‘canonifcation’ instead? But wait, it’s actually a real, technical term! It even has US and Anglicised versions, like a proper grown-up word and everything. Still, correct or not, it’s damn ugly, and blatantly coined by an American. However, it does not mean ‘to adjust the topography of an object in such a way as to cause it to resemble a big gun’.
Tragic that.
Haaaanyway, back to the point:
I personally would rather have less crap installed in Joomla, and want a solution that is not dependant on it.
.htaccess to the rescue! Feel the power of the rewrite rule!
What fun.
In your .htaccess file, ensure that you have ‘RewriteEngine On’ and add the necessary RewriteCond and RewriteRule. The rule tells the browser (and search engines) that the change is a 301 redirect, a very healthy way to go about things – html redirects, by contrast, being potentially indicative of blackhat behaviour.
RewriteEngine On# Redirect http://www.domain.tld requests to http://domain.tldRewriteCond %{HTTP_HOST} ^www.(.*)$ [NC]RewriteRule ^(.*)$ http://%1/$1 [R=301,L]
Bang. And the dirt is gone…
So, whagwandere? Well, RewriteEngine uses Regular Expressions (RegEx) to define the condition and the rule.
The condition says ONLY apply the following rule IF these conditions are met.
First we define variables:
%{HTTP_HOST} – A predifined variable meaning this domain name and tld without www or trailing slash
Then we define the conditions to match:
^www. – An http request (which is all .htaccess will process) that starts (^ = start of a match) with the string www followed by a period (. – .= any character – the period is preceded by a backslash to ‘escape’ it, meaning ignore any special meaning of the following character)
(.*)$ – Any number (* = any number) of any characters (. = any character). The parentheses creates a group ensuring the asterisk only applies to the preceding period, not a larger string, and also creates a backreference. $ indicates the end of the match.
[NC] – A flag telling Apache the rule is not case-sensitive
(.*) is an incredibly greedy regular expression, and should normally be avoided, but it is the right thing in this case, as we DO want to match absolutely ANYTHING after the www. and it is safe to use in this situation because we know precisely the nature of the input.
The rule consists of two parts, the match and the replacement. The match is pretty simple:
^(.*)$ – The greediest Regex EVER! It matches anything and everything. As we have already defined our condition, we know we want to replace EVERYTHING. It says start of match(^) followed by any number of any characters ((.*)) before the end of the match ($).
The replacement is slightly more complex:
http:// – string literal.
%1 – the first (in our case only) varable.
/ – string literal.
$1 – Backreference 1 defined by the parentheses in the condition regex.
[R=301,L] – Apache flags – R indicates which http status code to return, in this case 301 (Redirect: Permanently moved). L indicates that Apache should apply no more rules once this rule has bee applied.
What’s that? You want to direct domain to www.domain? Nob off. That’s not helping the cause. GIYF.
As I continue to work on truncating long activity stream posts, there aremore and more steps that need to be taken to clean up the code.
I don’t have time to fully explain everything I have done lately, so this iscurrently more a bunch of useful links and the current function and classesto point in the right direction.
First off, with some activities coming from the forum, they have BBCode inthem which is not parsed by the activity stream. Updating thestrip_html_tags() function to remove BBCode tags too is easy:
function strip_html_tags( $text ){ $text = preg_replace( array( // Remove BBCode tags '@[[/!]*?[^[]]*?]@siu', // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@@siu', '@@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu', // Add line breaks before and after blocks '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array( ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "n$0", "n$0", "n$0", "n$0", "n$0", "n$0", "n$0", "n$0", ), $text ); $text = strip_tags( $text ); $text = nl2br($text); return $text ;}
Argh! Knackernuts! While I get on to Posterous about the continued issue with object and embed in markdown, get the code from http://pastebin.com/kYRPPYrp
The function now also replaces inserts html line breaks before newline characters with the php nl2br() function after all tags have been stripped. The function name is perhaps becoming inaccurate, but the function is doing what I demand of it!
I also had some issues getting html_count class working in production, where it had been fine in testing.http://www.phpclasses.org/package/2653-PHP-Count-the-occurrences-of-a-given-H…
The issues here was that html_count is designed to parse the contents of an external file, not a string. Now, I can call an individual activity as an external file, but this requires post data to be sent.Wez Furlong’s do_post_request function set me on the track to solving that one:http://wezfurlong.org/blog/2006/nov/http-post-from-php-without-curl
However, it seemed a little (understatement of the year) inefficient to be generating an external file and parsing that for each activity in a stream (which can get very long when you ‘show older posts’ a few times), particularly when all that data being parsed is already in my hands at that point, so html_count needed tweaking to be able to handle strings and files – there are now two classes: string_html_tag_count and file_html_tag_count. These could easily be wrapped into one class with a switch to select which variant you wished to use. I just haven’t spent the extra few minutes doing that, as this is all taking far too long as is!
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// html_count_class.php// // This script was writed by Mahesh V. More maheshmore79 at yahoo dot com// // This program is freeware software; // // for contact me: http://www.maheshmore.tk///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////** * @class name html_count * @shortdesc Pattern matching and counting occurrences of HTML tag * @author Mahesh V. More maheshmore79 at yahoo dot com * @version 1.0.0 * @date 26th October 2005 * @downloaded from http://www.phpclasses.org/package/2653-PHP-Count-the-occurrences-of-a-given-HTML-tag.html * @modified by David Benson dmbenson1978 at gmail dot com * @modified string_html_tag_count added to permit counting from strings * @modified html_count changed to file_html_tag_count to allow sending of GET/POST params * @methods used html_count(constructor), call_html_count() **/ class string_html_tag_count{ # # stores count of number of tags found # var $count; # # stores regular expression pattern used for checking tag # var $pattern; // // function name: html_count() // description: constructor // purpose: to read data from variable, count Regex matches // arguments: $string, $pattern // returns: nothing // sets: $this->count // function string_html_tag_count($string, $pattern) { $this->count = 0; $this->pattern = $pattern; $matches = array(); preg_match_all($this->pattern, $string, $matches); //continue until it reaches the end of subject $this->count += count($matches[0]); }}//file versionclass file_html_tag_count{ # # stores count of number of tags found # var $count; # # stores regular expression pattern used for checking tag # var $pattern; # stores POST data pairs var $params; // // function name: html_count() // description: constructor // purpose: to read data from file, calls up the call_html_count() method // arguments: $file, $pattern // returns: nothing // function file_html_tag_count($file, $params = null, $method = 'GET', $pattern) { $this->count = 0; $this->pattern = $pattern; $cparams = array( 'http' => array( 'method' => $method, 'ignore_errors' => true ) ); if ($params !== null) { $params = http_build_query($params); if ($method == 'POST') { $cparams['http']['content'] = $params; } else { $url .= '?' . $params; } } $context = stream_context_create($cparams); $id = fopen($file,"r", false, $context); while($data = fread($id, 4096)) { $this->call_html_count($data); } fclose($id); } // // function name: call_html_count() // description: count tags // purpose: to perform pattern matching, counts tag and display tag name and path attribute // arguments: $contents // returns: nothing // function call_html_count($contents) { $matches = array(); preg_match_all($this->pattern, $contents, $matches); //continue until it reaches the end of subject $this->count += count($matches[0]); }}
As part of manipulating the activity stream, I need to do some precise matching of tags and attributes, so I can remove or manipulate them.
This of course means tasty, tasty regex. The worst part of working with regex is testing your matches (rewrite your expression, upload, refresh, repeat), and I have some tools installed to help with that – however they aren’t all geared towards PHP flavoured Regex.
Which is where this comes in – an online real time, real world PHP regex test utility. It’s great.
http://regex.larsolavtorvik.com/