TCLUG Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[OT] Regexp and HTML sanitization




>>>>> "KRB" == Kevin R Bullock <kbullock@ringworld.org> writes:

KRB> Hello all --
KRB> I'm developing a web application with PHP and MySQL for data entry. I need
KRB> to "sanitize" the HTML that my users enter (i.e. remove all HTML tags
KRB> except for BR, P, IMG, etc.). I've been trying to use a regular expression
KRB> to do this, but it's not working yet. Anyone have any suggestions? If I
KRB> can't do it in a single regular expression, it makes the code rather
KRB> complex.

Kevin, I think there is an *in principle* reasons why this should not
be possible.

Parsing HTML is a context-free parsing problem (since the tags can
embed and you have to have a stack to track the things you want to
match), not a regular expression parsing problem (there's no fixed
bound of memory you need to do this job).

So what you might want to do is to try to grab up a true HTML parser
(I think there are perl modules for this), and then walk the resulting
tree....