TCLUG Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [TCLUG:20540] [OT] Regexp and HTML sanitization



Robert P. Goldman said:
> >>>>> "DS" == Dave Sherohman <esper@sherohman.org> writes:
>     DS> At worst, he might need to walk through the (surviving) tags
>     DS> with a set of flags for whether, e.g., <I> is turned on and
>     DS> append a </I> to the document if the submitter forgot to close
>     DS> it.
> 
> But notice that this is enough to make my point!  Detecting balanced
> delimiters is the paradigm case of context-free versus regular
> expression parsing:  to match parentheses, you need to have a stack to 
> push the openers onto and pop off of when you find the match.  That's
> a pushdown automaton, not a finite state machine.

Except you missed my implication that it would probably be two separate
steps - first use a one-shot regex to filter out all 'unacceptable' tags,
then scan for balance.  If done in perl, the scan for balance could be done
using a second regex similar to the first one, but using the continuation
flag rather than the global flag, so it would still be regec-based, it
would just run the regex more than once.

Also, as there would be a small set of acceptable tags, I don't think a stack
would be needed, just a set of variables (or an array or a perl hash or...)
to either keep track of how many levels of each are open or just whether the
attribute was last seen as an opening or a closing tag.  (Which one would be
appropriate is based on whether <I><I></I> leaves italics on or off.)

Technically, <I><B></I></B> isn't the Right Way to write your HTML, but it
happens and I've never noticed any browser having problems with it.  A stack
would be good for enforcing that tags must be properly nested, but would not
do very well in this case without some extra logic for popping non-top
yalues.

-- 
"Two words: Windows survives." - Craig Mundie, Microsoft senior strategist
"So does syphillis. Good thing we have penicillin." - Matthew Alton
Geek Code 3.1:  GCS d- s+: a- C++ UL++$ P+>+++ L+++>++++ E- W--(++) N+ o+
!K w---$ O M- V? PS+ PE Y+ PGP t 5++ X+ R++ tv b+ DI++++ D G e* h+ r++ y+