TCLUG Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [TCLUG:20540] [OT] Regexp and HTML sanitization



>>>>> "DS" == Dave Sherohman <esper@sherohman.org> writes:

    DS> Robert P. Goldman said:
    >> Kevin, I think there is an *in principle* reasons why this
    >> should not be possible.
    >> 
    >> Parsing HTML is a context-free parsing problem (since the tags
    >> can embed and you have to have a stack to track the things you
    >> want to match), not a regular expression parsing problem
    >> (there's no fixed bound of memory you need to do this job).

    DS> I disagree.  Unless there's more going on here than the
    DS> original question stated, Kevin doesn't sound like he's
    DS> interested in the structure of the HTML tags or whether they
    DS> match up.  He just wants to create a list of 'approved' tags
    DS> and make everything else go away.

I agree about the above, I think.  But see below, where I think you
give me my point...

    DS> At worst, he might need to walk through the (surviving) tags
    DS> with a set of flags for whether, e.g., <I> is turned on and
    DS> append a </I> to the document if the submitter forgot to close
    DS> it.

But notice that this is enough to make my point!  Detecting balanced
delimiters is the paradigm case of context-free versus regular
expression parsing:  to match parentheses, you need to have a stack to 
push the openers onto and pop off of when you find the match.  That's
a pushdown automaton, not a finite state machine.

Best,
R1