TCLUG Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [TCLUG:20540] [OT] Regexp and HTML sanitization
>>>>> "DS" == Dave Sherohman <esper@sherohman.org> writes:
DS> Robert P. Goldman said:
>> Kevin, I think there is an *in principle* reasons why this
>> should not be possible.
>>
>> Parsing HTML is a context-free parsing problem (since the tags
>> can embed and you have to have a stack to track the things you
>> want to match), not a regular expression parsing problem
>> (there's no fixed bound of memory you need to do this job).
DS> I disagree. Unless there's more going on here than the
DS> original question stated, Kevin doesn't sound like he's
DS> interested in the structure of the HTML tags or whether they
DS> match up. He just wants to create a list of 'approved' tags
DS> and make everything else go away.
I agree about the above, I think. But see below, where I think you
give me my point...
DS> At worst, he might need to walk through the (surviving) tags
DS> with a set of flags for whether, e.g., <I> is turned on and
DS> append a </I> to the document if the submitter forgot to close
DS> it.
But notice that this is enough to make my point! Detecting balanced
delimiters is the paradigm case of context-free versus regular
expression parsing: to match parentheses, you need to have a stack to
push the openers onto and pop off of when you find the match. That's
a pushdown automaton, not a finite state machine.
Best,
R1