Re: [TCLUG:20540] [OT] Regexp and HTML sanitization

To: Dave Sherohman <esper@sherohman.org>
Subject: Re: [TCLUG:20540] [OT] Regexp and HTML sanitization
From: "Robert P. Goldman" <goldman@htc.honeywell.com>
Date: Mon, 21 Aug 2000 08:25:13 -0500 (CDT)
CC: tclug-list@mn-linux.org
In-Reply-To: <E13PuBE-0000xc-00@pchan>
References: <E13PuBE-0000xc-00@pchan>
Reply-To: goldman@htc.honeywell.com (Robert Goldman)

>>>>> "DS" == Dave Sherohman <esper@sherohman.org> writes:

    DS> Robert P. Goldman said:
    >> Kevin, I think there is an *in principle* reasons why this
    >> should not be possible.
    >> 
    >> Parsing HTML is a context-free parsing problem (since the tags
    >> can embed and you have to have a stack to track the things you
    >> want to match), not a regular expression parsing problem
    >> (there's no fixed bound of memory you need to do this job).

    DS> I disagree.  Unless there's more going on here than the
    DS> original question stated, Kevin doesn't sound like he's
    DS> interested in the structure of the HTML tags or whether they
    DS> match up.  He just wants to create a list of 'approved' tags
    DS> and make everything else go away.

I agree about the above, I think.  But see below, where I think you
give me my point...

    DS> At worst, he might need to walk through the (surviving) tags
    DS> with a set of flags for whether, e.g., <I> is turned on and
    DS> append a </I> to the document if the submitter forgot to close
    DS> it.

But notice that this is enough to make my point!  Detecting balanced
delimiters is the paradigm case of context-free versus regular
expression parsing:  to match parentheses, you need to have a stack to 
push the openers onto and pop off of when you find the match.  That's
a pushdown automaton, not a finite state machine.

Best,
R1

Follow-Ups:
- Re: [TCLUG:20540] [OT] Regexp and HTML sanitization
  - From: Dave Sherohman <esper@sherohman.org>
- Re: [TCLUG:20540] [OT] Regexp and HTML sanitization
  - From: Gabe Turner (officer) <dopp@acm.cs.umn.edu>

References:
- Re: [TCLUG:20540] [OT] Regexp and HTML sanitization
  - From: Dave Sherohman <esper@sherohman.org>

Prev by Date: Re: [TCLUG:20549] Scrolling in xterm
Next by Date: Re: [TCLUG:20553] Networking
Prev by thread: Re: [TCLUG:20540] [OT] Regexp and HTML sanitization
Next by thread: Re: [TCLUG:20540] [OT] Regexp and HTML sanitization
Index(es):
- Date
- Thread