[Fwd: HTML Parsing question]

To: "tclug-devel@mn-linux.org" <tclug-devel@mn-linux.org>
Subject: [Fwd: HTML Parsing question]
From: Perry Hoekstra <dutchman@uswest.net>
Date: Thu, 16 Mar 2000 13:12:28 -0600
Delivered-To: fixup-tclug-devel@mn-linux.org@fixme
Sender: phoekstra

This is in reference to the HTML parsing thread that occured a week ago.

-- 
Perry Hoekstra
Talent Software Services
dutchman@mn.uswest.net

"I don't see much sense in that," said Rabbit.
"No," said Pooh humbly, "there isn't. But there was going to be when I
began it. It's just that something happened to it along the way."

To: java-xml-interest@cybercom.net
Subject: Re: HTML Parsing question
From: "Spencer Marks" <smarks@digisolutions.com>
Date: 16 Mar 2000 13:28:25 -0500
Delivered-To: dutchman@mail-mpls.uswest.net
Delivered-To: java-xml-interest-outgoing@cybercom.net
Delivered-To: java-xml-interest@cybercom.net
In-Reply-To: Dennis Sosnoski's message of "Thu, 16 Mar 2000 10:08:47 -0800"
References: <93CB64052F94D211BC5D0010A800133101FDEB2B@wwmess3.bra01.icl.co.uk> <38D1232F.B27564D6@sosnoski.com>
Reply-To: java-xml-interest@cybercom.net
Sender: java-xml-interest-owner@cybercom.net

It looks like there is a Java port of tidy with source. One route
for me might be soemthing like this:

1) get URL to from which I want to extract text. 
2) use one the previous mentioned / suggested tools for extracting
text from a well formed HTML document. 
3) If no errors -> done
    else 
  catch error, but rather than throwing possilbly useful
content away, run tidy againist the problematic page and try to extract content
again.

Too bad Tidy doesn't have a save to text feature. 

Thanks to the folk(s) who took the trouble to port it to Java. 

Spencer


 

Dennis Sosnoski <dms@sosnoski.com> writes:

> If all you want to do is convert a set of static pages Tidy is definitely the
> way to go. It's probably too costly to run a separate program to convert data
> you retrieve on the fly, at least if you're running much volume of this type of
> query - it puts your performance back into the CGI realm.
> 
> Looking into the Tidy program is a great way to find out how to resolve some of
> the more thorny recovery cases, though.
> 
>   - Dennis
> 
> Kay Michael wrote:
> > 
> > > It's got decent recovery from most of the bad HTML I've tried
> > > it with. Some cases are tough, though - attribute values with a leading
> > > quote and no trailing quote, for instance.
> > 
> > Why not use Dave Raggett's HTML Tidy program, available from the W3C site?
> > 
> > Mike Kay
> > 
> > ---------------------------------------------------------------
> > java-xml-interest Commands
> > To: majordomo@cybercom.net
> > Body: subscribe java-xml-interest
> > Body: unsubscribe java-xml-interest
> 
> ---------------------------------------------------------------
> java-xml-interest Commands
> To: majordomo@cybercom.net
> Body: subscribe java-xml-interest
> Body: unsubscribe java-xml-interest

---------------------------------------------------------------
java-xml-interest Commands
To: majordomo@cybercom.net
Body: subscribe java-xml-interest
Body: unsubscribe java-xml-interest

Prev by Date: Re: [TCLUG-DEVEL:174] AWT
Next by Date: Doh!
Prev by thread: weird bug
Next by thread: Doh!
Index(es):
- Date
- Thread