It looks like there is a Java port of tidy with source. One route
for me might be soemthing like this:
1) get URL to from which I want to extract text.
2) use one the previous mentioned / suggested tools for extracting
text from a well formed HTML document.
3) If no errors -> done
else
catch error, but rather than throwing possilbly useful
content away, run tidy againist the problematic page and try to extract content
again.
Too bad Tidy doesn't have a save to text feature.
Thanks to the folk(s) who took the trouble to port it to Java.
Spencer
Dennis Sosnoski <dms@sosnoski.com> writes:
> If all you want to do is convert a set of static pages Tidy is definitely the
> way to go. It's probably too costly to run a separate program to convert data
> you retrieve on the fly, at least if you're running much volume of this type of
> query - it puts your performance back into the CGI realm.
>
> Looking into the Tidy program is a great way to find out how to resolve some of
> the more thorny recovery cases, though.
>
> - Dennis
>
> Kay Michael wrote:
> >
> > > It's got decent recovery from most of the bad HTML I've tried
> > > it with. Some cases are tough, though - attribute values with a leading
> > > quote and no trailing quote, for instance.
> >
> > Why not use Dave Raggett's HTML Tidy program, available from the W3C site?
> >
> > Mike Kay
> >
> > ---------------------------------------------------------------
> > java-xml-interest Commands
> > To: majordomo@cybercom.net
> > Body: subscribe java-xml-interest
> > Body: unsubscribe java-xml-interest
>
> ---------------------------------------------------------------
> java-xml-interest Commands
> To: majordomo@cybercom.net
> Body: subscribe java-xml-interest
> Body: unsubscribe java-xml-interest
---------------------------------------------------------------
java-xml-interest Commands
To: majordomo@cybercom.net
Body: subscribe java-xml-interest
Body: unsubscribe java-xml-interest