TCLUG Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

An interesting example of standards subversion...



http://www.fourmilab.ch/webtools/demoroniser/

DEMORONISER

                                Correct Moronic Microsoft HTML



This page describes, in Unix manual page style, a Perl program available
for downloading from this site which corrects
numerous errors and incompatibilities in HTML generated by, or edited
with, Microsoft applications. The demoroniser keeps
you from looking dumber than a bag of dirt when your Web page is viewed
by a user on a non-Microsoft platform. 

NAME

demoroniser - correct moronic and gratuitously incompatible HTML
generated by Microsoft applications 

SYNOPSIS

demoroniser [ -u ] [ -wcols ] [ infile ] [ outfile ] 

DESCRIPTION

Many slick, high profile corporate Web sites I visit seemed to exhibit
terrible grammar completely inconsistent with the
obvious investment in graphics and design. Apostrophes and quote marks
were frequently omitted, and every couple of
paragraphs words were run together which should have been separated by a
punctuation mark of some kind. 

This remained a mystery to me until I wanted to convert a presentation
I'd developed in 1996 using Microsoft PowerPoint into
a set of Web pages. A friend was kind enough to run the presentation
through PowerPoint's "Save as HTML" feature (I have
abandoned all use of Microsoft products, so I did not have a current
version of PowerPoint which includes this feature). When
I got the PowerPoint-generated HTML back and viewed it in my browser, I
discovered that it contained precisely the same
grammatical errors I'd noted on so many Web sites, and which certainly
were not present in my original presentation. 

A little detective work revealed that, as is usually the case when you
encounter something shoddy in the vicinity of a computer,
Microsoft incompetence and gratuitous incompatibility were to blame.
Western language HTML documents are written in the
ISO 8859-1 Latin-1 character set, with a specified set of escapes for
special characters. Blithely ignoring this prescription, as
usual, Microsoft use their own "extension" to Latin-1, in which a
variety of characters which do not appear in Latin-1 are
inserted in the range 0x82 through 0x95--this having the merit of being
incompatible with both Latin-1 and Unicode, which
reserve this region for additional control characters. 

These characters include open and close single and double quotes, em and
en dashes, an ellipsis and a variety of other things
you've been dying for, such as a capital Y umlaut and a florin symbol.
Well, okay, you say, if Microsoft want to have their
own little incompatible character set, why not? Because it doesn't stop
there--in their inimitable fashion (who would want
to?)--they aggressively pollute the Web pages of unknowing and innocent
victims worldwide with these characters, with the
result that the owners of these pages look like semi-literate morons
when their pages are viewed on non-Microsoft platforms
(or on Microsoft platforms, for that matter, if the user has selected as
the browser's font one of the many TrueType fonts
which do not include the incompatible Microsoft characters). 

You see, "state of the art" Microsoft Office applications sport a nifty
feature called "smart quotes." (Rule of thumb--every time
Microsoft use the word "smart," be on the lookout for something dumb).
This feature is on by default in both Word and
PowerPoint, and can be disabled only by finding the little box buried
among the dozens of bewildering option panels these
products contain. If enabled, and you type the string, 

                            "Halt," he cried, "this is the police!" 

"smart quotes" transforms the ASCII quote characters automatically into
the incompatible Microsoft opening and closing
quotes. ASCII single and double quotes are similarly transformed (even
though ASCII already contains apostrophe and single
open quote characters), and double hyphens are replaced by the
incompatible em dash symbol. What other horrors occur, I
know not. If the user notices this happening at all, their reaction
might be "Thank you Billy-boy--that looks ever so much
nicer," not knowing they've been set up to look like a moron to folks
all over the world. 

You see, when you export a document as text for hand-editing into HTML,
or avail yourself of the "Save as HTML" features
in newer versions of Office applications, these incompatible,
Microsoft-specific characters remain in place. When viewed by
a user on a non-Microsoft platform, they will not be displayed
properly--most browsers seem to just drop them, as opposed
to including a symbol indicating an undisplayable character. Hence, the
apparently ungrammatical text, which the author of the
page, editing on a Microsoft platform, will never be aware of. 

Having no desire to hand-edit the HTML for a long presentation to
correct a raft of Microsoft-induced incompatibilities, I
wrote a Perl program, the demoroniser, to transform Microsoft's "junk
HTML" into at least a starting point for something I'd
consider presentable on my site. In addition to replacing the
incompatible characters with HTML-compliant equivalents
wherever possible (a few rarely-encountered characters which can't be
translated result in warning messages if encountered),
the following sloppy or downright wrong HTML is corrected. 

     The missing semicolon at the end of numeric character escapes
(=) is supplied. 
     Numeric renderings of special characters (< > &) are
replaced with readable equivalents. 
     Unquoted <table> tags containing non-alphanumeric characters are
quoted. 
     PowerPoint's mis-nesting of <font> and <strong> tags is corrected. 
     PowerPoint's boneheaded use of <ul> and </ul> tags to accomplish
paragraph breaks is corrected and the proper <p>
     tags inserted. 
     Missing <tr> tags in text-only slides are inserted. 
     Nugatory </p> tags are removed. 
     Unmatched <li> tags in headings are removed. 
     Idiot "paragraph-long lines" are broken into something suitable for
editing with a normal text editor. 

OPTIONS

-u Print how-to-call information and a summary of options. 

-wcols 
     Wrap output lines at column cols. By default, lines are wrapped at
column 72. A cols specification of 0 disables line
     wrapping. demoroniser attempts to wrap lines so as to preserve
their meaning. Lines are broken at white space
     whenever possible. If this cannot be done, a line longer than the
cols specification will remain in the output HTML. 

BUGS

demoroniser is a Perl script. In order to use it, you must have Perl
installed on your system. demoroniser was developed
using Perl 4.0, patch level 36. 

FILES

If no outfile is specified, output is written to standard output. If no
infile is specified, input is read from standard input. 

SEE ALSO

perl(1) 

       Download demoroniser.zip

AUTHOR

John Walker 
http://www.fourmilab.ch/ 

     This software is in the public domain. Permission to use, copy,
modify, and distribute this software and its
     documentation for any purpose and without fee is hereby granted,
without any conditions or restrictions. This
     software is provided "as is" without express or implied warranty. 



by John Walker
January 16th, 1998