Remove Formatting from M$ Word Generated HTML

(Version 5)

What is It?

It is a simple program which will remove formatting and unnecessary garbage from the HTML generated when Microsoft Word 2000 is used to create a web page from a wordprocessor document.

I've found the HTML produced by M$ Word to be typically full of items which increase the file size (typically three-fold!) which are just to help M$ Word convert the page back to its own format and are unnecessary for display in web browser. It also uses a lot of explicit visual formatting which does not adapt as well to different browsers & different user's preferences as well structured HTML with formatting preferences separated out into a stylesheet. Furthermore, it is more difficult to maintain & the non-standard HTML/XML items may cause problems in the future. This program will hopefully remove all that leaving clean simple robust HTML which you can then add your own stylesheets etc. to if you want. It also converts the HTML from the Windows character set to standard UTF-8 Unicode.

It may need tweaking to work with different versions of M$ Word because M$ keep adding more mess to M$ Word. However, I wrote it to work with ASCII (technically Windows code page 1252 8-bit extended ASCII) HTML produced from M$ Word 2k for M$ Windows 2k but I've been told it even works reasonably well with UTF-8 HTML from M$ Word of Mac OS X.

For a list of the things which it removes, see the comments in the source code of the Perl script.

System Requirements

A Perl interpreter.

How to Use It

Open the wordprocessor document in M$ Word, choose File -> Save As from the menu and select Web Page as the file type. If it generates a subdirectory as well as the web page, delete 'filelist.xml' and any files you have no need for from that directory (it will typically contain a screen resolution version of every image for the web page plus a copy of the original image file if was not drawn in M$ Word but inserted from a separate image file).

Run this program with one parameter: the name (path with respect to the current working directory) of the HTML file to process.

Items it does not Correct

There are four items I have found which I have not put automatic corrections because the damage done by word is indistinguishable from possible intentional formatting in 3 cases and because it is so serious that I wrote a separate program to sort out the third.

You may well find others because I have written this program by adding more garbage removing search-&-replace commands as I find different types of garbage when I use it. I certainly haven't used all the features in feature-bloated M$ Word so you may be using some result in HTML garbage which I have not yet come across. Moreover, future versions of M$ Word will probably add even more garbage. Therefore check the HTML source code after running this program on it and, if you find any garbage remains, add some lines to the program to remove that extra garbage to the program and tell me what they are so that they can be given out as well.

An Alternative: HTML Tidy

Three years after writing this program, I found that there had already existed a far more comprehensive program called HTML Tidy. HTML Tidy can tidy up HTML source code in many ways of which one, its 'word-2000' option, is to strip out M$ Word garbage. I have not tested HTML Tidy myself but I expect that you will prefer HTML Tidy to my simple program.

However, I'm keeping my alternative available on the WWW because it handles web pages with images whereas, according to the HTML Tidy manual, its M$ Word HTML processing facility does not. My PhD thesis contained 112 pictures and a further ~600 equations rendered as images so an inability to cope with images would have been inconvenient.


Download (4 Kb).

Other Perl Scripts, Disclaimers Etc.

