Remove Formatting from Microsoft Word Generated HTML

(Version 5)

What is It?

It is a simple program which will remove formatting and unnecessary garbage from the HTML generated when Microsoft Word 2000 is used to create a web page from a wordprocessor document.

I've found the HTML produced by MS Word to be typically full of items which increase the file size (typically three-fold!) which are just to help MS Word convert the page back to its own format and are unnecessary for display in web browser. It also uses a lot of explicit visual formatting which does not adapt as well to different browsers & different user's preferences as well structured HTML with formatting preferences separated out into a stylesheet. Furthermore, it is more difficult to maintain & the non-standard HTML/XML items may cause problems in the future. This program will hopefully remove all that leaving clean simple robust HTML which you can then add your own stylesheets etc. to if you want. It also converts the HTML from the Windows character set to standard UTF-8 Unicode.

It may need tweaking to work with different versions of MS Word because MS keep adding more mess to MS Word. However, I wrote it to work with ASCII (technically Windows code page 1252 8-bit extended ASCII) HTML produced from MS Word 2k for MS Windows 2k but I've been told it even works reasonably well with UTF-8 HTML from MS Word of Mac OS X.

For a list of the things which it removes, see the comments in the source code of the Perl script.

System Requirements

A Perl interpreter.

How to Use It

Open the wordprocessor document in MS Word, choose File -> Save As from the menu and select Web Page as the file type. If it generates a subdirectory as well as the web page, delete 'filelist.xml' and any files you have no need for from that directory (it will typically contain a screen resolution version of every image for the web page plus a copy of the original image file if was not drawn in MS Word but inserted from a separate image file).

Run this program with one parameter: the name (path with respect to the current working directory) of the HTML file to process.

Items it does not Correct

There are four items I have found which I have not put automatic corrections because the damage done by word is indistinguishable from possible intentional formatting in 3 cases and because it is so serious that I wrote a separate program to sort out the third.

You may well find others because I have written this program by adding more garbage removing search-&-replace commands as I find different types of garbage when I use it. I certainly haven't used all the features in feature-bloated MS Word so you may be using some result in HTML garbage which I have not yet come across. Moreover, future versions of MS Word will probably add even more garbage. Therefore check the HTML source code after running this program on it and, if you find any garbage remains, add some lines to the program to remove that extra garbage to the program and tell me what they are so that they can be given out as well.

An Alternative: HTML Tidy

Three years after writing this program, I found that there had already existed a far more comprehensive program called HTML Tidy. HTML Tidy can tidy up HTML source code in many ways of which one, its 'word-2000' option, is to strip out MS Word garbage. I have not tested HTML Tidy myself but I expect that you will prefer HTML Tidy to my simple program.

However, I'm keeping my alternative available on the WWW because it handles web pages with images whereas, according to the HTML Tidy manual, its MS Word HTML processing facility does not. My PhD thesis contained 112 pictures and a further ~600 equations rendered as images so an inability to cope with images would have been inconvenient.

Download

Download StripFormattingFromWordGeneratedHtml.pl (4 Kb).

Other Perl Scripts, Disclaimers Etc.

See my computer programs index page for more simple useful computer programs.