'

[CEUD-ICT] Tools for conversion of Ms Word documents to HTML

Barry McMullin barry.mcmullin at dcu.ie
Tue Feb 3 14:43:11 GMT 2009


On Tue, 3 Feb 2009, Donal J. Rice wrote:

> We're doing some research into tools for the conversion of content in MS
> Word documents into accessible (X)HTML.

Well, I have to jump into this one because it is a hobbyhorse of
mind ... unfortunately I don't have any very good answers.

First, I think it is worth teasing this out into several
overlapping, but distinct, problems - that is, different
dimensions of looking at it:

+ "Long" versus "short" content.  There is no hard and fast
   division here, but I would say roughly once you get over four
   or five traditional hardcopy pages it is "long" in the sense of
   needing heavier duty tools to work with.

+ "Simple" versus "complex" content.  Again, no sharp dividing
   line, but if you have (complicated) tables, (complicated)
   images (diagrams, drawings etc.), complicated notations (maths,
   music etc.), multi-lingual content, forms, etc., then specialist
   training, and probably specialist tools, will be needed.

+ Original content creation versus (one-shot) remediation of
   content from some "external" or "uncontrolled" source.

+ Somewhat overlapping with that, "closed" versus "open"
   content creation: basically whether you need to preserve "good"
   markup while your content takes a round trip through editing
   partners who are outside your organisation (and therefore
   applying other tools and other levels of accessibility
   expertise).

+ Single target versus multi-target: do you want a single source
   to be automatically convertible (with maximum accessibility
   support) to different media - HTML, daisy, PDF, large print
   hardcopy, braille hardcopy, (analogue) audio book etc.

So, with that all said, do I have any tools at all to suggest?

+ Openoffice is a good alternative to MS-Word itself. *If* the
   document is edited appropriately (absolutely no direct
   formatting, correct and appropriate use of styles etc.) then both
   the HTML and PDF export options seem to work pretty well in
   terms of accessibility.  I haven't tested this exhaustively, so
   your mileage will vary, but it's certainly worth playing
   with. There is even a "daisy export" plugin (though I have not
   tried it). There is also the advantage that the "native" file
   formats are open XML standards - so, in more complex contexts,
   you should be able to interwork effectively with XML tool
   chains. (Of course, Microsoft has their own characteristic
   approach to XML interoperability ... I will say no more.)

+ For "academic" work (which would include most kinds of longer
   "reports"), I use plain, old-fashioned, LaTeX; processing it
   with pdflatex for PDF and tex4ht for HTML.  It's not perfect,
   but it's not bad.  It's particularly strong at handling
   document structure (hierarchical sections, subsections etc.,
   "footnotes", citations and references) with automatic
   generation of all appropriate internal and external navigation,
   and flexibility in subdivision.  LaTeX content is typically
   authored with some sort of customised text editor, though there
   are some more graphically-oriented front ends also.  In
   general, LaTeX tools are not suitable for direct use by "general"
   users; but, as I say, where we are talking about longer
   documents, where a public sector body would typically be using
   an external "design" or "typesetting" agency anyway, I would
   much prefer an agency using these kinds of "abstract mark-up"
   oriented tools over one relying on graphical, vision-oriented,
   WYSIWYG, "desktop publishing" tools.

As always, just my 2c worth.

Regards - Barry.

--
Barry McMullin, Dublin City University
   phone: +353-1-700-5432
   web: http://www.eeng.dcu.ie/~mcmullin/


More information about the CEUD-ICT mailing list