'
[CEUD-ICT] Tools for conversion of Ms Word documents to HTML
Barry McMullin
barry.mcmullin at dcu.ie
Tue Feb 3 14:43:11 GMT 2009
On Tue, 3 Feb 2009, Donal J. Rice wrote:
> We're doing some research into tools for the conversion of content in MS
> Word documents into accessible (X)HTML.
Well, I have to jump into this one because it is a hobbyhorse of
mind ... unfortunately I don't have any very good answers.
First, I think it is worth teasing this out into several
overlapping, but distinct, problems - that is, different
dimensions of looking at it:
+ "Long" versus "short" content. There is no hard and fast
division here, but I would say roughly once you get over four
or five traditional hardcopy pages it is "long" in the sense of
needing heavier duty tools to work with.
+ "Simple" versus "complex" content. Again, no sharp dividing
line, but if you have (complicated) tables, (complicated)
images (diagrams, drawings etc.), complicated notations (maths,
music etc.), multi-lingual content, forms, etc., then specialist
training, and probably specialist tools, will be needed.
+ Original content creation versus (one-shot) remediation of
content from some "external" or "uncontrolled" source.
+ Somewhat overlapping with that, "closed" versus "open"
content creation: basically whether you need to preserve "good"
markup while your content takes a round trip through editing
partners who are outside your organisation (and therefore
applying other tools and other levels of accessibility
expertise).
+ Single target versus multi-target: do you want a single source
to be automatically convertible (with maximum accessibility
support) to different media - HTML, daisy, PDF, large print
hardcopy, braille hardcopy, (analogue) audio book etc.
So, with that all said, do I have any tools at all to suggest?
+ Openoffice is a good alternative to MS-Word itself. *If* the
document is edited appropriately (absolutely no direct
formatting, correct and appropriate use of styles etc.) then both
the HTML and PDF export options seem to work pretty well in
terms of accessibility. I haven't tested this exhaustively, so
your mileage will vary, but it's certainly worth playing
with. There is even a "daisy export" plugin (though I have not
tried it). There is also the advantage that the "native" file
formats are open XML standards - so, in more complex contexts,
you should be able to interwork effectively with XML tool
chains. (Of course, Microsoft has their own characteristic
approach to XML interoperability ... I will say no more.)
+ For "academic" work (which would include most kinds of longer
"reports"), I use plain, old-fashioned, LaTeX; processing it
with pdflatex for PDF and tex4ht for HTML. It's not perfect,
but it's not bad. It's particularly strong at handling
document structure (hierarchical sections, subsections etc.,
"footnotes", citations and references) with automatic
generation of all appropriate internal and external navigation,
and flexibility in subdivision. LaTeX content is typically
authored with some sort of customised text editor, though there
are some more graphically-oriented front ends also. In
general, LaTeX tools are not suitable for direct use by "general"
users; but, as I say, where we are talking about longer
documents, where a public sector body would typically be using
an external "design" or "typesetting" agency anyway, I would
much prefer an agency using these kinds of "abstract mark-up"
oriented tools over one relying on graphical, vision-oriented,
WYSIWYG, "desktop publishing" tools.
As always, just my 2c worth.
Regards - Barry.
--
Barry McMullin, Dublin City University
phone: +353-1-700-5432
web: http://www.eeng.dcu.ie/~mcmullin/
More information about the CEUD-ICT
mailing list