semantic XML versus HTML, revisited

There is a long tradition in publishing, which has carried over into methods for representing text in digital form, of marking components of a document not by their appearance but by their function. That is, a given span of text is not simply labeled as being bold 14-point Helvetica, horizontally centered, but is a chapter heading. By identifying the components like this, you can easily change the way you want all chapter headings to be displayed without changing the appearance of another component of the document which might also happen to be bold 14-point Helvetica. This concept of “markup” from typesetting is the foundation of GML and LaTeX. Such markup, variously called “descriptive”, “structural”, or “semantic,” was the envisioned use for a general-purpose markup language like SGML or XML.

HTML was first created in the vein of semantic markup, though it included some tags, such as <B> for bold text and <I> for italicized text, for describing appearance without reference to the document components (“presentational markup”). These elements soon overtook the semantic markup in everyday use and gave HTML a bad reputation among people working with SGML and XML; the W3C has long been working to move HTML back in a semantic direction. But even without taking into account the very good changes coming in HTML5, is HTML really so unsuitable as a general-purpose markup language for publishing and representing text in digital form, as is sometimes claimed by XML experts? Continue reading