There is a long tradition in publishing, which has carried over into methods for representing text in digital form, of marking components of a document not by their appearance but by their function. That is, a given span of text is not simply labeled as being bold 14-point Helvetica, horizontally centered, but is a chapter heading. By identifying the components like this, you can easily change the way you want all chapter headings to be displayed without changing the appearance of another component of the document which might also happen to be bold 14-point Helvetica. This concept of “markup” from typesetting is the foundation of GML and LaTeX. Such markup, variously called “descriptive”, “structural”, or “semantic,” was the envisioned use for a general-purpose markup language like SGML or XML.
HTML was first created in the vein of semantic markup, though it included some tags, such as
<B> for bold text and
<I> for italicized text, for describing appearance without reference to the document components (“presentational markup”). These elements soon overtook the semantic markup in everyday use and gave HTML a bad reputation among people working with SGML and XML; the W3C has long been working to move HTML back in a semantic direction. But even without taking into account the very good changes coming in HTML5, is HTML really so unsuitable as a general-purpose markup language for publishing and representing text in digital form, as is sometimes claimed by XML experts?
The standard argument for using semantic tagging in SGML/XML instead of using HTML is that you can identify the components of the document once and then transform as needed into the desired format . (See Allen Renear’s exposition on these topics and more in the “Text Encoding” chapter of A Companion to Digital Humanities.) In the case of publishing online, the thought is that HTML as a standard keeps evolving, as does support for various tags in browsers, whereas your semantic markup could be stable.
However, there are various arguments for using HTML instead of semantic markup in XML. Production workflows using XML are expensive to build and are most valuable for very large bodies of content being used in various ways, not all of which involve rendering for reading. For most publishing workflows, involving just a handful of output formats, you don’t a system that is so robust and so complicated. And, perhaps more compellingly, there is a huge number of applications for creating and editing HTML code built into software that we use all of the time, and HTML forms of the basis of the most common ebook formats in use today … so why not just go with the flow? (For more detail on all of these arguments, see “XML Production Workflows? Start with the Web” and “The unXMLing of digital books”.) Furthermore, while you might think of HTML as an unsuitable format for storing your content (since the standard keeps evolving), the makers of browsers in fact have no incentive to remove support for old features. So webpages authored using older versions of HTML, even when crafted in an obscure way to work across browsers, often render quite suitably using a contemporary browser. Compare HTML with semantic markup formats like DocBook, Content MathML, JATS, and TEI, which do change—and not always in backwards-compatible ways. Are these really safer for your content than HTML?
This all came to mind again when listening to a presentation by Grigory Belonuchkin at RCDL’2013. When asked why he doesn’t use XML (with some sort of semantic tagging), he gave the argument that browsers don’t remove rendering features, and he also responded by asking why he’d go through the trouble of encoding in XML if his ultimate goal was to render in HTML anyway. His choice of pure HTML (see some outdated documentation in Russian of his proposed Istnet standard for representing digitized books) reminds me of the preference in Project Gutenberg and similar projects for simple, widely used standards rather than any of the various XML vocabularies. Still, he put himself through all sorts of difficulties to find a single encoding that will render page numbers and page breaks the same across browsers. It seems he is bordering on the sort of needs that push people from HTML into some sort of semantic XML.