cultural heritage materials and markup

I always enjoy the annual Balisage pre-conference symposium. This year’s, entitled Cultural Heritage Markup: Using Markup to Preserve, Understand, and Disseminate Cultural Heritage Materials, promises to be worthwhile: the call for participation includes a whole list of of interesting topics.

I was considering submitting a proposal, but I’m not sure that I have a whole presentation worth of material about any of the topics. But I have a bit to say on most of them, so I’ll share those thoughts here.

Is XML really appropriate for representing texts of scholarly interest?

The original meaning of markup—an editor’s marks on a manuscript—is like “coding” as done in the social sciences for qualitative analysis: a simple way of labeling spans of text. In such coding, those spans can overlap. But in XML, they can’t. Therefore, if XML is used to represent a text, it works best when your text fits into a hierarchy. This works very well when you are publishing something new, like a manual or reference work. The problem is when you are trying to represent a source document that already exists: when you look at that document, you will probably find instances of “overlapping hierarchies”, which can’t be represented in XML in a straightforward way.

We don’t have much of a toolchain around any markup language that handles multiple hierarchies, so most of us just get by with XML.

Is XML really appropriate for representing metadata about non-textual artefacts?

I think so. It had never occurred to me that it wouldn’t be. Is this a trick question?

What does it mean for cultural heritage texts to be interoperable? Is it desirable? Is it possible?

People mean different things by “interoperable”. I rather like the distinction that Syd Bauman made at Balisage a few years ago between “negotiated interchange”, “blind interchange”, and “interoperation”. Loosely speaking, “interchange” means we can read each other’s data, whereas “interoperation” goes further to mean we can use it without needing someone to intervene to interpret it.

According to these distinctions, the TEI works quite well for interchange but not for interoperation. That’s because TEI, in trying to support multiple scholarly views on text, often allows more than one way of marking up a phenomenon, meaning you have to normalize the encoding in order to process a set of documents uniformly. The Best Practices for TEI in Libraries tries to prescribe one option in many situations, and the TEI Simple project is going further, even including a processing model for its markup (which the TEI Guidelines have never included in any form). These will increase interoperability.

So it’s desirable, though perhaps unachievable, to have interoperability because it allows us to use each other’s data without any manual intervention. But it comes at the expense of expressiveness, constraining the sorts of things you can represent. To be cynical, with TEI encoding, you can represent anything, but good luck trying to do anything with it afterwards!

Shared tag sets: do shared markup vocabularies (e.g., TEI, EAD, LIDO, CDWA) do more harm than good?

Despite problems with interchange, if I pick up someone else’s TEI document, I can usually make sense of what’s there. Even if they’ve chosen unusual ways of marking up phenomena, if I understand the language of the text, I can usually figure out what they did. And then I can process the text however I need to get out of it what I require. This is all still more than I’m going to get if a digital document doesn’t use of TEI at all, whether it’s plain text, OCR output, a word-processor document, or most forms of HTML.

EAD, I believe, has less variation in practice than TEI, so the interchange problems aren’t so great. LIDO and CDWA are not as document-centric as TEI and EAD; instead, they are really about metadata. I believe the problems here are even less severe, though they are of a different type: of people supplying content in between the tags inconsistently.

Still, all of these technologies allow us to share data more easily than in the past. If anything, inconsistent data causing problems for interoperability is the result of more people creating metadata than of a problem with the technology.

Ultra Slavonic

A malapropism for "Old Church Slavonic", or a place to collect thoughts more than 140 characters long

cultural heritage materials and markup

Is XML really appropriate for representing texts of scholarly interest?

Is XML really appropriate for representing metadata about non-textual artefacts?

What does it mean for cultural heritage texts to be interoperable? Is it desirable? Is it possible?

Shared tag sets: do shared markup vocabularies (e.g., TEI, EAD, LIDO, CDWA) do more harm than good?

Leave a Reply Cancel reply