<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Article Authoring DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/articleauthoring/1.0/JATS-articleauthoring1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <article-meta>
      <title-group>
        <article-title>mPach: Integrated Publishing and Archiving of Journals in HathiTrust</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Johnson</surname>
            <given-names>Seth</given-names>
          </name>
          <aff>Michigan Publishing</aff>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Smith</surname>
            <given-names>Bryan</given-names>
          </name>
          <aff>Michigan Publishing</aff>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Hawkins</surname>
            <given-names>Kevin S.</given-names>
          </name>
          <aff>Michigan Publishing</aff>
        </contrib>
      </contrib-group>
      <permissions>
        <copyright-statement>Copyright 2013</copyright-statement>
        <copyright-year>2013</copyright-year>
        <copyright-holder>Seth Johnson</copyright-holder>
        <copyright-holder>Bryan Smith</copyright-holder>
        <copyright-holder>Kevin S. Hawkins</copyright-holder>
        <license>
          <license-p>The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.</license-p>
        </license>
        <license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by-nc/3.0/us/">
          <license-p>This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License</license-p>
        </license>
      </permissions>  
      <abstract>
        <p>mPach is a package of tools being developed to provide a modular platform to enable the publication of born-digital open-access journals in HathiTrust, a digital library created by a partnership of major research institutions. One of the chief technological challenges in creating such a toolkit is enabling the conversion of edited manuscripts to JATS, which was chosen as the preservation-quality format. This paper provides a technical overview of the mPach platform, with special attention paid to the design and functionality of Norm, a tool being developed to convert Microsoft Word documents to JATS.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec>
      <title>Motivation and Design Considerations</title>
      <p><ext-link ext-link-type="uri" xlink:href="http://www.hathitrust.org/">HathiTrust</ext-link>, a partnership of major research institutions and libraries, aims "to ensure that the cultural record is preserved and accessible long into the future." <xref ref-type="bibr" rid="E1" /> Its digital library is currently archiving and providing access to digitized library holdings, but since research libraries are increasingly involved not just in making their collections available but also in publishing new scholarly literature <xref ref-type="bibr" rid="E2" />, HathiTrust would be a natural place to archive and provide access to born-digital content as well to ensure its long-term preservation and discoverability.</p>
      <p><ext-link ext-link-type="uri" xlink:href="http://www.publishing.umich.edu/">Michigan Publishing</ext-link>, the University of Michigan's scholarly publishing operation, is based in the University Library and has long used a system called <ext-link ext-link-type="uri" xlink:href="http://www.dlxs.org/">DLXS</ext-link> as its primary publishing platform. An extensible system with a large amount of code shared between publications, DLXS has been essential in allowing the Library to publish at scale; however, as Michigan Publishing grows and attempts to achieve even greater scalability, additional needs have emerged that cannot be satisfied within the architecture of DLXS. Since there is growing interest in shared infrastructure and operations for scholarly publishing, both among libraries and university presses, the Library, with its close connection to HathiTrust, has seized this opportunity to develop a new platform, called <ext-link ext-link-type="uri" xlink:href="http://www.lib.umich.edu/mpach">mPach</ext-link>, to meet its own needs and to provide as shared infrastructure for use by other institutions.</p>
      <p>A modular platform for publishing born-digital open-access journals in the  repository, mPach contains a set of modifications to the existing HathiTrust code base, plus entirely new components, that facilitate ingest, display, and discoverability of journal literature in HathiTrust; all of these components are tightly coupled with the repository. mPach provides all of the tools needed to publish an open-access journal online, and it is designed to allow integration with popular journal publishing tools such as <ext-link ext-link-type="uri" xlink:href="http://pkp.sfu.ca/?q=ojs">Open Journal Systems</ext-link>.</p>
      <p>Of primary importance in the design of a system that both publishes and archives content is the inevitable tension between the needs of publishers, which require the flexibility to innovate, and the needs of the archive, where rapid change must yield to the demands of long-term preservation and access. To guarantee preservation of the final version of all published content, a primary design principle for mPach is that "[t]he version of the content in HathiTrust must always be the single authoritative version," and "any revisions to the content must be made to the authoritative copy in the HathiTrust repository." <xref ref-type="bibr" rid="E3" /> So while mPach strives to create a workspace that allows publishers to easily manage and update their content, it also enforces a requirement that HathiTrust always contains the final version of the article.</p>
    </sec>
    <sec>
      <title>An Overview of mPach</title>
      <p>There are three major parts of mPach, each of which includes components in various stages of development at the time of writing:</p>
      <list list-type="bullet">
        <list-item><p><bold>the peer review and editorial system</bold>: provides tools for authors, reviewers and editors</p></list-item>
        <list-item><p><bold>Prepper</bold>: prepares the article for publication and archiving in HathiTrust</p></list-item>
        <list-item><p><bold>modified HathiTrust components</bold>: provides support for born-digital journal content</p></list-item>
      </list>
      <fig id="f1">
        <caption>
          <title>Major parts of mPach</title>
        </caption>
        <graphic xlink:href="image1.png"/>
      </fig>
      <p>As a modular system, mPach could be used with any peer review and editorial system that is capable of interacting with Prepper; however, the developers have chosen to provide OJS as the default option. Despite the lack of support for digital preservation, OJS is already widely used for library-based journal publishing, and mPach's integration with this software will allow for a smooth transition of journals already published using OJS into the HathiTrust repository. Integration with mPach requires that manuscripts that reach the "layout" stage in OJS be sent to Prepper, which prepares the HathiTrust Submission Information Package (SIP).</p>
      <p>Prepper, a Ruby on Rails application, provides a user interface for the editor of a journal: a dashboard for administering the journal and putting manuscripts through a production process&#x2014;akin to composition and typesetting&#x2014;that prepares all content according to the preservation standard developed for mPach content in HathiTrust. Prepper invokes Norm, a Python application developed to convert manuscripts from Office Open XML ("DOCX") format <xref ref-type="bibr" rid="E4" /> to JATS. DOCX is the default option because, like OJS, it is widely used in the editorial process of journals published by libraries. The Prepper interface also guides the staff member through a review of validation errors detected by Norm's conversion, uploading high-resolution figures, supplying "alt text" for figures, previewing the article as rendered using the default stylesheet (based on the Preview XSLT stylesheets <xref ref-type="bibr" rid="E5" />), uploading supplementary material <xref ref-type="bibr" rid="E6" />, and submitting for ingest into HathiTrust.</p>
      <fig id="f2">
        <caption>
          <title>Early version of the mPach Prepper interface</title>
          <p>In this image, the article metadata is displayed just after Norm completes the conversion of the submitted article from DOCX to JATS</p>
        </caption>
        <graphic xlink:href="image2.png"/>
      </fig>
      <fig id="f3">
        <caption>
          <title>Using Prepper to upload an article and submitting the article to HathiTrust</title>
          <p>The images are numbered as follows: 1) Select file to upload. 2) Verify that conversion by Norm is correct, and edit certain bibliographic details. 3) Annotate media and provide archival copies. 4) Review HathiTrust Pageturner preview. 5) Add any supplemental material. 6) Review article submission. 7) Receive article submission confirmation. 8) View submitted content from the journal home page. Note that these are early versions of the interface and are subject to change.</p>
        </caption>
        <graphic xlink:href="image3.png"/>
      </fig>  
      <p>mPach requires a number of significant modifications to HathiTrust components and workflows originally designed to support digitized print materials. The reading interface in HathiTrust, which previously supported only rendering of digitized page images, renders JATS XML in HTML and allows a user to download a dynamically generated PDF and EPUB, display metadata specific to articles (<xref ref-type="fig" rid="f4">Figure 4</xref>), and link to a special "collection" for the journal in HathiTrust's Collections application <xref ref-type="bibr" rid="E7" /> that allows browsing volumes and issues of the journal (<xref ref-type="fig" rid="f5">Figure 5</xref>).</p>
      <fig id="f4">
        <caption>
          <title>Prototype of an article in HathiTrust's user interface</title>
        </caption>
        <graphic xlink:href="image4.png"/>
      </fig>
      <fig id="f5">
        <caption>
          <title>Prototype of a journal in HathiTrust's user interface</title>
        </caption>
        <graphic xlink:href="image5.png"/>
      </fig>
      <p>In HathiTrust, MARC records provide bibliographic metadata like title and author for every item in the repository, which enables discovery by browsing or searching. For mPach, each article has its own analytic catalog record, tied to a monographic record for the journal as a whole. Finally, the HathiTrust Data API <xref ref-type="bibr" rid="E8"/> allows for the content of each article to be retrieved for use outside of the native HathiTrust interface.</p>
      <p>Note that content within HathiTrust is restricted for legal reasons, not because a rights holder wants to restrict access. Therefore, mPach only supports the publication of open-access journals.</p>
    </sec>
    <sec>
      <title>Workflow</title>
      <p>In the typical workflow for publishing a journal using mPach, a journal editor uses OJS to manage submissions, peer review, and the editing process. Once an article reaches the "layout" stage (where a combination of composition and typesetting allows the article to be formatted in a consistent way), the journal editor formats the article using a predefined list of styles provided by mPach in Microsoft Word; these styles are used to flag specific content within the article (e.g., the title, authors, institutions, etc.), which provides the necessary semantics for Norm to transform the document into JATS.</p>
      <p>After the styles are applied to the Word document, the editor submits the article to Prepper, which guides the editor through conversion to JATS XML (and validation of the result), preparation of the submission information package (SIP), and submission for ingest into HathiTrust. Prepper tracks of version of articles so that a revised content can be resubmitted. Currently, the ingest process overwrites any previous version of an item with the same identifier; however, future support for versioning submissions will be provided by HathiTrust.</p>
    </sec>
    <sec>
      <title>JATS and the Publishing Tag Set</title>
      <p>University of Michigan Library staff researched various formats for publishing and archiving born-digital article content. JATS was selected because of the increasing coalescence of the publishing industry around this open, non-proprietary standard suitable for representing the structure and semantics of journal articles in order to both preserve them and render the content in various output formats. Although archiving is one of the primary goals for mPach and HathiTrust, the Archiving and Interchange Tag Set ("green") is not an appropriate choice because mPach defines the structure of the content and hence does not need to represent information about the appearance of the source document. Furthermore, the Archiving and Interchange Tag Set provides flexibility in tagging that would complicate the logic in the stylesheets used to render the content in HathiTrust. And while the Authoring ("orange") tag set provides the right tags to represent the content submitted by mPach, it lacks some necessary metadata appearing in the front section of the articles. The Publishing ("blue") tag set was selected because it provides the same benefits of the Authoring tag set as well as the necessary metadata.</p>
    </sec>
    <sec>
      <title>Norm: An Application for converting DOCX files to JATS XML</title>
      <p>Norm is the component of mPach responsible for transforming an author's DOCX article into XML conforming to the Journal Publishing Tag Set. It is a command-line Python application whose input is a DOCX file and output is JATS XML, plus any embedded content such as images.</p>
      <p>Norm parses the XML content of the Word document, mapping various Word styles to the appropriate JATS elements. Norm then generates the JATS document object model (DOM) using rules that specify how elements are nested as well as providing cardinality constraints, similar to the validation rules provided by technologies such as Document Type Definitions (DTD) and XML Schema. (These style-element mappings, as well as the rules for generating the JATS DOM, are represented in Norm's configuration files and are hence customizable.)</p>
      <p>A key requirement of Norm is that the Word document must use specific paragraph styles specified in the configuration file. Norm comes with a default configuration file containing correspondences between JATS element names and Word styles for components of the article such as the title, author first name, author last name, abstract, etc. (Users can edit this file to define their own custom styles.) During conversion to JATS, Norm determines the appropriate JATS element for any particular content by its associated styles; hence, accurate styling is essential, and in some cases incorrectly styled documents will be rejected by Norm.</p>
      <p>Conceptually, the idea of using Word's built in styles to differentiate elements for transformation is not new. <ext-link ext-link-type="uri" xlink:href="http://www.inera.com/extyles-products">Inera’s eXtyles product suite</ext-link> is an example of software that uses a similar strategy for documents, and also provides support for exporting content in JATS. At this point, Norm has fewer features than eXtyles; however, it will be made available as open-source software along with the rest of mPach, with plans to elicit and include feedback from and features suggested by the community of users.</p>
      <fig id="f6">
        <caption>
          <title>A Microsoft Word document using paragraph styles designed for use with Norm</title>
        </caption>
        <graphic xlink:href="image6.png"/>
      </fig>
    </sec>
    <sec>
      <title>A Technical Overview of Norm</title>
      <p>The following algorithm demonstrates how Norm performs the transformation:</p>
      <boxed-text>
        <p><bold>Overview of the algorithm used by Norm for transforming a Word document to JATS</bold></p>
        <p>Note that the Word Document Object Model (DOM) tree is relatively flat compared to the JATS DOM tree, and is much less structured; hence, the configuration plays an essential role in determining where to attach nodes in the tree.</p>
        
        <p><bold>Given:</bold></p>
        <preformat>
      Input = Document Object Model (DOM) tree representation of Word document XML
      Sections = [Head, Body, Back ]
      Configure = Rules to map Word style to JATS element, JATS element to parent, and 
                  JATS element to appropriate section</preformat>
        <p><bold>Step 1: Transform data into internal representation</bold></p>
        <preformat>
      Tuple Arrays = Create empty arrays for each section
      For Element in Input (XPath: /document/body):
          Style = Find in Element (XPath: /nPr/pStyle)
          Content = Find contents in Element (depending on element name)
          JATS Element = Find in Configuration using Style
          Section = Find in Configuration using JATS Element
          Append tuple [JATS Element, Content, Style] to section's Tuple Array</preformat>
        <p><bold>Step 2: Render JATS output from internal representation</bold></p>
        <preformat>
      Output = Empty Document Object Model (DOM) tree
      For Section in Sections
          Node = Create node for section
          Attach Node to Output
          For Tuple in Section's Tuple Array
              Node = Create node using JATS element name in Tuple
              Parent = Find in Configuration using JATS element
              Attach Node to Parent node
      Marshall Output DOM tree to XML</preformat>
      </boxed-text>
      
      <p>The first step in the algorithm involves transforming the data in the Word document as an ordered list of styled content. Note that Norm represents this content internally as a list of tuples (one tuple per styled element) with the following format:</p>
      <preformat>
    [ (JATS element, Content, Word style), … ]</preformat>
      <p>Where the <italic>Content</italic> is the textual content, along with any inline styles, for the Word element. </p>
      
      <boxed-text>
        <p><bold>Representation of sample content</bold></p>
        <p>The internal representation of a sample content fragment for an article title, containing the JATS element name, the content and the Word style name, respectively. Note that this tuple may be embedded as content within other tuples.</p>
        <p><bold>Title</bold>: Color variability and body size of larvae of two <italic>Epomis</italic> species (Coleoptera, Carabidae) in Israel, with a key to the larval stages</p>
        <p><bold>Internal representation (tuple):</bold></p>
        <preformat>
      ( 'article-title',
        [ ('Color vari...of two', None, None),
          ('Epomis', ['i'], None),
          ('Coleoptera...stages', None, None) ],
        'ArticleTitle')</preformat>  
      </boxed-text>
      <p>Although Prepper calls Norm on behalf of the user, here is a sample command-line usage:</p>
      <preformat>
    $ python norm.py -w article.docx -o /location/of/output/</preformat>
      <p>Norm provides various command-line options, which are described in the following table:</p>
      <table-wrap>
        <table frame="hsides" rules="groups">
          <thead>
            <tr>
              <td colspan="1" rowspan="1">
                <bold>Short Argument</bold>
              </td>
              <td colspan="1" rowspan="1">
                <bold>Long Argument</bold>
              </td>
              <td colspan="1" rowspan="1">
                <bold>Purpose</bold>
              </td>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td colspan="1" rowspan="1">&#8209;h</td>
              <td colspan="1" rowspan="1">&#8209;&#8209;help</td>
              <td colspan="1" rowspan="1">Show help message and exit.</td>
            </tr>
            <tr>
              <td colspan="1" rowspan="1">&#8209;a</td>
              <td colspan="1" rowspan="1">&#8209;&#8209;archive_name</td>
              <td colspan="1" rowspan="1">Sets a custom name for the zip archive created by this script.</td>
            </tr>
            <tr>
              <td colspan="1" rowspan="1">&#8209;c</td>
              <td colspan="1" rowspan="1">&#8209;&#8209;cfg</td>
              <td colspan="1" rowspan="1">Specify the location of the config file that maps Word Styles to JATS Elements.  A default will be used if none is provided.</td>
            </tr>
            <tr>
              <td colspan="1" rowspan="1">&#8209;o</td>
              <td colspan="1" rowspan="1">&#8209;&#8209;out</td>
              <td colspan="1" rowspan="1">Set the directory for norm's output, a XML file and a zip file containing the document assets will be created, using the name of the document.</td>
            </tr>
            <tr>
              <td colspan="1" rowspan="1">&#8209;v</td>
              <td colspan="1" rowspan="1">&#8209;&#8209;verbose</td>
              <td colspan="1" rowspan="1">Enabled Verbose Output for a more debug friendly output.</td>
            </tr>
            <tr>
              <td colspan="1" rowspan="1">&#8209;w</td>
              <td colspan="1" rowspan="1">&#8209;&#8209;word</td>
              <td colspan="1" rowspan="1">Specify the location of the Word Doc for norm to process.</td>
            </tr>
            <tr>
              <td colspan="1" rowspan="1">&#8209;V</td>
              <td colspan="1" rowspan="1">&#8209;&#8209;version</td>
              <td colspan="1" rowspan="1">Get the version number of Norm.</td>
            </tr>
          </tbody>
        </table>
      </table-wrap>
      <p>As previously discussed, Norm's configuration files drive the transformation logic, making Norm is highly extensible and customizable. Norm's configuration files include:</p>
      <list list-type="bullet">
        <list-item><p>Mappings from Word styles to JATS elements</p></list-item>
        <list-item><p>The major section (<monospace>article-meta</monospace>, <monospace>body</monospace>, or <monospace>back</monospace>) in which a particular JATS element appears</p></list-item>
        <list-item><p>Which and how many children a particular element can have</p></list-item>
        <list-item><p>Attributes permitted by a particular element</p></list-item>
      </list>
      <p>The following is an excerpt of lines from a configuration relevant to the surname element, corresponding to the Word style 'AuthorSurname':</p>
      <preformat>
    [ FRONT ]
    AuthorSurname = surname

    [ FRONT-PARENTS ]
    surname = name
    name = contrib
    contrib = contrib-group
    contrib-group = article-meta
    article-meta = front

    [ CHILDRENLIMITS ]
    surname = 1
    name = 1
    contrib = Yes

    [ ATTRIBUTES ]
    AuthorSurname = contrib-type,author</preformat>
      <p>The JATS XML tree is recursively defined by the configuration. In the above example, note that each element in the FRONT-PARENTS section has a defined parent. Norm uses this information about the parents to generate the <monospace>front</monospace>, resulting in the the following XML hierarchy:</p>
      <preformat>
        &lt;front&gt;
          &lt;article-meta&gt;
             &lt;contrib-group&gt;
               &lt;contrib contrib-type="author"&gt;
                 &lt;name&gt;
                   &lt;surname&gt;&lt;/surname&gt;</preformat>
      <p>While XSLT is a natural choice for many XML transformations, we developed Norm using a scripting language (Python) for three primary reasons. First, we wanted to make Norm extensible (e.g., transform content from additional file formats in the future): we predict that it will be easier to provide new functionality for Norm using configuration files than to create new XSLT files for each desired file format. Second, we want to allow users to customize Norm by providing additional styles with minimal technical knowledge: while there is a significant conceptual learning curve for these configuration files, we believe it will be less of a hurdle than editing (and maintaining) XSLT stylesheets. Third, a scripting language like Python provides the ability to extract content embedded in Word documents, such as images and video, whereas XSLT alone does not provide this capability.</p>
    </sec>
    <sec>
      <title>Future Plans for Norm and other mPach Components</title>
      <p>While Norm can already transform Word documents to JATS XML, we have begun developing support for the transformation of the OpenDocument ("ODF") format (used primarily by Apache OpenOffice and LibreOffice). Norm is designed to be extensible enough to allow the transformation of other structured document types, such as LaTeX, to JATS. At this time there are no plans to transform content from PDF, though there is promising work in this area (such as <ext-link ext-link-type="uri" xlink:href="http://www.shabash.net/merops/default.html">Merops</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/pdf2xml/">pdf2xml</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://pdfx.cs.man.ac.uk/">pdfx</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://code.google.com/p/lapdftext/">LA-PDFText</ext-link>, and <ext-link ext-link-type="uri" xlink:href="http://grobid.no-ip.org/">GROBID</ext-link>) that might provide the foundation for future support.</p>
      <p>Norm, as well as Prepper and other mPach components will be released as open-source software in the near future. Tentatively, the source will be available as a GitHub repository with an Apache 2.0 license, though details have not been finalized.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="E1">
        <mixed-citation>Welcome to the Shared Digital Future. HathiTrust Digital Library. http://www.hathitrust.org/about.</mixed-citation>
      </ref>
      <ref id="E2">
        <mixed-citation>See resources aggregated by the <ext-link ext-link-type="uri" xlink:href="http://www.librarypublishing.org/">Library Publishing Coalition</ext-link></mixed-citation>
      </ref>
      <ref id="E3">
        <mixed-citation>Design Principles and Requirements. mPach. http://www.lib.umich.edu/mpach/design-principles-and-requirements.</mixed-citation>
      </ref>
      <ref id="E4">
        <mixed-citation>Office Open XML. Wikipedia. http://en.wikipedia.org/wiki/Office_Open_XML.</mixed-citation>
      </ref>
      <ref id="E5">
        <mixed-citation>NISO Journal Article Tag Set (JATS) Version 1.0: Preview XSLT Stylesheets. https://github.com/NCBITools/JATSPreviewStylesheets.</mixed-citation>
      </ref>
      <ref id="E6">
        <mixed-citation>Recommended Practices for Online Supplemental Journal Article Materials: A Recommended Practice of the National Information Standards Organization and the National Federation of Advanced Information Services. January 2013. http://www.niso.org/publications/rp/rp-15-2013.</mixed-citation>
      </ref>
      <ref id="E7">
        <mixed-citation>Collections. HathiTrust Digital Library. http://babel.hathitrust.org/cgi/mb.</mixed-citation>
      </ref>
      <ref id="E8">
        <mixed-citation>HathiTrust Data API. HathiTrust Digital Library. http://www.hathitrust.org/data_api.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>
