TEI (XML in a Nutshell, 2nd Edition)

6.3. TEI

The Text Encoding Initiative (TEI, http://www.tei-c.org/) is an SGML application designed for the markup of classic literature, such as Virgil's Aeneid or the collected works of Thomas Jefferson. It's a prime example of a narrative-oriented DTD. Since TEI is designed for scholarly analysis of text rather than more casual reading or publishing, it includes elements not only for common document structures (chapter, scene, stanza, etc.) but also for typographical elements, grammatical structure, the position of illustrations on the page, and so forth. These aren't important to most readers, but they are important to TEI's intended audience of humanities scholars. For many academic purposes, one manuscript of the Aeneid is not necessarily the same as the next. Transcription errors and emendations made by various monks in the Middle Ages can be crucial.

TEI is an SGML application. It uses several features of SGML not found in XML, including the & connector and tag minimization. However, XML is clearly the wave of the future. Therefore, like most evolving SGML applications, TEI is moving toward XML. A light version of the TEI DTD is available for authors who prefer to work in pure XML. It's not exactly the same as the SGML version, but it's very close for many practical uses.

Example 6-1 shows a fairly simple TEI Lite document that uses the XML version of the TEI DTD. The content comes from the book you're reading now. Although a complete TEI-encoded copy of this manuscript would be much longer, this simple example demonstrates the basic features of most TEI documents that represent books. (As well as prose, TEI can also be used for plays, poems, missals, and essentially any written form of literature.)

Example 6-1. A TEI document

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE TEI.2 SYSTEM "xteilite.dtd">
<TEI.2>

  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>XML in a Nutshell</title>
        <author>Harold, Elliotte Rusty</author>
        <author>Means, W. Scott</author>
      </titleStmt>
      <publicationStmt><p></p></publicationStmt>
      <sourceDesc><p>Early manuscript draft</p></sourceDesc>
    </fileDesc>
  </teiHeader>

  <text id="HarXMLi">

    <front>
      <div type='toc'>
        <head>Table Of Contents</head>
        <list>
          <item>Introducing XML</item>
          <item>XML as a Document Format</item>
          <item>XML on the Web</item>
        </list>
      </div>

    </front>

    <body>

      <div1 type="chapter">
        <head>Introducing XML</head>
        <p></p>
      </div1>

      <div1 type="chapter">
        <head>XML as a Document Format</head>
        <p>
          XML is first and foremost a document format. It was always
          intended for web pages, books, scholarly articles, poems,
          short stories, reference manuals, tutorials, texts, legal
          pleadings, contracts, instruction sheets, and other documents
          that human beings would read. Its use as a syntax for computer
          data in applications like syndication, order processing,
          object serialization, database exchange and backup, electronic
          data interchange, and so forth is mostly a happy accident.
       </p>

       <div2 type="section">
         <head>SGML's Legacy</head>
         <p></p>
       </div2>

       <div2 type="section">
         <head>TEI</head>
         <p></p>
       </div2>

       <div2 type="section">
         <head>DocBook</head>
         <p>
           DocBook (<hi>http://www.docbook.org/</hi>) is an
           SGML application designed for new documents, not old ones.
           It's especially common in computer documentation. Several
           O'Reilly books have been written in DocBook including
           <bibl><author>Norm Walsh</author>'s <title>DocBook: The
           Definitive Guide</title></bibl>. Much of the <abbr
           expan='Linux Documentation Project'>LDP</abbr>
           (<hi>http://www.linuxdoc.org/</hi>) corpus is written in
           DocBook.
         </p>
       </div2>

      </div1>

      <div1 type="chapter">
        <head>XML on the Web</head>
        <p></p>
      </div1>

    </body>

    <back>
      <div1 type="index">
        <list>
          <head>INDEX</head>
          <item>SGML, 8, 9, 91, 92, 94</item>
          <item>DocBook, 97-101</item>
          <item>TEI, 94-97, 101</item>
          <item>Text Encoding Initiative, See TEI</item>
        </list>
      </div1>
    </back>

  </text>
</TEI.2>

The text element is itself divided into three parts:

Front matter in the front element: The preface, table of contents, dedication page, pictures of the cover, and so forth. Each of these is represented by a div element with a type attribute whose value identifies the division as a table of contents, preface, title page, and so forth. Each of these divisions contains other elements laying out the content of that division.
The body of the work in the body element: The individual chapters, acts, and so forth that make up the document. Each of these is represented by a div1 element with a type attribute that identifies this particular division as a volume, book, part, chapter, poem, act, and so forth. Each div1 element has a header child giving the title of the volume, book, part, chapter, etc.
Back matter in the back element: The index, glossary, etc.

The divisions may be further subdivided; div1 s can contain div2s, div2 s can contain div3s, div3 s can contain div4 s, and so on up to div7. However, for any given work, there is a smallest division. This division contains paragraphs represented by p elements for prose or stanzas represented by lg elements for poetry. Stanzas are further broken up into individual lines represented by l elements.

Both lines and paragraphs contain mixed content; that is, they contain plain text. However, parts of this text may be marked up further by elements indicating that particular words or characters are peoples' names (name), corrections (corr), illegible (unclear), misspellings (sic), and so on.