Cataloguing beyond the walls: APLA 1997

Metadata choices


Metadata really is nothing more than data about data; a catalog record is metadata; so is a TEI header, or any other form of description. We could call it cataloging, but for some people that term carries excess baggage, like Anglo-American Cataloging Rules and USMARC. (Caplan 1995)

For cataloguers, the path from the dictionary book catalogue of Charles Ammi Cutter at the beginning of the century, to the object-oriented integrated library system of the end of the century, has been a pretty linear one. Certainly, new information formats have been introduced and absorbed, as have demands for additional descriptive and subject access to items entered in the catalogue. The vehicle for communicating bibliographic descriptions has even changed in the last thirty years, with the introduction of data processing equipment and the MARC format, but the amount of time and effort put into a single catalogue record has remained fairly constant up until the last few years. The standards that we have developed over this time have been largely incremental, adding just what was needed to make them cover all the new bases in a logical and easy to apply manner.

A number of recent factors have militated against the continuation of this linear growth pattern in Library technical standards and the processing they empower. These include the economics of library support and resulting priority planning, the development of enhanced search capabilities within the OPAC, changes in user expectations of the catalogue, and of course, the phenomenal growth in resources available to the library selection process, particularly since 1990. Net publishing has thrown open the gates of creativity to the unwashed masses and effectively moved the qualitative control of the world's literary and factual output off the shoulders of publishers and onto those of the end user. Libraries are still trying to establish whether they have a role to play in all of this or not. I say we do, but only time will tell if we are up to the challenge.

The role of the library in the electronic era should ideally be the same as it ever was for print media- to select those works which are of particular interest to our clientele and to organize them for easy retrieval and use. We have already determined that resources of sufficient quality and interest exist on the network. Some new institutions of higher learning have even used the Internet as the justification for not erecting physical buildings at all. Does the Internet constitute a virtual library? This depends on how an institution views its library:

... the Net does not constitute a library in any interesting sense. It may be part of the technological infrastructure which could support a library, but it is not a library -- not in the sense of an institution which oversees and provides access to a collection. Moreover, the materials on the Net do not constitute a library collection -- not in the sense of a selection of items organized to serve a particular clientele. (Levy 1995)

This view of the library lays to rest the arguments that a "library" is made up of just those documents or objects which have been purchased or which are physically held within the confines of a building. Certainly these too may be part of the "infrastructure which could support a library", but the real library is defined in the value librarians (in the broadest sense) bring to the information universe through their selection, organization and interpretation functions.

Given the huge quantity of new electronic resources which have become available since 1993 and the fact that we are probably in a period of transition between information technologies, libraries will need to develop coping strategies to tide them over until the transition concludes. Libraries are still receiving print materials at close to the same rate as before, but are also expected to come up with a means of organizing electronic information with the same or lower staff levels. Surely, this is one of the defining characteristics of our times!

When new opportunities like this crop up, librarians and particularly cataloguers, are always ready to call a meeting. We did this in Paris in 1961 and came up with the Paris Principles, leading into the first Anglo-American Cataloging Rules. Lately we have been meeting with greater frequency and seeming urgency to talk about metadata from both the inside (embedded within digital documents) and outside (described in a surrogate or catalogue file). A number of proposals have surfaced at these meetings which might offer a solution to our retrieval and use problems, but we will discuss only a few here, namely TEI headers, Dublin Core, and MARC. Each of these attacks the problem from a slightly different perspective, TEI from the position of the academic text encoder, Dublin Core from that of the actual information producer, and MARC from that of the Library, yet all offer some aid to each other through support for each other's vocabularies and what has become known euphemistically as "crosswalks".


TEI headers

The Text Encoding Initiative is "an international project to develop guidelines for the preparation and interchange of electronic texts for scholarly research, and to satisfy a broad range of uses by the language industries more generally." Sponsored by the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC), the Text Encoding Initiative, or TEI, uses Standardized General Markup Language (SGML) as the basis for both procedural (lay out, typesetting) and descriptive (structure and content) markup of largely public domain electronic texts. SGML is probably best known amongst the masses these days as the genus for which HTML (Hyper Text Markup Language), the lingua franca of the Web, is the species. SGML has been chosen for most of the text digitizing projects for the simple reason that it is a far richer language than HTML, but also because much of the groundwork in the area of electronic text predates the widespread acceptance of HTML. Probably the most well-known participants in the Text Encoding Initiative are the Electronic Text Center at the University of Virginia, the Humanities Text Initiative at the University of Michigan, the Oxford Text Archive, and Inso Corporation, corporate producers of the American Heritage Dictionary and Columbia Encyclopedia.

All TEI-conformant texts follow fairly strict coding guidelines (we cataloguers like to call these "standards") using the Guidelines for electronic text encoding and interchange often referred to as P3. While primarily aimed at standardizing encoding for the text itself, P3 also allows for the creation of the TEI header, a DTD or Document Type Definition, which documents "the bibliographic description of the text and its source, the encoding of the text, nonbibliographic information that characterizes the text, and a history of updates and changes made to the electronic text." (Giordano 1994) While the elements themselves are standardized, the contents of the elements can be written as free text or in any other manner that the encoder likes including standard library collocation schemes like LCSH, the DDC and so on. The elements in a TEI header are detailed enough to enable their use in creation of a traditional catalogue record, and this is probably due in part to the influence of library participation in this project. Here are some examples from Giordano:

<teiHeader>
  <fileDesc>
     <titleStmt>
        <title>Notebooks of a computer pioneer, Tom Kilburn; a machine-readable
	     transcription</title>
        <author>Tom Kilburn</author>
        <sponsor>National Archive for the History of Computing</sponsor>
        <funder>Simon Engineering Fund</funder
        <principal>Martin Campbell-Kennedy<principal>
        <respStmt>
           <name>Jon Shapiro</name>
              <resp>data entry, scanning and proof correction</resp>
        </respStmt>
        <respStmt>
           <name>Carole Goble</name>
              <resp>created and maintained pre-SGML full text
              and image database</resp>
        </respStmt>
        <respStmt>
           <name>Anna Garry</name>
              <resp>converted full text database to TEI markup</resp>
        </respStmt>
      <titleStmt>
   </fileDesc>
</teiHeader>

   <publicationStmt>
      <publisher>Oxford University Press
      <publisher>
      <pubPlace>Oxford</pubPlace>
      <date>1989</date>
      <idno type=ISBN>0192547054</idno>
      <availability>To be distributed for purposes of teaching 
         and research only.</availability>
   </publicationStmt>

   <profileDesc>
      <creation>
         <date value='1989-08'>August 1989</date>
         <place>Brooklyn, New York</place>
      </creation>
      <langUsage>
         <language id=EN wsd=wsd.en>
         <language id=SP wsd=wsd.sp>
         <p>Approximately 95% of the text is in American
            English with quotations from first- and second-
            generation Italian immigrants to Brooklyn; the
            remainder is in transcribed Spanish spoken by
            first- and second-generation Puerto Rican
            immigrants to Brooklyn</p>
      </langUsage>
      <textClass>
         <keywords scheme=LCSH>
            <list><item>Brooklyn (New York, N.Y.)--Biography
               </item><item>Brooklyn (New York, N.Y.)--Social
               life annd customs.</item>
            </list>
         </keywords>
      <classCode scheme=LC>F129.B7
      </classCode>
   </profile Desc>
   </publicationStmt>
  

Note the level of detail allowed within the TEI header and the ability to specify elements which could prove useful later in generating an OPAC record with a bit of manipulation. Note too that the collocation function of the catalogue in using this information has been thwarted to some extent by the lack of control over fields such as <Author> above. Since TEI headers are created at time of digitization and/or revision of an etext, the considerable work of adding this detail is paid for out of the digitization project funding itself and is basically archival/special collections level work. Obviously, the encoding agency does not need to specify all elements which are possible, only the ones of most interest to users of the documentation. Still, it is easy to see how this level of description and subject analysis could end up being slower and even more expensive than traditional cataloguing, even using templates to provide the markup elements. Since the Text Encoding Initiative was set up primarily to digitize extant texts already identified by libraries and vendors as having substantial value to their clients, the creation of the metadata to accompany these texts falls to those agencies as well. This being the case, there is certainly not likely to be any savings here in time and cost during the transition period alluded to earlier. Perhaps that savings can be realized using the Dublin Core?


Dublin Core

In March 1995, OCLC and the National Center for Supercomputing Applications (NCSA) brought together fifty-two librarians, archivists, scholars, and systems experts for a workshop in Dublin, Ohio. The aim of this meeting was to establish a core set of elements which could be used by the authors of the electronic information resources themselves to describe their own works using standardized metadata elements embedded into the HTML texts being created. Given the quantity of data available on the Internet and the time required for the creation of full catalogue records or TEI headers, the sponsors of the conference saw metadata created by information providers themselves as the solution to the logistics problem of who had the time and motivation to create useful metadata. The outcome of this workshop was the Dublin Core Metadata Element Set, since shortened to just "Dublin Core".

Dublin Core is yet another DTD which originally allowed for thirteen elements to be added to an HTML file describing the document contained therein (Weibel 1995). These were:

Two additions were made to this list in 1996. A complete list of elements and their definitions may be found at the Dublin Core Metadata element set Reference Description site at OCLC. The new elements were:

As an example of how these may be used, here is the Dublin Core data for the electronic version of Maya Angelou's poem "On the pulse of morning":

Another example can be found on the LC server.

As was true for TEI headers, no particular syntax is prescribed for the text to be supplied for these elements. One of the virtues of the Dublin Core model is that authors are not expected to provide the level of detail we have come to associate with full cataloguing, but only to provide data which could then form the basis for additional work by libraries who eventually selected the work. Obviously translating the supplied data into usable metadata within the library would require one of two things. Either the library would have to archive the object locally so that they could edit the metadata within its container, or they would have to extract the metadata as a surrogate. With this end in mind, the Library of Congress has created the Dublin Core/MARC/GILS Crosswalk, a translation table to enable Dublin Core data to map to two standard DTDs for bibliographic data and back again.

As with any standards which organizations create for those outside of the organization itself, Dublin Core suffers from the problem of finding adherents in the real world. While HTML coders are forced to learn at least some of the standards of the language just to enable the proper functioning of their Web documents, there is nothing to force them to create their own metadata. While authors embed keywords in their documents now, with the hope that one of the Web robots that is adding sites to the search engines like Excite or Maple Square will encounter these and provide access to potential users, there is no Web equivalent to typical library metadata schemes like AACR2, MARC, or MeSH to guide them and no professional enforcement agency like publishers or professional associations to enforce adherence. So where does this leave us?


MARC

While originally developed by the Library of Congress as a means of communicating the bibliographic information needed for the generation of catalogue cards, MARC has since grown to a fairly complex metadata scheme capable, however clumsily, of describing most of the known information universe. While not a single standard (USMARC seems to be emerging as the de facto standard, but UNIMARC as well as UKMARC and other national schemes all have their adherents), MARC has in its favour a rich descriptive language, substantial installed base, and institutions (national libraries like LC and NLC) and committees (MARBI) committed to its maintenance.

MARC's problems, as viewed by Gaynor are its "inability to structure analytical, non-bibliographic information that can be used to evaluate electronic documents" and its "inability to cope with hierarchically structured information and to provide access to complex collections through descending levels of analysis." These complaints are less about inherent shortcomings of the format then the desire of those maintaining it to open up the standard to purposes beyond those of creating surrogate records for objects of interest to the maintainers of MARC. It takes a long time to establish a standard and a long time to change it. One case where this was not so was in the creation of field 856 for Electronic location and access. This field enables the creator of MARC records (the cataloguer) to embed location information into a surrogate, thus allowing OPAC vendors to program their systems for retrieval of the information container as well as its surrogate. This, in effect, circumvents Gaynor's second problem noted above. Even if the metadata is not embedded in the same file as the object it describes (as in the TEI and Dublin Core models), this becomes transparent to the user who can now be taken directly to the information they seek. This is now possible in WWW-enabled OPACs like Sirsi's WebCat, DRA's DRAWeb, and Innopac's WebPAC. These products all read MARC records from their system database and render this to the user's Web browser as HTML. 856|u, which contains an Internet resource's URL, is thus made into a hypertext link.

Of course the next step in all of this is to discard the MARC format in favour of a more open standard using a more generic markup language such as HTML, GILS, or SGML. Many groups who have had difficulty effecting change within the current MARC structure, such as archivists, museologists, data librarians and the like, are proposing just that. While not denying the usefulness of the MARC format in creating usable finding aids, the Digital Image Access Project at Columbia University suggested that "SGML Catalog Records" (SCR) be created as intermediaries between MARC surrogates and the documents they describe. MARC records would point to the SCR as secondary vessels of descriptive metadata embedded in documents themselves. The Berkeley Finding Aids project, primarily comprised of electronic archive and manuscript people, came to similar conclusions in their deliberations. This split in the level of detail expected of various user populations has dogged the MARC format for years with print information, so its carryover to the electronic arena should come as no surprise. That doesn't mean that MARC is insufficient to the task of cataloguing the Internet, just that it may be insufficient at the level of detail required by some users.

last section

next section




Table of contents. URL: http://www.mun.ca/library/cat/catnet/metadata.htm
Last revised: 22-May-1997 18:20 NST
Document author: Charley Pennell