INRIA
This paper reports on the discussion about metadata which arose during the WG6 meeting on 3 December 1996. It also shortly presents some of world-wide working groups on Metadata, their results and recommendations. Finally, it describes two possible scenarios for the creation of Metadata for the Aquarelle folders, and propose in Annex a DTD for describing Folder Profiles (ie. Folder metadata).
Metadata is a description of objects, documents or services which may contain data about their format and content . It has been used for many years by librarians under the form of catalog records for print publications, abstract and keywords, or index databases. It is widely used in archives, museums and for document management in companies as well.
With the explosion of the WWW, the creation and publication of metadata upon digital information has been recognized of major importance to improve the precision of document retrieval in a distributed environment. Beyond this acknowledgment, there is a need for more precise agreement on which medata models and description language could be used. It is very unlikely that an unique set of metadata could ever been agreed upon, but hopefully some common understanding and usage (see section 3).
Metadata may be part of the resources themselves or kept separately from them. As far as digital documents are concerned, it could be argued that meta-information for several purposes could be directly extracted from the documents (especially regarding SGML documents). Even for individual document this approach would not be appropriate since it would require from the author to provide all the required information, which, supposing they could be forseen, could be unknown by the author (publisher, storage location of the document, etc.). Moreover some objects as collections, images, sounds will still require their metadata description outside the object itself.
In the following we will discribed the set of metadata that have been identified for describing the folders (hereinafter called the folder profile)and the discussion about how to describe and provide them.
Then we will give a short presentation of the aims, results and recommendations from two world-wide Metadata working groups, known as the Dublin Workshop and the Warwick Worshop, as well as a more semantic approach from Apple.
Finally we propose an SGML syntax for the folder profiles, as well as two possible scenarii for generating these metadata.
The base of the discussion was the Folder Profile description provided by ICS-FORTH at the meeting. We remind here that folders are SGML documents which are managed by so-called Folder servers. Documents could be searched and retrieved using the folder description, or a more advanced query language involving the document structure (query using the DTD) and full text content.
We quickly agreeded on identifying three different types of metadata:
As Aquarelle is already strongly SGML-based it seems natural to describe the metadata using SGML as well, at least as a virtual interface between the Folder Server and the external world. It could be easily displayed as an SGML document and the extra information that have to be provided could be entered using Grif, the SGML editor which is already part of the system. [1]
More precise scenario for generating the metadata will be proposed and discussed in section 4.
Before addressing the issue of generating metadata we must be clear about which metadata and which format model (SGML being only the selected syntax), and to make sure that our choices will be compatible with ongoing internationals initiatives.
Metadata has already attracted a lot of interest and work from the International community. Two Workshops have been organized to foster " ...a common understanding of the problems and potential solutions ... and promote a consensus on a core set of metadata elements to describe networked resources". The first Workshop was organized by OCLC/NCSA in March 1995, Dublin (Ohio). The result of the workshop was a simple resource description record, widely known as the Dublin Core set.[2]
The second Workshop [1]organized by UKOLN and OCLC in Warwick, April 1996, was intended to broaden the scope of the first meeting and to identify implementation strategies. It was attended by a mixed of computer science, text markup, and library professionals. The focus of the workshop very soon turned to the extensibility issue to support richer description and linkage to other description models.
Another initiative from Apple [3]concentrates more on the representation of content using a knowledge representation language in the spirit of Cycl or KIF, rather than a markup language. These languages are best designed for classification of documents. The intention is to extract the description from the document content rather than using any external description.
All these initiatives have to be carefully considered, especially if we expect Aquarelle to be accessed from the Web or through a Web-based Intranet. However it is obvious that Aquarelle has an immediate need for more specific metadata description than, let say, the Dublin Core set. Yet the Folder Profile should includes the Dublin Core set as a minimum, or make it possible to export such description if required.
Actually metadata should be regarded as an a posteriori external view upon documents and collection of documents that are provided when publishing in a specific context. It should be possible to describe different metadata sets, adapting to the content or format model as required. This approach will be developed with scenario 2. in the next section.
We agreed upon exporting Folder profile as a metadata description using SGML syntax. A DTD for that description is proposed in Annexe. We have added a couple of field that are part of the Dublin Core and favored a structure that makes possible compatibility with other descriptions (optional fields), further extensions (using lists rather than fixed ordered elements), and richer description (repeatable fields).
This section proposes two scenarios for the generation the Virtual metadata description: in scenario 1, the server outputs the metadata in the appropriate SGML format, from internal and external data for the document. In scenario 2 , an extended SGML description defines the way metadata have to be calculated or directly input.
In this approach the Folder server will generate the Folder profile using its internal data structures and generating programs as shown in Figure 1. Data are of two types:
The generator will extract other metadata straight from the Folder SGML source (metadata extractor) or will calculate it from the document (metadata calculator), then it will generate a virtual SGML document according to the Aquarelle Folder_Profile DTD which includes all the required metadata.
Figure 1 - Generating Folder Profile: Scenario 1
Advantages:
Shortcomings:
The second approach is to start with the SGML document to produce, in the DTD that is required. All the parts that have to be externally entered by an human can be edited straight into this document, using an SGML editor. The other metadata, eg. the ones that can be calculated by the server, will be specified within the SGML document using an SGML Processing Instruction (starting with <?), which will specify how to build the specific elements for a folder referred with the variable &Folder.
More precisely:
Ex.1:
<title>
<?content = &folder.header.folder_title> </title>
This description makes the connection between a "folder_title" in the Folder DTD and a "title" in the metadata description DTD. Interpretation of the <content> tag will issue a query to the database or will run a generic extraction program.
Ex.2:
<?content = (<author> &folder..author </author>)* >
When there is no ambiguity in the DTD, an incomplete path can be provided using a "..'"notation instead of "." .
The "*" character indicate that the content has to be repeated for each element of the list resulting of the path or more generally of the processing.
Ex.3:
<in-link> <?content = get_in_link(&Folder) > </in-link>
This Folder profile specification will be interpreted to provide the profile according to the same DTD as shown in figure 2.
Figure 2 - Specifying metadata: Scenario 2.
Then the Folder profile specification is a declarative and constructive prescription of the SGML metadata description. The specification is itself an extended SGML document that can be stored as a document.
A full description will has the following format:
<!DOCTYPE Folder Profile "Fold-Prof.dtd" <Fold-Prof> <Fold-ID> <?content = get_folder_id(&folder) > </Fold-ID> <Loc> "http://WWW.inria.fr/Aquarelle-server/" </Loc> <Size> <?content = get_folder_size(&folder) > </Size> <Title> <?content = &Title= Query("Aquarelle_server","Folder_profile", "folder.header.Folder_title") > [2] </Title> <Subject> This document reports on the Aquarelle WG6 meeting on 3 December 1996, </Subject> <Subject> The document propose two scenarios for the production of Folder Profile (Document Metadata). </Subject> etc. </Fold-Prof>
Advantages:
Shortcomings:
We think that the second approach emphasizes the idea of multiple metadata descriptions as external views upon the documents that are provided for publishing in various contexts.
Main recommendations:
this DTD must conform to the Warwick recommendation for extensibility and some syntax form (optional and repeatable fields)
Optional:
using an extended SGML description to produce a declarative description together with the way of automatically generating some of the fields.
We have commented mostly the elements and attributes that were not part of the ICS-FORTH proposal and which have be added for compatibility with the Dublin Core
<!-- DTD Fold-Prof --> <!ENTITY % doctype "Folder_Profile" --> <!ELEMENT %doctype; - - (FOLD-ID, LOC, FIELD*,KEYW?, CATG?,RIGHTS?,VERSION?,REVISION?,STATS?) > <!ELEMENT FOLD-ID - O (#PCDATA) > <!ELEMENT LOC - O (#PCDATA) -- Location --> <!ELEMENT FIELD - - (TITLE | AUTHOR | SIZE | EVENT | TYPE | SUBJECT | CONTENT | COMMENT | RELATION) --> <!ELEMENT TITLE - O (#PCDATA) -- repeatable --> <!ELEMENT AUTHOR - O (#PCDATA)* -- repeatable --> <!ATTLIST AUTHOR Role (main, alpha, publish) alpha -- main: the main author alpha: by alphabetic order publish: the publisher Agent: others contributors --> <!ELEMENT SIZE - O (#PCDATA) -- various metrics --> <!ELEMENT TYPE - O (#PCDATA) -- repeatable, can be more general that just the DTD --> <!ATTLIST TYPE Case (DTD,GENRE, LANG, FORM) DTD -- DTD: the name of the DTD must be given GENRE: such as home page, novel, poem, report LANG: Language of the document FORM: such as text/html, ASCII, Postscript --> <!ELEMENT SUBJECT - O (#PCDATA) -- repeatable -->