A Virtual Document Approach for Reusing SGML/XML Information Objects

Brendan Hills, F. Paradis, A.M. Vercoustre
CSIRO-DMIS, Australia and INRIA-Rocquencourt, France


The importance of reusing information is well understood in electronic publishing, and is one of the motivations for the development and use of SGML. Reuse is actually quite hard to achieve with SGML, as the elements are strongly typed and there can be incompatibilities between the DTDs. HTML, an SGML derivative. HTML relaxes those constraints, but unfortunately it does not provide a significant level of structure for identifying and extracting information, since the tags are mostly used for presentation. XML, another SGML derivative, is a promising alternative which could bring the power of SGML to the Web while keeping the simplicity of HTML. Those standards all have particularities which must addressed in a global solution to the reuse of information.

We present our solution to the reuse of SGML information objects: a system that can dynamically combine information from various sources, including databases and SGML-like documents, to produce a virtual document, which allow an author to reuse information in a document-centric, descriptive way. We maintain support for the particularities of the data sources, by having them stored in different formats and accessed in their own native query language, but also support the integration of these information objects by converting them into a common, tree-like data structure, and by providing a language to extract and transform information in those trees.

In this approach, a collection of SGML documents can be stored in an object-oriented database as a tree- like hierarchy of information objects; thus taking advantage of the strict typing of SGML to provide efficient storage and retrieval. By extending the standard query language of the object-oriented database, we can query on an incomplete or partial knowledge of the document structure whilst retaining the search efficiency that the database engine provides us. Combination of the results with other databases or data sources, and inclusion into the SGML virtual document is handled by our tree language. HTML and XML documents, do not always conform to a DTD, and, if they come from the Web, are volatile and fast-changing in nature. We propose in this case to access those documents through the standard file systems or http protocol, to convert them to our tree-like data structures on-line, and to use our tree language for both extraction and transformation, with possibly some specific instructions to handle links.

The system is currently being implemented. Our prototypal application, a document to generate activity reports, reuses both an SGML database and a collection of HTML pages (as well as an SQL database), and shows how flexible and powerful our tool for information reuse is.


Brendan Hills is a member of the Electronic Documents and Commerce Portfolio at CSIRO Mathematical and Information Sciences in Melbourne where he is working on the “Reuse of Information Objects” project (RIO). He has been working with SGML, HTML, and document management systems since 1994 and presented some of the CSIRO’s work in this area at last year’ SGML Asia Pacific Conference. His interests are in object oriented databases, object oriented programming, and in bringing a cognitive science approach to designing document systems.
Francois Paradis is currently doing a post-doctorate at CSIRO Mathematical and Information Sciences, Melbourne, in the Text-Based Information Management group, where he is involved in the RIO project and in the TREC Information Retrieval experiments. His prior work with SGML includes the definition of an indexing model for structured documents, and an application of this model to TEI, which was part of his Ph.D. completed in 1996 at CLIPS-IMAG, Grenoble.
Anne-Marie Vercoustre is a Research Director at INRIA, France, where she has been involved in Syntax based Programming Environments, Structured Documents, and Hypertext for more than 10 years. She participated in the design of Symposia, the W3C WYSIWYG editor connected to the Web, based on the SGML Grif Editor (Grif.SA). She is currently involved in databases for structured and semi-structured data, with a focus on reusing SGML, XML and HTML documents through virtual documents. In 1996-1997, she spent 16 months at CSIRO-DMIS (Melbourne), as the leader of the Text-based Information Management Research Group, and directed the project on Reuse of Information Objects (RIO).