A Virtual Document Approach for Reusing SGML/XML Information Objects
Brendan Hills, F. Paradis, A.M. Vercoustre
CSIRO-DMIS, Australia and INRIA-Rocquencourt, France
The importance of reusing information is well understood in electronic publishing, and is one of the
motivations for the development and use of SGML. Reuse is actually quite hard to achieve with
SGML, as the elements are strongly typed and there can be incompatibilities between the DTDs.
HTML, an SGML derivative. HTML relaxes those constraints, but unfortunately it does not provide a
significant level of structure for identifying and extracting information, since the tags are mostly used
for presentation. XML, another SGML derivative, is a promising alternative which could bring the
power of SGML to the Web while keeping the simplicity of HTML. Those standards all have
particularities which must addressed in a global solution to the reuse of information.
We present our solution to the reuse of SGML information objects: a system that can dynamically
combine information from various sources, including databases and SGML-like documents, to produce
a virtual document, which allow an author to reuse information in a document-centric, descriptive way.
We maintain support for the particularities of the data sources, by having them stored in different
formats and accessed in their own native query language, but also support the integration of these
information objects by converting them into a common, tree-like data structure, and by providing a
language to extract and transform information in those trees.
In this approach, a collection of SGML documents can be stored in an object-oriented database as a tree-
like hierarchy of information objects; thus taking advantage of the strict typing of SGML to provide
efficient storage and retrieval. By extending the standard query language of the object-oriented database,
we can query on an incomplete or partial knowledge of the document structure whilst retaining the
search efficiency that the database engine provides us. Combination of the results with other databases
or data sources, and inclusion into the SGML virtual document is handled by our tree language. HTML
and XML documents, do not always conform to a DTD, and, if they come from the Web, are volatile
and fast-changing in nature. We propose in this case to access those documents through the standard file
systems or http protocol, to convert them to our tree-like data structures on-line, and to use our tree
language for both extraction and transformation, with possibly some specific instructions to handle
The system is currently being implemented. Our prototypal application, a document to generate activity
reports, reuses both an SGML database and a collection of HTML pages (as well as an SQL database),
and shows how flexible and powerful our tool for information reuse is.
Brendan Hills is a member of the Electronic Documents and Commerce Portfolio at CSIRO
Mathematical and Information Sciences in Melbourne where he is working on the “Reuse of
Information Objects” project (RIO). He has been working with SGML, HTML, and document
management systems since 1994 and presented some of the CSIRO’s work in this area at last year’
SGML Asia Pacific Conference. His interests are in object oriented databases, object oriented
programming, and in bringing a cognitive science approach to designing document systems.
Francois Paradis is currently doing a post-doctorate at CSIRO Mathematical and Information
Sciences, Melbourne, in the Text-Based Information Management group, where he is involved in the
RIO project and in the TREC Information Retrieval experiments. His prior work with SGML includes
the definition of an indexing model for structured documents, and an application of this model to TEI,
which was part of his Ph.D. completed in 1996 at CLIPS-IMAG, Grenoble.
Anne-Marie Vercoustre is a Research Director at INRIA, France, where she has been involved in
Syntax based Programming Environments, Structured Documents, and Hypertext for more than 10
years. She participated in the design of Symposia, the W3C WYSIWYG editor connected to the Web,
based on the SGML Grif Editor (Grif.SA). She is currently involved in databases for structured and
semi-structured data, with a focus on reusing SGML, XML and HTML documents through virtual
documents. In 1996-1997, she spent 16 months at CSIRO-DMIS (Melbourne), as the leader of the
Text-based Information Management Research Group, and directed the project on Reuse of Information