Workshop on Reuse of Web Information

Some highlights and conclusions from the workshop

The workshop attracted about 25 people, mostly from Australia but also from the USA, UK, Austria, Germany and France. It was combined with the workshop on Flexible Hypertext.

The workshop included four main sessions, an invited talk and a concluding session. The first three sessions had a theme corresponding to the main steps in reusing information, namely:

Searching and finding information
Extracting information
Combining/Presenting information

while a forth theme, modeling Information, was regarded as central to the whole issue of reuse.

The steps involved in the process of information reuse were highlighted in Anne-Marie's introduction. She also pointed out that extracting information may amount to following links among pages to find the relevant fragments or further links. Therefore searching and extracting are tightly related and the allocation of papers to one of the other session was somehow arbitrary.

Modeling information appeared to be a condition for extracting and reusing. The two papers in this session considered a priori modeling for supporting future reuse, rather than using a model for discovering and extracting information, like in some other papers (eg. A Jumping Spider and papers in session 2).

The last session focused on the flexible presentation of information, which is another important aspect in the reuse of existing information.

Unfortunately, for personal and professional reasons, Philipp Merrick from webMethods was unable to attend the workshop and the WWW7 Conference. At the last minute his presentation on using XML for e-business was replaced by an introduction to XML and its potential benefit for reuse, by Anne-Marie Vercoustre. Her presentation focused on the research perspective rather than the business perspective on XML.

After drawing some conclusions we had an unplanned wrap-up session with the Workshop on Hypertext Functionality and the WWW, co-chaired by Carolyn Watters and Fabio Vitali. This also resulted on a BOF session on Virtual Documents, and the formation of a special interest group on this topic.

Finding/Searching Information

In order to reuse information, we first require access to that information, either by: finding out what is relevant (by searching); accessing well-known places where information is regularly updated; being notified that new information has come; or a combination of the above.

Matthieu Montebello underlined the inefficiency of search engines and browsing regarding precision and recall, and arose the issue of such evaluation criteria in a world-wide environment. He focused on meeting user's need and interest and proposed a system that prompts the user with new URLs selected by peers, according to a dynamically evolving user's profile.

Curtis Dyreson made the distinction between static and dynamic approaches for search tools. A static search is done at periodic intervals, off-line, for example by a robot, and results are stored locally for further querying and reuse. A dynamic search occurs when a tool crawls part of the Web on-line for extracting information each time the information is requested. Curtis described a static system to index concepts that span more than one page.

Anand Rajaman presented the architecture of a commercial system developed by Junglee and used for integrating book offers from various electronic bookshops, job opportunities, etc. It was really "reuse at work". The system includes a tool box for building wrappers on sites, linguistic feature extraction, as well as a data verifier that checks the conformance of data type to expectations, in case the server's structure had changed without notification.

General discussion

A first question was: is searching for reuse different than usual searching?
One partial answer is that searching for reuse is in the context of a specific reuse task, rather than the user's task.
We had a long discussion on evaluation of search and reuse in the context of the Web, compared to TREC evaluation.
The problems include:
- the validation of data: TREC evaluation works on a well-established and fixed corpus. For web reuse there is a need for testing the validity (stability) of data for each wrapper (agent), regarding both the structure and the content.
- the nature of the web, where there may be a tension between "relevance" for a customer and "attractiveness" for the vendor. So the question "what is relevance" or "meaning" (eg. for images) is yet more complex than for classical IR.
Evaluation should:
- be user-based
- set the task involved in reuse (relevance for reuse)
- evaluate the efficiency (how much reuse is possible?), regarding the instability of data.

Extracting Information

Jim Stinger presented a tool for automatically extracting electronic component data from web datasheets. His system does not rely on the data format (tags) for extracting information. Instead it uses domain knowledge about products, and vendor specific catalogues which model the knowledge embedded in particular vendors' datasheets. The navigator visits vendors' pages, traverses along promising links or fills forms as required, and returns a list of pages to the extractor. The extractor iteratively processes the pages for each characteristic to be mined from the datasheet and sends them to the database. The difficulties encountered in automatic mining were many due to poorly written HTML pages and inconsistency between datasheets versions.
The second paper, presented by Joseph Davis, also uses a domain centered model to support more precise access to information. The approach here is quite different though, since it uses complex metadata based on ontologies. The success of the approach is dependent on a given number of sites relevant to the domain adopting the use of the UniGuide Scheme metadata and to attach these metadata to their pages. This approach is in the line of current initiatives for metadata development, like the Dublin Core initiative, or the W3C working group on RDF schema registery.
The two presentations highlighted the need for making the underlying model of the application explicit, since it is often difficult to infer it from the structure of the site(s).

Modeling Information

The 2 papers in this session are concerned with modeling information for supporting better reuse, the first one for reusing digital videos, the second for reusing Web styles and instances.
Craig Lindley introduced FRAMES, a system for reusing digital videos through synthesised new videos. FRAMES uses a rich model for describing videos which reflects the complexity of video semantics (multiple possible interpretations) as well as physical features. Although the query engine and content models can be used directly for search and retrieval of video content, the important point, with regard to reuse of the content, is that it can also be used for a wide variety of other purposes, for example, virtual video generation. A video synthesis can be specified by a virtual document prescription which contains queries to the video database using a combination of features available in the model.
Max Mühlhäuser focussed on hyperwebs, ie. a logically correlated set of nodes and links. He showed how typing concepts for hyperwebs can improve finding information as well as reusing such hyperwebs in augmented hypermedia. He introduced graph grammars as a way of typing the Web in the programming language sense. With this formalism, models and instances of hyperwebs can be precisely described and reused. This work is strongly rooted in hypertext/hypermedia research and deserved to find a better place in the Web community especially for intranet design.
Both presentations demonstrated the importance of a rich description of the primary sources (content, structure) in order to support effective reuse.

Presenting Information

We did not have specific papers on reassembling/reconstructing information, although several of the participants are working on Virtual Documents, which were discussed in a separate BOF session (see below). Virtual documents were also discussed by Craig Lindley in his presentation on building video synthesis.

Petros Demetriades presented an architecture for building various views on Web Information, as well as on browsing history. Views perform the actual interpretation of the data into some visual form. This architecture can accommodate both static and dynamic search, and facilitate experimentation and side-by-side comparison of different collection, storage and visualisation methods and techniques.

Kim Mariott showed the advantages of using constraint-based techniques for the automatic generation of good document layout. Documents and Web pages are made of content, structure and layout. Both structure and layout can change according to reuse content and context. Layout should adapt itself to a wide variety of environments, medias and viewers. Constraint-based layout provides this flexibility and supports a real negotiation between the author and the viewer for controlling the final appearance of the document.

Some Conclusions

Conclusions drawn by Craig Lindley were:

clarifying the notion of reuse: making a distinction between "use" of information, "reuse", and multi-purpose information. Reuse often amounts to restructuring or adding-value to primary sources (second level information), or use it for other purposes than the one initially intended. Multi-purpose information refers to flexible information delivery to accommodate various contexts. Reuse here is anticipated.
One particularly interesting issue was that the context of "use" is not necessarily the "user" context, and these should be treated separately. There is a need for evaluation of search and reuse relevance and effectiveness, according to a context or a task. Relevance regarding meaning is difficult, especially for images which may have several meanings. Relevance can be quite different for a vendor, or a customer, and for entertainment.
The conclusions also highlighted the need for research on meta models for structure and semantics and a language for virtual inclusion/transclusion.

Wrapping session

Another workshop which was running concurrently dealt with the hypertext functionality of the web and where we might like the web to be enhanced. For example, in transclusion, instead of including a link to a piece of information, that information might be included in the document on-the-fly in order to achieve some purpose. This is like the traditional "stretch-text" idea where pieces of additional text are weaved into an otherwise static document for some reason, such as the user not understanding a particular term etc. We invited the members of this workshop to join us for the final wrap-up session and a member from each workshop gave a summary of the issues which had arisen over the course of the day.
Requirements for the Web (from this point of view):

The Web needs to enable typed links. Examples of types of links are: annotations, transclusion (inclusion by reference), personalized links, trails and guided tours, backtracking and history-based navigation etc. Currently, the Web only offers one type of link. For the hypertext community, this is insufficient.
Web applications need to be evaluated more rigorously. Evaluation is likely to become a big theme, as no one does it (or does it well) and as the need is clearly there.
The Web needs to provide facilities that support collaboration.
More user-centered approaches to web design are needed.

This allowed us to see the links (pardon the pun) between the two groups more clearly and led to the formation of a "Birds of a Feather" session later in the conference.

This BoF session focussed on "virtual documents" in order to discuss at some length some of the issues we consider important in building systems which can produce documents on-the-fly from either large pieces of text or smaller fragments using a grammar etc. We spent some time discussing the meaning of the term "virtual document" but didn't reach a conclusion in one hour.

Further actions

Participants to the BoF session agreed on further collaboration about Virtual Documents, starting with be setting up a mailing list and a web page for those people who are working in this area in order to pool ideas and constraints, and to facilitate collaboration. For more information please to contact Brendan Hills: Brendan.Hills@cmis.csiro.au