Some highlights and conclusions from the workshop
The workshop attracted about 25 people, mostly from Australia but also
from the USA, UK, Austria, Germany and France. It was combined with the
workshop on Flexible Hypertext.
The workshop included four main sessions, an invited talk and a concluding
session. The first three sessions had a theme corresponding to the main
steps in reusing information, namely:
- Searching and finding information
- Extracting information
- Combining/Presenting information
while a forth theme, modeling Information, was regarded as central
to the whole issue of reuse.
The steps involved in the process of information reuse were highlighted in
Anne-Marie's introduction. She also pointed out that
extracting information may amount to following links among pages to find
the relevant fragments or further links. Therefore searching and
extracting are tightly related and the allocation of papers to one of the
other session was somehow arbitrary.
Modeling information appeared to be a condition for extracting and
reusing. The two papers in this session considered a priori modeling for
supporting future reuse, rather than using a model for discovering and
extracting information, like in some other papers (eg. A Jumping Spider
and papers in session 2).
The last session focused on the flexible presentation of information,
which is another important aspect in the reuse of existing information.
Unfortunately, for personal and professional reasons, Philipp Merrick from
webMethods was unable to attend the workshop and the WWW7 Conference. At
the last minute his presentation on using XML for e-business was replaced
by an introduction to XML and its potential benefit for reuse, by
Anne-Marie Vercoustre. Her presentation focused on the research
perspective rather than the business perspective on XML.
After drawing some conclusions we had an unplanned wrap-up session with
the Workshop on Hypertext
Functionality and the WWW, co-chaired by Carolyn Watters and Fabio
Vitali. This also resulted on a BOF session on Virtual Documents, and
the formation of a special interest group on this topic.
Finding/Searching Information
In order to reuse information, we first require access to that
information, either by: finding out what is relevant (by searching);
accessing well-known places where information is regularly updated; being
notified that new information has come; or a combination of the above.
Matthieu Montebello underlined the inefficiency of search engines and
browsing regarding precision and recall, and arose the issue of such evaluation criteria in a world-wide environment. He focused on meeting user's
need and interest and proposed a system that prompts the user with new
URLs selected by peers, according to a dynamically evolving user's
profile.
Curtis Dyreson made the distinction between static and
dynamic approaches for search tools. A static search is done at
periodic intervals, off-line, for example by a robot, and results are
stored locally for further querying and reuse. A dynamic search occurs
when a tool crawls part of the Web on-line for extracting information each
time the information is requested. Curtis described a static system to
index concepts that span more than one page.
Anand Rajaman presented the architecture of a commercial system developed
by Junglee and used for integrating book offers from various electronic
bookshops, job opportunities, etc. It was really "reuse at work". The
system includes a tool box for building wrappers on sites, linguistic
feature extraction, as well as a data verifier that checks the conformance
of data type to expectations, in case the server's structure had changed
without notification.
General discussion
A first question was:
is searching for reuse different than usual searching?
One
partial answer is that searching for reuse is in the context of a specific
reuse task, rather than the user's task.
We had a long discussion on
evaluation of search and reuse in the context of the Web, compared to TREC
evaluation.
The problems include:
- the validation of data: TREC evaluation works on a well-established and
fixed corpus. For web reuse there is a need for testing the validity
(stability) of data for each wrapper (agent), regarding both the structure
and the content.
- the nature of the web, where there may be a tension between "relevance"
for a customer and "attractiveness" for the vendor. So the question "what
is relevance" or "meaning" (eg. for images) is yet more complex than for
classical IR.
Evaluation should:
- be user-based
- set the task involved in reuse (relevance for reuse)
- evaluate the efficiency (how much reuse is possible?), regarding the
instability of data.
Extracting Information
Jim Stinger presented a tool for automatically extracting electronic
component data from web datasheets. His system does not rely on the data
format (tags) for extracting information. Instead it uses domain knowledge
about products, and vendor specific catalogues which model the knowledge
embedded in particular vendors' datasheets. The navigator visits vendors'
pages, traverses along promising links or fills forms as required, and
returns a list of pages to the extractor. The extractor iteratively
processes the pages for each characteristic to be mined from the datasheet
and sends them to the database. The difficulties encountered in automatic
mining were many due to poorly written HTML pages and inconsistency
between datasheets versions.
The second paper, presented by Joseph Davis, also uses a domain centered
model to support more precise access to information. The approach here is
quite different though, since it uses complex metadata based on
ontologies. The success of the approach is dependent on a given number of
sites relevant to the domain adopting the use of the UniGuide Scheme
metadata and to attach these metadata to their pages. This approach is in
the line of current initiatives for metadata development, like the Dublin
Core initiative, or the W3C working group on RDF schema registery.
The two presentations highlighted the need for making the underlying model
of the application explicit, since it is often difficult to infer it from
the structure of the site(s).
Modeling Information
The 2 papers in this session are concerned with modeling information for
supporting better reuse, the first one for reusing digital videos, the
second for reusing Web styles and instances.
Craig Lindley introduced FRAMES, a system for reusing digital videos
through synthesised new videos. FRAMES uses a rich model for describing
videos which reflects the complexity of video semantics (multiple possible
interpretations) as well as physical features. Although the query engine
and content models can be used directly for search and retrieval of video
content, the important point, with regard to reuse of the content, is that
it can also be used for a wide variety of other purposes, for example,
virtual video generation. A video synthesis can be specified by a virtual
document prescription which contains queries to the video database using a
combination of features available in the model.
Max Mühlhäuser focussed on hyperwebs, ie. a logically correlated set of
nodes and links. He showed how typing concepts for hyperwebs can
improve finding information as well as reusing such hyperwebs in augmented
hypermedia. He introduced graph grammars as a way of typing the Web in the
programming language sense. With this formalism, models and instances of
hyperwebs can be precisely described and reused. This work is strongly
rooted in hypertext/hypermedia research and deserved to find a better
place in the Web community especially for intranet design.
Both presentations demonstrated the importance of a rich description of
the primary sources (content, structure) in order to support effective
reuse.
Presenting Information
We did not have specific papers on reassembling/reconstructing
information, although several of the participants are working on Virtual
Documents, which were discussed in a separate BOF session (see below).
Virtual documents were also discussed by Craig Lindley in his presentation
on building video synthesis.
Petros Demetriades presented an architecture for building various views on
Web Information, as well as on browsing history. Views perform the actual
interpretation of the data into some visual form. This architecture can
accommodate both static and dynamic search, and facilitate
experimentation and side-by-side comparison of different collection,
storage and visualisation methods and techniques.
Kim Mariott showed the advantages of using constraint-based techniques for
the automatic generation of good document layout. Documents and Web pages
are made of content, structure and layout. Both structure and layout can
change according to reuse content and context. Layout should adapt itself
to a wide variety of environments, medias and viewers. Constraint-based
layout provides this flexibility and supports a real negotiation
between the author and the viewer for controlling the final appearance of
the document.
Some Conclusions
Conclusions drawn by Craig Lindley were:
- clarifying the notion of reuse: making a distinction between "use" of
information, "reuse", and multi-purpose information. Reuse often amounts
to restructuring or adding-value to primary sources (second level
information), or use it for other purposes than the one initially
intended. Multi-purpose information refers to flexible information
delivery to accommodate various contexts. Reuse here is anticipated.
- One particularly interesting issue was that the context of "use" is
not necessarily the "user" context, and these should be treated
separately. There is a need for evaluation of search and reuse relevance
and effectiveness, according to a context or a task. Relevance regarding
meaning is difficult, especially for images which may have several
meanings. Relevance can be quite different for a vendor, or a
customer, and for entertainment.
- The conclusions also highlighted the need for research on meta models
for structure and semantics and a language for virtual
inclusion/transclusion.
Wrapping session
Another workshop which was running concurrently dealt with the hypertext
functionality of the web and where we might like the web to be enhanced.
For example, in transclusion, instead of including a link to a piece of
information, that information might be included in the document on-the-fly
in order to achieve some purpose. This is like the traditional
"stretch-text" idea where pieces of additional text are weaved into an
otherwise static document for some reason, such as the user not
understanding a particular term etc. We invited the members of this
workshop to join us for the final wrap-up session and a member from each
workshop gave a summary of the issues which had arisen over the course of
the day.
Requirements for the Web (from this point of view):
- The Web needs to enable typed links. Examples of types of links are:
annotations, transclusion (inclusion by reference), personalized links,
trails and guided tours, backtracking and history-based navigation etc.
Currently, the Web only offers one type of link. For the hypertext
community, this is insufficient.
- Web applications need to be evaluated more rigorously. Evaluation is
likely to become a big theme, as no one does it (or does it well)
and as the need is clearly there.
- The Web needs to provide facilities that support collaboration.
- More user-centered approaches to web design are needed.
This allowed us to see the links (pardon the pun) between
the two groups more clearly and led to the formation of a "Birds of a
Feather" session later in the conference.
This BoF session focussed on "virtual documents" in order to discuss at
some length some of the issues we consider important in building systems
which can produce documents on-the-fly from either large pieces of text or
smaller fragments using a grammar etc. We spent some time discussing the
meaning of the term "virtual document" but didn't reach a conclusion in
one hour.
Further actions
Participants to the BoF session agreed on further collaboration about
Virtual Documents, starting with be setting up a mailing list and a web
page for those people who are working in this area in order to pool
ideas and constraints, and to facilitate collaboration. For more
information please to contact Brendan Hills: Brendan.Hills@cmis.csiro.au