Database Querying on the

World Wide Web: UniGuide-

An Object-Relational Search Engine for

Australian Universities

Carlos F. Enguix, Joseph G. Davis, and Aditya K. Ghose

Decision Systems Lab

Department of Business Systems

The University of Wollongong

Northfields Ave.

Wollongong NSW 2522

joseph_davis@uow.edu.au , cfe01@wumpus.uow.edu.au , aditya@uow.edu.au

Please address all correspondence to Joseph G. Davis

(Word Count : 2981)

Abstract

The World Wide Web can be considered to be a huge semi-structured database that can provide us with a vast amount of information. Existing web search techniques have significant deficiencies with respect to robustness, flexibility and precision. The purpose of this research is to develop a domain-centred alternative to keyword and subject directory search engines. The specific domain being considered for the prototype implementation is that of Australian universities including all the internal entities that belong to each university such as faculties, departments, research centres, etc. that is available on the web. By modelling the ontology of this particular domain using an object-relational data model and restructuring the web data using an object-relational database, structured queries can be issued against this database in a fashion that current search engines do not provide.

1. Introduction

The explosion in the quantum of data available on the WWW in recent years has rendered the problem of discovering necessary information resources in reasonable time, somewhat difficult. The quality and usefulness of the global, distributed, hypermedia structure of the web is critically dependent on the availability of effective means by which users can obtain the required information efficiently. The efficacy of the existing search engines based on keywords and subject directories has been under severe strain.

An alternative approach that enables the database querying of web data is proposed in our research. We address some of the conceptual and practical questions dealing with, developing and structuring ontologies within well-defined domains such as health care, universities, etc. The ontology model, structured as an object relational database schema, is used to develop an object-relational database query search engine entitled "UniGuide".

The paper is organised as follows: In section 2 we present an overview of the relevant literature. A model that captures the core constructs and their inter-relationships (ontology) of the university domain, the architecture, and implementation of UniGuide prototype are presented in section 3. The usage of the prototype from an end-user perspective is outlined in section 4. Sections 5 and 6 devoted to future research directions and conclusion respectively.

2. Overview of the Literature

2.1 Structuring the Web Data

The WWW can be considered as a huge semi-structured database, presenting all the problems implicit in semi-structured data [Abi97]. Extracting the structure of every HTML document is a challenging issue given the absence of predefined standard and schema. Often the schema can be derived only after the existence of data as compared to conventional databases where the schema is defined before the database is populated even though the schema can be very large and constantly evolving.

One of the possibilities to "put things in order" in this relative chaos, is to create a structured layer on top of the semi-structured layer along the lines proposed in [HZF95]. A feasible approach is that of attaching metadata that describes the kind of contents of individual web pages. This would permit us to view relevant information about web pages as a series of structured tuples of data. The approach is partly based on the assumption that metadata will/should be treated as first class objects [W3C97] and will serve as the interface from the WWW to a structured database. Because of the exclusive focus on metadata, there is no need for strict typing over the contents of the HTML documents but only over the required metadata. The type of metadata to be attached to web pages are basic and standard meta tags, name/value pairs that describe properties of the document. Examples of the most extended use of meta tags includes: keywords and description used by some of the most popular search engines. Other important standard is the meta tags proposed by the Dublin Core, a 15-element metadata set intended to facilitate discovery of electronic resources [COR97].

2.2 The Query Problem

A majority of the existing search engines provide a very simple interface to querying, a simple text box. We list below some of the most common deficiencies of current implementations:

Most of the search engines are keyword-based, constrained to very limited structured querying, therefore providing more syntactic and less semantic precision

The lack of control on querying data: the boundaries of the query are unknown, the output of a query is hard to predict

The ability to establish relations between data elements is scarce or non-existent

2.3 Related work

Structured database querying on the WWW was proposed by Han, Zaiane, and Fu [HZF95]. A critical problem with their approach is that it was too generic, trying to model a schema that could represent the whole semantics of the WWW. A more realistic approach is to follow a strategy of "divide and conquer", identifying and isolating an arbitrary number of domains where a model can be derived and extracted. The ideal domains to "attack" are those that present large hierarchies of webpages. From this starting point we can identify and isolate a given number of entities that are present in almost every different domain instance (i.e. webpages of a particular university) of a given "characteristic" domain (i.e. university webpages). One should be aware that it is almost impossible to model all the different variations of a given entity, or to model all the possible entities in a "characteristic" domain. Probably we can represent up-to an 80 % of the possible entities of a given domain.

P. Atzeni et. al. [AMM+97] have proposed a data model and a view language in order to represent, query and restructure the information stored in structured web servers. Generally these servers are characterised by having their webpages stored in databases and having normalised not only the content of their webpages but the hypertextual structure (i.e. HTML tags) as well. This feature permits that attribute values can be extracted automatically using a text restructuring language.

Our focus is on domains where semantic models can be extracted. Although these domains are considered to be logically structured in general terms, they do not share a normalised hypertextual structure. This makes it almost impossible to extract attributes automatically from every given "target" webpage using general text restructuring programs. This impossibility justifies our approach of attaching structured meta-tags to webpages in order to extract the attributes of entity-instances represented in webpages.

3. UniGuide: Architecture

3.1 Introduction

The framework proposed in this paper is predicated on two significant assumptions:

Ontologies or models of concepts and their relationships [MH97] represent powerful means to structure the global information base on the web

The range and diversity of data on the web is so extensive that ontologies may have to be constructed separately for each relatively well-defined domain such as universities, health care, government departments, schools, etc.

Ontology is a term with a long pedigree in philosophy. It refers to things that exist (in the domain). For instance, it is reasonable to expect that the university domain will always have information regarding research entities, academic departments, courses, research outputs, and so on. Furthermore, these are likely to be inter-related in similar and predictable ways [MH97].

Our proposed method involves isolating a distinct domain, modelling its ontology using an object-relational data model, and storing structured data provided by UniGuide Scheme meta tags attached to the domain webpages into database tables corresponding to objects in the model. The UniGuide scheme meta tags are generated using forms-based input and finally attached to the required webpage in order to bring about the possibility of an indexing robot to populate the database automatically. This database becomes a resource that can be queried by end-users in a fashion that current search engines cannot match, allowing the execution of typical and non-typical SQL queries. We can emphasise that we can consider the process of querying the database as finding information rather than searching because the boundaries of a query can be delimited, a feature not available in keyword-based search engines.

UniGuide is a demonstration prototype implementation of the above approach. The ontology for all Australian University webpages is modelled as an object-relational data model. This model is then mapped to ILLUSTRA database tables and a set of queries that can be issued against this database is presented in subsequent sections.

3.2 The model: Object-Relational

The object-relational model of UniGuide is shown in figure-1. It shows the current entities that have been modelled but the model is extensible. The objects in the model represent the webpages of corresponding entities in the universities. Therefore there is a relation of 1:M relation (R[1:M]) between entity-instances and URLs. Generally, a webpage may contain many entity-instances but an entity-instance may have one and only one URL.

Figure 1: Synthesised graphical representation of the Object Relational Model of UniGuide

A university may contain many university units.
University unit is the superclass of all the following subclasses: club/association, administrative entity (division, office, etc.), library, residential college, campus and academic entities (faculties, departments, schools, etc.).
A library may have a set of catalogues, and a set of staff members.
An administrative entity can be one of the following types: Centre, Department, Division, Group, Institute, Office, etc. and may contain other administrative entities (i.e. Division of the Registrar contains various offices). An administrative entity may contain a set of staff members.
An academic entity comprises the following types: faculty, department, school, unit, program, etc. An academic entity may contain other delegated academic entities (i.e. Faculty contains various departments)
A research entity can be of type: research institute, research group, research centre, etc. Research entity is not considered as a subclass of university unit because a research entity can be part of many universities or can be a totally independent organism.
Academic entities and research entities may have a set of publications, projects, courses, course units (subjects), staff members and students.

Finally we can distinguish relations between entities:

Academic entity-research entity: an academic entity can-have/collaborate-with many research entities.
Research entity-university: a research entity may belong-to/collaborate-with many universities

All entities contain a timestamp attribute in order to store date and time of last modification. This attribute will provide an effective mechanism for a customised indexing robot to decide whether a previously inserted entity-instance has changed or not, and may be updated or not. The same case applies to forms-based manual input.

Figure 2. Simplified Chart Core Algorithm Automatical Database Population (indexing robot)

3.3 Components: Overview

A WWW search engine is defined as a retrieval service, consisting basically of a database, search software and a user interface [Pou97]. UniGuide has similar components, with some subtle differences. The database (ILLUSTRA ORDBMS™), is an object-relational hybrid, with the capability to handle sets, arrays, abstract data types, object identifiers, references, relations, user defined functions, inheritance, rules, etc. [SM96].

The search software is based on SQL3 queries, SQL queries that can call to external C functions (ILLUSTRA API™) with the ability to run queries as well ("callback" feature), rules, ILLUSTRA Web Datablade® Applications and Javascripts®. For security reasons queries are "actually" generated on the server-side. Only the generation of the interface and the input/output of data are done on the client-side. The system comprises more than 100 rules used intensively in order to control referential integrity, constraints, uniqueness of sets, automatic actualisation of object references and hypertext links.

3.4 Meta Tags

Our proposal is to include meta tags that simulate to be tuples or rows of data or metadata on a given range of particular webpages. The following figure (Fig. 3) contains two meta-tags representing entity-instances of academic entity publication and academic entity course/degree:

<!-- mandatory columns marked with * -->
<!-- Please Enter Values inside ' ' -->
<meta name="academic_entity_publication"
content="
(~ uni_id [*university]= 'Macquarie University' ~),
(~ academic_entity_type [*academic entity type]= 'Department' ~),
(~ academic_entity_name [*academic entity name]= 'Computing' ~),
(~ pub_name [*publication name]= 'MINNI: Micromouse Incorporating Neural 
Network Intelligence' ~),
(~ pub_type [publication type]= 'paper' ~),
(~ pub_date [publication date]= '1997' ~),
(~ pub_topics [topics covered]= 'Neural networks, Robotics' ~),
(~ pub_authors [authors]= 'Jondarr Gibb, Len Hamey' ~),
(~ pub_desc [short description]= 'MINNI is a system whereby a back 
propagation neural network is used to control the steering of a micromouse 
(small robot) in following a straight path.' ~)">


<meta name="academic_entity_course/degree"
content=	"
(~ uni_id [*university]= 'University of Technology Sydney' ~),
(~ academic_entity_type [*academic entity type]= 'Department' ~),
(~ academic_entity_name [*academic entity name]= 'Computer Science' ~),
(~ course_name [*course name]= 'Bachelor of Science' ~), 
(~ course_spec [course speciality]= 'Computing Science'	~), 
(~ course_type [course type]= 'Undergraduate' ~), 
(~ course_degree_type [course degree type]= 'Single' ~),
(~ course_semesters [course semesters]= '6' ~), 
(~ course_credits [course credits]= '144' ~), 
(~ course_desc [course description]= 'This course aims to provide a sound 
education in all aspects of computing for students who intend to make a 
career in the profession' ~)">

Figure 3. Examples of UniGuide Scheme meta-tags

These meta tags can be generated automatically by the UniGuide Meta tag Generator. This may allow a customised indexing robot that indexes only specific meta tags: UniGuide scheme meta tags.

4. UniGuide from the end-user perspective

4.1 Introduction.

There are three well-defined sub-sections: Submit URL, which allows the user to manually populate the database, Queries, and the Meta Tag Generator, that generates UniGuide Scheme meta tags. We shall describe only the query section.

4.2 Queries

4.2.1 Simple Queries

Entities are grouped hierarchically by domain/sub-domains. When the user clicks on a given entity (left-frame), an HTML form is generated dynamically on the right hand-side (right-frame). The user can specify the range of values to search on the text boxes. The search options are contextual depending of the type of data of the column (i.e. LIKE option is activated only for columns of type text). The output of a query is displayed in a tabular form. Other information includes the SQL query generated and the number of rows affected.

A simple query example follows: "Give me all the information available about Universities located in Sydney and are public"

Figure 4: UniGuide Simple Query Form Interface. Note: (*) accepts all values.

Figure 5: UniGuide Simple Query Form Interface: Query Results

Some other examples can be:

"Give me all the homepages of academic staff members of a particular university who have PhDs and are interested in Artificial Intelligence"
"Give me all the Masters in Commerce courses offered by Australian Universities"

4.2.2 Predefined Queries

From our viewpoint, predefined queries are complex queries, constructed in a similar way to parameterizable views. Queries can include summarised data, simulation of transitive closure, relations between entities, etc. Some examples of more elaborated queries can be:

"Give me all the hierarchical structure of the Faculty of Engineering of a particular university (schools within the faculty, departments within schools, etc.)"
"Give me all the research projects that involve collaboration between two or more academic entities and are funded by a given company"

4.2.3 Configurable Queries: WebQBE and FreeSQL

End-users will configure and customise the required query. Our goal is to provide an interface similar to a typical Query By Example interface. Another option currently implemented is a more complex interface that allows advanced end-users to elaborate free SQL queries with the aid of predefined queries, functions, operators, and a list of tables and columns available in the schema.

Figure 6: UniGuide FreeSQL interface

5. Future Research Directions

Developing an interface for UniGuide that fits the needs of both inexperienced and advanced users constitutes a challenge, especially in hiding the complexity of the schema. Also, currently we are developing in JAVA™ a breadth-first multi-threaded indexing robot that captures UniGuide meta-tags from University domains. Other important issues are whether we should continue providing strict integrity rules to the system (i.e. reject tuples that violate referential integrity, or constraints) or should there be a natural evolution of the system towards "weaker" integrity rules, fuzzy referential integrity etc.

6. Conclusions

A new kind of search engine has been proposed as an alternative to current implementations, with the ability to provide more structured and complex queries. This work is part of an ongoing research program exploring object-relational database approaches to searching the web. The success of this project is partly dependent on the consensus of a given number of Australian Universities to adopt the use of UniGuide Scheme meta tags in order to populate the database automatically. Finally we conclude that although the proposed solution is domain-specific, wherever a model can be "extracted" and a standard can be established respect to metadata, our approach can be customised to adapt to the requirements of that specialised domain (i.e. ideal for large intranets: government departments, large companies, etc.).

7. References

[Abi97]	Serge Abiteboul. Querying Semi-Structured Data. ICDT 97 6th International Conference on Database Theory Delphi, Greece, January 8-10, 1997. http://www-db.stanford.edu/~abitebou/pub/icdt97.semistructured.ps
[AMM+97]	Paolo Atzeni, Giansalvatore Mecca, Paolo Merialdo, Elena Tabet. Structures in the Web. Technical Report RT-INF-19-1997 Department of Computer Science and Automation. January 1997. http://www.inf.uniroma3.it/tech-rep/inf-19-97.ps
[COR97]	Dublin Core Metadata Element Set: Reference Description. October 2, 1997. http://purl.org/metadata/dublin_core_elements
[MH97]	Kunhanandha Mahaligan and Michael Huhns. A tool for Organising Web Information, IEEE Computer, June 1997, pages 80-83.
[HZF95]	J. Han, O. R. Zaïane, and Y. Fu, " Resource and Knowledge Discovery in Global Information Systems: A Scalable Multiple Layered Database Approach", Proc. Of a Forum on Research and Technology Advances in Digital Libraries (ADL'95), McLean, Virginia, May 1995.
[Pou97]	Alan Poulter. The design of World Wide Web search engines: a critical review. Program, vol31 no.2 April 1997, pages 131-145.
[SM96]	Michael Stonebraker, Dorothy Moore. Object-Relational DBMSs: The Next Great Wave. Morgan Kaufmann Publishers, Inc 1996
[W3C97]	W3C.Hypertext Links in HTML.W3C Working Draft 28-Mar-97 http://www.w3.org/TR/WD-htmllink - meta