Transforming the Internet into a Database

Anand Rajaraman

Junglee Corporation

anand@junglee.com

1. Introduction

Virtual database (VDB) technology makes the World Wide Web and other external data sources behave as an extension of an enterprise's relational database (RDBMS) system. According to some estimates, as much as 90% of the world's data is outside of relational database systems. Vital data is scattered across web sites, file systems, database systems, and legacy applications. These data sources differ in the way they organize the data, in the vocabulary they use, and in their data-access mechanisms. Many of them do not even support native query operations. Writing applications that combine data from these sources is a complex, often impossible, task because of the heterogeneity involved.

VDB technology lets applications ask powerful SQL queries of data that is scattered over a variety of data sources. The VDB gathers, structures and integrates the data from these disparate data sources and provides the application programmer with the appearance of a single, unified relational database system. VDB technology enables the development of an exciting new breed of applications that use all the data.

As an illustration of the applications enabled by VDB technology, consider job hunting on the Web. In order to make a meaningful career choice, a job seeker needs information on available opportunities as well as related data - such as information on housing, school districts, and crime statistics in the job area. Information on job openings is scattered across thousands of different web sites - company home pages and several aggregate sites, such as newspaper classifieds sites. Keyword search capabilities on words appearing in the job listing are the only available search choice.

VDB technology converts all these data sources into a single virtual relational database. Using an application based on VDB technology, the job seeker can now obtain answers to the following query posed to the Web, "find marketing manager positions in a company that is within 15 miles of San Francisco and whose stock price has been growing at a rate of at least 25% per year over the last three years". This single query would span the Web employment listings of many corporations, in addition to web sites that have geographical mapping information and websites that contain historical records of corporate equity prices. The query would also return, for each position, related information including statistics on housing prices, school districts, and crime statistics. Section 3 provides details on this and other VDB applications that are deployed on several high-traffic web sites, including those of Yahoo!, The Wall Street Journal, The Washington Post, and The San Jose Mercury News.

VDB technology is particularly effective in enterprise settings when combined with data warehousing. Using VDB technology, corporations can include data from nontraditional sources (e.g., files in directory systems) and external sources (e.g., the World Wide Web) in the data warehouse, enabling key decision support applications such as data mining and online analytical processing (OLAP).

2. Technology Architecture

2.1 The Virtual Database

Figure 1 is a run-time view of a simple Virtual Database (VDB), which we'll call the Books VDB for future reference. This VDB integrates the contents of two bookstores (Amazon.com and Powell's Books) and the New York Times Book Reviews and presents a unified schema with two tables, books and reviews. The database application operates on this unified schema, issuing SQL queries through the JDBC or ODBC API; the application itself is built using standard RAD tools such as Delphi, PowerBuilder, Visual Basic, or similar Java toolkits. In Figure 1, the application issues the standard SQL query:

SELECT * 

FROM books, reviews 

WHERE books.author = "gardner" AND books.isbn = reviews.isbn 

 

Figure 1. The Books Virtual Database.

The VDB is accessed through the VDB Server, and is administered through the browser-based VDB Console. The VDB also contains, for each external data source, a wrapper that interfaces the data source to the VDB server. A wrapper makes an arbitrary external data source, such as a web site, behave like an RDBMS, while the VDB Server integrates these separate relational databases into a unified Virtual Database (VDB).

Figure 2 shows an individual wrapper in action. The wrapper interfaces with the web site, typically using HTTP and HTML. It handles HTTP protocol-related issues such as forms, cookies, and authentication. The wrapper is accessed via the JDBC API, through which clients can issue SQL queries. A SQL query issued to the wrapper in this case would result in the wrapper filling out a HTML form on the Amazon.com web site, navigating and parsing the resulting HTML pages, and transforming the data into rows in a relational table. The wrapper uses extraction rules that apply sophisticated linguistic processing to extract attributes from the web pages, data transformation rules to map and format the data to fit the schema, and data validation rules to ensure data integrity.

Figure 2. A standalone wrapper.

Lightweight Java applications that interact with one (or a few) data sources can interface directly with wrappers. The application sees each data source as a separate JDBC source with its own schema, and must connect to each source separately and combine the data as needed.

Sophisticated applications that use more than a few data sources use the full functionality of the VDBMS, as shown in Figure 3. The VDBMS exposes tables in multiple data sources as virtual tables in a single Virtual Database (VDB), and supports full RDBMS functionality over virtual tables including view definitions and query processing across sources. In the example of Figure 3, the VDB defines the view books as the union of the amazon and powell's virtual tables. When the VDBMS receives the query shown in the figure, the query processor component decomposes the query, determines the fragments to be sent down to the individual data sources, and combines their results. The query result cache caches results from data sources for performance.



Figure 3.
From standalone wrappers to an integrated Virtual Database

In addition, the publishing system can be set up to periodically create physical snapshots of virtual tables in a local relational data store. The VDBMS can also perform data validation tests that are more sophisticated than those at individual wrappers; an example is stability testing, which compares data against historical statistical trends and raises an alert if there is a large deviation.

2.2 The Virtual Database Management System (VDBMS)

Junglee's Virtual Database Management System (VDBMS) enables the creation and management of Virtual Databases such as the Books VDB. Figure 4 shows the components of the VDBMS.

Wrapper Development Kit (WDK).

The Wrapper Development Kit enables the rapid creation of wrappers for web sites, file systems, and other network data sources. The WDK is built around the notion of wrapper frameworks. A wrapper framework is a collection of classes and a programming idiom that together ease the creation of wrappers for a family of data sources. For example, the Navigator framework makes it simple to create wrappers for web sites: a few simple lines of Java code can capture the structure of a complicated web site, including sites whose pages are dynamically generated in response to filled out forms. This framework also captures relationships between hyperlinked pages in a web site and reflects the relationships in the contents of the virtual database tables corresponding to the site.

Figure 4. VDBMS components

Extractor Development Kit (EDK).

Data integration often involves extracting structure from unstructured textual data. For example, consider a newspaper web site that lists apartments for rent. The application needs a table with columns for features such as number of bedrooms, number of bathrooms, location, and rent. However, each apartment classified listing is typically a block of undifferentiated text. Extraction rules describe how to extract the required features from the text.

The Extractor Engine is an interpreter for Jel (Junglee Extraction Language), a language designed to express sophisticated text processing rules. Extraction rules rely on dictionaries of words and phrases with attributes. For example, a location extraction rule may use a dictionary that lists the names of cities and states in the US, together with common abbreviations for the names. The EDK includes Dictionary Management Utilities to create and manage such dictionaries.

Data Quality Kit (DQK).

The DQK deals with data transformation (also called mapping) and data validation.

Wrappers can make arbitrary data sources behave like relational data tables, but these tables are likely to have inconsistent schemas and vocabularies. As a simple example, consider a scenario where some wrappers export salary in $/month, others export it in $/week, and the application wants it in $/year. The required attribute name and unit conversions are handled by a field mapping. Row mappings create new table columns based on the values of other columns.

Virtual databases often deal with highly irregular data from sources outside the control of the VDB administrator, and subject to large-scale changes without notice. Therefore, ensuring data integrity is a key issue for VDBs. The DQK provides several varieties of data validation checks to ensure data integrity.

VDB Server.

The VDB Server exposes tables in multiple data sources as virtual tables in a single Virtual Database (VDB). The VDB Server supports full RDBMS functionality over virtual tables, including view definitions and query processing across sources. The VDB Server supports the JDBC and ODBC APIs and is administered via the browser-based VDB Console. It includes a relational data store, which is a commercial-strength RDBMS. The data store can be used to warehouse snapshots of virtual tables for rapid access, and also to store other physical data tables that are used by the database application. The VDB Server implements a security model that hides the details of authentication with individual data sources.

Data Publishing System (DPS)..

The DPS allows administrators to use virtual tables in two modes: dynamic and warehoused. In dynamic mode, a query to a virtual table results in one or more queries to the underlying data sources. In warehoused mode, the DPS is set up to periodically refresh a snapshot of the virtual table created in the local relational data store, and queries to the table go to the local store. A VDB can contain both kinds of tables, and a single query can span dynamic and warehoused data.

Warehousing is appropriate for slowly-varying data that changes on a daily or weekly basis, while dynamic querying is appropriate for rapidly-varying data, such as stock quotes, or when the data cannot be warehoused because of copyright restrictions. For example, the Books VDB might set up the Amazon and Powell's tables to be dynamically queried (since bookstores tend to change prices at unpredictable intervals) and the NY Times Reviews data to be warehoused weekly in the local data store.

3. The VDB At Work: Real-World Applications

Junglee has applied VDB technology in several key domains: Employment Classifieds, Consumer Shopping, Real Estate, and Apartment listings. The Canopy family of products uses VDB technology to integrate data from a potentially unlimited number of World Wide Web sites, enabling consumers to harness the power of the Web and make more informed decisions.

3.1 Online Recruitment

The JobCanopy VDB application integrates job listings from over 700 data sources, including employer web sites, flat files, and legacy data feeds. The schema for this VDB includes 31 attributes of interest to employers and jobseekers, including job title, job category, job location, and contact information. The data sources are scoured each week to ensure that the information is always fresh. Listings from different employers are normalized to have the same set of fields and the same vocabulary. The JobCanopy product is accessible from the web sites of several major newspapers and online media companies, including The Wall Street Journal Interactive Edition, The Washington Post, The San Jose Mercury News, Classifieds2000, and Westech Virtual Job Fair.

Figure 5. JobCanopy Application

3.2 Web Commerce

The ShopCanopy VDB application allows comparison shopping over 40 merchants in 8 categories, including Books, Music, Computer Hardware, and Consumer Electronics. ShopCanopy is deployed on the Yahoo! Visa Shopping Guide web site.

Figure 6. ShopCanopy Application

The ShopCanopy application brings together buyers and sellers online to create marketplaces on the Web. ShopCanopy allows consumers to easily access and compare product and pricing information from merchants simultaneously, and then link to a specific merchant's site to make a purchase. For example, individuals looking for a specific book can go to the book category on the VISA ShoppingGuide by Yahoo!, type a book title, such as Runaway Jury, in the title field, and in seconds see a list of more than 25 versions of the popular John Grisham novel, offered by several different Internet book stores within the guide, in paperback, hard cover, or audio, with prices from $4.79 to $26.95. Users can also see at a glance which merchants have the book in stock. After making a selection, buying the book is just a click away.

4. Conclusion

VDB technology enables rapid deployment of applications with at least one of the following characteristics:

The World Wide Web, and most Intranets, have all these characteristics. VDB technology converts the Internet into a database and transforms the World Wide Web.