Abstract

Much data in the Web is hidden behind Web query interfaces. In most cases the only means to "surface" the content of a Web database is by formulating complex queries on such interfaces. Applications such as Deep Web crawling and Web database integration require an automatic usage of these interfaces. Therefore, an important problem to be addressed is the automatic extraction of query interfaces into an appropriate model. We hypothesize the existence of a set of domain-independent "commonsense design rules" that guides the creation of Web query interfaces. These rules transform query interfaces into schema trees. In this paper we describe a Web query interface extraction algorithm, which combines HTML tokens and the geometric layout of these tokens within a Web page. Tokens are classified into several classes out of which the most significant ones are text tokens and field tokens. A tree structure is derived for text tokens using their geometric layout. Another tree structure is derived for the field tokens. The hierarchical representation of a query interface is obtained by iteratively merging these two trees. Thus, we convert the extraction problem into an integration problem. Our experiments show the promise of our algorithm: it outperforms the previous approaches on extracting query interfaces on about 6.5% in accuracy as evaluated over three corpora with more than 500 Deep Web interfaces from 15 different domains.

Keywords

design experimentation, languages, measurement, performance query, formulation query languages, query processing, world wide web

Date of this Version

2009

Comments

Proceedings of the VLDB Endowment VLDB Endowment Hompage archive Volume 2 Issue 1, August 2009

Download

Included in

Engineering Commons, Life Sciences Commons, Medicine and Health Sciences Commons, Physical Sciences and Mathematics Commons

COinS

Cyber Center Publications

A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration

Abstract

Keywords

Date of this Version

Comments

Included in

Search

Links

Links for Authors

Browse

Cyber Center Publications

A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration

Authors

Abstract

Keywords

Date of this Version

Comments

Included in

Share

Search

Links

Links for Authors

Browse