Stop Word and Related Problems in Web Interface Integration


The goal of recent research projects on integrating Web databases has been to enable uniform access to the large amount of data behind query interfaces. Among the tasks addressed are: source discovery, query interface extraction, schema matching, etc. There are also a number of tasks that are commonly ignored or assumed to be apriori solved either manually or by some oracle. These tasks include (1) finding the set of stop words and (2) handling occurrences of "semantic enrichment words" within labels. These two subproblems have a direct impact on determining the synonymy and hyponymy relationships between labels. In (1), a word like "from" is a stop word in general but it is a content word in domains such as Airline and Real Estate. We formulate the stop word problem, prove its complexity and provide an approximation algorithm. In (2), we study the impact of words like AND and OR on establishing semantic relationships between labels (e.g. "departure date and time" is a hypernym of "departure date"). In addition, we develop a theoretical framework to differentiate synonymy relationship from hyponymy relationship among labels involving multiple words. We scrutinize its strength and limitations both analytically and experimentally. We use real data from the Web in our experiments. We analyze over 2300 labels of 220 user interfaces in 9 distinct domains.


database applications, design, experimentation, measurement, oracle, performance, world wide web

Date of this Version



Proceedings of the VLDB Endowment VLDB Endowment Hompage archive Volume 2 Issue 1, August 2009