A database server for next-generation scientific data management
The growth of scientific information and the increasing automation of data collection have made databases integral to many scientific disciplines including life sciences, physics, meteorology, earth and atmospheric sciences, and chemistry. These sciences pose new data management challenges to current database system technologies. This dissertation addresses the following three challenges: (1) Annotation Management: Annotations and provenance information are important metadata that go hand-in-hand with scientific data. Annotating scientific data represents a vital mechanism for scientists to share knowledge and build an interactive and collaborative environment. A major challenge is: How to manage large volumes of annotations, especially at various granularities, e.g., cell, column, and row level annotations, along with their corresponding data items. (2) Complex Dependencies Involving Real-world Activities: The processing of scientific data is a complex cycle that may involve sequences of activities external to the database system, e.g., wet-lab experiments, instrument readings, and manual measurements. These external activities may incur inherently long delays to prepare for and to conduct. Updating a database value may render parts of the database inconsistent until some external activity is executed and its output is reflected back and updated into the database. The challenge is: How to integrate these external activities within the database engine and accommodate the long delays between the updates while making the intermediate results instantly available for querying. (3) Fast Access to Scientific Data with Complex Data Types: Scientific experiments produce large volumes of data of complex types, e.g., arrays, images, long sequences, and multi-dimensional data. A major challenge is: How to provide fast access to these large pools of scientific data with non-traditional data types.^ In this dissertation, I present extensions to current database engines to address the above challenges. The proposed extensions enable scientific data to be stored and processed within their natural habitat: the database system. Experimental studies and performance analysis for all the proposed algorithms are carried out using both real-world and synthetic datasets. Our results show the applicability of the proposed extensions and their performance gains over other existing techniques and algorithms.^
Walid G. Aref, Purdue University, Ahmed K. Elmagarmid, Purdue University.