next up previous
Next: Semantic Problems of Up: EXPLORING HETEROGENEOUS MOLECULAR Previous: EXPLORING HETEROGENEOUS MOLECULAR

Introduction

Data of interest to molecular biologists are distributed over numerous heterogeneous molecular biology databases (MBDs). These MBDs display heterogeneity at various levels: they are implemented using different systems, such as structured files or database management systems (DBMSs), are based on different views of the molecular biology domain, and contain different and possibly conflicting data. Furthermore, each MBD represents some part of the molecular biology domain and is often designed to address certain queries or applications. The data in an MBD are structured according to a schema specified in a data definition language (DDL) and are manipulated using operations specified in a data manipulation language (DML), where these languages are based on a data model that defines the semantics of their constructs and operations. Exploring multiple MBDs entails coping with the distribution of data among MBDs, the heterogeneity of the systems underlying these MBDs, and the semantic (schema representation) heterogeneity of these MBDs.

Strategies for managing heterogeneous MBDs can be grouped into two main categories(see [2,14,20] for related classifications of heterogeneous database systems):

  1. Consolidation strategies that entail replacing heterogeneous MBDs with a single homogeneous MBD formed by physically integrating the component MBDs, or by requiring MBDs to be reorganized using a common DDL or DBMS.

  2. Federation strategies that allow access to multiple heterogeneous MBDs, while the component MBDs preserve their autonomy, that is, their local definitions, applications, and policy of exchanging data with other MBDs. Federation strategies include:

    1. incorporating in MBDs references (links) to elements in other MBDs, or constructing MBDs consisting of such links;

    2. organizing MBDs into loosely-coupled multidatabase systems; and

    3. constructing data warehouses.

Heterogeneous MBDs can be connected via hypertext links on the Web at the level of individual data items. Data retrieval in such systems is limited to selecting a starting data item within one MBD and then following hyperlinks between data items within or across MBDs. Note that data item links (e.g., hypertext links) between MBDs do not require or comply with schema correlations across MBDs. Numerous MBDs are currently providing such links. However, missing and inconsistent links between MBDs prompted some archival MBDs to propose coordinating the management of links between their MBDs [1]. Systems such as SRS [10,11] and LinkDB [13] extract existing link information from (usually flat file) MBDs, and construct indexes for both direct and reverse links allowing fast access to these MBDs. These systems resolve heterogeneity issues such as duplicate or incompatible identifiers and provide only simple index and key match retrieval, but lack the ability of supporting full query facilities.

Multidatabase systems are collections of loosely coupled MBDs which are not integrated using a global schema. Querying multidatabase systems involves constructing queries over component MBDs, where a query explicitly refers to the elements of each MBD involved. Component MBDs of multidatabase systems can be queried using a common query language, as is done in Kleisli [3], or can be both described and queried using a common data model, as entailed by the Object-Protocol Model tool-based strategy described in this paper. The common query language approach does not require the component MBDs to be represented using a common DDL or data model; however, the users are required to have some knowledge regarding the structure of the MBDs they query. On the other hand, the common data model approach requires all participating MBDs to have a view defined in a common DDL so that users can examine and query component MBDs in the context of the same data model. Unlike link-based MBD systems, multidatabase systems support query languages that allow specifying complex query conditions across MBDs. A query translator is needed for translating queries expressed in the multidatabase query language to subqueries targeting component MBDs, and for optimizing these queries.

Data warehouses entail developing a global schema (view) of the component MBDs, where definitions of these MBDs are expressed in a common DDL and discrepancies between these definitions are resolved before they are integrated into the global schema. Data from component MBDs are transformed in order to comply with this global schema, and loaded into a central data repository. The Integrated Genomic Database (IGD) [18], Genome Topographer (GT) [9], and Entrez [21] are examples of data warehouses, where IGD is developed with the ACeDB database system, GT is developed with the Gemstone commercial object-oriented DBMS, and Entrez is based on ASN.1 structured files. The query facilities of data warehouses are provided by the underlying system (e.g., ACeDB), and query processing is local to the warehouse. However, constructing data warehouses requires costly initial integration of component MBDs, followed by frequent synchronization with these MBDs in order to capture the evolution of their schemas. Moreover, data warehouses need to be updated on a regular basis in order to reflect updates of component MBDs.

In this paper, we propose a tool-based strategy for exploring heterogeneous MBDs in the context of the Object-Protocol Model (OPM). We will argue that possibly the most difficult tasks in exploring heterogeneous MBDs is understanding the semantics of component MBDs and their connections, and specifying and interpreting queries expressed over multiple MBDs. Our strategy involves developing tools that provide facilities for constructing and maintaining OPM views for MBDs implemented using a variety of DBMSs, assembling MBDs into an OPM-based multidatabase system, while documenting MBD schemas and known schema links between MBDs, examining the semantics of MBDs, supporting multidatabase queries via uniform OPM interfaces, and assisting scientists in specifying and interpreting queries. Each of these tools can be used independently and therefore represents a valuable resource in its own right.

The rest of this document is organized as follows. The main semantic problems of exploring heterogeneous MBDs are discussed in section 2. Our tool-based strategy is described in section 3. An example of applying our strategy for querying GDB 6.0 and GSDB 2.0 is described in section 4. In section 5, we review the status of implementing our strategy and our plans for continuing this work.



next up previous
Next: Semantic Problems of Up: EXPLORING HETEROGENEOUS MOLECULAR Previous: EXPLORING HETEROGENEOUS MOLECULAR



& Markowitz
Thu Mar 14 15:45:38 PST 1996