next up previous contents
Next: The Object-Protocol Model Up: OPM*QS: The Object-Protocol Previous: Contents

Introduction

Molecular biology data are scattered among multiple data repositories, including molecular biology databases (MBDs). Although containing related data, these repositories are often isolated and are characterized by various degrees of heterogeneity: they usually represent different views (schemas) of the molecular biology domain and are implemented using different database management systems (DBMSs). Comprehensive studies of biological data often involves examining data across heterogeneous databases.

Solutions currently promoted for querying data across heterogenous MBDs involve constructing MBD federations or data warehouses, such as the Genome Topographer (Cold Spring Harbor Laboratory) [7] and the Integrated Genomic Database (German Cancer Research Institute) [13]. These solutions entail constructing a global view of a collection of MBDs, where definitions of the component MBDs are expressed in a common language and discrepancies between these definitions are resolved before they are integrated into a global view. For data warehouses, data from MBDs must be also loaded into a central data repository. The main problem of MBD federations and data warehouses is the complexity of constructing global views. Data warehouses have also the additional problems of not being synchronized with evolving component MBDs and of potentially very large physical sizes.

Heterogeneous MBDs can be also connected via WWW hypertext links at the level of individual data items. Data retrieval in such systems is limited to selecting a starting data item within one MBD and then following hyperlinks between data items within or across MBDs. Numerous MBDs are currently providing such links (see [12]). In addition, systems such as SRS [8] extract explicit links from existing flat file MBDs and construct indexes for both direct and reverse links allowing fast access to these MBDs.

Querying heterogenous MBDs can be achieved without constructing MBD federations or data warehouses, by organizing MBDs in a loose multidatabase system. We have developed a multidatabase query strategy for MBDs implemented with relational DBMSs, in the context of the Object-Protocol Model (OPM) data management tools [4]. For MBDs that have not been developed using the OPM tools, OPM views of the MBDs are first constructed using an OPM retrofitting tool. Then, the OPM query translator [6] provides facilities for browsing and querying MBDs associated with such OPM views.

Our multidatabase query strategy is based on an MBD dictionary that contains information on MBDs, including their OPM views, DBMS implementation, and links to other MBDs. Multidatabase queries are expressed in the OPM multidatabase query language (OPM*QL). A multidatabase query translator processes OPM*QL queries expressed over MBDs associated with OPM views, by

  1. decomposing these queries into subqueries for individual MBDs;

  2. using the OPM query translator for processing the subqueries; and

  3. assembling subquery results into multidatabase query results.

Our query strategy assumes that users understand the structure and semantics of the MBDs they query. In a related project, we plan to develop an MBD Schema Library containing comprehensive documentation on MBDs and with facilities that will assist users in expressing multidatabase queries.

The multidatabase query strategy outlined above is implemented as part of an OPM Multidatabase Query System (OPM*QS). OPM*QS is currently used for querying Genome DataBase (GDB) 6.0 and Genome Sequence Data Base (GSDB) 2.0.

OPM*QS follows a strategy similar to the Kleisli system developed at the University of Pennsylvania [3]. Kleisli provides query access to multiple heterogeneous data-sources using the query language CPL. Each data-source requires a driver which maps CPL queries into queries against that data-source, and then maps the resulting data into the nested-relational data model. The nested relational model is entirely value based, with no direct support for object identities or constraints, but supports arbitrary nesting of record and variant type constructors with collection types (sets, bags and lists). Kleisli has a programming language approach to querying multiple databases - Kleisli does not have a concept of database schemas, and when it accesses databases via functions it does not check whether these functions are compatible with the schemas of target databases. Furthermore, unlike OPM*QS, Kleisli requires users to understand the native database schemas (e.g., the Sybase definition of GSDB) and to find the relevant links between databases at this low level.

The rest of this document is organized as follows. Section 2 contains a brief overview of OPM. The OPM multidatabase query language is presented in Section 3. The OPM multidatabase query strategy is discussed in Section 4. Section 5 examines the existing links and overlaps between the OPM schemas of GDB 6.0 and GSDB 2.0. Section 6 discussed the processing of three typical multidatabase queries expressed over GDB 6.0 and GSDB 2.0.



next up previous contents
Next: The Object-Protocol Model Up: OPM*QS: The Object-Protocol Previous: Contents



Victor M. Markowitz
Wed Jan 17 16:39:09 PST 1996