Molecular biology data are distributed among multiple data repositories characterized by various degrees of heterogeneity: they usually provide different views of the molecular biology domain and are implemented using different systems, such as file systems and database management systems (DBMSs).
A database is a data repository that provides a centralized and homogeneous view of data for multiple applications. The data in a database are structured according to a database definition, called schema, specified in a data definition language, and are manipulated using operations specified in a data manipulation language. Data definition and manipulation languages are based on a data model that defines the semantics of the constructs and operations supported by these languages.
Comprehensive studies of molecular biology data often involve exploring multiple molecular biology databases, which entails coping with the distribution of data among these databases, the heterogeneity of the systems underlying these databases, and the semantic (schema representation) heterogeneity of these databases. Early attempts to manage heterogeneous databases were based on resolving heterogeneity by consolidating these databases either physically, through integration into a single homogeneous database, or virtually, by imposing a common data definition language, data model, or even DBMS, upon heterogeneous databases. These attempts failed because they were requiring a very difficult to attain degree of cooperation and a costly replacement of applications that were already based on the existing databases.
The most effective way of coping with heterogeneous
databases is to allow them to preserve their autonomy,
that is, their local definitions, applications, and
policy of exchanging data with other databases.
Approaches to managing heterogeneous databases include
connecting them using the World Wide Web (WWW),
organizing them into database federations or multidatabase systems,
and constructing data warehouses.
Heterogeneous databases can be connected via WWW hypertext links at the level of individual data items. Data retrieval in such systems is limited to selecting a starting data item within one database and then following hyperlinks between data items within or across databases. Numerous molecular biology databases are currently providing such links. In addition, systems such as SRS and DBGET extract explicit links from existing flat file molecular biology databases, and construct indexes for both direct and reverse links allowing fast access to these databases.
Database federations and data warehouses entail developing a global schema (view) of the component heterogeneous databases, where definitions of these databases are expressed in a common data definition language and discrepancies between these definitions are resolved before they are integrated into the global schema. For data warehouses, data from component databases are also loaded (integrated) into a central database. The Integrated Genomic Database (IGD) is an example of a molecular biology data warehouse. Database federations and data warehouses usually provide full query facilities and insulate users from the component databases via their global schemas. In a database federation, query translators convert queries expressed over the global schema to queries for component databases. In a data warehouse, query processing is local to the warehouse and therefore query translators are not required.
Multidatabase systems are collections of loosely coupled databases which are not integrated using a global schema. Querying multidatabase systems involves constructing queries over component databases, where a query explicitly refers to the elements of each database it involves. Component databases of multidatabase systems can be queried using a common query language, such as done in Kleisli or can be both described and queried using a common data model, such as done in OPM/MDB: The Object-Protocol Model Multidatabase Query System.
Numerous systems have been developed in the past several years for
exploring data across heterogeneous molecular biology databases.
Unfortunately, most systems are scarcely documented,
seldom mention their assumptions and limitations, and therefore
are hard to evaluate.
Four papers in the Journal of Computational Biology, Vol 2, No 4, 1995,
address these problems by examining the main trends followed in
developing heterogeneous molecular biology database systems
and by proposing criteria for evaluating such systems
in terms of their scalability, architecture,
and facilities.
The paper by Stanley Letovsky, Beyond the Information Maze , proposes scalability criteria for evaluating technologies employed for managing molecular biology data, and applies these criteria for examining the World Wide Web, the Wide Area Information Search (WAIS) system, traditional database systems, and federated database systems.
The paper by Victor M. Markowitz and Otto Ritter, Characterizing Heterogeneous Molecular Biology Database Systems, proposes criteria for characterizing heterogeneous databases systems in terms of their heterogeneity and cooperation assumptions, schema and data converters, extent of detecting and resolving semantic conflicts, database correlations, query interfaces, query processing mechanisms, and synchronization with component databases.
The paper by Susan Davidson, Christian Overton, and Peter Buneman, Challenges in Integrating Biological Data Sources, examines available methodologies for developing heterogeneous database systems, presents classification criteria for such systems, and identifies the main technical challenges for developing enhanced heterogeneous database systems.
The paper by Peter Karp, A Strategy for Database Interoperation , identifies assumptions and requirements for supporting querying across heterogeneous databases, shows that these requirements are only partially satisfied by existing heterogeneous molecular biology database systems, and proposes a strategy for addressing these requirements.
The database concepts and criteria presented in these papers can be used for effectively characterizing and documenting heterogeneous molecular biology database systems. Precise characterization of these systems is indispensable for sharing methods, technology, and experience in the quest of developing enhanced facilities for exploring heterogeneous molecular biology databases.