Molecular biology data are distributed among multiple
molecular biology databases (MBDs).
Although containing related data, these databases are often isolated
and are characterized by various degrees of heterogeneity: they usually
represent different views (schemas) of the scientific domain and are
implemented using different data management systems ranging from
file management systems to database management systems (DBMSs).
Thus, flat ASCII files were used to maintain and distribute early MBDs,
such as the Protein Data Bank (PDB), Swiss-Prot, and OMIM.
Some of these MBDs have recently switched to DBMSs
(e.g., PDB to Sybase),
but they still distribute data as ASCII files.
Large archival MBDs, such as
Genome DataBase (GDB),
FlyBase,
Genome Sequence Data Base (GSDB),
and the Protein Data Bank (PDB),
are maintained using commercial relational DBMSs, such as Sybase.
Some MBDs, such as the genome databases for nematode, yeast, and
various plants, are implemented
with ACeDB (Durbin and Thierry-Mieg 1992).
A few MBDs, such as LabBase and MapBase of the
Whitehead Institute for Biomedical Research in Cambridge, Massachusetts
have been developed using object-oriented DBMSs,
such as ObjectStore (Goodman et al. 1994).
There are also many legacy systems containing valuable data
stored in files, sometimes with missing, obsolete, or incomplete
structural information (metadata).
Early attempts at resolving the heterogeneity problem for MBDs included proposals for standardizing on a specific DBMS, data definition language (DDL), or even database definition (schema). This approach seemed to be supported by the trend of cloning existing MBDs for developing new MBDs. For example, several ACeDB-based plant genome databases started from clones of AAtDB, the Arabidopsis thaliana database, originally developed at Harvard. However standardization efforts failed mainly because they were requiring a hard to attain degree of cooperation between database groups with different resources and goals, and a complex and expensive replacement of applications based on the existing databases. Furthermore, no single DDL or DBMS satisfies the requirements for all MBDs. For example, Sybase is used for archival MBDs such as GDB, GSDB, and PDB because relational DBMSs in general, and Sybase in particular, are robust and reliable, and provide facilities, such as concurrent and secure data access, that are essential for operating these MBDs. MBDs that do not need all the facilities provided by commercial DBMSs (e.g., powerful query processing, concurrency control) and cannot afford the time required for developing and maintaining interfaces to large relational database, systems like ACeDB, with built-in MBD-specific interfaces, offer an adequate and cheaper alternative. As for standardizing schemas, even cloned MBDs do not have identical schemas since they tend to represent different views of the scientific domain.
Realizing that the heterogeneity of existing MBDs will persist, numerous systems have been developed in the past several years for handling data in heterogeneous MBDs. Unfortunately, these systems are usually developed in isolation, are scarcely documented, and seldom mention their assumptions and limitations. It is therefore very hard to determine how these systems compare to one another, in terms of shared goals, architecture, and facilities.
Consider, for example, the Genome Topographer (GT) (Cozza et al. 1994) and the Integrated Genomic Database (IGD) (Ritter 1994). GT and IGD are implemented with different DBMSs (GemStone and ACeDB, respectively), but share the goal of providing uniform data retrieval and querying facilities via an integrated database containing data from multiple MBDs. It is not clear how GT and IGD differ beyond being implemented with different DBMSs, having different sizes, and having different user interfaces, or how both GT and IGD differ from SRS (Etzold and Argos 1993), Entrez (Schuler et al. 1995), Docking-D/RELIWE (Aberer 1995), and LinkDB (Goto et al. 1995) which all share the goals of IGD and GT.
Comparing systems like GT, IGD, SRS, LinkDB, Docking-D and Entrez based on criteria such as the number and size of their component databases or underlying data management system they use is not satisfactory. A system that consists at a certain time of fewer components than another system, is not necessarily inferior since it can, for example, have a mechanism that allows it to incorporate new components fast and without side effects.
The rest of the paper is organized as follows. In section 2, we briefly review traditional criteria for classifying heterogeneous database systems and argue that these criteria are not sufficient for a comprehensible evaluation and comparison of such systems. In section 3, we propose additional criteria for characterizing heterogeneous database systems in terms of their heterogeneity and cooperation assumptions, schema and data converters, extent of detecting and resolving semantic conflicts, type of database correlations, query interfaces and processing, and synchronization with component databases. In section 4, we illustrate how these criteria can be used for characterizing heterogeneous molecular biology database systems.
2