next up previous
Next: Semantics of Data Up: Semantic Problems of Previous: Semantic Problems of

Semantics of Global Schemas and Views

Examining data within and across MBDs is currently hampered by the lack of information on MBDs and their semantics. The need for comprehensive documentation of MBD schemas was discussed extensively at the last two Meetings on Interconnection of Molecular Biology Databases [15,19]. It was observed at these meetings that MBD schemas capture domain knowledge about biology and therefore the goal of schema design is not only achieving an efficient implementation but also supporting biological exploration.

Existing systems for exploring heterogeneous MBDs do not address the problem of understanding the semantics of component MBDs. For example, systems that support links between MBDs do not provide any information regarding the structure or semantics of the linked MBDs. Some multidatabase systems require users to know the structure (schemas) of component MBDs, without providing them with any support for this purpose. The highest expectations are promoted by data warehouses which present a single unified view while insulating users from the component MBDs. Unfortunately, systems such as GT, IGD, and Entrez are no better documented than their component MBDs, if at all. IGD, GT, and Entrez are based on global schemas (views) of their component MBDs, expressed in ACeDB DDL for IGD, Gemstone DDL for GT, and ASN.1 for Entrez. As of March 1996, no documentation was available on the structure of the GT data warehouse; IGD's global schema specified in ACeDB DDL is scarcely discussed (see [18]) and questions, such as the relationship between the relatively simple IGD schema and often more complex schemas of its components, are left unanswered; the structure of the ASN.1 files underlying Entrez's is also scarcely documented (see [21]).

The global schemas of systems such as GT, IGD, and Entrez are not based on schema integration techniques, but are the result of independent schema design processes based on the domain knowledge underlying the component MBDs. The global schemas for GT, IGD, and Entrez were developed locally by small groups, and therefore do not represent `consensus' schemas. In order to reduce the complexity of schema design and to make global schemas general enough, developers usually design these schemas using `generic' classes and/or attributes, which may not be applicable to individual component MBDs and may not fully capture the semantics of the data. Little or no information regarding the relationships between global schemas and participating MBD schemas is provided. Furthermore, the design of the global schemas of such integrated databases are expressed in system-dependent DDLs (e.g., ACeDB's DDL), may be affected by system considerations and therefore contain features which do not reflect domain modeling requirements, or may not capture all the information in each component MBD. For example, certain features of the IGD schema are governed by ACeDB's limitations for modeling large (approaching one gigabyte) databases, rather than by any semantic considerations.

Constructing global schemas or local views for exploring heterogeneous MBDs usually requires detecting semantic conflicts between schemas of component MBDs, ranging from naming conflicts and inconsistencies to detecting identical entities of interest that are represented differently. The same concept can be represented in different schemas by using synonyms, alternative terminology, or different data structures. For example one database could use the term primer to represent a class of primer sequences, while another could use the term oligo, and yet another could simply represent primers directly using their sequence data. Homonyms can cause naming conflicts in a heterogeneous MBD environment. Domain conflicts can be caused by storing similar values using different units or formats in different MBDs, or from conflicting data arising from different experiments or experimental techniques. Entities of interest can be represented using various data structures in different MBDs, where the diversity of representations stems from different views of the data (e.g., an Author can be represented only as an attribute of a Citation or as an independent object) and on the underlying DDL (e.g., a Citation can be represented as an object of a class within an object data model, but needs to be represented with one or several tables using a relational DDL). Other causes of conflicts include different ways of representing incomplete information (e.g., the meaning of nulls), and different ways of identifying objects in MBDs.

Resolving schema conflicts is a very complex task and may involve various methods ranging from simple renamings in order to resolve naming conflicts to schema restructurings in order to resolve structural dissimilarities. Systems such as IGD and SRS detect and resolve only simple schema and data conflicts, such as some name and object identification conflicts. Alternatively, a heterogeneous MBD system could leave conflict resolution to users. If users are responsible for conflict resolution, such a system could provide a mechanism for recording such resolutions and making them available to other users.



next up previous
Next: Semantics of Data Up: Semantic Problems of Previous: Semantic Problems of



& Markowitz
Thu Mar 14 15:45:38 PST 1996