Koch, C.: Data Integration against Multiple Evolving Autonomous Schemata, Ph.D. Thesis

8/14/2019 Koch, C.: Data Integration against Multiple Evolving Autonomous Schemata, Ph.D. Thesis

1/150

1

D I S S E R T A T I O N

Data Integration against

Multiple Evolving Autonomous Schemata

ausgefuhrt zum Zwecke der Erlangung des akademischen Gradeseines Doktors der technischen Wissenschaften unter der Leitung

von

o. Univ.-Prof. Dr. Robert TrapplInstitut fur medizinische Kybernetik und Artificial Intelligence

Universitat Wien

und

Universitatslektor Dipl.-Ing. Dr. Paolo PettaInstitut fur medizinische Kybernetik und Artificial Intelligence

Universitat Wien

eingereicht an der Technischen Universitat Wien

Fakultat fur Technische Naturwissenschaften und Informatik

von

Christoph KochE9425227

A-1030 Wien, Beatrixgasse 26/70

Wien, am


2/150

2


3/150

3

Abstract

Research in the area of data integration has resulted in approaches such as fed-erated and multidatabases, mediation, data warehousing, global information sys-tems, and the model management/schema matching approach. Architecturally,approaches can be categorized into those that integrate against a single globalschema and those that do not, while on the level of inter-schema constraints,most work can be classified either as so-called global-as-view or as local-as-viewintegration. These approaches differ widely in their strengths and weaknesses.

Federated databases have been found applicable in environments in whichseveral autonomous information systems coexist each with their individualschemata and need to share data. However, this approach does not providesufficient support for dealing with change of schemata and requirements. Otherapproaches to data integration which are centered around a single global inte-gration schema, on the other hand, cannot handle design autonomy of informationsystems. Under evolution, this type of autonomy eventually leads to schematabetween which neither the global-as-view nor the local-as-view approaches to

source integration can be used to express the inter-schema semantics.In this thesis, this issue is addressed with a novel approach to data integration

which combines techniques from model management, mediation, and local-as-view integration. It allows for the design of inter-schema mappings that are morerobust when change occurs. The work has been motivated by the requirementsof large scientific collaborations in high-energy physics, as encountered by theauthor during his stay at CERN.

The approach presented here is based on two foundations. The first is queryrewriting with very expressive symmetric inter-schema constraints, called con-

junctive inclusion dependencies (cinds). These are containment relationshipsbetween conjunctive queries. We address a very general form of the source inte-gration problem, in which several schemata may coexist, each of them containinga number of purely logical as well as a number of source entities. For the sourceentities, the information system that belongs to the schema holds data, while thelogical entities are meant to allow schema entities from other information systemsto be integrated against. The query rewriting problem now aims at rewriting aquery over (possibly) both source and logical schema entities of one schema intosource entities only, which may be part of any of the schemata known. Under theclassical logical semantics, and given a conjunctive input query, we address theproblem of finding maximally contained positive rewritings under a set of cinds.Such rewritten queries can then be optimized and efficiently answered using clas-

sical distributed database techniques. For the purpose of data integration andthe sake of computability, we require the dependency graph of a set of cinds tobe acyclic with respect to inclusion direction.

Regarding the query rewriting problem, we first present semantics and maintheoretical properties. Subsequently, algorithms and optimizations based on tech-


4/150

4

niques from database theory are presented, which have been implemented in aresearch prototype. Finally, experimental results based on this prototype arepresented, which demonstrate the practical feasibility of our approach.

Reasoning is done exclusively over schemata and queries, and is independentfrom data volumes, which renders it highly scalable. Apart from that, this flavorof query rewriting has another important strength. The expressiveness of the

constraints allows for much freedom and flexibility for modeling the peculiaritiesof a mapping problem. For instance, both global-as-view and local-as-view inte-gration are special cases of the query rewriting problem addressed in this thesis.As will be shown, this flexibility allows to design mappings that are robust withrespect to change, as principles such as the decoupling of inter-schema dependen-cies can be implemented. It is furthermore clear that query rewriting with cindsalso permits to deal with concept mismatch in a very wide sense, as each pair ofcorresponding concepts in two schemata can be modeled as conjunctive queries.

The second foundation is model managementbased on cinds as inter-schemaconstraints. Under the model management approach to data integration, sche-mata and mappings are treated as first-class citizens in a repository, on which

model management operations can be applied. This thesis proposes definitionsof schemata and mappings, as well as an array of powerful operations, which arewell suited for designing and maintaining mappings between information systemswhen change is an issue. To complete this work, we propose a methodology fordealing with evolving schemata as well as changing integration requirements.

The combination of the contributions of this thesis brings a practical improve-ment of openness and flexibility to the federated database and model managementapproaches to data integration, and a first practical integration architecture tolarge, complex, and evolving computing environments such as those encounteredin large scientific collaborations.


5/150

5

Inhaltsangabe

Forschung im Gebiet der Datenintegration hat u.a. Richtungen wie foderierte undMultidatenbanken, Mediation, Data Warehousing, Global Information Systemsund Model Management bzw. Schema Matching zu Tage gebracht. Von einemarchitektonischen Standpunkt aus gesehen kann zwischen Ansatzen unterschiedenwerden, in denen gegen ein einziges globales Schema integriert wird, und solchen,wo das nicht der Fall ist. Auf der Ebene der Interschemasemantik kann man denGroteil der bisherigen Forschungsarbeit in die sogenannten global-as-viewundlocal-as-view Ansatze einteilen. Diese Ansatze unterscheiden sich teilweise starkin ihren individuellen Eigenschaften.

Foderierte Datenbanken haben sich in Umgebungen als brauchbar erwiesen,in denen mehrere Informationssysteme miteinander Daten austauschen mussen,

jedes dieser Informationssysteme aber sein eigenes Schema hat, und, was dasDesign dieses Schemas betrifft, auch autonom ist. In der Praxis unterstutzt dieserAnsatz aber unangenehmerweise die Wartung von sich andernden Schemata nicht.Andere bekannte Ansatze, die gegen ein globales Schema integrieren, unterstu-

tzen hingegen die Design Autonomy von Informationssystemen nicht. Bei not-wendig werdenden Schemaanderungen fuhrt diese Art von Autonomie namlichoft zu Schemata, gegen die die erwunschte Interschemasemantik weder durchglobal-as-view noch durch local-as-view-Ansatze ausgedruckt werden kann.

Diese Problematik ist das Thema dieser Dissertation, in der ein neuer Ansatzzur Datenintegration, der Ideen von Model Management, Mediation, and local-as-view Integration vereint, vorgeschlagen wird. Unser Ansatz ermoglicht die Mo-dellierung von (partiellen) Abbildungen zwischen Schemata, die Anderungen einevorteilhafte Robustheit entgegensetzen. Die Motivation fur die prasentierten Re-sultate ist Folge eines ausgedehnten Aufenthalts des Autors am CERN, wahrend-dessen die die Informationsinfrastruktur betreffenden Ziele und Notwendigkeitenvon groen wissenschaftlichen Kollaborationen studiert wurden.

Unser Ansatz basiert auf zwei zentralen Grundlagen. Die erste ist QueryRewriting, also das Umschreiben von Abfragen, unter sehr ausdrucksstarkensymmetrischen Interschemaabhangigkeiten, namlich Inklusionsabhangigkeitenzwischen sogenannten Conjunctive Queries, die wir Conjunctive Inclusion Depen-dencies (cinds) nennen. Wir behandeln eine sehr allgemeine Form des Quellen-integrationsproblems, in dem mehrere Schemata koexistieren durfen, und jedesdavon sowohl echte Datenbankentititaten, fur die also Daten vorhanden sind,sowie rein logische oder virtuelle Entititaten enthalten darf, gegen die mit Hilfevon cinds Abhangigkeiten von anderen Schemata definiert werden konnen. Das

Query Rewriting Problem zielt nun darauf ab, eine Abfrage, die sowohl uber lo-gische als auch echte Entititaten eines Schemas gestellt werden darf, so in eineandere umzuschreiben, da nur echte Datenbankentititaten, allerdings, wennnotig, von allen dem Integrationssystem bekannten Schemata, verwendet wer-den. Exakter wird unter der klassisch-logischen Semantikmit Hilfe einer Menge


6/150


7/150

7

Acknowledgments

Most of the work on this thesis was carried out during a 30 months stay at CERN,which was sponsored by the Austrian Federal Ministry of Education, Science andCulture under the CERN Austrian Doctoral Student Program.

I would like to thank the two supervisors of my thesis, Robert Trappl of the

Department of Medical Cybernetics and Artificial Intelligence of the Universityof Vienna and Jean-Marie Le Goff of CERN / ETT Division and the Universityof the West of England for their continuous support. This thesis would not havebeen possible without their help.

Paolo Petta of the Austrian Research Institute for Artificial Intelligence tookover much of the day-to-day supervision, and I am indebted to him for countlesshours of discussions, proofreading of draft papers, and feedback of any kind.

I would like to thank Enrico Franconi of the University of Manchester forprovoking my interest in local-as-view integration during his short visit at CERNin early 2000, which has influenced this thesis. I am also indebted to Richard Mc-Clatchey and Norbert Toth of the University of the West of England and CERN

for valuable comments on parts of an earlier version of this thesis. However,mistakes, as is obvious, are entirely mine.


8/150

8


9/150

Contents

1 Introduction 131.1 A Brief History of Data Integration . . . . . . . . . . . . . . . . . 131.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3 Use Case: Large Scientific Collaborations . . . . . . . . . . . . . . 181.4 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . 231.5 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Preliminaries 272.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2 Query Containment . . . . . . . . . . . . . . . . . . . . . . . . . . 312.3 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.4 Global Query Optimization . . . . . . . . . . . . . . . . . . . . . 342.5 Complex Values and Object Identities . . . . . . . . . . . . . . . . 35

3 Data Integration 393.1 Definitions and Overview . . . . . . . . . . . . . . . . . . . . . . . 393.2 Federated and Multidatabases . . . . . . . . . . . . . . . . . . . . 413.3 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . 433.4 Information Integration in AI . . . . . . . . . . . . . . . . . . . . 44

3.4.1 Integration against Ontologies . . . . . . . . . . . . . . . . 443.4.2 Capability Descriptions and Planning . . . . . . . . . . . . 453.4.3 Multi-agent Systems . . . . . . . . . . . . . . . . . . . . . 47

3.5 Global-as-view Integration . . . . . . . . . . . . . . . . . . . . . . 503.5.1 Mediation . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.5.2 Integration by Database Views . . . . . . . . . . . . . . . 513.5.3 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6 Local-as-view Integration . . . . . . . . . . . . . . . . . . . . . . . 53

3.6.1 Answering Queries using Views . . . . . . . . . . . . . . . 543.6.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 563.6.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . 60

3.7 Description Logics-based Information Integration . . . . . . . . . 623.7.1 Description Logics . . . . . . . . . . . . . . . . . . . . . . 62

9


10/150

10 CONTENTS

3.7.2 Description Logics as a Database Paradigm . . . . . . . . 633.7.3 Hybrid Reasoning Systems . . . . . . . . . . . . . . . . . . 65

3.8 The Model Management Approach . . . . . . . . . . . . . . . . . 653.9 Discussion of Approaches . . . . . . . . . . . . . . . . . . . . . . . 66

4 Reference Architecture 714.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2 Mediating a Query . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3 Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Query Rewriting 755.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.1 The Classical Semantics . . . . . . . . . . . . . . . . . . . 785.3.2 The Rewrite Systems Semantics . . . . . . . . . . . . . . . 825.3.3 Equivalence of the two Semantics . . . . . . . . . . . . . . 84

5.3.4 Computability . . . . . . . . . . . . . . . . . . . . . . . . . 885.3.5 Complexity of the Acyclic Case . . . . . . . . . . . . . . . 905.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5.1 Chain Queries . . . . . . . . . . . . . . . . . . . . . . . . . 945.5.2 Random Queries . . . . . . . . . . . . . . . . . . . . . . . 97

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6 Model Management 996.1 Model Management Repositories . . . . . . . . . . . . . . . . . . . 996.2 Managing Change . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2.1 Decoupling Mappings . . . . . . . . . . . . . . . . . . . . . 1036.2.2 Merging Schemata . . . . . . . . . . . . . . . . . . . . . . 107

6.3 Managing the Acyclicity of Constraints . . . . . . . . . . . . . . . 108

7 Outlook 1117.1 Physical Data Independence . . . . . . . . . . . . . . . . . . . . . 113

7.1.1 The Classical Problem . . . . . . . . . . . . . . . . . . . . 1137.1.2 Versions of Logical Schemata . . . . . . . . . . . . . . . . 117

7.2 Rewriting Recursive Queries . . . . . . . . . . . . . . . . . . . . . 122

8 Conclusions 127


11/150

List of Figures

1.1 Mappings in LAV (left) and GAV (right). . . . . . . . . . . . . . . 151.2 The space of objects that can be shared using symmetric map-

pings given true concept mismatch between entities of source andintegration schemata. . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3 Data flow between information systems that manage the steps ofan experiments lifecycle. . . . . . . . . . . . . . . . . . . . . . . . 20

1.4 ER diagrams for Example 1.3.1: Electronics database (left) and

product-data management system (right). . . . . . . . . . . . . . 211.5 Concept mismatch between PCs of the electronics database andparts of the product-data management system of Project1. . . . 22

1.6 Architecture of the information infrastructure . . . . . . . . . . . 24

3.1 Artists impression of source integration. . . . . . . . . . . . . . . 403.2 Federated 5-layer schema architecture . . . . . . . . . . . . . . . . 423.3 Data warehousing architecture and process. . . . . . . . . . . . . 433.4 MAS architectures for the intelligent integration of information.

Arrows between agents depict exemplary communication flows.Numbers denote logical time stamps of communication flows. . . . 48

3.5 A mediator architecture . . . . . . . . . . . . . . . . . . . . . . . 513.6 MiniCon descriptions of the query and views of Example 3.6.1. . . 583.7 Comparison of global-as-view and local-as-view integration. . . . . 673.8 Comparison of Data Integration Architectures. . . . . . . . . . . . 68

4.1 Reference Architecture . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1 Hypertile of size i 2 (left) and the nine possible overlappinghypertiles of size i 1 (right). . . . . . . . . . . . . . . . . . . . . 91

5.2 Experiments with chain queries and nonlayered chain cinds. . . . 955.3 Experiments with chain queries and two layers of chain cinds. . . 96

5.4 Experiments with chain queries and five layers of chain cinds. . . 965.5 Experiment with random queries. . . . . . . . . . . . . . . . . . . 97

6.1 Operations on schemata. . . . . . . . . . . . . . . . . . . . . . . . 1006.2 Operations on mappings. . . . . . . . . . . . . . . . . . . . . . . . 100

11


12/150

12 LIST OF FIGURES

6.3 Complex model management operations. . . . . . . . . . . . . . . 1016.4 Data integration infrastructure of Example 6.2.1. Schemata are

visualized as circles and elementary mappings as arrows. . . . . . 1046.5 The lifecycle of the mappings of a legacy integration schema. . . . 1066.6 Merging auxiliary integration schemata to improve maintenance. . 1076.7 A clustered auxiliary schema. Schemata are displayed as circles

and mappings as arrows. . . . . . . . . . . . . . . . . . . . . . . . 1087.1 A cind as an inter-schema constraint (A) compared to a data trans-

formation procedure (B). Horizontal lines depict schemata andsmall circles depict schema entities. Mappings are shown as thinarrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.2 EER diagram of the university domain (initial version). . . . . . . 1147.3 EER diagram of the university domain (second version). . . . . . 1187.4 Fixpoint of the bottom-up derivation of Example 7.2.1. . . . . . . 123


13/150

Chapter 1

Introduction

The integration of heterogeneous databases and information systems is an areaof high practical importance. The very success of information systems and datamanagement technology in a short period of time has caused the virtual om-nipresence of stand-alone systems that manage data islands of information

that by now have grown too valuable not to be shared. However, this sharing, andwith it the resolution of heterogeneity between systems, entails interesting andnontrivial problems, which have received much research interest in recent years.Ongoing research activity, however, is evidence of the fact that many questionsremain unanswered.

1.1 A Brief History of Data Integration

Given a number of heterogeneous information systems, in practice it is not al-ways desirable or even possible to completely reengineer and reimplement themto create one homogeneous information system with a single schema (schemaintegration [BLN86, JLVV00]). Instead, it is often necessary to perform dataintegration [JLVV00], where schemata of heterogeneous information systems areleft unchanged and integration is carried out by transforming queries or data.To realize such transformations, some flavor of mappings (either procedural codeor declarative inter-schema constraints) between information systems is required.If the data integration reasoning is entirely effected on the level of queries andschema-level descriptions, this is usually called query rewriting, while the termdata transformation refers to heterogeneous data themselves being classified,transformed and fused to appear homogeneous under some integration schema.

Most previous work on data integration can be classified into two major di-rections by the method by which inter-schema mappings used for integrationare expressed (see e.g. [FLM98, Ull97]). These are called local-as-view (LAV)[LMSS95, YL87, LRO96, GKD97, AK92, TSI94, CKPS95] and global-as-view(GAV) [GMPQ+97, ACPS96, CHS+95, FRV95] integration.

13


14/150

14 CHAPTER 1. INTRODUCTION

The more traditional paradigm is global-as-view integration, where mappings often called mediators after [Wie92] are defined as follows. Mediators imple-ment virtual entities (concepts, relations or classes, depending on nomenclatureand data model used) exported by their interfaces as views over the heteroge-neous sources, specifying how to combine their data to resolve some (or all) ofthe experienced heterogeneity. Such mediators can be (generalizations of) simple

database views (e.g. CREATE VIEW constructs in SQL) or can be implementedby some procedural code. Global-as-view integration has been used in multi-databases [SL90], data warehousing [JLVV00], and recently for the integration ofmultimedia sources [ACPS96, CHS+95] and as a fertile testbed for semistructureddata models and technologies [GMPQ+97].

In the local-as-view paradigm, inter-schema constraints are defined in strictlythe opposite way1. Queries over a purely logical global mediated schema areanswered by treating sources as if they were materialized views over the medi-ated schema, where only these materialized views may be used to answer thequery after all, the mediated schema does not directly represent any data.Query answering then reduces to the so-called problem of answering queries

using views , which has been intensively studied by the database community[LMSS95, DGL00, AD98, BLR97, RSU95] and is related to the query containmentproblem [CM77, CV92, Shm87, CDL98a]. Local-as-view integration has not onlybeen applied to and shown to be well-suited for data integration in global infor-mation systems [LRO96, GKD97, AK92], but also in related applications beyonddata integration, such as query optimization [CKPS95] and the maintenance ofphysical data independence [TSI94].

An important distinction is to be made between data integration architecturesthat are centered around a single global integration schema against which allsources are integrated (This is the case, for instance, for data warehouses and

global information systems, and is intrinsic to the local-as-view approach.) andothers that are not, such as federated and multidatabases. The lack of a singleglobal integration schema in the data integration architecture has a problematicconsequence. Each source may need to be mapped against each of the integrationschemata, leading to a large number of mappings that need to be created andmanaged. In architectures such as those of federated database systems whereeach component database may be a source and a consumer of integrated data atonce, a quadratic number of mappings may be required.

The globality of integration schemata is usually judged by their role in anintegration architecture. Global schemata are singletons that occupy a very cen-tral role in the architecture, and are unique consistent and homogeneous worldviews against which all other schemata in the system (usually considered the

1At first sight, this may appear unintuitive, but is not. For instance, the local-as-viewapproach can be motivated by AI planning for information gathering using content descriptionsof sources in terms of a global world model (as planning operators) [AK92, KW96].


15/150

1.1. A BRIEF HISTORY OF DATA INTEGRATION 15

source

1

source

2source

3

space of tuplesexpressible as

queries over theglobal schema

space of tuples

expressible as queriesover the sources

tuples

accessiblethrough

mediators

LAV GAV

Figure 1.1: Mappings in LAV (left) and GAV (right).

sources) are to be integrated. There is globality in integration schemata on adifferent level as well. We want to consider integration schemata as designed atwill while taking a global perspective if

they are artifacts specifically created for the resolution of some heterogene-ity and

the entirety of sources in the system that have any relevance to those het-erogeneity problems addressed have been taken into account in the designprocess.

Thus in such global schemata, a global perspective has been taken whendesigning them. However, they do not have to be monolithic homogeneous worldviews. This qualifies the collection of logical entities exported by mediators ina global-as-view integration system as a specifically designed global integrationschema, although such a schema is not necessarily homogeneous.

An important characteristic of data integration approaches is how well conceptmismatchoccurring between source and integration schemata can be bridged. Wehave pointed out that both GAV and LAV use a flavor of views for the mappingbetween sources and integration schemata. In Figure 1.1, we compare the local-as-view and global-as-view paradigms by visualizing (by Venn diagrams) the spacesof tuples (in relational queries) or objects that can be expressed by queries over

source and integration schemata.Views as inter-schema constraints are strongly asymmetric. One single atomic

schema entity appearing in a schema on one side of the invisible conceptual borderline between integration and source schemata is always defined by a query or (asthe general idea of mediation permits) by some procedural code which computes


16/150


the entitys extent over the schemata on the other side of that border line. As aconsequence, both LAV and GAV are restricted in how well they can deal withconcept mismatch2.

This restriction is theoretical, because in both LAV and GAV it is alwaysimplicitly assumed that sources are integrated against integration schemata thathave been freely designed with no other constraints imposed than the current

integration requirements3

. However, when data need to be integrated againstschemata of information systems that have design autonomy, or when integrationschemata have a legacy4 burden that an integration approach has to be able todeal with, both LAV and GAV fail.

Note that views are not the only imaginable way of mapping schemata indata integration architectures. For mappings that are not expressible as views,it may be possible to relate the spaces of objects expressible by complex logicalexpressions say queries over the concepts of the schemata (see Figure 1.2).

Legacy integration schemata are faced when

there is no central design authority providing global schemata,

future integration requirements or changes to schemata of information sys-tems cannot be appropriately predicted,

existing integration schemata cannot be amended when integration require-ments or the nature of sources to be made available change in an unforeseenway, or

the creation of global schemata is infeasible because of the size and com-plexity of the problem domain and modeling task5 [MKW00].

Recent work in the area has resulted in two new approaches that do not center

around a single global integration schema and where inter-schema constraintsdo not necessarily have that strictly asymmetric syntax encountered in LAV andGAV. The first uses expressive description logics systems with symmetric con-straints for data integration [CDL98a, CDL+98b, Bor95]. Constraints can be

2See Example 1.3.1 and [Ull97].3This makes the option of the change of requirements or the nature of sources after the design

of the integration schemata has been finished hover over such architectures like Damocles sword.4We do not refer to the legacy systems issue here, though. In principle, legacy systems are

operational systems that in some aspect of their design differ from what they ideally shouldbe like; they use at least one technology that is no longer part of the current overall strategyin some enterprise or collaborative environment [AS99]. In practice, information systems are

usually referred to as legacy in the context of data integration if they are not even based on amodern data management technology, usually making it necessary to treat them monolithically,and wrap them [GMPQ+97, RS97] by software that makes them appear to respond to datarequests under a state-of-the-art data management paradigm.

5This may make the Semantic Web effort of the World Wide Web Consortium [Wor01] seemto be threatened by another very sharp blade hanging by an amazingly fragile thread.


17/150

1.2. THE PROBLEM 17

space of

tuplesexpressible

as queries

over the

sources

tuples that can

be madeavailable to

queries over the

integrated

schema bymappings fromsources

space of

tuples

expressibleas queries

over the

globalschema

Figure 1.2: The space of objects that can be shared using symmetric mappingsgiven true concept mismatch between entities of source and integration schemata.

defined as containment relationships between complex concepts that represent(path) queries. The main drawback is that integration has to be carried out asABox reasoning [CDL99], i.e. the classification of data in a (hybrid) descriptionlogics system [Neb89]. This does not scale well to large data volumes. Further-more, such an approach is not applicable when sources have restricted interfaces(as is often the case on the Web) and it is not possible to import all data of asource into the reasoning system.

The second approach, model management [BLP00, MHH+01], treats schemataand mappings between schemata as first-class objects that can be stored in arepository and manipulated with cleanly defined model management operations.This direction is still in an early stage and no convergence against clean, widelyusable semantics has occurred yet. Mappings are often defined as lines between

concepts (e.g. relations or classes in schemata) using an array of semantics thatare often not very expressive. While such approaches allow for neat graphicalvisualization and the editing of mappings, they do not provide the mechanismsand expressive semantics to support design and modeling actions to make evolvingschemata manageable.

1.2 The Problem

The problem addressed in this thesis is the following. We aim at an approach todata integration that satisfies three requirements.

Individual information systems may have design autonomy for their sche-mata. In general, no global schemata can be built. Each individual schemamay have been defined before integration requirements were completelyknown, and be ill-suited for a particular integration task.


18/150


Individual schemata may evolve independently. Even the best-designedintegration schemata may end up with concept mismatch that cannot bedealt with through view-based mappings.

The third requirement concerns the scalability of the approach. The dataintegration problem has to be solved entirely on the level of queries anddescriptions of information systems (i.e., query rewriting) rather than thelevel of reasoning over the data to ensure the independence of the approachfrom the amount of data managed.

Given the problem that the number of mappings in data integration architec-tures with autonomous component systems may be quadratic in the number ofschemata and thus very large, the option that schemata and integration require-ments change renders a way of managing schemata and mappings necessary thatis simple and for which many tasks can be automated. This requires support formanaging mappings and their change and reusing mappings both actively, in theactions performed for managing schemata and mappings, and passively, throughthe transitivity of their semantics6.

The work presented in this thesis has been carried out in the context of a verylarge international scientific collaboration in the area of high-energy physics. Wewill have a closer look at the problem of providing interoperability of informationsystems in that domain in Section 1.3.

1.3 Use Case: Large Scientific Collaborations

Large scientific collaborations are becoming more and more common due to thefact that nowadays cutting-edge scientific research in areas such as high energyphysics, the human genome or aerospace has become extremely expensive. Data

integration is an issue since many of the individual information systems beingoperated in such an environment require integrated data to be provided fromother information systems in order to work. As we will point out in this section,the main sources of difficulty related to source integration in the informationinfrastructures of such collaborations are the design autonomy of informationsystems, change of requirements and evolution of schemata, and large data sets.

A number of issues stand in the way of building a single unified globallogical schema (as they exist for data warehouses or global information systems)for a large science project. We will summarize them next.

Heterogeneity. Heterogeneity is pervasive in large scientific research collabora-

tions, as there are existing legacy systems as well as largely autonomous groupsthat build more such legacy systems.

6That is, given that we have defined a mapping from schema A to schema B and a mappingfrom schema B to schema C, we assume that we automatically arrive at a mapping from schemaA to schema C.


19/150

1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 19

Scientific collaborations consist of a number7 of largely autonomous institutesthat independently develop and maintain their individual information systems8.This lack of central control fosters creativity and is necessary for political andorganizational reasons. However, it leads to problems when it comes to mak-ing information systems interoperate. In such a setting, heterogeneity arises dueto many reasons. Firstly, no two designers would conceptualize a given prob-

lem situation in the same way. Furthermore, distinct groups of researchers havefundamentally different ways of dealing with bodies of knowledge, due to differ-ent (human) languages, professional background, community or project jargon9,teacher and curriculum, or school of thought. Several subcommunities inde-pendently develop and use similar but distinct software for the same tasks. As aconsequence, one can assume similar but slightly different schemata10. In an en-vironment such as the Large Hadron Collider (LHC) project at CERN [LHC] andhuge experiments such as CMS [CMS95] currently under preparation, potentiallyhundreds of individual information systems will be involved with the project dur-ing its lifetime, some of them commercial products, others homegrown efforts ofpossibly several hundred person years. This is the case because even for the same

task, sub-collaborations or individual institutes working on different subprojectsindependently build systems.

When it comes to types of heterogeneity that may be encountered in such anenvironment, it has to be remarked that beyond heterogeneity due to discrepan-cies in conceptualizations of human designers (including polysemy, terminologicaloverlap and misalignment), there is also heterogeneity that is intrinsic to the do-main. For example, in the environment of high-energy physics experiments (say,a particle detector), detector parts will be necessarily conceptualized differentlydepending on the kind of information system in which they are represented. Forinstance, in a CAD system that is used for designing the particle detector, parts

will be spatial structures; in a construction management system, they will haveto be represented as tree-like structures modeling compositions of parts and theirsub-parts, and in simulation and experimental data taking, parts have to beaggregated by associated sensors (readout channels), with respect to which anexperiment becomes a topological structure largely distinct from the one of thedesign drawing. We believe that such differences also lead to different views onthe knowledge level, and certainly lead to different database schemata.

Hardness of Modeling. Apart from the notion of intrinsic heterogeneity thatwe have given rise to in the previous paragraph, there are a number of otherissues that contribute to the hardness of modeling in a scientific domain. Firstly,

7

In large collaborations, they may amount to hundreds.8The requirements presented here closely relate to classifications of component autonomy infederated databases [HM85].

9Such jargon may have developed over time in previous projects in which a group of peoplemay have worked on together.

10Unfortunately, it is often trickier to deal with subtle than with great mismatch.


20/150


21/150

1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 21

part_location

part

location

id name

cpupc_cpu

location

pc

pc_location

id name

part_of

Figure 1.4: ER diagrams for Example 1.3.1: Electronics database (left) andproduct-data management system (right).

built until other information systems have already been in existence for years.In such an experimental setting, full understanding of the requirements for

subsequent information systems can often only be achieved once that informationsystems for the current work have been implemented. Nevertheless, since someinformation systems are already in need of data integration, one either has tobuild a global logical schema today which might become invalid later, leadingto serious maintenance problems of the information infrastructure (that is, thelogical views that map sources), or an approach has to be followed that goeswithout such a schema. Since it is impossible to preview all the requirementsof a complex system far into the future, one cannot avoid the need for changethrough proper a priori design.

Concept Mismatch. It is clear from the above observations that concept mis-match between schemata relevant to data integration may occur in the domainof high energy physics research.

Example 1.3.1 Assume there are two information systems, the first of whichis a database holding data on electronics components13 of an experiment underconstruction, with the relational schema

R1 = {pc cpu(Pc,Cpu), pc location(Pc,LocId), location(LocId, LocName)}

The database represents information about PCs and their CPUs as well asthe location where these parts currently are to be found. Locations have a name

13To make the example more easily accessible, we speak of personal computers as sole elec-tronics parts represented. Of course, personal computers are not representative building blocksof high-energy physics experiments.


22/150


PCs of Project1

PCsParts ofProject1

sourceschema

destinationschema

Figure 1.5: Concept mismatch between PCs of the electronics database and partsof the product-data management system of Project1.

and an identifier. The second system is a product data management system fora subproject Project1 with the schema

R2 = {part of(Part1,Part2), part location(Part,LocId),location(LocId, LocName)}

(see also Figure 1.4). The second database schema represents an assemblytree of Project1 by the relation part of and again the locations of parts.

Let us now assume that the first information system (the electronics database)holds data that should be shared with the second. We assume that while thenames of the locations are the same in the second as in the first informationsystem, the domains of the location ids in the two information systems must beassumed to be distinct, and cannot be shared.

We thus experience two kinds of complications with this integration problem.The distinct key domains for locations in the two information systems in factentail that a correspondence between (derived) concepts in the two schemata isto be made that are both to be defined by queries14. Furthermore, we observeconcept mismatch. The first schema only contains electronics parts but may doso for other projects besides Project1 as well while in the second schema onlyparts of Project1 are to be represented, but those parts are not restricted toelectronics parts (Figure 1.5).

As a third complication in this example, we assume some granularity mis-match. Assume that the second information system is to hold a more detailed

model of Project1 than the first and shall represent CPUs as parts of main-boards of PCs and those in turn as parts of PCs, rather than just CPUs as partsof PCs. Of course, we have no information on mainboards in the electronicsdatabase, but this information could be obtained from another source.

14Thus, this correspondence could neither be expressed in GAV nor in LAV.


23/150

1.4. CONTRIBUTIONS OF THIS THESIS 23

We could encode this by the following semantic constraint expressing a map-ping between schemata by a containment relationship between two queries:

{Pc,Cpu,LocName |Mb,LocId : R2.part of(Mb ,Pc) R2.part of(Cpu,Mb)

R2.location(LocId, LocName)

R2.part location(Pc,LocId)} {Pc,Cpu,LocName |LocId : R1.pc cpu(Pc,Cpu) R1.belongs to(P c, Project1)

R1.location(LocId, LocName) R1.pc location(Pc,LocId)}

Informally, one may read this constraint as

PCs together with their CPUs and locations which are marked as belong-ing to Project1 in the first information system should be part of theanswers to queries over parts and their locations in the second informa-tion system, where CPUs should be known as parts two levels below PCsin the assembly hierarchy represented by the part of relation.

We do not provide any formal semantics of such constraints for data integra-tion at this point, but rely on the intuition that such a containment constraintbetween two queries expresses the desired inter-schema dependency and allows,given appropriate reasoning algorithms (if they exist), to perform data integrationin the presence of concept mismatch in a wide sense.

Large Data Sets. Scientific computing has always been known for manipulatingvery large amounts of data. Data volumes in information systems related to theconstruction of LHC experiments are expected to be in the Terabyte range, andexperimental data collected during the lifetime of LHC will amount to dozens of

Petabytes. For scalability reasons, information integration has to be carried outon the level of queries (query rewriting) rather than data (data transformation).

1.4 Contributions of this Thesis

This thesis is, to the best of our knowledge, the first to actually address theproblem of data integration with multiple unsophisticated evolving autonomousintegration schemata. Each such schema may consist of both source relations thathold data and logical relations that do not. Schemata may be designed withouttaking other schemata or data integration considerations into account. Each

query over a schema is rewritten into a query exclusively over source relations ofinformation systems in the environment, using a number of schema mappings.

We propose an approach to data integration (see Figure 1.6) based on modelmanagement and query rewriting with expressive constraints within a federatedarchitecture. Our flavor of query rewriting is based on constraints with clean,


24/150


25/150

1.5. RELEVANCE 25

constraints and realistic applications. We conclude with a discussion of how ourquery rewriting approach fits into state-of-the-art data integration and modelmanagement systems.

Regarding model management, we present definitions of data models, sche-mata, mappings, and a set of expressive model management operations for themanagement of schemata in a data integration setting. We argue that our ap-

proach can overcome the problems related to unsophisticated legacy integrationschemata, and provide a sketch of a methodology for managing evolving map-pings.

1.5 Relevance

As we discuss a framework for data integration that is based on very weak as-sumptions, this paper is relevant to a large number of applications in whichother approaches eventually fail. These include networks of autonomous virtualenterprises having different deployment lifecycles or standards for their informa-

tion systems, the information infrastructure of large international collaborations(e.g., in science), and large enterprises that face the integration of several exist-ing heterogeneous data warehouses after mergers or acquisitions or major changeof business model. More generally, our work is applicable in simply any envi-ronment in which anything less than full commitment exists towards far-rangingreengineering of information systems to bring all information systems that roamits environment under a single common enterprise model. Obviously, our workmay also allow federated databases [HM85, SL90] to deal more successfully withschema evolution.

Let us reconsider the point of design autonomy for schemata of information

systems in the case of companies and e-commerce. For many good reasons, com-panies nowadays want to have their information systems interoperate; however,there is no sufficiently strong trend towards agreeing on schemata. While thereis clearly much work done towards standardization, large players in IT have anincentive to propose competing standards and bodies of meta-data. Askingfor common schemata beyond enterprise boundaries today is hardly realistic.Instead, even the integration of the information systems inside a single largeenterprise is a problem almost too hard to solve17, and motivates some indepen-dence of the information infrastructure of horizontal or vertical business units,again leading to the legacy integration schema problem that we want to addresshere. That mentioned, the work in this thesis is highly relevant to business-

to-business e-commerce and the management of the extended supply chain and17This of course excludes the issue of data warehouses, which, although they have a global

scope w.r.t. the enterprise, address only a small part of the company data (in terms of schemacomplexity, not volume) such as sales information that are usually well understood andwhere requirements are not expected to change much in the future.


26/150


virtual enterprises.Data warehouses that have been the results of large and very expensive de-

sign and reengineering efforts customized to a specific enterprise really are legacysystems from the day when their design phase ends. Similarly, when companiesmerge, the schemata of those data warehouses that the former entities createdare again bound to feature a substantial degree of heterogeneity. This can be ap-

proached in two ways, either by considering these schemata legacy or by creatinga new, truly global information system (almost) from scratch.

1.6 Overview

The remainder of this thesis is structured as follows. In Chapter 2, some pre-liminary notions from database theory, computability theory, and complexitytheory are presented. Chapter 3 discusses previous work on data integration.We start with definitions in Section 3.1 and consecutively discuss federated andmultidatabases, data warehousing, mediator systems, information integration in

AI, global-as-view and local-as-view integration (the latter is presented at somelength, since its theory will be highly relevant to our work of Chapter 5), thedescription logics-based and model management approaches to data integration,and finally, in Section 3.9, we discuss the various approaches by maintainabil-ity and other aspects. In Chapter 4, we present our reference architecture fordata integration and discuss its building blocks, which will be treated in moredetail in consecutive chapters. Chapter 5 presents our approach to query rewrit-ing with expressive symmetric constraints. Chapter 6 first discusses our flavorof schemata, mappings and model management operations, and then providessome thoughts on how to guide the modeling process for mappings such that theintegration infrastructure can be managed as easily as possible. We discuss some

advanced issues of query rewriting, notably extensions of query languages suchas recursion and sources with binding patterns in Chapter 7. We also discuss an-other application of our work on query rewriting with symmetric constraints, themaintenance of physical data independence under schema evolution. Chapter 8concludes with a final discussion of the practical implications of this thesis.


27/150

Chapter 2Preliminaries

This chapter discusses some preliminaries which mainly stem from database the-ory and which will be needed in later chapters. It is beyond the scope of thisthesis to give a detailed account of computability theory and complexity theory.We refer to [HU79, Sip97, GJ79, Joh90, Pap94, DEGV] for introductory texts in

these areas. We also assume a basic understanding of databases, schemata, andquery languages, and notably SQL (for an introductory work on this see [Ull88]).Finally, we presume basic understanding of mathematical logics and automatedtheorem proving, including concepts such as resolution and refutation, and no-tions such as predicates, atoms, terms, Skolem function, Horn clauses, and unitclauses, which are used in the standard way (see e.g. [RN95, Pap94]).

We define the following access functions for later use: Given a Horn clause c,Head(c) returns cs head atom and Body(c) returns the ordered list of its bodyatoms. Bodyi(c) returns the i-th body atom. Pred(a) returns the predicate nameof atom a, while Preds(Body(c)) returns the predicate names of the atoms in the

body of clause c. V ars(a) returns the set of variables appearing in atom a andV ar(Body(c)) returns the variables in the body of the clause c.

We will mainly focus on the relational data model and relational queries[Cod70, Ull88, Ull89, Kan90] under a set-based rather than bag-based seman-tics (That is, answers to queries are sets, while they are bags in the originalrelational model [Cod70] and SQL).

2.1 Query Languages

Let dom be a countably infinite domain of atomic values. A relation schema Ris a relation name together with a sort, which is a tuple1 of attribute names, andan arity, i.e.

1Relation schemata are usually defined as sets of attributes. However, we choose the tuple,as we will use the unnamed calculus perspective widely throughout this work.

27


28/150

28 CHAPTER 2. PRELIMINARIES

sort(R) = A1, . . . , An arity(R) = n

A (relational) schema R is a set of relation schemata. A relation I is a finiteset of tuples, I domn. A database instance I is a set of relations.

A relational query Q is a function that maps each instance I over a schemaR and dom to another instance J over a different schema R.Relational queries can be seen from at least two perspectives, an algebraic

and a calculus viewpoint. Relational algebra ALG is based on the following basicalgebraic operations (see [Cod70] or [Ull88, AHV95]):

Set-based operations (intersection , union , and difference \) over rela-tions of the same sort (that is, arity, as we assume a single domain dom ofatomic values).

Tuple-based operations (projection , which eliminates or renames rows of

relations, and selection , which filters tuples of a relation according to apredicate built by conjunction of equality atoms, which are statements ofthe form A = B, where A, B are relational attributes).

The cartesian product as a constructive operation that, given two rela-tions R1 and R2 of arities n and m, respectively, produces a new relationof arity n + m which contains a tuple t1, t2 for each distinct pair of tuplest1, t2 with t1 R1 and t2 R2.

Other operations (e.g., various kinds of joins) can be defined from these.There are various subtleties, such as named and unnamed perspectives of ALG,for which we refer to [AHV95].

Queries in the first-order relational domain calculus CALC are of the form

{X | (X)}

where X is a tuple of variables (called unbound or distinguished) and is a first-order formula (using , , , , and ) over relational predicates pi.

An important desirable property of well-behaved database queries is domainindependence. Let the set of all atomic values appearing in a database I becalled the active domain (adom). A CALC query Q over a schema R is domain

independent iff, for any possible database I over R, Qdom(I) = Qadom(I).

Example 2.1.1 The CALC query {x, y | p(x)} is not domain independent, asthe variable y is free to bind with any member of the domain. Clearly, such aquery does not satisfy the intuitions of well-behaved database queries.


29/150


30/150


Queries with inequality constraints (i.e., =,


31/150

2.2. QUERY CONTAINMENT 31

binary relations can be expressed using the first-order queries3. Much has beensaid on categories and hierarchies of relational query languages, and examples oflanguages strictly more expressive than relational algebra and calculus are, forinstance, datalog with negation (under various semantics) or the while queries.We refer to [CH82, Cha88, Kan90, AHV95] for more on these issues.

Treatments of complexity and expressiveness of relational query languages can

be found in [Var82, CH82, Cha88, AHV95]. We leave these issues to the relatedliterature and remark only that the positive relational calculus queries are (data)-complete in PSPACE [Var82]. The decision problem whether an unfolding of aconjunctive query with a nonrecursive datalog program (with constants) existsthat uses only certain relational predicates which is related to the approach todata integration developed later on in this thesis is equally PSPACE-completeand thus presumably a computationally hard problem.

2.2 Query Containment

The problem of deciding whether a query Q1 is contained in a query Q2 (denotedQ1 Q2) (possibly under a number of constraints describing a schema) is theone of deciding whether for any possible databases satisfying the constraints, eachtuple in the result of Q1 is contained in the result of Q2. Two queries are calledequivalent, denoted Q1 Q2, iff Q1 Q2 and Q1 Q2.

The containment problem quickly becomes undecidable for expressive querylanguages. Already for relational algebra and calculus, the problem is undecid-able [SY80, Kan90]. In fact, the problem is co-r.e. but not recursive (under theassumption that databases are finite but the domain is not). Checking the con-tainment of two queries would require a noncontainment check for every finite

database over dom.For conjunctive queries, the containment problem is decidable and NP-com-plete [CM77]. Since queries tend to be small, query containment can be prac-tically used, for instance in query optimization or data integration [CKPS95,YL87]. It is usually formalized using the notion of containment mappings (ho-momorphisms) [CM77].

Definition 2.2.1 Let Q1 and Q2 be two conjunctive queries. A containmentmapping is a function from the variables and constants of Q1 into the variablesand constants ofQ2 that is

the identity on the constants of Q1

Headi(Q2) for a variable Headi(Q1)

3However, transitive closure can of course be expressed in datalog


32/150


and for which for every atom p(x1, . . . , xn) Body(Q1),

p((x1), . . . , (xn)) Body(Q2)

It can be shown that for two conjunctive queries Q1 and Q2, the containment

Q1 Q2 holds iff there is a containment mapping from Q2 into Q1 [CM77].

Example 2.2.2 [AHV95] The two conjunctive queries

q1(x,y,z) p(x2, y1, z), p(x, y1, z1), p(x1, y , z1),p(x, y2, z2), p(x2, y2, z).

and

q2(x,y,z) p(x2, y1, z), p(x, y1, z1), p(x1, y , z1).

are equivalent. For q1 q2, the containment mapping is the identity. Clearly,since Body(q2) Body(q1), and the heads of the two queries match, q1 q2 musthold. For the other direction, we have (x) = x, (y) = y, (z) = z, (x1) = x1,(y1) = y1, (z1) = z1, (x2) = x2, (y2) = y1, and (z2) = z1.

An alternative way [Ull97] of deciding whether a conjunctive query Q1 iscontained in a second, Q2, is to freeze the variables ofQ1 into new constants (i.e.,which do not appear in the two queries) and to evaluate Q2 on the canonicaldatabase created from the frozen body atoms of Q1. Q1 is then contained in Q2if and only if the frozen head ofQ1 appears in the result ofQ2 over the canonical

database.

Example 2.2.3 Consider again the two queries of Example 2.2.2. The canon-ical database for q2 is I = {p(ax2, ay1, az), p(ax, ay1, az1), p(ax1, ay, az1)} whereax, ay, az, ax1, ay1, az1, ax2 are constants. We have

q1(I) = {ax2, ay1, az, ax2 , ay1, az1, ax, ay1, az, ax, ay1, az1,ax, ay, az, ax, ay, az1, ax1, ay1, az1, ax1, ay, az1}

Since the frozen head of q2 is ax, ay, az and ax, ay, az q1(I), q2 is contained

in q1.

The containment of positive queries Q1, Q2 can be checked by transformingthem into sets of conjunctive queries Q1, Q

2. Q

1 is of course contained in Q

2 iff

each member query of Q1 is individually contained in a member query of Q2.


33/150

2.3. DEPENDENCIES 33

Bibliographic Notes

The containment problem for conjunctive queries is NP-complete, as mentioned.The problem can be efficiently solved for two queries if neither query containsmore than two atoms of the same relational predicate [Sar91]. In that case, avery efficient algorithm exists that runs in time linear in the size of the queries.Another polynomial-complexity case is encountered when the so-called hypergraphof the query to be tested for subsumption is acyclic [YO79, FMU82, AHV95]. Forthat class of queries, the technique of Example 2.2.3 can be combined with thepolynomial expression complexity of the candidate subsumer query.

If arithmetic comparison predicates4 are permitted in conjunctive queries[Klu88], the complexity of checking query containment is harder and jumps to thesecond level of the polynomial hierarchy [vdM92]. The containment of datalogqueries is undecidable [Shm87]. This remains true even for some very restrictedclasses of single-rule programs (sirups) [Kan90]. Containment of a conjunctivequery in a datalog query is EXPTIME-complete this problem can be solved withthe method of Example 2.2.3, but then consumes the full expression complexity

of datalog [Var82] (i.e., EXPTIME). The opposite direction, i.e. containment ofa datalog program in a conjunctive query, is still decidable but highly intractable(it is 2-EXPTIME-complete [CV92, CV94, CV97]).

Other interesting recent work has been on the containment of so-called regularpath queries which have found much research interest in the field of semistruc-tured databases under constraints [CDL98a] and on containment of a class ofqueries over databases with complex objects [LS97] (see also Section 2.5).

2.3 Dependencies

Dependencies are used in database design to add semantics and integrity con-straints to a schema, which database instances have to comply to. Two particu-larly important classes of dependencies are functional dependencies (abbreviatedfds) and inclusion dependencies (inds).

A functional dependency R : X Y over a relational predicate R (where Xand Y are sets of attribute names of R5) has the following semantics. It enforcesthat for each relation instance over R, for each pair t1, t2 of tuples in the instance,if for each attribute name in X the values in t1 and t2 are pairwise equal, thenthe values for the attributes in Y must be equal as well.

Primary keys are special cases of functional dependencies where XY contains

all attributes of R.4Such queries satisfy the real-worl need of asking queries where an attribute is to be, for

instance, of value greater than a certain constant.5Under the unnamed perspective sufficient for conjunctive queries in datalog notation, we

will refer to the i-th attribute position in R by $i, instead of an attribute name.


34/150


Example 2.3.1 Let R dom3 be a tertiary relation with two functional de-pendencies R : $1 $2 $3 (i.e., the first attribute is a primary key for R) andR : $3 $2. Consider an instance I = {1, 2, 3}. The attempt to insert a newtuple 1, 2, 4 into R would violate the first fd, while the attempt to do the samefor 5, 6, 3 would violate the second.

Informally, inclusion dependencies are containment relationships between que-ries of the form (R), i.e., attributes of a single relation R may be reordered orprojected out. Foreign key constraints, which require that a foreign key stored inone tuple must also exist in the key attribute position of some tuple of a usuallydifferent relation, are inclusion dependencies.

Dependencies as database semantics, notably, are valuable in query optimiza-tion and allow to enforce the integrity of database updates.

2.4 Global Query Optimization

Modern database systems rely on the idea of a separation of physical and logicalschemata in order to simplify their use [TK78, AHV95]. This, together with thedeclarative flavor of many query languages, leads to the need to optimize queriessuch that they execute quickly.

In the general case of the relational queries (i.e., ALG or the relational calcu-lus), global optimization is not computable. For conjunctive queries, and on thelogical level, where physical cost-based metrics can be left out of consideration,though, global optimality (that is, minimality) can be achieved. A conjunctivequery Q is minimal if there is no equivalent conjunctive query Q s.t. Q has feweratoms (subgoals) in its body than Q.

This notion of optimality is justified because joins of relations are usuallyamong the most expensive relational (algebra) operations carried out by a rela-tional database system during query execution. Minimality is of interest in dataintegration as well.

Computing a minimal equivalent conjunctive query is strongly related to thequery containment problem (see Section 2.2). The associated decision problemis again NP-complete. Minimal queries can be computed using the following fact[CM77]. Given a conjunctive query Q, there is a minimal query Q (with Q Q)s.t. Head(Q) = Head(Q) and Body(Q) Body(Q), i.e. the heads are equaland the body of Q contains a subset of the subgoals of Q, without any changesto variables or constants. Conjunctive queries can thus be optimized by checking

all queries created by dropping body atoms from Q while preserving equivalenceand searching for the smallest such query.

Example 2.4.1 Take the queries q1 and q2 from Example 2.2.2. By checking allsubsets ofBody(q2), it can be seen that q2 is already minimal. In fact, q2 is also


35/150

2.5. COMPLEX VALUES AND OBJECT IDENTITIES 35

a minimal query for q1, as Body(q2) is the smallest subset of Body(q1) such thatq2 and q1 remain equivalent.

Global optimization of conjunctive queries under a number of dependencies(e.g., fds) can be carried out using a folklore technique called the chase [ABU79,MMS79], for which we refer to the literature (see also [AHV95]).

2.5 Complex Values and Object Identities

Among the principal additional features of the object-oriented data model [BM93,Kim95, CBB+97], compared to the relational model, we have object identifiers,objects that have complex (nested) values, IS-A hierarchies, and behavior at-tributed to classes of objects, usually via (mostly) imperative methods. For thepurpose of querying and data integration under the object-oriented data model,the notions of object identifiers and complex objects deserve some consideration.

Research on complex values in database theory has started by giving up the

requirement that values in relations may only contain atomic values of the domain(non-first normal form databases). The complex value model, theoretically veryelegant, is strictly a generalization of the relational data model. Values are createdinductively from set and tuple constructors. The relational data model is thusthe special case of the complex value model where each relation is a set of tuplesover the domain. For instance,

{A : dom, B : dom, C : {A : dom, B : {dom}}}

is a valid sort in the complex value model and

{a,b, {c, {}, d, {e, g}}, e,f, {}}

is a value of this sort, where a, b, c, d, e, f, g are constants of dom. Asfor the relational data model, algebra and calculus-based query languages canbe specified, and equivalences be established. Informally, in the algebraic per-spective, set-based operations (union, intersection and difference), which are re-quired to operate over sets of the same sorts, and simple tuple-based operations(such as projection) known from the relational model are extended by a more ex-pressive selection operation, which may have conditions such as set membershipand equality of complex values, and the powerset operation, furthermore tuple-

and set-creation and destruction operations (see [AHV95]). Other operationssuch as renaming, join, and nesting and unnesting can be defined from these.The complex-value algebra (ALGcv) has hyperexponential complexity. When thepowerset operation is replaced by nesting and unnesting operations, we arriveat the so-called nested relation algebra ALGcv. All queries in ALGcv can be


36/150


executed efficiently (relative to the size of the data), which has motivated com-mercial object-oriented database systems such as O2 [LRV88] and standards suchas ODMGs OQL [CBB+97] to closely adopt it.

Interestingly, it can be shown that all ALGcv queries over relational databaseshave equivalent relational queries [AB88, AHV95]. This is due to the fact thatunnested values in a tuple always represent keys for the nested tuples; nestings

are thus purely cosmetic.Furthermore, every complex value database can be transformed (in polyno-mial time relative to the size of the complex value database) into a relationalone [AHV95] (This, however, requires keys that identify nested tuples as objects,i.e., object identifiers). The nested relation model - and with it a large classof object-oriented queries - is thus just syntactic sugaring over the relationaldata model with keys as supplements for object identifiers. From the query-onlystandpoint of data integration, where structural integration can take care of in-venting object identifiers in the canonical transformation between data models,we can thus develop techniques in terms of relational queries, which can then bestraightforwardly applied to object-oriented databases as well6.

We also make a comment on the calculus perspective. Differently from therelational model, in the complex value calculus CALCcv variables may representand be quantified over complex values. We are thus operating in a high-orderpredicate calculus with a finite model semantics. The generalization of rangerestriction (called safe-range calculus) for the relational calculus to the complexvalue calculus is straightforward but verbose (see [AHV95]). It can be shownthat ALGcv and the safe-range calculus CALCcv (which represents exactly thedomain independent complex value calculus queries) are equivalent. Furthermore,if set inclusion is disallowed but set membership as the analog of nesting remainspermitted, the so-called strongly safe-range calculus CALCcv is attained, which

is equivalent to ALG

cv

.Conjunctive nested relation algebra in which set union and difference havebeen removed from ALGcv is thus equivalent to the conjunctive relationalqueries.

Example 2.5.1 Consider an instance Parts, which is a set of complex values ofthe following sort. A part (in a product-data management system) is a tuple of abarcode B, a name N, and a set of characteristics C. A characteristic is a tupleof a name N and a set of data elements D. A data element is a tuple of a nameN, a unit of measurement U, and a value V7. The sort can be thus written as

B : dom, N : dom, C : {N : dom, D : {N : dom, U : dom, V : dom}}6Some support for object-oriented databases is a requirement in the use case of Section 1.3.7For simplicity, we assume that all atomic values are of the same domain dom. This is not

an actual restriction unless arithmetic comparison operators (


37/150

2.5. COMPLEX VALUES AND OBJECT IDENTITIES 37

Suppose now that we ask the following query in nested relation algebra ALGcv:

N,B,D(unnestC(B,C(Parts)))

which asks for transformed complex values of sort

N : dom, B : dom, D : {N : dom, U : dom, V : dom}

and can be formulated in strongly safe-range calculus CALCcv as

{x : N ,B,D : {N,U,V} | y , z , z, w , w, u , u :y : B,N ,C: {N, D : {N,U,V}} z : {N,D, {N,U,V}} z : N,D, {N,U,V} w : {N,U,V} w : N,U,V u : {N,U,V} u : N,U,V x.B = y.B y.C = z z z

z

.N = x.N z

.D = w w

w x.D = u u u u = w}

Let us map the collection Parts to a flat relational database with schema

R = {Part(Poid,B,N), Char(Coid,N,Poid), DataElement(N,U,V,Coid)}

where the attributes Poid and Coid stand for object identifiers which must beinvented when flattening the data. The above query can now be equivalentlyasked in relational algebra as

N,B,Dn,U,V((P oid,B(Part) Char) NDn,U,V,Coid(DataElement))

The greatest challenge here is the elimination or renaming of the three nameattributes N. The same query has the following equivalent in the (conjunctive)relational calculus

{x,y,z,u,v | i1, i2, d : Part(i1, x , d) Char(i2, y , i1) DataElement(z,u ,v ,i2)}

After executing the query, the results can be nested to get the correct resultfor the nested relational algebra or calculus query.


38/150



39/150

Chapter 3

Data Integration

This chapter briefly surveys several research areas related to data integration.We proceed by first presenting two established architectures, federated and mul-tidatabases in Section 3.2 and data warehouses in Section 3.3. Next, in Sec-tion 3.4, we discuss information integration in AI. Several research areas of AI

that are relevant to this thesis are surveyed, including ontology-based global in-formation systems, capability description and planning, and multi-agent systemsas a further integration architecture. Then we discuss global-as-view integra-tion (together with an integration architecture, mediator systems) in Section 3.5and local-as-view integration in Section 3.6. In Sections 3.7 and 3.8 we arriveat recent data integration approaches. Section 3.9 discusses management andmaintainability issues in large and evolving data integration systems and com-pares the different approaches presented according to various qualitative aspects.First, however, we start with some definitions.

3.1 Definitions and Overview

Source integration [JLVV00] refers to the process of integrating a number ofsources (e.g. databases) into one greater common entity. The term is usuallyused as part of a greater, more encompassing process, as perceived in the datawarehousing setting, where source integration is usually followed by aggregationand online analytical processing (OLAP). There are two forms of source inte-gration, schema integration and data integration. Schema integration [BLN86]refers to a software engineering or knowledge engineering approach, the processof reverse-engineering information systems and reengineering schemata in order

to obtain a single common integrated schema which we will not address inmore detail in this thesis. While the terms data and information are of coursenot to be confused, data integration and information integration are normallyused synonymously (e.g., [Wie96, Wie92]).

Data integration is the area of research that addresses problems related to

39


40/150

40 CHAPTER 3. DATA INTEGRATION

sourceintegration

schemaintegration

data

integration

semantic

integration

datareconciliation

structural

integration

Figure 3.1: Artists impression of source integration.

the provision of interoperability to information systems by the resolution of het-

erogeneity between systems on the level of data. This distinguishes the problemfrom the wider aim of cooperative information systems [Coo], where also moreadvanced concepts such as workflows, business processes, and supply chains comeinto play, and where problems related to coordination and collaboration of sub-systems are studied which go beyond the techniques required and justified for theintegration of data alone.

The data integration problem can be decomposed into several subproblems.Structural integration (e.g., wrapping [GK94, RS97]) is concerned with the res-olution of structural heterogeneity, i.e. the heterogeneity of data models, queryand data access languages, and protocols1. This problem is particularly inter-

esting when it comes to legacy systems, which are systems that in general havesome aspect that would be changed in an ideal world but in practice cannot be[AS99]. In practice, this often refers to out-of-date systems in which parts of thecode base or subsystems cannot be adapted to new requirements and technologiesbecause they are no longer understood by the current maintainers or because thesource code has been lost.

Semantic integration refers to the resolution of semantic mismatch betweenschemata. Mismatch of concepts appearing in such schemata may be due to anumber of reasons (see e.g. [GMPQ+97]), and may be a consequence of differ-ences in conceptualizations in the minds of different knowledge engineers. Mis-

1

We experience structural heterogeneity if we need to make a number of databases interop-erable of which, for example, some are relational and others object-oriented, or if among therelational databases some are only queryable using SQL while others are only queryable usingQUEL [SHWK76]. Other kinds of structural heterogeneity are encountered when two databasesystems use different models for managing transactions or lack middleware compatible withboth which allows to communicate queries and results.


41/150

3.2. FEDERATED AND MULTIDATABASES 41

match may not only occur on the level of schema entities (relations in a relationaldatabase or classes in an object-oriented system), but also on the level of data.The associated problem, called data reconciliation [JLVV00], includes object iden-tification (i.e., the problem of determining correspondences of objects representedby different heterogeneous data sources) and the handling of mistakes that hap-pened during the acquisition of data (e.g. typos), which is usually referred to as

data cleaning. An overview of this classification of source integration is given inFigure 3.1.Since for this thesis, the main problem among those discussed in this section

is the resolution of semantic mismatch, we will also put an emphasis on thisproblem in the following discussion and comparison of research related to dataintegration.

3.2 Federated and Multidatabases

The data integration problem has been addressed early on by work on multi-

database systems. Multidatabase systems are collections of several (distributed)databases that may be heterogeneous and need to share and exchange data. Ac-cording to the classification2 of [SL90], federated database systems [HM85] are asubclass of multidatabase systems. Federated databases are collections of col-laborating but autonomous component database systems . Nonfederated multi-database systems, on the other hand, may have several heterogeneous schematabut lack any other kind of autonomy. Nonfederated multidatabase systems haveone level of management only and all data management operations are performeduniformly for all component databases. Federated database systems can be cat-egorized as loosely or tightly coupled systems. Tightly coupled systems are ad-

ministrated as one common entity, while in loosely coupled systems, this is notthe case and component databases are administered independently [SL90].

Component databases of a federated system may be autonomous in severalsenses. Design autonomy permits the creators of component databases to maketheir own design choices with respect to representation, i.e. data models andquery languages, data managed and schemata used for managing them, and theconceptualizations and semantic interpretations of the data applied. Other kindsof component autonomy that are of less interest to this thesis but still deserve tobe mentioned are communication autonomy, execution autonomy and associationautonomy [SL90, HM85]. Autonomy is often in conflict with the need for sharingdata within a federated database system. Thus, one or several kinds of autonomy

may have to be relaxed in practice to be able to provide interoperability.2There is some heterogeneity in the nomenclature of this area. A cautionary note is due at

this point: Many of the terms in this chapter have been used heterogeneously by the researchcommunity. Certain choices had to be made in this thesis to allow a uniform presentation,which are hopefully well documented.


42/150


...

Component

Schema

LocalSchema

Export

Schema

Export

Schema

Export

Schema

Component

Schema

LocalSchema

FederatedSchema

FederatedSchema

ExternalSchema

ExternalSchema

ExternalSchema

...

...

...

...

Figure 3.2: Federated 5-layer schema architecture

Modern database systems successfully use a three-tier architecture [TK78]which separates physical (also called internal) from logical representation andthe logical schema in turn from possibly multiple user or application perspectives(provided by views). In federated database systems, these three layers are con-sidered insufficient, and a five-layer schema architecture has been proposed (e.g.[SL90] and Figure 3.2). Under this architecture, there are five types of schematabetween which queries are translated. These five types of schemata are

Local schemata. The local schema of a component database corresponds to

the logical schema in the classical three-layered architecture of centralizeddatabase systems.

Component schemata. The component schema of a database is a version ofits local schema translated into the data model and representation formal-ism shared across the federated database system.

Export schemata. An export schema contains only the part of the schemarelevant to one integrated federated schema.

Federated schemata3. This schema is an integrated homogeneous view of

the federation, against which a number of export schemata are mapped(using data integration technology). There may be several such federatedschemata inside a federation, providing different integrated views of theavailable data.

3These are also known as import schemata or global schemata [SL90].


43/150

3.3. DATA WAREHOUSING 43

Mediator

WrapperWrapperWrapperWrapperWrapper

Data

Warehouse

Data Marts"Data Cube"

(MDDBS)

Extraction &Aggregation

DataReconciliation &

Integration

Data

Analysis

Figure 3.3: Data warehousing architecture and process.

External Schemata provide application or user-specific views of the feder-ated schemata, as in the classical three-layer architecture.

This five-layer architecture is believed to provide better support for the inte-gration and management of heterogeneous autonomous databases than the clas-sical three-layer architecture [HM85, SL90].

3.3 Data Warehousing

Data Warehousing (Figure 3.3) is a somewhat interdisciplinary area of researchwhose scope goes beyond pure data integration. The goal is usually, in an en-terprise environment, to collect data from a number of distributed sites4 (e.g.,grocery stores), clean and integrate them, and put them into one large centralstore, the corporate data warehouse. Data warehousing is also about performingaggregation of relevant data (e.g. sales data). Data may then be extracted andtransformed according to schemata customized for particular users or analysistools (Online Analytical Processing, OLAP) [JLVV00].

Since the data manipulated are in practice often highly mission-critical toenterprises and may be very large, special technologies have been developed for

4The point of this is not just the resolution of heterogeneity but also to to have distinct sys-tems for Online Transaction Processing (OLTP) and data analysis for decision support, whichusually access data in very different ways and also need differently optimized schemata. (InOLTP, transactions are usually short and occur at a high density, while in OLAP, transactionsare few but long and put emphasis on querying.)


44/150


dealing with aggregation of data (e.g. the summarization of sales data accordingto criteria such as categories of products sold, regions, and time spans), such asmultidimensional databases (MDDBMS) or data cubes.

As data integrated against a warehouse are usually materialized there, thedata warehousing literature often makes a distinction between mediation, which isconfined to data integration on demand, i.e. when a query against the warehouse

occurs (also called virtual integration or the lazy approach [Wid96] by datawarehouse researchers), and materialized data integration (the eager approach[Wid96]). The materialized approach to data integration in fact adds problemsrelated to dynamic aspects (e.g., the view update and view maintenance prob-lems). These problems are not yet well understood, and known theoretical resultsare often quite negative [AHV95].

Data Warehousing has received considerable interest in industry, and thereare several commercial implementations, such as those by Informix and MicroS-trategy [JLVV00]. Two well-known research systems are WHIPS [GMLY98] andSQUIRREL [ZHKF95b, ZHKF95a].

3.4 Information Integration in AI

There has traditionally been much cross-fertilization between the artificial intel-ligence and information systems areas, and the intel

Koch, C.: Data Integration against Multiple Evolving Autonomous Schemata, Ph.D. Thesis

Documents

Transcript of Koch, C.: Data Integration against Multiple Evolving Autonomous Schemata, Ph.D. Thesis