Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical...

Overview

Overview

Übersetzung &Sichtauflösung

Standardisierung &Vereinfachung

Optimierung Plan-parametrisierung

Code-Erzeugung

Ausführung

Algebraterm

Algebraterm

Zugriffsplan

Zugriffsplan

Code

ErgebnisSQL-Anfrage

Übersetzungszeit Laufzeit

LogischeOptimierung

PhysischeOptimierung

KostenbasierteAuswahl

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–1

Overview

Phases of Query Processing1 Translation and View Expansion

I Simplify arithmetic expressions in the query planI Resolve subqueriesI Insert the view definition

2 Logical or algebraic optimizationI Transform query plan irrespective of the specific storage form; and

pulling in of selections in other operations3 Physical or Internal optimization

I Take into account concrete storage techniques (indexes, clusters)I Select algorithmsI Several alternative internal plans


Overview

Phases of Query Processing (2)4 Cost-Based Selection

I Use statistic information (size of tables, selectivity of attributes) forthe selection of a specific internal plan

5 Plan ParametrizationI For Pre-compiled queries: (e.g., Embedded-SQL): Replace

placeholders with values6 Code Generation

I Convert the access plan into executable code


Overview

Phases of Query Processing (3)

Representation of requests during the processingI Algebra expressions −→ Operator Tree

F Operators as NodesF Edges represent data flow

I Later phases −→ Access or query plan (query execution plan –QEP)

F Concrete algorithms as operators


Overview

Logical vs. physical Operators

σF πA ⋈ γG;aggr(A)

R R R RS

IndexScanTableScan

NestedLoopsJoinSortMergeJoinHashJoin

SortGroupByHashGroupBy

Selektion Projektion Verbund Gruppierung

Algebra-operatoren

Plan-operatoren

TableScanProjSortProj

R, S – Relations

A – Attribute Set

F – Condition

G – Grouping Elements


Star-Join

Optimization of Star-Joins

Star-Joins are a typical pattern for Data Warehouse queriesTypical properties by the Star-Schema:

I Very large fact tableI Clearly smaller, independent dimension tables

⇒ Heuristics of classical relational optimizers often fail in this regard!


Star-Join

Optimization of Star-Joins (2)

Example: Join over fact table Sales and the three dimensiontables Product, Time und Geography:

I 4-Wey JoinI In RDBMS usually only pairwise Join: Sequence of pairwise joins

requiredI 4! possible Join ordersI Heuristic to reduce the number of combinations to check:

Joins between relations that are not linked by a Join condition in thequery will not be considered


Star-Join


Heuristic gives for example the following query plan (Plan A):

Verkauf Ort

Zeit

σProduktkategorie='Bier"⋈

⋈

⋈

Produkt

σBundesland='Thüringen'

σDatum=201101


Star-Join


The following query plan (Plan B) is usually not considered (withcross product of the dimension tables):

Verkauf

OrtZeit

σProduktkategorie='Bier'

×

×

⋈

Produkt

σBundesland='Thüringen'σDatum=201101


Star-Join

Star-Join: Calculation example

Annahmen:I Table Sales: 10.000.000 TuplesI 10 Shops in Thüringen (out of 100 in Germany)I 20 days of sale in January 2010 (out of 1000 stored days)I 50 products in the product category "Beer" (out of 1000)I Uniform distribution / same selectivity of the individual attribute

values

Plan Operation Number of Resulting TuplesA 1. Join 1.000.000

2. Join 20.0003. Join 1.000

B 1. Cross Product 2002. Cross Product 10.0003. Join 1.000


Star-Join

Semi-Join of Dimension Tables

Calculation of the Cross product for the dimension tables only forsufficiently restictive selection conditions for dimensions usefulAvoidance of the complete calculation of the cross product Use of the Semi-Join


Star-Join

Semi-Join of Dimension Tables (2)

1 On the fact table a simple B+-Tree is used for each dimension asan index

2 Through Semi-Joins with the dimension tables the sets of tupleidentifiers (TID) of the potentially relevant tuples is determined

3 The intersection of those TID sets is computed (e.g., by usingefficient main memory methods):

I Contains all TIDs of the tuples that fulfill all restrictions for alldimensions

4 After that a "normal" pairwise Join is performed

Not the whole fact table goes into the join, but instead only the relevanttuples! (in the example: 1.000 instead of 10.000.000 tuples)


Star-Join

Semi-Join of Dimension Tables (3)

IndexScan(Verkauf_Ort_IDX,Ort_ID)

IndexScan(Verkauf_Produkt_IDX,Produkt_ID)

SemiJoin

TID-Intersect

IndexScan(Verkauf_Zeit_IDX,Zeit_ID)

TableScan(Ort) TableScan(Produkt) TableScan(Zeit)

SemiJoin SemiJoin

TIDs der Faktentabelle

TIDs TIDs

TIDs


Optimization of the GROUP BY


Special treatment of grouping and aggregation operations duringthe optimizationLogical/Algebraic Optimization: "Push-down" of groupings reduction of intermediate result cardinality

I Invariant GroupingI Early Pre-Grouping

Physical/Internal OptimizationI Special implementation for GROUP BY, CUBE and other

OLAP-Functions



Invariant Grouping

Idea: Shifting a grouping operation "down" (Invariance w.r.t.position)Usable if

I Join partner does not directly contribute to the result (implicitselection)

I Grouping attributes have the role of a foreign key in the join

Example:

SELECT S_Time_ID, S_Location_ID, SUM(Revenue)FROM Sales, Time, LocationWHERE S_Time_ID=T_ID AND

S_Location_ID=L_IDAND Year < 2010 AND County <> "THÜR"

GROUP BY S_Time_ID, S_Location_ID



Invariant Grouping (2)

R S

σP(Y)

⋈R.A=S.B

γA;F(Z)

R S

⋈R.A=S.B

γA;F(Z)

σP(Y)

πB

A,Z ∈ RB, Y ∈ S



Early Pre-Grouping

Invariant Grouping: restriktive precondition seldom usedIdea: Insertion of an additional grouping operator before the join(similar to a projection)Usable if

I Grouping condition contains the join attributesI Aggregated attributes do not depend on the attributes of the join

partner



Early Pre-Grouping (2)

Example:

SELECT Year, L_City, SUM(Revenue), COUNT(Revenue)FROM Sales, Time, LocationWHERE S_Time_ID = T_ID AND

S_L_ID = L_IDGROUP BY Year, L_City




R S

⋈R.A=S.B

γC,D;F(Z)

R S

⋈R.A=S.B

γA,C;F(Z)

A,C,Z ∈ RB, D ∈ S

γC,D;F(Z)




Also required: Adjustment of the aggregation function

Verkauf Zeit

⋈

γJahr,O_Stadt;

SUM(Umsatz),COUNT(Umsatz)

⋈

Ort

Verkauf Zeit

⋈

γJahr,O_ID;

$1=SUM(Umsatz),

$2=COUNT(Umsatz)

⋈

Ort

γJahr,O_Stadt;

SUM($1),SUM($2)



Implementation of the grouping operator

Implementation of GROUP BY

Implementation of OLAP-FunctionsComputation of the CUBE

Iceberg-Cubes



GROUP BY-Implementations

Sort-basedI Pre-sorting the relation or sorting read (Index-Scan)I Process

1 Sortinge2 Iteration over tuples3 Aggregation of the values and output of the aggregated value in case

of a group change

Hash-basedI Hashfunction over grouping attributes h(G1, . . . ,Gn)I Process

1 Insertion of tuples in hash tables using h(G1, . . . ,Gn)2 Iteration through hash table3 Application of aggregation functions



Implementation of OLAP-Functions

Sequential evaluationFor each OLAP-Function:

I Input data sorted according to OVER()-clauseI Apply aggregation functions

Sorting byI Attributes of the global groupingI Attributes of the OVER()-clause (PARTITION BY and ORDER BY)

In case of multiple OLAP-Functions in a queryI Sequential evaluation, i.e., possibly repeated sorting for usage of

shared sorting prefixes



Implementation of OLAP-Functions (2)global grouping attributes G1 . . .Gn,locale sorting attributes of OVER(): O1 . . .Op

aggregation function AGG(),locale partitioning attributed (opt.) P1 . . .Pm,lower and upper window border (opt.) Wu . . .Wo

sort(G1, . . . ,Gn,P1, . . . ,Pm,O1, . . . ,Op);while (t = next_tuple()) {

if (t has equal values w.r.t. G1 . . .Gn,P1 . . .Pm like last tuple)aggrlist := concat(aggrlist, t);

else {// Partition switchfor i := 1 to length(aggrlist) {

// Compute absolute window borders low, highaggrval := AGG({aggrlist[low]...aggrlist[high]});output(G1, . . . ,Gn, P1, . . . ,Pm, O1, . . . ,Op, aggrval);

}aggrlist := ();

}}


Calculation of the CUBE

Calculation of the CUBE: naive Approach

Separate calculation of all grouping combinationsFinal unification

VERKAUF ��ZEIT ��PRODUKT � ORT

!(Produktgruppe, Bundesland)

!(Bundesland) !(Produktgruppe)

!()

!(Produktgruppe, Jahr) !(Jahr, Bundesland)

!(Jahr)

UNION ALL

!(Produktgruppe, Jahr, Bundesland)



CUBE and Aggregation functions

Algebraic functions enableI Calculation of less detailed aggegates from more detailed

aggregates (more dimensions)I Partial order ("grid") of the GROUP BY operations of the CUBE

F Data Cube LatticeI GROUP BY is a child of another GROUP BY if the parent operation

can be used to calculate the child operation→ Derivability



Derivability

Derivability of grouping combinations Gi

Direct Derivability:I G2 is derivable from G1 ifI G2 has exactly one attribute less than G1: G2 ⊂ G1 and|G2| = |G1| − 1

I or in G1 exactly one attribute Ai is replaced by Bi where thefollowing holds true: Ai → Bi

⇒ Data Cube LatticeDerivability:

I Grouping combinations: G2 is within a data cube lattice derivablefrom G1 when there is a path from G1 to G2



Data Cube Lattice

all

Produktgruppe (P) Bundesland (B) Jahr (J)

PB PJ BJ

PBJ

0

1

2

3

Ebene



Computation of the CUBE

Idea:I Using of the grid view (Derivabiity)I GROUP-BYs with common grouping attributes can share partitions,

sorted parts etc.Appriach

I Based on sorting: PipeSortI Based on hashing: PipeHash



Optimized Computation: Principle

γ(A,B,C)

γ(A,B)γ(A,C) γ(B,C)

γ(A) γ(B) γ(C)

γ()

UNION



Optimization Potential

Smallest-parentI Computation of a GROUP-BY based on the minimal previously

computed Parent-GROUP-BYCache-results

I Temporary storage of the results (in the main memory) of aGROUP-BY for subsequent GROUP-BYs (Example.: ABC→ AB)

Amortize-scansI Joint calcukation of multiple GROUP-BYs in a scan (Example: ABC,

ACD, ABD, BCD aus ABCD)Share-sorts

I For sort-based approaches: Temporary storage and shared use ofsorted parts

Share-partitionsI For hash-based approaches: Temporary storage and joint use of

partitions



Meaning of the Sorting Order

Assumption: Grouping based on sortingI potentially requires re-sorting

Example:Product Year Region SalesRotwein 2009 SANH 230Rotwein 2009 THÜR 210Rotwein 2010 SANH 200

... ... ... ...Bier 2009 SANH 568

I Re-sorting for the grouping (Product, Region)

Consideration of the sort costs in a cost model



Cost Model

Cost TypesI S-Costs (Still to sort): Calculation of GROUP-BY j from GROUP-BY

i, if i not sorted yetI A-Costs (Already sorted): Calculation of GROUP-BY j from

GROUP-BY i, if ialready sorted

Estimation based on data distribution, system parameters, etc.



PipeSort

Input: Search gridI Graph with nodes to represent GROUP-BY (Aggregationsgitter)I Directed edge connects GROUP-BY i with GROUP-BY j

F i is parent node of jF j can be generated from iF j has exactly one attribute less than i

I Level k refers to all GROUP-BYs with k attributes

Annotations of edges eij with A- and S-CostsOutput: Subgraph of the search grid

I Each node is connected to a single parent nodeI Determines sorting order while preserving pipelining

F Special expanded tree

Goal: Subgraph with minimal summe of edge costs



PipeSort: Example

ACABAB AC

A B C

BC BC

ACABAB AC

A B C

BC BC

2 10 5 12 13 20

2 10 5 12 13 20

Pipeline Sortierung



PipeSort: Algorithm

Can be traced back to a known Graph-AlgorithmLevel-wise approach: k = 0..N − 1 (N: Number of attributes)Transformation of level k + 1 by k copies of each nodeEach copied node has connections with the same node as theoriginalEdge costs for original node: A(eij), otherwise: S(eij)

Search for graph with minimal costsI Forming of pairs and minimization of the total costs ("hungarian"

method – weighted bipartite matching problem)



PipeSort: Sorting Order

Each node h in level k is connected to a node g in level k + 1

h→ g over A()-edge: h determines attribute order for sorting g

h→ g over S()-edge: g is re-sorted for the calculation of h



PipeSort: Sort Plan

ABC

Pipeline Sortierung

Relation

CBAD

CBA BAD ACD DBC

CB BA AC DB AD CD

D

all



Iceberg Cubes

Idea: Exploitation of the monotony of aggregations

If an aggregation group does not fulfill the COUNT-condition, then thiscondition is also not fulfilled by groups with additional attributes.

Approach: Bottom-Up-ConstruCtion of a cube andMinimal-Support-Pruning (similar to Apriori)



Iceberg Cube bottom up

ABCD

ABC ABD ACD BCD

ADACAB BC BD CD

A B C D

all



Iceberg-Cube: ComputationBottomUpCube(input, dim):

aggregate(input);write(outputRec);for (d:=dim; d<numDims; d++) {

C := cardinality[d]; /* Cardinality of the dimension */Partition(input, d, C, dataCount[d]);k := 0;for (i:=0; i < C; i++) {

c := dataCount[d][i]; /* Size of Partition */if (c >= minsup) {

outputRec.dim[d] := input[k].dim[d];BottomUpCube(input[k...k+c], d+1);

}k += c;

}outputRec.dim[d] = ALL;

}



Summary

Special charakterics of DW queries require specific optimizationmethodsRewriting techniques:

I Join order for Star-JoinI Push-down of groupings

Operator implementationI CUBE and Iceberg-CUBEI OLAP-Functions


Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical...

Documents

Transcript of Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical...