Download - Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Transcript

Page 1: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Overview

Übersetzung &Sichtauflösung

Standardisierung &Vereinfachung

Optimierung Plan-parametrisierung

Code-Erzeugung

Ausführung

Algebraterm

Zugriffsplan

Code

ErgebnisSQL-Anfrage

Übersetzungszeit Laufzeit

LogischeOptimierung

PhysischeOptimierung

KostenbasierteAuswahl

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–1

Page 2: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Overview

Phases of Query Processing1 Translation and View Expansion

I Simplify arithmetic expressions in the query planI Resolve subqueriesI Insert the view definition

2 Logical or algebraic optimizationI Transform query plan irrespective of the specific storage form; and

pulling in of selections in other operations3 Physical or Internal optimization

I Take into account concrete storage techniques (indexes, clusters)I Select algorithmsI Several alternative internal plans

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–2

Page 3: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Overview

Phases of Query Processing (2)4 Cost-Based Selection

I Use statistic information (size of tables, selectivity of attributes) forthe selection of a specific internal plan

5 Plan ParametrizationI For Pre-compiled queries: (e.g., Embedded-SQL): Replace

placeholders with values6 Code Generation

I Convert the access plan into executable code

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–3

Page 4: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Overview

Phases of Query Processing (3)

Representation of requests during the processingI Algebra expressions −→ Operator Tree

F Operators as NodesF Edges represent data flow

I Later phases −→ Access or query plan (query execution plan –QEP)

F Concrete algorithms as operators

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–4

Page 5: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Overview

Logical vs. physical Operators

σF πA ⋈ γG;aggr(A)

R R R RS

IndexScanTableScan

NestedLoopsJoinSortMergeJoinHashJoin

SortGroupByHashGroupBy

Selektion Projektion Verbund Gruppierung

Algebra-operatoren

Plan-operatoren

TableScanProjSortProj

R, S – Relations

A – Attribute Set

F – Condition

G – Grouping Elements

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–5

Star-Join

Optimization of Star-Joins

Star-Joins are a typical pattern for Data Warehouse queriesTypical properties by the Star-Schema:

I Very large fact tableI Clearly smaller, independent dimension tables

⇒ Heuristics of classical relational optimizers often fail in this regard!

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–6

Star-Join

Optimization of Star-Joins (2)

Example: Join over fact table Sales and the three dimensiontables Product, Time und Geography:

I 4-Wey JoinI In RDBMS usually only pairwise Join: Sequence of pairwise joins

requiredI 4! possible Join ordersI Heuristic to reduce the number of combinations to check:

Joins between relations that are not linked by a Join condition in thequery will not be considered

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–7

Star-Join

Optimization of Star-Joins (3)

Heuristic gives for example the following query plan (Plan A):

Verkauf Ort

Zeit

σProduktkategorie='Bier"⋈

⋈

Produkt

σBundesland='Thüringen'

σDatum=201101

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–8

Star-Join

Optimization of Star-Joins (4)

The following query plan (Plan B) is usually not considered (withcross product of the dimension tables):

Verkauf

OrtZeit

σProduktkategorie='Bier'

⋈

Produkt

σBundesland='Thüringen'σDatum=201101

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–9

Star-Join

Star-Join: Calculation example

Annahmen:I Table Sales: 10.000.000 TuplesI 10 Shops in Thüringen (out of 100 in Germany)I 20 days of sale in January 2010 (out of 1000 stored days)I 50 products in the product category "Beer" (out of 1000)I Uniform distribution / same selectivity of the individual attribute

values

Plan Operation Number of Resulting TuplesA 1. Join 1.000.000

2. Join 20.0003. Join 1.000

B 1. Cross Product 2002. Cross Product 10.0003. Join 1.000

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–10

Star-Join

Semi-Join of Dimension Tables

Calculation of the Cross product for the dimension tables only forsufficiently restictive selection conditions for dimensions usefulAvoidance of the complete calculation of the cross product Use of the Semi-Join

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–11

Star-Join

Semi-Join of Dimension Tables (2)

1 On the fact table a simple B+-Tree is used for each dimension asan index

2 Through Semi-Joins with the dimension tables the sets of tupleidentifiers (TID) of the potentially relevant tuples is determined

3 The intersection of those TID sets is computed (e.g., by usingefficient main memory methods):

I Contains all TIDs of the tuples that fulfill all restrictions for alldimensions

4 After that a "normal" pairwise Join is performed

Not the whole fact table goes into the join, but instead only the relevanttuples! (in the example: 1.000 instead of 10.000.000 tuples)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–12

Star-Join

Semi-Join of Dimension Tables (3)

IndexScan(Verkauf_Ort_IDX,Ort_ID)

IndexScan(Verkauf_Produkt_IDX,Produkt_ID)

SemiJoin

TID-Intersect

IndexScan(Verkauf_Zeit_IDX,Zeit_ID)

TableScan(Ort) TableScan(Produkt) TableScan(Zeit)

SemiJoin SemiJoin

TIDs der Faktentabelle

TIDs TIDs

TIDs

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–13

Page 14: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Special treatment of grouping and aggregation operations duringthe optimizationLogical/Algebraic Optimization: "Push-down" of groupings reduction of intermediate result cardinality

I Invariant GroupingI Early Pre-Grouping

Physical/Internal OptimizationI Special implementation for GROUP BY, CUBE and other

OLAP-Functions

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–14

Page 15: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Invariant Grouping

Idea: Shifting a grouping operation "down" (Invariance w.r.t.position)Usable if

I Join partner does not directly contribute to the result (implicitselection)

I Grouping attributes have the role of a foreign key in the join

Example:

SELECT S_Time_ID, S_Location_ID, SUM(Revenue)FROM Sales, Time, LocationWHERE S_Time_ID=T_ID AND

S_Location_ID=L_IDAND Year < 2010 AND County <> "THÜR"

GROUP BY S_Time_ID, S_Location_ID

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–15

Page 16: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Invariant Grouping (2)

R S

σP(Y)

⋈R.A=S.B

γA;F(Z)

R S

⋈R.A=S.B

γA;F(Z)

σP(Y)

πB

A,Z ∈ RB, Y ∈ S

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–16

Page 17: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Early Pre-Grouping

Invariant Grouping: restriktive precondition seldom usedIdea: Insertion of an additional grouping operator before the join(similar to a projection)Usable if

I Grouping condition contains the join attributesI Aggregated attributes do not depend on the attributes of the join

partner

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–17

Page 18: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Early Pre-Grouping (2)

Example:

SELECT Year, L_City, SUM(Revenue), COUNT(Revenue)FROM Sales, Time, LocationWHERE S_Time_ID = T_ID AND

S_L_ID = L_IDGROUP BY Year, L_City

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–18

Page 19: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Early Pre-Grouping (3)

R S

⋈R.A=S.B

γC,D;F(Z)

R S

⋈R.A=S.B

γA,C;F(Z)

A,C,Z ∈ RB, D ∈ S

γC,D;F(Z)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–19

Page 20: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Early Pre-Grouping (4)

Also required: Adjustment of the aggregation function

Verkauf Zeit

⋈

γJahr,O_Stadt;

SUM(Umsatz),COUNT(Umsatz)

⋈

Ort

Verkauf Zeit

⋈

γJahr,O_ID;

$1=SUM(Umsatz),

$2=COUNT(Umsatz)

⋈

Ort

γJahr,O_Stadt;

SUM($1),SUM($2)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–20

Page 21: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Implementation of the grouping operator

Implementation of GROUP BY

Implementation of OLAP-FunctionsComputation of the CUBE

Iceberg-Cubes

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–21

Page 22: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

GROUP BY-Implementations

Sort-basedI Pre-sorting the relation or sorting read (Index-Scan)I Process

1 Sortinge2 Iteration over tuples3 Aggregation of the values and output of the aggregated value in case

of a group change

Hash-basedI Hashfunction over grouping attributes h(G1, . . . ,Gn)I Process

1 Insertion of tuples in hash tables using h(G1, . . . ,Gn)2 Iteration through hash table3 Application of aggregation functions

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–22

Page 23: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Implementation of OLAP-Functions

Sequential evaluationFor each OLAP-Function:

I Input data sorted according to OVER()-clauseI Apply aggregation functions

Sorting byI Attributes of the global groupingI Attributes of the OVER()-clause (PARTITION BY and ORDER BY)

In case of multiple OLAP-Functions in a queryI Sequential evaluation, i.e., possibly repeated sorting for usage of

shared sorting prefixes

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–23

Page 24: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Optimization of the GROUP BY

Implementation of OLAP-Functions (2)global grouping attributes G1 . . .Gn,locale sorting attributes of OVER(): O1 . . .Op

aggregation function AGG(),locale partitioning attributed (opt.) P1 . . .Pm,lower and upper window border (opt.) Wu . . .Wo

sort(G1, . . . ,Gn,P1, . . . ,Pm,O1, . . . ,Op);while (t = next_tuple()) {

if (t has equal values w.r.t. G1 . . .Gn,P1 . . .Pm like last tuple)aggrlist := concat(aggrlist, t);

else {// Partition switchfor i := 1 to length(aggrlist) {

// Compute absolute window borders low, highaggrval := AGG({aggrlist[low]...aggrlist[high]});output(G1, . . . ,Gn, P1, . . . ,Pm, O1, . . . ,Op, aggrval);

}aggrlist := ();

}}

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–24

Page 25: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

Calculation of the CUBE: naive Approach

Separate calculation of all grouping combinationsFinal unification

VERKAUF ��ZEIT ��PRODUKT � ORT

!(Produktgruppe, Bundesland)

!(Bundesland) !(Produktgruppe)

!()

!(Produktgruppe, Jahr) !(Jahr, Bundesland)

!(Jahr)

UNION ALL

!(Produktgruppe, Jahr, Bundesland)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–25

Page 26: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

CUBE and Aggregation functions

Algebraic functions enableI Calculation of less detailed aggegates from more detailed

aggregates (more dimensions)I Partial order ("grid") of the GROUP BY operations of the CUBE

F Data Cube LatticeI GROUP BY is a child of another GROUP BY if the parent operation

can be used to calculate the child operation→ Derivability

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–26

Page 27: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

Derivability

Derivability of grouping combinations Gi

Direct Derivability:I G2 is derivable from G1 ifI G2 has exactly one attribute less than G1: G2 ⊂ G1 and|G2| = |G1| − 1

I or in G1 exactly one attribute Ai is replaced by Bi where thefollowing holds true: Ai → Bi

⇒ Data Cube LatticeDerivability:

I Grouping combinations: G2 is within a data cube lattice derivablefrom G1 when there is a path from G1 to G2

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–27

Page 28: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

Data Cube Lattice

all

Produktgruppe (P) Bundesland (B) Jahr (J)

PB PJ BJ

PBJ

Ebene

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–28

Page 29: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

Computation of the CUBE

Idea:I Using of the grid view (Derivabiity)I GROUP-BYs with common grouping attributes can share partitions,

sorted parts etc.Appriach

I Based on sorting: PipeSortI Based on hashing: PipeHash

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–29

Page 30: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

Optimized Computation: Principle

γ(A,B,C)

γ(A,B)γ(A,C) γ(B,C)

γ(A) γ(B) γ(C)

γ()

UNION

Page 31: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

Optimization Potential

Smallest-parentI Computation of a GROUP-BY based on the minimal previously

computed Parent-GROUP-BYCache-results

I Temporary storage of the results (in the main memory) of aGROUP-BY for subsequent GROUP-BYs (Example.: ABC→ AB)

Amortize-scansI Joint calcukation of multiple GROUP-BYs in a scan (Example: ABC,

ACD, ABD, BCD aus ABCD)Share-sorts

I For sort-based approaches: Temporary storage and shared use ofsorted parts

Share-partitionsI For hash-based approaches: Temporary storage and joint use of

partitions

Page 32: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

Meaning of the Sorting Order

Assumption: Grouping based on sortingI potentially requires re-sorting

Example:Product Year Region SalesRotwein 2009 SANH 230Rotwein 2009 THÜR 210Rotwein 2010 SANH 200

... ... ... ...Bier 2009 SANH 568

I Re-sorting for the grouping (Product, Region)

Consideration of the sort costs in a cost model

Page 33: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

Cost Model

Cost TypesI S-Costs (Still to sort): Calculation of GROUP-BY j from GROUP-BY

i, if i not sorted yetI A-Costs (Already sorted): Calculation of GROUP-BY j from

GROUP-BY i, if ialready sorted

Estimation based on data distribution, system parameters, etc.

Page 34: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

PipeSort

Input: Search gridI Graph with nodes to represent GROUP-BY (Aggregationsgitter)I Directed edge connects GROUP-BY i with GROUP-BY j

F i is parent node of jF j can be generated from iF j has exactly one attribute less than i

I Level k refers to all GROUP-BYs with k attributes

Annotations of edges eij with A- and S-CostsOutput: Subgraph of the search grid

I Each node is connected to a single parent nodeI Determines sorting order while preserving pipelining

F Special expanded tree

Goal: Subgraph with minimal summe of edge costs

Page 35: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

PipeSort: Example

ACABAB AC

A B C

BC BC

ACABAB AC

A B C

BC BC

2 10 5 12 13 20

Pipeline Sortierung

Page 36: Data Warehouse Technologies · Star-Joins are a typical pattern for Data Warehouse queries Typical properties by the Star-Schema: I Very large fact table I Clearly smaller, independent

Calculation of the CUBE

PipeSort: Algorithm

Can be traced back to a known Graph-AlgorithmLevel-wise approach: k = 0..N − 1 (N: Number of attributes)Transformation of level k + 1 by k copies of each nodeEach copied node has connections with the same node as theoriginalEdge costs for original node: A(eij), otherwise: S(eij)

Search for graph with minimal costsI Forming of pairs and minimization of the total costs ("hungarian"

method – weighted bipartite matching problem)