Overview
Overview
Übersetzung &Sichtauflösung
Standardisierung &Vereinfachung
Optimierung Plan-parametrisierung
Code-Erzeugung
Ausführung
Algebraterm
Algebraterm
Zugriffsplan
Zugriffsplan
Code
ErgebnisSQL-Anfrage
Übersetzungszeit Laufzeit
LogischeOptimierung
PhysischeOptimierung
KostenbasierteAuswahl
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–1
Overview
Phases of Query Processing1 Translation and View Expansion
I Simplify arithmetic expressions in the query planI Resolve subqueriesI Insert the view definition
2 Logical or algebraic optimizationI Transform query plan irrespective of the specific storage form; and
pulling in of selections in other operations3 Physical or Internal optimization
I Take into account concrete storage techniques (indexes, clusters)I Select algorithmsI Several alternative internal plans
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–2
Overview
Phases of Query Processing (2)4 Cost-Based Selection
I Use statistic information (size of tables, selectivity of attributes) forthe selection of a specific internal plan
5 Plan ParametrizationI For Pre-compiled queries: (e.g., Embedded-SQL): Replace
placeholders with values6 Code Generation
I Convert the access plan into executable code
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–3
Overview
Phases of Query Processing (3)
Representation of requests during the processingI Algebra expressions −→ Operator Tree
F Operators as NodesF Edges represent data flow
I Later phases −→ Access or query plan (query execution plan –QEP)
F Concrete algorithms as operators
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–4
Overview
Logical vs. physical Operators
σF πA ⋈ γG;aggr(A)
R R R RS
IndexScanTableScan
NestedLoopsJoinSortMergeJoinHashJoin
SortGroupByHashGroupBy
Selektion Projektion Verbund Gruppierung
Algebra-operatoren
Plan-operatoren
TableScanProjSortProj
R, S – Relations
A – Attribute Set
F – Condition
G – Grouping Elements
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–5
Star-Join
Optimization of Star-Joins
Star-Joins are a typical pattern for Data Warehouse queriesTypical properties by the Star-Schema:
I Very large fact tableI Clearly smaller, independent dimension tables
⇒ Heuristics of classical relational optimizers often fail in this regard!
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–6
Star-Join
Optimization of Star-Joins (2)
Example: Join over fact table Sales and the three dimensiontables Product, Time und Geography:
I 4-Wey JoinI In RDBMS usually only pairwise Join: Sequence of pairwise joins
requiredI 4! possible Join ordersI Heuristic to reduce the number of combinations to check:
Joins between relations that are not linked by a Join condition in thequery will not be considered
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–7
Star-Join
Optimization of Star-Joins (3)
Heuristic gives for example the following query plan (Plan A):
Verkauf Ort
Zeit
σProduktkategorie='Bier"⋈
⋈
⋈
Produkt
σBundesland='Thüringen'
σDatum=201101
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–8
Star-Join
Optimization of Star-Joins (4)
The following query plan (Plan B) is usually not considered (withcross product of the dimension tables):
Verkauf
OrtZeit
σProduktkategorie='Bier'
×
×
⋈
Produkt
σBundesland='Thüringen'σDatum=201101
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–9
Star-Join
Star-Join: Calculation example
Annahmen:I Table Sales: 10.000.000 TuplesI 10 Shops in Thüringen (out of 100 in Germany)I 20 days of sale in January 2010 (out of 1000 stored days)I 50 products in the product category "Beer" (out of 1000)I Uniform distribution / same selectivity of the individual attribute
values
Plan Operation Number of Resulting TuplesA 1. Join 1.000.000
2. Join 20.0003. Join 1.000
B 1. Cross Product 2002. Cross Product 10.0003. Join 1.000
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–10
Star-Join
Semi-Join of Dimension Tables
Calculation of the Cross product for the dimension tables only forsufficiently restictive selection conditions for dimensions usefulAvoidance of the complete calculation of the cross product Use of the Semi-Join
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–11
Star-Join
Semi-Join of Dimension Tables (2)
1 On the fact table a simple B+-Tree is used for each dimension asan index
2 Through Semi-Joins with the dimension tables the sets of tupleidentifiers (TID) of the potentially relevant tuples is determined
3 The intersection of those TID sets is computed (e.g., by usingefficient main memory methods):
I Contains all TIDs of the tuples that fulfill all restrictions for alldimensions
4 After that a "normal" pairwise Join is performed
Not the whole fact table goes into the join, but instead only the relevanttuples! (in the example: 1.000 instead of 10.000.000 tuples)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–12
Star-Join
Semi-Join of Dimension Tables (3)
IndexScan(Verkauf_Ort_IDX,Ort_ID)
IndexScan(Verkauf_Produkt_IDX,Produkt_ID)
SemiJoin
TID-Intersect
IndexScan(Verkauf_Zeit_IDX,Zeit_ID)
TableScan(Ort) TableScan(Produkt) TableScan(Zeit)
SemiJoin SemiJoin
TIDs der Faktentabelle
TIDs TIDs
TIDs
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–13
Optimization of the GROUP BY
Optimization of the GROUP BY
Special treatment of grouping and aggregation operations duringthe optimizationLogical/Algebraic Optimization: "Push-down" of groupings reduction of intermediate result cardinality
I Invariant GroupingI Early Pre-Grouping
Physical/Internal OptimizationI Special implementation for GROUP BY, CUBE and other
OLAP-Functions
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–14
Optimization of the GROUP BY
Invariant Grouping
Idea: Shifting a grouping operation "down" (Invariance w.r.t.position)Usable if
I Join partner does not directly contribute to the result (implicitselection)
I Grouping attributes have the role of a foreign key in the join
Example:
SELECT S_Time_ID, S_Location_ID, SUM(Revenue)FROM Sales, Time, LocationWHERE S_Time_ID=T_ID AND
S_Location_ID=L_IDAND Year < 2010 AND County <> "THÜR"
GROUP BY S_Time_ID, S_Location_ID
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–15
Optimization of the GROUP BY
Invariant Grouping (2)
R S
σP(Y)
⋈R.A=S.B
γA;F(Z)
R S
⋈R.A=S.B
γA;F(Z)
σP(Y)
πB
A,Z ∈ RB, Y ∈ S
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–16
Optimization of the GROUP BY
Early Pre-Grouping
Invariant Grouping: restriktive precondition seldom usedIdea: Insertion of an additional grouping operator before the join(similar to a projection)Usable if
I Grouping condition contains the join attributesI Aggregated attributes do not depend on the attributes of the join
partner
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–17
Optimization of the GROUP BY
Early Pre-Grouping (2)
Example:
SELECT Year, L_City, SUM(Revenue), COUNT(Revenue)FROM Sales, Time, LocationWHERE S_Time_ID = T_ID AND
S_L_ID = L_IDGROUP BY Year, L_City
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–18
Optimization of the GROUP BY
Early Pre-Grouping (3)
R S
⋈R.A=S.B
γC,D;F(Z)
R S
⋈R.A=S.B
γA,C;F(Z)
A,C,Z ∈ RB, D ∈ S
γC,D;F(Z)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–19
Optimization of the GROUP BY
Early Pre-Grouping (4)
Also required: Adjustment of the aggregation function
Verkauf Zeit
⋈
γJahr,O_Stadt;
SUM(Umsatz),COUNT(Umsatz)
⋈
Ort
Verkauf Zeit
⋈
γJahr,O_ID;
$1=SUM(Umsatz),
$2=COUNT(Umsatz)
⋈
Ort
γJahr,O_Stadt;
SUM($1),SUM($2)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–20
Optimization of the GROUP BY
Implementation of the grouping operator
Implementation of GROUP BY
Implementation of OLAP-FunctionsComputation of the CUBE
Iceberg-Cubes
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–21
Optimization of the GROUP BY
GROUP BY-Implementations
Sort-basedI Pre-sorting the relation or sorting read (Index-Scan)I Process
1 Sortinge2 Iteration over tuples3 Aggregation of the values and output of the aggregated value in case
of a group change
Hash-basedI Hashfunction over grouping attributes h(G1, . . . ,Gn)I Process
1 Insertion of tuples in hash tables using h(G1, . . . ,Gn)2 Iteration through hash table3 Application of aggregation functions
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–22
Optimization of the GROUP BY
Implementation of OLAP-Functions
Sequential evaluationFor each OLAP-Function:
I Input data sorted according to OVER()-clauseI Apply aggregation functions
Sorting byI Attributes of the global groupingI Attributes of the OVER()-clause (PARTITION BY and ORDER BY)
In case of multiple OLAP-Functions in a queryI Sequential evaluation, i.e., possibly repeated sorting for usage of
shared sorting prefixes
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–23
Optimization of the GROUP BY
Implementation of OLAP-Functions (2)global grouping attributes G1 . . .Gn,locale sorting attributes of OVER(): O1 . . .Op
aggregation function AGG(),locale partitioning attributed (opt.) P1 . . .Pm,lower and upper window border (opt.) Wu . . .Wo
sort(G1, . . . ,Gn,P1, . . . ,Pm,O1, . . . ,Op);while (t = next_tuple()) {
if (t has equal values w.r.t. G1 . . .Gn,P1 . . .Pm like last tuple)aggrlist := concat(aggrlist, t);
else {// Partition switchfor i := 1 to length(aggrlist) {
// Compute absolute window borders low, highaggrval := AGG({aggrlist[low]...aggrlist[high]});output(G1, . . . ,Gn, P1, . . . ,Pm, O1, . . . ,Op, aggrval);
}aggrlist := ();
}}
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–24
Calculation of the CUBE
Calculation of the CUBE: naive Approach
Separate calculation of all grouping combinationsFinal unification
VERKAUF ��ZEIT ��PRODUKT � ORT
!(Produktgruppe, Bundesland)
!(Bundesland) !(Produktgruppe)
!()
!(Produktgruppe, Jahr) !(Jahr, Bundesland)
!(Jahr)
UNION ALL
!(Produktgruppe, Jahr, Bundesland)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–25
Calculation of the CUBE
CUBE and Aggregation functions
Algebraic functions enableI Calculation of less detailed aggegates from more detailed
aggregates (more dimensions)I Partial order ("grid") of the GROUP BY operations of the CUBE
F Data Cube LatticeI GROUP BY is a child of another GROUP BY if the parent operation
can be used to calculate the child operation→ Derivability
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–26
Calculation of the CUBE
Derivability
Derivability of grouping combinations Gi
Direct Derivability:I G2 is derivable from G1 ifI G2 has exactly one attribute less than G1: G2 ⊂ G1 and|G2| = |G1| − 1
I or in G1 exactly one attribute Ai is replaced by Bi where thefollowing holds true: Ai → Bi
⇒ Data Cube LatticeDerivability:
I Grouping combinations: G2 is within a data cube lattice derivablefrom G1 when there is a path from G1 to G2
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–27
Calculation of the CUBE
Data Cube Lattice
all
Produktgruppe (P) Bundesland (B) Jahr (J)
PB PJ BJ
PBJ
0
1
2
3
Ebene
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–28
Calculation of the CUBE
Computation of the CUBE
Idea:I Using of the grid view (Derivabiity)I GROUP-BYs with common grouping attributes can share partitions,
sorted parts etc.Appriach
I Based on sorting: PipeSortI Based on hashing: PipeHash
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–29
Calculation of the CUBE
Optimized Computation: Principle
γ(A,B,C)
γ(A,B)γ(A,C) γ(B,C)
γ(A) γ(B) γ(C)
γ()
UNION
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–30
Calculation of the CUBE
Optimization Potential
Smallest-parentI Computation of a GROUP-BY based on the minimal previously
computed Parent-GROUP-BYCache-results
I Temporary storage of the results (in the main memory) of aGROUP-BY for subsequent GROUP-BYs (Example.: ABC→ AB)
Amortize-scansI Joint calcukation of multiple GROUP-BYs in a scan (Example: ABC,
ACD, ABD, BCD aus ABCD)Share-sorts
I For sort-based approaches: Temporary storage and shared use ofsorted parts
Share-partitionsI For hash-based approaches: Temporary storage and joint use of
partitions
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–31
Calculation of the CUBE
Meaning of the Sorting Order
Assumption: Grouping based on sortingI potentially requires re-sorting
Example:Product Year Region SalesRotwein 2009 SANH 230Rotwein 2009 THÜR 210Rotwein 2010 SANH 200
... ... ... ...Bier 2009 SANH 568
I Re-sorting for the grouping (Product, Region)
Consideration of the sort costs in a cost model
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–32
Calculation of the CUBE
Cost Model
Cost TypesI S-Costs (Still to sort): Calculation of GROUP-BY j from GROUP-BY
i, if i not sorted yetI A-Costs (Already sorted): Calculation of GROUP-BY j from
GROUP-BY i, if ialready sorted
Estimation based on data distribution, system parameters, etc.
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–33
Calculation of the CUBE
PipeSort
Input: Search gridI Graph with nodes to represent GROUP-BY (Aggregationsgitter)I Directed edge connects GROUP-BY i with GROUP-BY j
F i is parent node of jF j can be generated from iF j has exactly one attribute less than i
I Level k refers to all GROUP-BYs with k attributes
Annotations of edges eij with A- and S-CostsOutput: Subgraph of the search grid
I Each node is connected to a single parent nodeI Determines sorting order while preserving pipelining
F Special expanded tree
Goal: Subgraph with minimal summe of edge costs
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–34
Calculation of the CUBE
PipeSort: Example
ACABAB AC
A B C
BC BC
ACABAB AC
A B C
BC BC
2 10 5 12 13 20
2 10 5 12 13 20
Pipeline Sortierung
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–35
Calculation of the CUBE
PipeSort: Algorithm
Can be traced back to a known Graph-AlgorithmLevel-wise approach: k = 0..N − 1 (N: Number of attributes)Transformation of level k + 1 by k copies of each nodeEach copied node has connections with the same node as theoriginalEdge costs for original node: A(eij), otherwise: S(eij)
Search for graph with minimal costsI Forming of pairs and minimization of the total costs ("hungarian"
method – weighted bipartite matching problem)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–36
Calculation of the CUBE
PipeSort: Sorting Order
Each node h in level k is connected to a node g in level k + 1
h→ g over A()-edge: h determines attribute order for sorting g
h→ g over S()-edge: g is re-sorted for the calculation of h
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–37
Calculation of the CUBE
PipeSort: Sort Plan
ABC
Pipeline Sortierung
Relation
CBAD
CBA BAD ACD DBC
CB BA AC DB AD CD
D
all
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–38
Calculation of the CUBE
Iceberg Cubes
Idea: Exploitation of the monotony of aggregations
If an aggregation group does not fulfill the COUNT-condition, then thiscondition is also not fulfilled by groups with additional attributes.
Approach: Bottom-Up-ConstruCtion of a cube andMinimal-Support-Pruning (similar to Apriori)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–39
Calculation of the CUBE
Iceberg Cube bottom up
ABCD
ABC ABD ACD BCD
ADACAB BC BD CD
A B C D
all
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–40
Calculation of the CUBE
Iceberg-Cube: ComputationBottomUpCube(input, dim):
aggregate(input);write(outputRec);for (d:=dim; d<numDims; d++) {
C := cardinality[d]; /* Cardinality of the dimension */Partition(input, d, C, dataCount[d]);k := 0;for (i:=0; i < C; i++) {
c := dataCount[d][i]; /* Size of Partition */if (c >= minsup) {
outputRec.dim[d] := input[k].dim[d];BottomUpCube(input[k...k+c], d+1);
}k += c;
}outputRec.dim[d] = ALL;
}
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–41
Calculation of the CUBE
Summary
Special charakterics of DW queries require specific optimizationmethodsRewriting techniques:
I Join order for Star-JoinI Push-down of groupings
Operator implementationI CUBE and Iceberg-CUBEI OLAP-Functions
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.02.2020 0–42
Top Related