Post on 06-Aug-2020
Ein Unternehmen der Daimler AG
Lecture @DHBW: Data Warehouse
05 Data Engineering
Andreas Buckenhofer
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
tss@daimler.com / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision
Andreas BuckenhoferSenior DB Professional
Since 2009 at Daimler TSS
Department: Machine Learning Solutions
Business Unit: AnalyticsDHBWDOAG
Contact/Connect
vcard
• Oracle ACE Associate
• DOAG responsible for InMemory DB
• Lecturer at DHBW
• Certified Data Vault Practitioner 2.0
• Certified Oracle Professional
• Certified IBM Big Data Architect
• Over 20 years experience with
database technologies
• Over 20 years experience with Data
Warehousing
• International project experience
Daimler TSS Data Warehouse / DHBW 3
Change Log
Date Changes
30.10.2019 Initial version
Daimler TSS Data Warehouse / DHBW 4
What you will learn today
• Understand concepts behind
• ETL & ELT
• Data integration
• Data Engineering
• Data Pipelines
Data Warehouse /
DHBWDaimler TSS 5
Data Engineering
Over 75%• of time is spent for
• say they least enjoy
DATA PREPARATION
Data Consumers
The machine learning pipeline
Daimler TSS Data Warehouse / DHBW 7
Data scientist vs Data Engineering vs software engineering
Source: 2018 Enterprise Almanah
Daimler TSS Data Warehouse / DHBW 8
Data Scientist vs Data Engineer
Source: https://www.oreilly.com/ideas/data-engineers-vs-data-scientistsDaimler TSS Data Warehouse / DHBW 9
Standard Data Warehouse architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 10
? ? ? ?
Daimler TSS Data Warehouse / DHBW 10
?
ETL Process
Extract – Transform - Load
Other term: Data integration (better, more neutral)
Similar: Data Engineering, Data Pipelines
Daimler TSS Data Warehouse / DHBW 11
Tasks of the ETL Process - Extract
• Capture and copy data from source systems (e.g. operational systems) or from source table within DWH
• Many different types of sources
• Relational, hierarchical DBMSs
• Flat files
• NoSQL DBs
• Other internal/external sources … CSV, XML, JSON
Daimler TSS Data Warehouse / DHBW 12
Tasks of the ETL Process - Transform
Most complex part of data integration
• Filter data
• Integrate data
• Check and cleanse data
Daimler TSS Data Warehouse / DHBW 13
Tasks of the ETL Process - Load
Insert data into the target table
Daimler TSS Data Warehouse / DHBW 14
ETL vs ELT
ETL often used for data integration in general (for ETL and ELT)
But: if ELT is mentioned, it is differentiated from ETL
Source
DB
Target
DB
ETL Server
Source
DB
Target
DB
ELT Server
Data flow
Daimler TSS Data Warehouse / DHBW 15
ETL vs ELT
ETL ELT
Data is transferred to ETL server and transferred
back to DB. High network bandwidth required
Data remains in the DB except for cross
Database loads (e.g. source to target)
Transformations are performed in the ETL Server Transformations are performed (in the source
or) in the target
Proprietary code is executed in the ETL server Generated code, e.g. SQL, PL/SQL, SQLT
Typically used for
• source to target transfer
• Compute intensive transformations
• Small amount of data
Typically used for
• High amounts of data
Daimler TSS Data Warehouse / DHBW 16
ETL/ELT tool vs manual ETL/ELT
ETL Tool Manual ETL
Informatica, Talend, Oracle ODI, etc. SQL, PL/SQL, SQLT, etc.
Separate license No additional license
Workflow, error handling, and restart/recovery
functionality included
Workflow, error handling, and restart/recovery
functionality must be implemented manually
Impact analysis and where-used (lineage)
functionality available
Impact analysis and where-used (lineage)
functionality difficult
Faster development, easier maintenance Slower development, more difficult maintenance
Additional (Tool-) Know How required Know How often available
Daimler TSS Data Warehouse / DHBW 17
ETL/ELT tool vs manual ETL/ELT
Extract servicesLoad
services
Operations management services
Scheduler Control Repository Management
Connectors
Sorter
Connector
Sorter
Bulk Loader
Data Profiling servicesSource analysis
Data Quality servicesData cleansing
Data Transformation and Integration services
Data mapping Business rules
Slowly Changing Dimensions
Datatype conversion
Lookups
Job Monitoring Auditing Error Handling
Security
Daimler TSS Data Warehouse / DHBW 18
Mapping - Informatica
Source Target
Filter
Lookup
Daimler TSS Data Warehouse / DHBW 19
Mapping with Transformations - Informatica
Sorter
Aggregator Transformation
Union Transformation
Daimler TSS Data Warehouse / DHBW 20
Data Mapping
• Specification between source and target columns
• Source tables + columns
• Target table + columns
• Join rules
• Filter criteria
• Transformation rules
Daimler TSS Data Warehouse / DHBW 21
Workflow - Informatica
Decision & coordination step
Session containing Mapping
Daimler TSS Data Warehouse / DHBW 22
Job monitoring - Informatica
Daimler TSS Data Warehouse / DHBW 23
Extract
Data change detection
• Extracts from source systems
• Initial extract for setting up the data warehouse
• Initial Load
• Periodical extracts for adding new/changed information to the data warehouse
• Incremental Load
• Question: How to determine what is new or what has changed in the source systems?
Task of data change detection („monitoring“)
Daimler TSS Data Warehouse / DHBW 25
OLTP OLAP
OLTP OLAP
Data change detection: net effect of changes
• Discovery of all changes vs. determining the net effect at extract/load time only
• Example: an attribute value can get changed in two ways:
• by one update operation
• by one delete and one insert operation
• The net effect of both is the same
• However, history information is lost if the net effect is recorded only
Daimler TSS Data Warehouse / DHBW 26
5 UPDATE 6
5 DELETE INSERT 6
Exercise Data change detection
• Which techniques can be used to identify changes in a source system (RDBMS)?
• E.g. in OLTP system
• new products are inserted
• customer address changes
• Product is deleted because it is out of stock
• How would you identify such changes? List advantages / disadvantages of possible solutions
• Think about making changes in the source system. Think also about other solutions without any change in the source system.
Daimler TSS Data Warehouse / DHBW 27
Data change detection techniques
• Depend on characteristics of the data sources
• The following techniques are based on modern relational DBMS
• Types of techniques
• Based on DBMS
• Trigger-based
• Log-based discovery
• Controlled by application
• Timestamp-based discovery
• Snapshot-based discovery
Daimler TSS Data Warehouse / DHBW 28
Trigger-based
• Active monitoring mechanisms
• Based on (database) triggers
• Example:
• If new record is inserted in sales transaction table then insert transaction id and timestamp in change table
• Advantage:
• Triggers do not change operational applications
• Disadvantage:
• Performance impact on operation systems if triggers are used extensively
• Triggers have to be implemented for every table in the source systems
Daimler TSS Data Warehouse / DHBW 29
Trigger-based
• Sample Trigger Code, Oracle
CREATE [OR REPLACE] TRIGGER <trigger_name>
{BEFORE|AFTER} {INSERT|DELETE|UPDATE}
ON <table_name>
[REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]]
[FOR EACH ROW [WHEN (<trigger_condition>)]]
<trigger_body>
• Trigger is created for each source table in OLTP DB and stores insert/update/delete changes in a “log/journal table”
• trigger body contains insert statements into log/journal table
Daimler TSS Data Warehouse / DHBW 30
Log-based
• Log-based discovery
• Also often referenced as CDC (Change Data Capture)
• Usage of database transaction logs to determine changes
• DBMSs write transaction logs in order to be able to undo partially executed transactions
• This information can be used to determine all changes
• Log reader identifies insert, update, delete, truncates and writes the changes as inserts into staging layer
• Transaction Log files can be transferred to other systems to avoid additional load on source systems
Daimler TSS Data Warehouse / DHBW 31
Daimler TSS Data Warehouse / DHBW 32
Transcation LogsRDBMS & NoSQL use transaction logs
Insert
Update
Commit
Delete
Create table
Create index
And others
OLTP
DB
Database
Files
Transaction
LogsLog Buffer
Data Buffer
commit
Checkpoint
Triggered
e.g. by time
or size
Essential for backup &
recovery; used for
replication & failover
Sequential
writes (and
reads for
most DBs)
Random
writes (and
reads for
most DBs)
Log-based (sample product architecture IIDR)
Fro
nte
nd
Standard
Reports
AdHoc
Reports
IIDR
ReplEngine
Source
Datastore
Source
OLTP
DB
IIDR ReplEngine
DWH
Datastore
DWH
DWH DB
Staging Layer
Core DWH Layer
Mart Layer
Transaction
Logs
Database
Files
initialincremental
Timestamp-based
• Timestamp-based discovery
• Every data item in a table is associated with timestamp information about its validity period
• Changed data can be determined from this timestamp information
Daimler TSS Data Warehouse / DHBW 34
Timestamp-based
• Sample customer table in OLTP
• Each table gets Change timestamp
• Delta process reads latest data only (e.g. ChangeTimestamp >= <yesterday>)
• Problem: it is not possible to identify deleted rows
CustomerID Name Department Change
Timestamp
1 Miller DWH 15.01.2015
17:00:01
2 Powell DB 22.03.2016
08:30:22
Daimler TSS Data Warehouse / DHBW 35
Snapshot-based
• Data comparison
• Comparison of snapshots of the operational data at different points in time
• Compute difference between two latest snapshots
• E.g. unload all data from a table into a file and diff newest file content with latest file content
• Can be very complex
• Sometimes the only possibility, for instance for legacy applications
• High performance impact on source
Daimler TSS Data Warehouse / DHBW 36
Data change detection techniques comparison
Trigger-based Log-based
discovery
Timestamp-
based
discovery
Snapshot-
based
discovery
Performance
impact on source
system
High Low Medium High
Performance
impact on target
system
Low Low Low High
Load on network Low Low Low High
Data loss if
nologging
operations
No Yes No No
Daimler TSS Data Warehouse / DHBW 37
Data change detection techniques comparison
Trigger-
based
Log-based
discovery
Timestamp-
based
discovery
Snapshot-
based
discovery
Identify
DELETE
operations
Yes Yes No Yes
Identify ALL
changes
(changes
between
extractions)
Yes Yes No No
Daimler TSS Data Warehouse / DHBW 38
Data Transport – direct access
Direct Access
• Source writes data into target or
• Target reads data from source
• Security concerns
• High coupling / dependencies
Source Target
Daimler TSS Data Warehouse / DHBW 39
Data Transport – file transfer
File transfer (or other transport medium)
• csv, json, xml, binary, etc
• Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise service bus), SOA (service oriented architecture), etc
• Often high amounts of data, therefore bulk transfer of compressed data most widely used
• Better decoupling of source and target
Source Targetfiles
Daimler TSS Data Warehouse / DHBW 40
Transform
Data quality concerns all
Source: https://twitter.com/markmadsen/status/1059579065164738560?s=21
Daimler TSS Data Warehouse / DHBW 42
Marriott – Starwood integration of loyalty programs: feedback on social media
Sources: https://www.flyertalk.com/forum/starwood-starwood-preferred-guest/ + https://www.facebook.com/marriottrewards/ + https://twitter.com/MarriottRewards/with_repliesDaimler TSS Data Warehouse / DHBW 43
Data quality issues
Source: https://flowingdata.com/2019/04/09/bad-data-from-a-faulty-sensor-on-the-boeing-737-max/
Daimler TSS Data Warehouse / DHBW 44
Prerequisite of transformation: Understanding The Data
• Profile Existing Data Sources, Extracted Data
• Analyze data structure, content, and quality
• Find data relationships across systems
• Often badly documented or missing foreign keys
• Uncover data issues that can affect subsequent transformation steps
• Missing values
• Duplicates
• Inconsistencies
Daimler TSS Data Warehouse / DHBW 45
After 3 years and 9 months half the customer records in a database are incorrect
Half live of data
Source: NoCOUG 1/2019, Journal Bud Walker: The half-live of data, p. 20
Daimler TSS Data Warehouse / DHBW 46
Data Quality issues
CustomerNo Name Birthdate Age Gender Zip code
1 Miller, Tom 33.01.2001 15 M NULL
1 John Mayor 15.01.2001 15 M 98144
2 Mrs. Bush 31.10.1988 22 Q 00000
3 Martin 31.10.1988 22 M 75890
PK / Unique Key violated Data not uniform Not valid
Inconsistent Wrong value
Unknown / missing
FK violated
Daimler TSS Data Warehouse / DHBW 47
Data Quality issues and possible solutions in the source RDBMS
Issue Solution
Wrong data e.g. 31.02.2016 Proper data type definition
Wrong values, e.g. number out of range Check constraint
Missing values NOT NULL constraint
Violated references FOREIGN KEY constraint
Duplicates PRIMARY or UNIQUE KEY constraint
Inconsistent data ACID transactions, business logic, additional checks
Daimler TSS Data Warehouse / DHBW 48
Data Quality issues and possible solutions in the source RDBMS
Issue Solution
Wrong data e.g. 31.02.2016 Proper data type definition
Wrong values, e.g. number out of range Check constraint
Missing values NOT NULL constraint
Violated references FOREIGN KEY constraint
Duplicates PRIMARY or UNIQUE KEY constraint
Inconsistent data ACID transactions, business logic, additional checks
Daimler TSS Data Warehouse / DHBW 49
Data Quality issues: correct data in the source
Correcting the data
• In the source systems
• Common master data management across all operational applications
• Dedicated systems are “master” of e.g. customer data
• Correcting the data at the source is best approach but slow and often not feasible
Daimler TSS Data Warehouse / DHBW 50
Data Quality issues: workarounds in DWH
Correcting the data
• Automatically during ETL
• E.g., address of a customer if a correct reference table exists
• Manually after ETL is finished
• ETL stored bad data in error log tables or files
• ETL flags bad data (e.g. invalid)
Daimler TSS Data Warehouse / DHBW 51
Data Quality issues: missing data
• Column is null
• Reject data
• Use default values
• Missing values can represent
• an unknown value Iike date of birth of a customer
• a missing value like engine_id for a car (logical not null constraint)
• Dimension tables can include some dummy values:
DimensionTable_X Description
-1 Unknown
-2 MissingDaimler TSS Data Warehouse / DHBW 52
Data Quality issues: missing data
• Data is inaccuratee.g. wrong date 32.12.2015 or wrong number 55U
• Reject data
• Replace with value that represents „Invalid“
• Dimension tables can include some dummy values:
DimensionTable_X Description
-1 Unknown
-2 Missing
-3 InvalidDaimler TSS Data Warehouse / DHBW 53
Data Quality issues: conflicting data
• Data has conflicts, e.g. wrong postal code 80995 Stuttgart
• Reject data
• Replace one of the values with a value that represents „Invalid“ or with corrected valueWhich value to replace? Rules necessary
Daimler TSS Data Warehouse / DHBW 54
Data Quality issues: inconsistent data
• Data is inconsistent, e.g. unlikely high price for a product
• Can be discovered by statistical and data mining methods
Daimler TSS Data Warehouse / DHBW 55
Data Quality issues: duplicates
• Data is duplicated, e.g. „Martin Miller” vs “Miller, Martin” vs “M.Miller”
• Multiple representations for one entity
• Different keys
• Different encodings
• Duplicate detection can be very difficult / tricky
• Products are available for e.g. address duplicate detection address validation (Kingstreet = does this address actually exist?)address harmonization (Kingstr, Kingstreet, King Street, etc)
• Standardize / Harmonize data during ETL flow: “unification”
Daimler TSS Data Warehouse / DHBW 56
Transform - Unification of data
• Unification of data types
• Character string date „20.01.2006“ 20.01.2006
• Character string number „12345“ 12345
• Unification of encodings
• For instance for gender F and M
• Lookup-tables contain the mapping from old to new encodings
• Combination of different attributes to one attribute
• day, month, year date
Daimler TSS Data Warehouse / DHBW 57
Transform - Unification of data
• Split of one attribute into two or more
• Name first name, last name (“Herr Prof. Dr. Hans M. vom und zum Stein”)
• Unification of names can become very challenging “Herr Prof. Dr. Hans M. vomund zum Stein” or “Werner Martin” or “Mariae Gloria … Wilhelmine HubertaGräfin von Schönburg-Glauchau“
• Product name - „Cola, 0.33 l“ Product short name - „Cola“, size in liters - 0.33
Daimler TSS Data Warehouse / DHBW 58
Transform - Unification of data
• Unification of dates and timestamps
• Rules for representing incomplete date information If only month and year are known
• Dates and timestamps with regard to one specific timezoneImportant for multi-national organizationsUTC Coordinated Universal Time without daylight saving zone
• What can happen if clock is changed to wintertime if no UTC is used?- Update arrives at 02:15 in staging layer (CDC / log-based monitor)- Clock is changed to wintertime: -1h- Update of the same row arrives at 02:10 in staging layer (CDC / log-based)- How can batch load running the next night discover which update is the most recent one?
Daimler TSS Data Warehouse / DHBW 59
Transform - Unification of data
• Computation of derived values
• Profit = sales price – purchase price Without clear definition, different interpretations possible
• Net or gross sales price?
• Net or gross purchase price?
• Aggregations
• Revenue of the year computed from revenues of the dayWithout clear definition, different interpretations possible
• Calendar year?
• Fiscal year?
Daimler TSS Data Warehouse / DHBW 60
Load
Load
• Efficient load operations are important
• bulk load: Single row processing vs set based processing
• Online load
• Data warehouse (especially Data Mart) is still accessible
• Offline load
• Data warehouse (especially Data Mart) is offline
• For updates that require the recomputation of a cube
• Offline load is often a Tool limit because the Tool locks data structures. But
offline load could be faster.
Daimler TSS Data Warehouse / DHBW 62
Bulk processing
• Specific Bulk load operations provided by RDBMS, e.g. External tables in Oracle or LOAD command in DB2
• Single row vs set based processing
Single row processing Set based processing
Cursor curs = SELECT * FROM <source>
WHILE NOT EOF(curs)
FETCH NEXT ROW INTO myRoW;
INSERT INTO <target> VALUES(myRow);
LOOP
INSERT INTO <target>
SELECT * from <source>
Error handling easy All or nothing if there are errors
Slow for high amounts of data Performs well for small and high amounts of data
More coding Less code = less errors
Daimler TSS Data Warehouse / DHBW 63
Data Engineering/ETLSummary
Example for Data integration in Data Vault 2.0 architecture
Source: Hans Hultgren: Modeling the agile Data Warehouse with Data Vault, New Hamilton 2012, p. 224
Hard
Rules
only
Soft
Rules
Raw
Data Vault
Business
Data Vault
ETL (E)T(L) ETL
ETL, „M
on
ito
ring“
Daimler TSS Data Warehouse / DHBW 65
ETL-Job parallelism for loading data into Core Warehouse Layer
HU
B lo
ad
ed
LIN
K u
nd
HU
B-
SA
Tlo
ad
ed
LIN
K-S
AT
load
ed
Da
ta V
au
lt
Lo
ad
Cla
ssic
al
Lo
ad
?
? ?
Integration of new JobsTime Windows for Loads, e.g 00:00-06:00
• Complex
• Many dependencies
• Many sequential jobs
• Systematic / Methodic
• Few, well defined dependencies
• Massive parallel
Daimler TSS Data Warehouse / DHBW 66
Standard Data Warehouse architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 67
? ? ? ??
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
tss@daimler.com / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision
Learn how to replace code
• Machine learning will no doubt change software development in significant ways
• Software developers will put much more effort into data collection and preparation
• Developers will have to do more than just collect data; they’ll have to build data pipelines and the infrastructure to manage those pipelines. We’ve called this “Data Engineering”
• Data engineers will be responsible for maintaining the data pipeline: ingesting data, cleaning data, feature engineering, and model discovery
Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-development
Daimler TSS Data Warehouse / DHBW 69
Replication-based
• Replication techniques
• Data replication
• Target tables not necessarily on local system
• Uses typically Transaction Logs
• Log reader identifies insert, update, delete, truncates and writes the changes into replicated tables (insert remains insert, update remains update, etc)
• Useful for 1:1 copies (e.g. ODS, Operational Data Store) but still challenge to detect changes for loading the data mart
Daimler TSS Data Warehouse / DHBW 70
Replication-based (sample product architecture IIDR)
Fro
nte
nd
Standard
Reports
AdHoc
Reports
IIDR
ReplEngine
Source
Datastore
Source
OLTP
DBIIDR ReplEngine
DWH
Datastore
DWH
DWH DB
Staging Layer
Core Layer
Mart Layer
Transaction
Logs
Daimler TSS Data Warehouse / DHBW 71
Extraction intervals
Extraction intervals
• Periodically – in regular intervals
• Every day, week, etc.
• Instantly / Continuous
• Every change is directly propagated into the data warehouse
• „real time data warehouse“
• Depends on the requirements on timeliness of the data warehouse data
Daimler TSS Data Warehouse / DHBW 72
Extraction intervals
Triggered by a specific request
• Addition of a new product
• Query which involves more recent data
Triggered by specific events
• Number of changes in operational data exceeds threshold
Daimler TSS Data Warehouse / DHBW 73
Data scientist sexiest job of the 21st century?
Source: https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-articleDaimler TSS Data Warehouse / DHBW 74
Exercise: load Data Vault table
Draw a flow diagram how to load a HUB, LINK and SAT table and describe the SQL statements
Daimler TSS Data Warehouse / DHBW 75
Exercise: load HUB table
Source
data exist
Load distinct
business keys
Does
business
Key exist in
HUB?
Insert row into
HUB
Conflict if PK
HashKey
collision!
no
Reject
data
Data loaded into
HUB
yes
Daimler TSS Data Warehouse / DHBW 76
Exercise: load HUB table
INSERT INTO core.fahrzeug (vehicle_hk, fin, loaddate, recordsource)
SELECT DISTINCT f.fahrzeug_hashkey
, f.fin_bk
, f.loaddate
, f.recordsource
FROM staging.fahrzeugdaten f
WHERE f.fin_bk NOT in (SELECT fin FROM core.hub_fahrzeug)
AND f.loaddate = <date to load>;
Daimler TSS Data Warehouse / DHBW 77
Exercise: load LINK table
Source
data exist
Load distinct
business keys
Does Hash
Key
relationship
exist in
HUB?
Insert row into
LINK
Conflict if PK
HashKey
collision!
no
Reject
data
Data loaded into
LINK
yes
Daimler TSS Data Warehouse / DHBW 78
Exercise: load LINK table
INSERT INTO core.link_verbaut (verbaut_hk, motor_hk, vehicle_hk, loaddate, recordsource)
SELECT DISTINCT h.verbaut_hk
, f.motor_hashkey
, f.fahrzeug_hashkey
, f.loaddate
, f.recordsource
FROM staging.fahrzeugdaten f
WHERE (f.motor_hashkey, f.fahrzeug_hashkey) NOT in (SELECT motor_hk, vehicle_hk FROM core.link_verbaut v)
AND loaddate = <date to load>;
Daimler TSS Data Warehouse / DHBW 79
Exercise: load SAT table
Source
data exist
Load
distinct
source
data
MD5-
HASH
Diff
identical?
Insert row into
SAT
no
Reject
data
Data loaded into
SAT
yes
Load
current/
latest row
from SAT
table
Daimler TSS Data Warehouse / DHBW 80
Exercise: load SAT table
INSERT INTO core.sat_fahrzeug_text (vehicle_hk, loaddate, recordsource, md5_hash, codeleiste, kommentar)
SELECT DISTINCT f.fahrzeug_hashkey
, f.loaddate
, f.recordsource
, f.md5hash
, f.codeleiste
, f.kommentar
FROM staging.fahrzeugdaten f
LEFT OUTER JOIN (select s.vehicle_hk, s.md5_hash from s_fahrzeug s JOIN (select i.VEHICLE_HK, max(i.loaddate) as loaddate from s_fahrzeug i GROUP BY i.VEHICLE_HK) m
ON s.vehicle_hk = m.vehicle_hk AND s.loaddate = m.loaddate) k ON f.fahrzeug_hashkey = k.vehicle_hk
WHERE (k.md5_hash is null OR f.md5hash <> k.md5_hash)
AND f.loaddate = <date to load>;
Daimler TSS Data Warehouse / DHBW 81
Bounded vs unbounded: Streaming ETL
Finite
Complete
consistent
Infinite
inomplete
inconsistent
Daimler TSS Data Warehouse / DHBW 82
Different Times in streaming ETL
• Event-time: the time the event was created in the origin
• Load-time (ingestion-time): the time the event was stored in a database or storage engine in general
• Processing-time: the time the event was consumed by e.g. a stream processor
t00:00 00:05 00:10 00:15 00:20 00:25
Daimler TSS Data Warehouse / DHBW 83
Window slices
t00:00 00:05 00:10 00:15 00:20 00:25
Daimler TSS Data Warehouse / DHBW 84
Window slices .. And late arrivals
t00:00 00:05 00:10 00:15 00:20 00:25
t = 00:01
Daimler TSS Data Warehouse / DHBW 85
Window slices
• Tumbling: Fixed size, gap-less
• Hopping: Fixed size, overlapping
• Session: Variable size, key-bound time
Daimler TSS Data Warehouse / DHBW 86
Window slicesTumbling: fixed-size, gap-less
t00:00 00:05 00:10 00:15 00:20 00:25
Daimler TSS Data Warehouse / DHBW 87
Advance
by
Window slicesHopping: fixed-size, overlapping
t00:00 00:05 00:10 00:15 00:20 00:25
Daimler TSS Data Warehouse / DHBW 88
Window slicesSession: variable size, key-bound timeout
t00:00 00:05 00:10 00:15 00:20 00:25
gap
of
Inactivity (timeout)
gap
of
Inactivity (timeout)Daimler TSS Data Warehouse / DHBW 89
Streaming tools
Source: https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
Daimler TSS Data Warehouse / DHBW 90
Streaming ETLKafka Streams and Kafka KSQL
Source: https://docs.confluent.io/current/KSQL/docs/concepts/KSQL-and-kafka-streams.html
Daimler TSS Data Warehouse / DHBW 91
Streaming ETLKafka Streams and Kafka KSQL
Source: https://docs.confluent.io/current/KSQL/docs/concepts/KSQL-and-kafka-streams.html
Daimler TSS Data Warehouse / DHBW 92
KSQL stream
The following example creates a stream that has three columns from the pageviews topic: viewtime, userid, and pageid
CREATE STREAM pageviews \
(viewtime BIGINT, \
userid VARCHAR, \
pageid VARCHAR) \
WITH (KAFKA_TOPIC='pageviews');
Daimler TSS Data Warehouse / DHBW 93
KSQL table
This query computes the pageview count per region per minute:
CREATE TABLE pageviews_per_region_per_minute AS \
SELECT regionid, \
COUNT(*) \
FROM pageviews \
WINDOW TUMBLING (SIZE 1 MINUTE) \
GROUP BY regionid;
Daimler TSS Data Warehouse / DHBW 94
KSQL joins
For example, to find orders that have shipped within the last hour from an orders stream and a shipments stream, you might run a query like:
SELECT o.order_id, o.total_amount, o.customer_name, s.shipment_id,
s.warehouse \
FROM new_orders o \
INNER JOIN shipments s \
WITHIN 1 HOURS \
ON o.order_id = s.order_id;
Daimler TSS Data Warehouse / DHBW 95
ETL original vs general meaning
• ETL - Original meaning: load data from source systems into the Core Warehouse Layer
• General meaning: Loading data from source system into staging layer and/or between any layer
Daimler TSS Data Warehouse / DHBW 96
Data Warehouse /
DHBWDaimler TSS 97
Data pyramid and data quality
Source: By Matthew.viel - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49310779 LinkedIn 11/2017: https://www.linkedin.com/feed/update/urn:li:activity:6334062387355746304
Daimler TSS Data Warehouse / DHBW 97