Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Bucken... · 2019-10-17 · Big Data...
Transcript of Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/Bucken... · 2019-10-17 · Big Data...
Ein Unternehmen der Daimler AG
Lecture @DHBW: Data Warehouse
03 Architecture
Andreas Buckenhofer
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision
Andreas BuckenhoferSenior DB Professional
Since 2009 at Daimler TSS
Department: Machine Learning Solutions
Business Unit: AnalyticsDHBWDOAG
Contact/Connect
vcard
• Oracle ACE Associate
• DOAG responsible for InMemory DB
• Lecturer at DHBW
• Certified Data Vault Practitioner 2.0
• Certified Oracle Professional
• Certified IBM Big Data Architect
• Over 20 years experience with
database technologies
• Over 20 years experience with Data
Warehousing
• International project experience
Daimler TSS Data Warehouse / DHBW 3
Change Log
Date Changes
17.10.2019 Initial version
Daimler TSS Data Warehouse / DHBW 4
What you will learn today
• Explain and draw different DWH architectures
• Explain and draw Big Data architectures
• Understand Data Lakes and other buzzwords
Purpose: why are DWH/Big Data architectures useful?
• Specific implementation can follow an architecture
• Architecture describes an ideal type. Therefore an implementation may not use all components or can combine components
• Better understanding, overview and complexity reduction by decomposing a DWH into its components
• Can be used in many projects: repeatable, standardizable
• Map DWH tools into the different components and compare functionality
• Functional oriented as it describes data and control flow
5Daimler TSS Data Warehouse / DHBW 5
1. DWH architectures
• Standard DWH architecture
• Kimball Bus architecture
• Data Vault 2.0 architecture
2. Big Data architectures
• Data Lake
• Lambda architecture
• Kappa architecture
1. DWH architectures
• Standard DWH architecture
• Kimball Bus architecture
• Data Vault 2.0 architecture
2. Big Data architectures
• Data Lake
• Lambda architecture
• Kappa architecture
Standard Data Warehouse architecture
8
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 8
Data sources
• Providing internal and external data out of the source systems
• Enabling data through Push (source is generating extracts) or Pull (BI Data Backend is requesting or directly accessing data)
• Example for Push practice (deliver csv or text data through file interface; Change Data Capture (CDC))
• Example for Pull practice (direct access to the source system via ODBC, JDBC, API and so on)
9Daimler TSS Data Warehouse / DHBW 9
Staging layer
• “Landing Zone” for data coming into a DWH
• Purpose is to increase speed into DWH and decouple source and target system (repeating extraction run, additional delivery)
• Granular data (no pre-aggregation or filtering in the Data Source Layer, i.e. the source system)
• Usually not persistent, therefore regular housekeeping is necessary (for instance delete data in this layer that is few days/weeks old or – more common - if a correct upload to Core Warehouse Layer is ensured)
• Tables have no referential integrity constraints, columns often varchar
10Daimler TSS Data Warehouse / DHBW 10
Integration layer
• Business Rules, harmonization and standardization of data
• Classical Layer for transformations: ETL = Extract – TRANSFORM – Load
• Fixing data quality issues
• Usually not persistent, therefore regular housekeeping is necessary (for instance after a few days or weeks or at the latest once a correct upload to Core Warehouse Layer is ensured)
• The component is often not required or often not a physical part of a DB
Daimler TSS Data Warehouse / DHBW 11
Core Warehouse layer
• Data storage in an integrated, consolidated, consistent and non-redundant (normalized) data model
• Contains enterprise-wide data organized around multiple subject-areas
• Application / Reporting neutral data storage on the most detailed level of granularity (incl. historic data)
• Size of database can be several TB and can grow rapidly due to data historization
• Write-optimised layer
Daimler TSS Data Warehouse / DHBW 12
Aggregation layer
• Preparing data for the Data Mart Layer to the required granularity
• E.g. Aggregating daily data to monthly summaries
• E.g. Filtering data (just last 2 years or just data for a specific region)
• Harmonize computation of key performance indicators (measures) and additional Business Rules
• The component is often not required or often not a physical part of a DB
Daimler TSS Data Warehouse / DHBW 13
Data Mart layer
• Read-optimised layer: Data is stored in a denormalized data model for performance reasons and better end user usability/understanding
• The Data Mart Layer is providing typically aggregated data or data with less history (e.g. latest years only) in a denormalized data model
• Created through filtering or aggregating the Core Warehouse Layer
• One Mart ideally represents one subject area
• Technically the Data Mart Layer can also be a part of an Analytical Frontend product (such as Qlik, Tableau, or IBM Cognos TM1) and need not to be stored in a relational database
Daimler TSS Data Warehouse / DHBW 14
Metadata management, Security, DWH Manager
• Metadata Management
• Not just “Data about Data”, separate lecture
• Security
• Not all users are allowed to see all data
• Data security classification (e.g. restricted, confidential, secret)
• DWH Manager incl. Monitor
• DWH Manager initiates, controls, and checks job execution
• Monitor identifies changes/new data from source systems, separate lecture
Daimler TSS Data Warehouse / DHBW 15
1. DWH architectures
• Standard DWH architecture
• Kimball Bus architecture
• Data Vault 2.0 architecture
2. Big Data architectures
• Data Lake
• Lambda architecture
• Kappa architecture
Kimball Bus Architecture (Central Data Warehouse based on data marts)
Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/Daimler TSS Data Warehouse / DHBW 17
Kimball bus architecture (Central Data Warehouse based on data marts)
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer
(Input Layer)
OLTP
OLTP
Core Warehouse Layer
= Mart Layer
Data Mart 1
Data Mart 2Data Mart 3
Metadata Management
Security
DWH Manager incl. Monitor
More Business-
process oriented
than subject-
oriented,
integrated,
time-variant,
non-volatile
Daimler TSS Data Warehouse / DHBW 18
Kimball Bus Architecture (Central Data Warehouse based on data marts)
• Bottom-up approach
• Dimensional model with denormalized data
• Sum of the data marts constitute the Enterprise DWH
• Enterprise Service Bus / conformed dimensions for integration purposes
• (don’t confuse with ESB as middleware/communication system between applications)
• Kimball describes that agreeing on conformed dimensions is a hard job and it’s expected that the team will get stuck from time to time trying to align the incompatible original vocabularies of different groups
• Data marts need to be redesigned if incompatibilities exist
Daimler TSS Data Warehouse / DHBW 19
Co
re W
are
house
Laye
r
Data Integration with and without Core Warehouse layer
Daimler TSS Data Warehouse / DHBW 20
1. DWH architectures
• Standard DWH architecture
• Kimball Bus architecture
• Data Vault 2.0 architecture
2. Big Data architectures
• Data Lake
• Lambda architecture
• Kappa architecture
Data Vault 2.0 architecture – today’s world (Dan Linstedt)
Data Warehouse /
DHBW
Data Vault 2.0 architecture (Dan Linstedt)
Daimler TSS
Michael Olschimke, Dan Linstedt: Building a Scalable Data Warehouse with Data Vault 2.0, Morgan Kaufmann, 2015, Chapter 2.2
Daimler TSS Data Warehouse / DHBW 23
Data Vault 2.0 architecture (Dan Linstedt)
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer
(Input Layer)
OLTP
OLTP
Raw
Data Vault
Mart Layer
(Output
Layer)
(Reporting
Layer)
Business
Data Vault
Metadata Management
Security
DWH Manager incl. Monitor
Hard
Rules
onlySoft Rules
Daimler TSS Data Warehouse / DHBW 24
Data Vault 2.0 architecture (Dan Linstedt)
• Core Warehouse Layer is modeled with Data Vault and integrates data by BK (business key) “only” (Data Vault modeling is a separate lecture)
• Business rules (Soft Rules) are applied from Raw Data Vault Layer to Mart Layer and not earlier
• Alternatively from Raw Data Vault to additional layer called Business Data Vault
• Hard Rules don’t change data
• Data is fully auditable
• Real-time capable architecture
• Architecture got very popular recently; also applicable to BigData, NoSQL
Daimler TSS Data Warehouse / DHBW 25
Data Vault 2.0 architecture (Dan Linstedt)
• In the classical DWHs, the Core Warehouse Layer is regarded as “single version of the truth”
• Integrates + cleanses data from different sources and eliminates contradiction
• Produces consistent results/reports across Data Marts
• But: cleansing is (still) objective, Enterprises change regularly, paradigm does not scale as more and more systems exist
• Data in Raw Data Vault Layer is regarded as “Single version of the facts”
• 100% of data is loaded 100% of time
• Data is not cleansed and bad data is not removed in the Core Layer (Raw Vault)
Daimler TSS Data Warehouse / DHBW 26
Data Vault 2.0 architecture (Dan Linstedt)
• Data Vault is optimized for the following requirements:
• Flexibility
• Agility
• Data historization
• Data integration
• Auditability
• Bill Inmon wrote in 2008: “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)
Daimler TSS Data Warehouse / DHBW 27
Daimler TSS Data Warehouse / DHBW 28
Data Vault – architecture, methodology, model
Lecture
DWH Architectures
Lecture
DWH Data Modeling Implementation: Automation,
Pattern based, High speed
Architecture
• Multi-Tier
• Scalable
• Supports NoSQL
Methodology
• Repeatable
• Measureable
• Agile
Model
• Flexible
• Hash based
• Hub & Spoke
Example DWH for state of construction documentation
Architecture from a project in the automotive industry
ETL Engine
Fro
nte
nd
Standard
Reports
AdHoc
ReportsLogs
TSM
IIDR
ReplEngine
Source
Datastore
Source
Mirror DB
(Operational
Data Store)
OLTP
DB
IIDR ReplEngine
Mirror
Datastore
Mirror
IIDR ReplEngine
DWH
Datastore
DWH
Backend
DWH DB
Staging Layer
Raw + Business Data Vault
Mart Layer
Daimler TSS Data Warehouse / DHBW 30
End user sample questions
Which vehicles or aggregates are documented incompletely? (Data quality)
Which vehicles / which control units require SW updates?
Which interiors are most common by region?
Supply data for external simulations, customs clearance, spare part planning, etc.
Daimler TSS Data Warehouse / DHBW 31
Exercise: recommend an architecture
Review the presented Data Warehouse architectures.
Which architecture would you recommend for
• An online store with real/right-time data integration needs
• Marketing department of a bank
List advantages and drawbacks of your proposal.
Daimler TSS Data Warehouse / DHBW 32
Exercise: recommend an architecture
An online store with real-time/right-time data integration needs
• Architecture: Data Vault 2.0
• + Integration of many internal and external source systems (e.g. integrate social media data about the online store)
• + Fast data delivery in Raw Vault Layer (Real-time/Right-time data integration). Complex data cleansing / transformation / soft rules are delayed downstream towards Mart Layer
• - Transformation overhead (Source system data model to Data Vault data model to Dimensional data model)
Daimler TSS Data Warehouse / DHBW 33
Exercise: recommend an architecture
Marketing department of a bank
• Architecture: Kimball Bus architecture
• + Start small for a department. If other departments are interested, new data and new Marts can be added on demand
• - High risk to loose the Enterprise view and several DWHs are built
That’s still quite a common scenario nowadays. A single Enterprise DWH is often not achieved (e.g. Mergers & Acquisitions, inflexibility due to a single centralized DWH, rapidly changing conditions, etc.) and therefore very often several DWHs with different architectures exist in parallel within a company.
Daimler TSS Data Warehouse / DHBW 34
Summary DWH architectures
Which layers does the standard DWH architecturehave?
• Staging (Input), Integration (Cleansing), Core Warehouse (Storage), Aggregation, Mart (Reporting, Output) and additionally Metadata, Security, DWH Manager + Monitor
Which other architectures exist?
• Kimball Bus Architecture (Central Data Warehouse based on data marts)
• Data Vault 2.0 Architecture (Dan Linstedt)
There are even more
• DW 2.0: The Architecture for the Next Generation of Data Warehousing
• Operational Data Store (ODS)
Daimler TSS Data Warehouse / DHBW 35
DWH architectures - advantages
Daimler TSS Data Warehouse / DHBW 36
Standard DWH
Single version of the truth
Kimball Bus
Single version of the truth
Data Vault 2.0
Single version of the
facts
• Separation of concerns: long-
term enterprise data storage
separated from data
presentation
• Changes in requirements and
scope are easier to manage
• Lower subsequent
development costs (but
higher startup costs)
• Two layers only mean faster
development and less work
• Rather simple approach to
make data fast and easily
accessible
• Lower startup costs (but
higher subsequent
development costs)
• Separation of concerns:
long-term enterprise data
storage separated from
data presentation
• Store raw data and
postpone time-consuming
data cleansing
• Supports flexibility and
agility
DWH architectures - disadvantages
Daimler TSS Data Warehouse / DHBW 37
Standard DWH Kimball Bus Data Vault 2.0
• Data model transformations
from 3NF to Dimensional
model required
• More complex as two
different data models are
required
• Larger team(s) of specialists
required
• High effort to start
• Can become inflexible and
expensive in the long-run
• If table structures change (instable
source systems), high effort to
implement the changes and reload
data, especially conformed
dimensions (“Dimensionitis”
desease)
• Non-metric data not optimal for
dimensional model
• Dimensional model (esp. Star
Schema) contains data redundancy
• Difficult to optimize: read or write?
• Data model transformations
from Data Vault model to
Dimensional model required
• More complex as two different
data models are required
1. DWH architectures
• Standard DWH architecture
• Kimball Bus architecture
• Data Vault 2.0 architecture
2. Big Data architectures
• Data Lake
• Lambda architecture
• Kappa architecture
1. DWH architectures
• Standard DWH architecture
• Kimball Bus architecture
• Data Vault 2.0 architecture
2. Big Data architectures
• Data Lake
• Lambda architecture
• Kappa architecture
Standard Data Warehouse architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 40
What is a Data Lake?
Hadoop
Dump anything in and wait?
Hoard 100ths of
Petabyte in HDFS?
Daimler TSS Data Warehouse / DHBW 41
What are differences between Hadoop and DWH?
Dump question, but actually there are many comparisons like that at the moment
• Hadoop is a tool / technology (or even many tools) like a RDBMS
• DWH is an architecture and concept
• Architecture is abstraction and defines a goal
• Architecture vs tools / technology
Daimler TSS Data Warehouse / DHBW 42
Data Lake on Hadoop
Data Swamp
Data Reservoir
Landing Zone
Data Library
Data Repository
Data Archive
Data Lake on Spark
Data Lake 3.0
Daimler TSS Data Warehouse / DHBW 44
Landing Zone – Definition by Dan Linstedt
Source: https://datavaultalliance.com/news/defining-a-data-lake/
Daimler TSS Data Warehouse / DHBW 45
Data Lake – Definition by Dan Linstedt
Source: https://datavaultalliance.com/news/defining-a-data-lake/
Daimler TSS Data Warehouse / DHBW 46
Data LakeData Vault 2.0 systems architecture by Dan Linstedt
Source: https://datavaultalliance.com/news/defining-a-data-lake/
Data LakeZones instead of layers according to Mark Madsen
New data of unknown value,
simple requests for new data
can land here first, with little
work by IT. Typically
schema-on-read.
More effort applied to data
management: cleansed,
curated. Typically schema-
on-write.
Optimized for specific uses /
workloads. Personal folders.
Schema-on-write.
Daimler TSS Data Warehouse / DHBW 47
Source: https://de.slideshare.net/mrm0/architecting-a-data-platform-for-enterprise-use-strata-ny-2018
It‘s very hard to get speed and quality – choose one
Stability & Quality
•Predictable
•Improving
•Save user time
Flexibility & Speed
•Explanatory
•Experimenting
•Save developer time
Source: https://www.gartner.com/it-glossary/bimodal/
Daimler TSS Data Warehouse / DHBW 48
?
Assumptions + requirements have changed
• The data is not always known in advance, so it can’t be modeled in advance. [data can be anywhere collect everything approach]
• The data architecture must be read-write from both the back and front, not a one-way data flow. The data written back may be repeatedly used, persistent data, or it may be temporary.
• The data may arrive with any frequency, and the rate may not be under your control.
Daimler TSS Data Warehouse / DHBW 49
Use case: Analysis battery aging
• JSON data ingested into HDFS, Hive tables on JSON files
• Identify breaks (“> 8h”) and compute current drain
Max capacity
Current capacity
Daimler TSS Data Warehouse / DHBW 50
Structuring the Data Lakenew data sources – sensor data
• Sensor data format change without notice
• Sensors get regularly updated with new versions
• Names of metrics may change
• Sensors with various versions in the field
• Sensors from different suppliers
• Often many fields >>100 and increasing with new sensor versions
• Easy storing of data in HDFS and applying schema later
• Data from Robots, vehicles, …
Daimler TSS Data Warehouse / DHBW 51
Use case: Analysis battery agingData Lake architecture
Structuring the Data Lake “schema-on-read”
• Sensor data format change without notice
• Time consuming and error-prone data integration into the Data Lake
• Therefore preparation of data for usage in the Consumption area required: “Data Engineer”
Raw dataD
ata
Gove
rnan
ce
Consumption
Enhanced data
Meta
data
Managem
ent
Data
Arc
hiva
l
Data
Security
json
Samp-
ling /
filter
Hive tables
Hive tables
Struc-
ture
R Python
Daimler TSS Data Warehouse / DHBW 52
Characteristics of a Data Lake architecture
• Landing Zone, Data Dump, Data Swamp are not Data Lakes
• Characteristics of a Data Lake architecture:
• Is managed, governed, organized
• Deals with data and schema change easily
• Does not always require up front modeling
• Does not limit the format or structure of data
• Assumes a full range of data latencies, from streaming to one-time bulk loads
• Leverages lineage and metadata
Daimler TSS Data Warehouse / DHBW 53
1. DWH architectures
• Standard DWH architecture
• Kimball Bus architecture
• Data Vault 2.0 architecture
2. Big Data architectures
• Data Lake
• Lambda architecture
• Kappa architecture
Lambda architecture
• Architecture by Nathan Marz
• Combine realtime and batch processing
• Batch layer stores and historizes raw data
• Serving layer contains batch views
• Query unions serving and realtime layer
• Rather complex
• Author recommends graph data model and advises against schema-on-read
Daimler TSS Data Warehouse / DHBW 55
What is the main inevitable problem in data systems?
• Lack of human fault tolerance
• Bugs will be deployed
• Operational errors will happen, e.g. accidentally delete data
• Data loss / corruption is worst case scenario
• Without data loss / corruption, mistakes can be fixed if original data is still available
• Must design for human error like you’d design for any other fault
Daimler TSS Data Warehouse / DHBW 56
Analytics vs Re-Analytics
“My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.”–Jules J. Berman.
Source: http://www.odbms.org/blog/2014/07/big-data-science-interview-jules-j-berman/
Daimler TSS Data Warehouse / DHBW 57
Immutable data - Master data set
• Atomic, immutable data
• Ensure that data can not be deleted (accidentally)
• Fundamentally simpler
• Easy to implement on top of a distributed filesystem, eg Hadoop
• CR instead of CRUD
• No updates
• No deletes
• Create (insert) and read (select) only
Immutability restricts the range of errors that can cause data loss/corruption
Daimler TSS Data Warehouse / DHBW 58
What is the Lambda architecture?
Batch Layer
All Data (Master data set)
Speed Layer
RealTime Views
Serving Layer
Batch Views
Query(merge)
Data StreamDaimler TSS Data Warehouse / DHBW 59
Lambda architecture – data examplecompute follower list
Batch
Layer
1.1. insert Jim
1.1. insert Anne
2.1 remove Jim
3.1. insert George
5.1. insert John (now)
Speed Layer
insert John (now)
Serving
LayerAnne
George
Query
Data Stream
Anne
George
John
Follower: 3
Daimler TSS Data Warehouse / DHBW 60
Batch views and realtime views
Daimler TSS Data Warehouse / DHBW 61
Data absorbtion
https://dzone.com/articles/Lambda-architecture-with-apache-spark
Daimler TSS Data Warehouse / DHBW 62
Eventual accuracy
Computing best answer in real time may not always be possible
• Can compute exact answer in batch layer and approximate answer in realtime layer
• Best of both worlds of performance and accuracyFor example, a machine learning application where generation of the batch model requires so much time and resources that the best result achievable in real-time is computing and approximated updates of that model. In such cases, the batch and real-time layers cannot be merged, and the Lambda architecture must be used.
Daimler TSS Data Warehouse / DHBW 63
Nathan Marz view on Schema
• RawnessStore the data as it is. No transformations.
• ImmutabilityDon’t update or delete data, just add more.
• Graph-like schema recommended
„Many developers go down the path of
writing their raw data in a schemaless
format like JSON. This is appealing because
of how easy it is to get started, but this
approach quickly leads to problems.
Whether due to bugs or misunderstandings
between different developers, data
corruption inevitably occurs“
(see page 103, Nathan Marz, „Big Data:
Principles and best practices of scalable
realtime data systems", Manning
Publications)
Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015
Daimler TSS Data Warehouse / DHBW 64
Which tools could be used in the Lambda architecture?
Batch Layer
All Data
Speed Layer
RealTime Views
Serving Layer
Batch Views
Query(merge)
Data StreamDaimler TSS Data Warehouse / DHBW 65
Cloud vendors offerings, e.g. Microsoft Azure
https://social.technet.microsoft.com/wiki/contents/articles/33626.Lambda-architecture-implementation-using-microsoft-azure.aspx
Daimler TSS Data Warehouse / DHBW 66
Cloud vendors offerings, e.g. Amazon AWS
https://aws.amazon.com/de/blogs/big-data/unite-real-time-and-batch-analytics-using-the-big-data-Lambda-architecture-without-servers/
https://d0.awsstatic.com/whitepapers/lambda-architecure-on-for-batch-aws.pdfDaimler TSS Data Warehouse / DHBW 67
sourc
e: M
ark
us
Sch
mid
berg
er
-B
ig D
ata
ist
tot
–E
sle
be
Busi
ness
In
tellig
enz?
, TD
WI 2
01
6
Daimler TSS Data Warehouse / DHBW 68
sourc
e: M
ark
us
Sch
mid
berg
er
-B
ig D
ata
ist
tot
–E
sle
be
Busi
ness
In
tellig
enz?
, TD
WI 2
01
6
Lambda @glomex
• Enrich batch-
driven data
processing with
real-time
requirements
• Adapt Lambda
architecture to
own requirements
Remark:
AWS Lambda # Lambda architecture
Lambda architecture – pros and cons
Pro Con
Architecture emphasizes to keep data
immutable. Mistakes can be corrected via
recomputation
Maintaining code that needs to produce
the same result in two complex distributed
systems
Reprocessing is one of the key challenges
of stream processing but is very often
ignored
Operational burden of running and
debugging two systems
Daimler TSS Data Warehouse / DHBW 70
Summary Lambda architecture
• Incoming data is sent to batch and speed layer
• Batch layer constantly (re-) computes batch views
• Master data is stored in the batch layer in raw format: immutable & append-only
• Contains data except most up-to-date data due to high latency
• Replaces data in speed layer as soon as data in newer compared to speed layer
• Speed layer uses incremental algorithms to refresh real-time Views
• Receives data stream for real-time processing
• Contains most up-to-date data only
• Serving layer contains views on batch layer data
• Merge is done in the query or in the serving layerDaimler TSS Data Warehouse / DHBW 71
1. DWH architectures
• Standard DWH architecture
• Kimball Bus architecture
• Data Vault 2.0 architecture
2. Big Data architectures
• Data Lake
• Lambda architecture
• Kappa architecture
Questioning the Lambda architecture
Jay Kreps wrote an article about „Questioning the Lambda architecture“
• He wrote about his experience with the Lambda architecture
• It works, but not very pleasant or productive
• Keeping code in sync is really hard
• Need to build complex, low-latency processing systems
• Scalable high-latency batch system
• Low-latency stream stream-processing system
• Instead of duct taping batch & speed:
Kappa architecture / Log-centric architecture / Stream data platform
Source: https://www.oreilly.com/ideas/questioning-the-Lambda-architecture
Daimler TSS Data Warehouse / DHBW 73
Kappa architecture
• Architecture by Jay Kreps
• Logcentric, write-ahead logging
• Each event is an immutable log entry and is added to the end of the log
• Read and write operations are separated
• Materialized views can be recomputed consistently from data in the log
Daimler TSS Data Warehouse / DHBW 74
75
What is a message? A log!
Source: Jay Kreps: I heart logs, O’Reilly 2014
Daimler TSS Data Warehouse / DHBW
76
What is a log?
• A bank account’s current balance can be built from a complete list of its debits and credits, but the inverse is not true.
• In this way, the log of transactions is the more “fundamental” data structure than the database records storing the results of those transactions.
A software application’s database is better thought of as a series of time-ordered immutable facts collected since that system was born, instead of as a current snapshot of all data records as of right now.Source: https://blog.parse.ly/post/1550/kreps-logs/
Daimler TSS Data Warehouse / DHBW
Kappa architecture
Source: https://www.oreilly.com/ideas/questioning-the-Lambda-architecture
Daimler TSS Data Warehouse / DHBW 77
Reprocessing
Lambda architecture
• Complex
• It works, but not very pleasant or productive: keeping code in sync is really hard
• Addresses the challenge of reprocessing
• Batch views to contain updated data (a Lambda architecture rcommendation)
• Bugs (true requirement for reprocessing)
• New user requirements (true „requirement“ for reprocessing)
Logs unify
• Batch processing + Stream processing
• In Kappa, reprocessing required only when processing logic has been modified
Daimler TSS Data Warehouse / DHBW 78
Kappa architecture – pros and cons
Pro Con
Flexible architecture that can be used
• stand-alone
• combined with a Data Lake (or DWH)
• in a microservice architecture
All use cases must be mapped to a
streaming problem which can become
artificial for well-known cases, e.g. sales
prognosis.
Not suited for analysis of history data:
sequential reading and writing is fast,
random is not.
Easy to understand Weak separation of streaming + compute
and store
Daimler TSS Data Warehouse / DHBW 79
Summary: Enterprise-wide architecture
Source: Jay Kreps: I heart logs, O’Reilly 2014
Daimler TSS Data Warehouse / DHBW 80
1. DWH architectures
• Standard DWH architecture
• Kimball Bus architecture
• Data Vault 2.0 architecture
2. Big Data architectures
• Data Lake
• Lambda architecture
• Kappa architecture
Daimler TSS Data Warehouse / DHBW 82
Architecture - summary
One size does not fit all
• Understand why an architecture is needed
• Use + work on an architecture
• Don’t reinvent the wheel
• But adapt to your needs and constraints (time, budget, skills, etc)
Data management according to BARCFlexibility causes heterogenity
Source: https://www.datafestival.de/events/proof-of-concepts-als-strategietool-erfahrungen-aus-der-medienbranche-de/; Jacqueline Bloemen, Datenarchitektur für Business Analytics – was Sie
berücksichtigen sollten, DataFestival 2019
Daimler TSS Data Warehouse / DHBW 83
Exercise - Architecture
• Now imagine that you prepare an exam.
• Identify 1 question about DWH / Big Data architecture that you would ask in an exam.
• Write down the question on a stick-it card.
Daimler TSS Data Warehouse / DHBW 84
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision
Other DWH architectures
• Inmon Corporate Information Factory
• DW 2.0: The Architecture for the Next Generation of Data Warehousing
• Operational Data Store (ODS)
• Virtual architecture
Daimler TSS Data Warehouse / DHBW 86
Exercise: classical DWH architectures
The article
http://www.kimballgroup.com/2004/03/differences-of-opinion/
compares THE two classic DWH architectures.
Read the paper and complete the table / questions on the next slide.
(Caution: The paper is biased / favors one approach; you may want to read other/more papers for a neutral view.)
Daimler TSS Data Warehouse / DHBW 87
Exercise: classical DWH architectures
How are the approaches
called?
Who “invented” the approach?
How many layers are used and how are
the layers called?
Which data modeling approaches are
used in which layer?
In which layer are atomic detail data
stored?
In which layer are aggregated / summary
data stored?
List at least 2 advantages
List at least 2 disadvantages
Exercise: classical DWH architectures
How are the
approaches called?
Kimball Bus Architecture Corporate Information Factory
Who “invented” the approach? • Ralph Kimball • Bill Inmon
How many layers are used and
how are the layers called?
• Data Staging
• Dimensional Data Warehouse
• Data Acquisition
• Normalized Data Warehouse
• Data Delivery / Dimensional Mart
Which data modeling approaches
are used in which layer?
• Data Staging: variable, corresponds to
source system
• Dimensional Data Warehouse:
Dimensional Model
• Data Acquisition: variable, corresponds to
source system
• Normalized Data Warehouse: 3NF
• Data Delivery: Dimensional Model
In which layer are atomic detail
data stored?
• Dimensional Data Warehouse • Normalized Data Warehouse
In which layer are aggregated /
summary data stored?
• Dimensional Data Warehouse • Data Delivery / Dimensional Mart
Daimler TSS Data Warehouse / DHBW 89
Exercise: classical DWH architectures
Kimball Bus Architecture Corporate Information Factory
Advantages • Two layers only mean faster development and
less work
• Rather simple approach to make data fast and
easily accessible
• Lower startup costs (but higher subsequent
development costs)
• Separation of concerns: long-term enterprise data
storage separated from data presentation
• Changes in requirements and scope are easier to
manage
• Lower subsequent development costs (but higher
startup costs)
Disadvantages • If table structures change (instable source
systems), high effort to implement the changes
and reload data, especially conformed
dimensions (“Dimensionitis” desease)
• Non-metric data not optimal for dimensional
model
• Dimensional model (esp. Star Schema) contains
data redundancy
• Data model transformations from 3NF to
Dimensional model required
• More complex as two different data models are
required
• Larger team(s) of specialists required
Daimler TSS Data Warehouse / DHBW 90
Inmon Corporate Information Factory
Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/Daimler TSS Data Warehouse / DHBW 91
Inmon Corporate Information Factory
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer
(Input Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output Layer)
(Reporting Layer)
Metadata Management
Security
DWH Manager incl. Monitor
subject-
oriented,
integrated,
time-
variant,
non-volatile
Daimler TSS Data Warehouse / DHBW 92
Inmon Corporate Information Factory
• Top-down approach
• (Normalized) Core Warehouse is essential for subject-oriented, integrated, time-variant and nonvolatile data storage
• Create (departmental) Data Marts as subsets of Core Enterprise DWH as needed
• Data Marts can be designed with Dimensional model
• The standard DWH architectureis more general compared to CIF, but was mainly influenced by CIF
Daimler TSS Data Warehouse / DHBW 93
DW 2.0: The Architecture for the Next Generation of Data Warehousing
Source: W.H. Inmon, Dan Linstedt: Data Architecture: A Primer for the Data Scientist, Morgan Kaufmann, 2014, chapter 3.1
Operational application
data model
Integrated corporate
data model
Integrated corporate
data model
Archival
data model
Data
Lifecyc
le
Daimler TSS Data Warehouse / DHBW 94
DW 2.0: The Architecture for the Next Generation of Data Warehousing
Main characteristics:
• Structured and “unstructured” data, not just metrics
• Life Cycle of data with different storage areas
• Hot data: High speed, expensive storage (RAM, SSDs) for most recent data
• …
• Cold data: Low speed, inexpensive storage (e.g. hard disks) for old data; archival data model with high compression
• Metadata is an integral part of the DWH and not an afterthought
Daimler TSS Data Warehouse / DHBW 95
Operational data store (ODS)
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer
(Input Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output Layer)
(Reporting Layer)
Metadata Management
Security
DWH Manager incl. Monitor
subject-
oriented,
integrated,
time-
variant,
non-volatile
Operational Data Store
Daimler TSS Data Warehouse / DHBW 96
Operational data store (ODS)
• ODS: Real-time/Right-time layer
• Replication techniques used to transport data from source database to ODS layer with minimal impact on source system
• Data in the ODS has no history and is stored without any cleansing and without any integration (1:1 copy from single source)
• DWH performance not optimal as data model is suited for OLTP and not for reporting requirements
• ODS normally additionally to Staging / Core DWH / Mart Layer but can exist alone without other layers
Daimler TSS Data Warehouse / DHBW 97
Logical architecture (Gartner)
• The LDW is a multi-server / multi-engine architecture
• The LDW can have multiple engines, the main ones being: The Data Warehouse(DW), data marts and the Data Lake
• We should not be trying to choose a single engine for all our requirements. Instead, we should be sensibly distributing all our requirements across the various components we have to choose from. Reasons:
• Regulatory constraints
• Organizational constraints
https://blogs.gartner.com/henry-cook/2018/01/18/logical-data-warehouse-project-plans/Daimler TSS Data Warehouse / DHBW 98
Virtual architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Data
Lake
OLTPQuery Management
Weakly+partly subject-oriented,
Weakly+partly integrated,
Not time-variant,
Not non-volatile
Views,
Alias names,
Metadata
DWH
Daimler TSS Data Warehouse / DHBW 99
Virtual architecture
• Data not extracted from operational systems and stored separately
• Standardized interface for all operational data sources
• One "GUI" for all existing data
• Generates combined queries
• Query Processor joins query result data from different sources
• Can also access data in Hadoop (Polybase, Big SQL, BigData SQL, etc)
Daimler TSS Data Warehouse / DHBW 100
Virtual architecture
• Query Management manages metadata about all operational systems
• (physical) location of data and algorithms for extracting data from OLTP system
• Implementation easier
• Low cost: can use existing hardware infrastructure
• Queries cause significant performance problems in operational systems
• Known problems when analyzing operational data directly
• Same query is processed multiple times (if queried multiple times)
• Same query delivers different results when processed at different times
Daimler TSS Data Warehouse / DHBW 101
Specific Big Data architectures
• There exist well-known reference architectures for Data Warehouses
• Many tools and schema-on-read came with the Hadoop ecosystem
• Was a “black box” at the beginning
• Gets more and more structure with different layers instead of a “black box”
• Structure, modeling, organization, governance instead of tool-only focus
Daimler TSS Data Warehouse / DHBW 102
How does a Data Lake differ from a data swamp?
• Naive idea: dump everything in (“landing zone”)
• Data hoarding is not a data management strategy
• A Data Lake brings in structure
• e.g. create directories in HDFS if Hadoop is used
• /transient
• /raw
• /standardized
• /use-case specific
Daimler TSS Data Warehouse / DHBW 103
What is a Data Lake according to Mark Madsen?separate collect / manage / delivery
Daimler TSS Data Warehouse / DHBW 104
Schema-on-write vs schema-on-read revisited
Old approach New approach
Model Collect
Collect Model
Analyze Analyze
Promotemove data into next
zone if required (data
has been analyzed and
is needed more often)
Daimler TSS Data Warehouse / DHBW 105
Data Vault 2.0 architecture – today’s world (Dan Linstedt)
Defining a Data Lake … by Dan Linstedt
https://www.youtube.com/watch?v=tDNjI1Yvqxw
Interview with “Father of data warehousing” Bill Inmon
https://www.youtube.com/watch?time_continue=48&v=b9YfsjEjVS8
Data Lake turned into Data swamp
Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.12
Daimler TSS Data Warehouse / DHBW 109
Data Lake
Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.11
• No agreed definition
• Characteristics [Madsen]:
• Deals with data and schema
change easily
• Does not always require up front
modeling
• Does not limit the format or
structure of data
• Assumes a full range of data
latencies, from streaming to
one-time bulk loads, both in and
out including write-backDaimler TSS Data Warehouse / DHBW 110
Bimodal data governance
Two different styles of work: different zones with different degrees ofgovernance
• Stability & Quality
• Predictable, improving and renovating
• Flexibility & Speed
• Explanatory
Source: https://www.gartner.com/it-glossary/bimodal/
Daimler TSS Data Warehouse / DHBW 111
Exercise: recommend an architecture
A holding of 3 telecommunication companies
• Architecture: Virtual Data Warehouse
• + Companies may not want to provide their data to a new storage
• + Can easily be extended if new companies join the holding or reduced if a company leaves the holding
• - Bad performance
• - Not really data integration achieved, low Data Quality
• - Firewalls have to be opened
Daimler TSS Data Warehouse / DHBW 112
Where to use Hadoop and spark?
Source: Rick F. van der Lans: New Data Storage Technologies, TDWI Munich 2018
Where to use nosql?
Source: Rick F. van der Lans: New Data Storage Technologies, TDWI Munich 2018
Examples of Data Warehouses in the industry
Apple: multiple Petabytes
• Customer insights: who’s who and what are the customers up to
Walmart: 300TB (2003), several PB today
• It tells suppliers, “You have three feet of shelf space. Optimize it.”
eBay: >10PB, 100s of production DBs fed in
• Get better understanding of customers
Most DWHs are much smaller though. For huge and small DWHs: High challenges to architect + develop + maintain + run such complex systemshttps://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/ and http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
Daimler TSS Data Warehouse / DHBW 115
•Architecture, conceptData Lake
•Tools (that can be used to implement a Lake)
Hadoop, Spark, Elastic Stack
Data Lake vs Hadoop
Daimler TSS Data Warehouse / DHBW 116
DWH and Data Lake
DWH on RDBMS
Slowly Changing Dimension
ELT vs ETL
3-Layer vs 2-Layer
Kimball Approach
Inmon Definition
Star Schema
Data Vault
Anchor Modeling
etc
Data Lake on Hadoop
Schema-on-Read
Agility
Parquet
Hive
HBase
SQL-on-Hadoop
Impala
Oozie
Zoekeeper
Methods,
Concepts,
Techniques
Tools,
Tools,
Tools
Daimler TSS Data Warehouse / DHBW 117
The (ab)use of the words „Data Lake“technical
• A physical implementation of data solutions in the Hadoop ecosystem.
• A reservoir of curated, and often unconnected, datasets for data science, data exploration and reference data management use.
• The new Operational Data Store, a pub flap to hide a proper data management architecture.
• A name to indicate the staging area or the persistent staging area of a Data Warehouse solution.
• A technical solution to process huge volumes of data, for which an RDBMS is often too expensive to deploy.
Daimler TSSSource: https://www.linkedin.com/pulse/abuse-words-data-lake-martijn-ten-napel/Daimler TSS Data Warehouse / DHBW 118
The (ab)use of the words „Data Lake“cynical reality
• A concept that will spray fairy dust upon data management issues that arise from organisational issues, politics or a lack of understanding what working with data requires of an organisation.
• A term to use when you are really out of your comfort zone as an architect and to shelter behind.
• A clever marketing and sales trick, used to sell you a “must have” solution next to what you already have. You are told that without a Data Lake you will be unable to do data science, digital transformation or business analytics.
Source: https://www.linkedin.com/pulse/abuse-words-data-lake-martijn-ten-napel/Daimler TSS Data Warehouse / DHBW 119
Data Lake (Martin fowler)
Source: https://martinfowler.com/bliki/DataLake.htmlDaimler TSS Data Warehouse / DHBW 120
Data Lake (Martin fowler)
Source: https://martinfowler.com/bliki/DataLake.htmlDaimler TSS Data Warehouse / DHBW 121
What is a Data Lake?
A Data Lake acquires data from multiple sources in an enterprise in its native form and may also have internal, modeled forms of this same data for various purposes. The information thus handled could be any type of information, ranging from structured or semi-structured data to completely unstructured data. A Data Lake is expected to be able to derive enterprise-relevant meanings and insights from this information using various analysis and machine learning algorithms.
Source: Pankaj Misra, Tomcy John: Data Lake for Enterprises Packt 2017
Daimler TSS Data Warehouse / DHBW 122
Data Lake life cycle
Source: Pankaj Misra, Tomcy John: Data Lake for Enterprises Packt 2017Daimler TSS Data Warehouse / DHBW 123
Data Lake (eckerson group)
Source: https://www.eckerson.com/articles/ten-characteristics-of-a-modern-data-architecture
Daimler TSS Data Warehouse / DHBW 124
New approach to build data systems
Immutable data
(Source of Truth /
Single version of
facts)
(Materialized) Views
on data
(Materialized) Views
on data
(Materialized) Views
on data(Materialized) Views
on data
(Materialized) Views
on data
Query = Application
HDFS / NoSQL
RDBMS
NewSQL(Materialized) Views
on data
(Materialized) Views
on data
(Materialized) Views
on data(Materialized) Views
on data
(Materialized) Views
on data
Query = Application
Daimler TSS Data Warehouse / DHBW 125
Lambda architecture
Source:
Batch Layer
Batch
Engine Serving Layer
Serving
Backend Queries
Raw history
data
Result
data
Real-Time Layer
Real-Time
Engine
Daimler TSS Data Warehouse / DHBW 126
Overview Lambda architecture
https://de.slideshare.net/gschmutz/big-data-and-fast-data-Lambda-architecture-in-actionDaimler TSS Data Warehouse / DHBW 127
Which tools could be used in the Lambda architecture? db requirements
Easier to implement for
database vendors
compared to random
access
Daimler TSS Data Warehouse / DHBW 128
Batch layer
Write sequential once
Bulk sequential read many
times
Speed layer
Random write
Random read
Servinglayer
Batch write
Random read
Which tools could be used in the Lambda architecture? db requirements
More challenging
More challenging More challenging
Daimler TSS Data Warehouse / DHBW 129
What are the three paradigms of programming?
• Request/Response
• Batch
• Stream processing
Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619Daimler TSS Data Warehouse / DHBW 130
Enterprise-wide architectureusing log-centric approach
Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 131
Enterprise-wide architectureusing log-centric approach and CDC
Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 132
Enterprise-wide architecture todaydata integration challenge
Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 133
Enterprise-wide architecture todaydata integration challenge
Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 134
Enterprise-wide architecture todaydata integration challenge
Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/Daimler TSS Data Warehouse / DHBW 135
Log centric architecture / Kappa architecture / data streaming platform
Source: Jay Kreps: I heart logs, O’Reilly 2014Daimler TSS Data Warehouse / DHBW 136
What is stream processing?
Most common understanding (doing the „T“ in ETL):
• Stream processing is the parallel processing of data in motion = computing on data directly as it is produced or received.
• Not necessarily transient, approximate, lossy (assumptions from Lambda architecture and other event processing systems)
Daimler TSS Data Warehouse / DHBW 137
Messaging systems @Linkedin
1st attempt
• Active MQ, RabbitMQ
• Problems:
• Not distributed
• Throughput
• Persistence
• Ordering
2nd attempt
• Kafka
• Key abstraction: Logs
• build from scratch
• Distributed system by design
• Partitioning with local ordering
• Elastic scaling
• Fault tolerance
Daimler TSS Data Warehouse / DHBW 138
Kafka: a modern distributed system for streams
• Scalability
• Hundreds of MB/sec/server throughput
• Many TB per node
• Guarantees of a database• All messages strictly ordered (within a partition)
• All data persistent
• Distributed by default• Replication
• Partitioning
• Producers + consumers all fault tolerant and horizontally scalable
Daimler TSS Data Warehouse / DHBW 139
ETL revisited with kafka connect and kafka streams
Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619Daimler TSS Data Warehouse / DHBW 140
What are stream processing frameworks?
• Apache Spark streaming
• Apache Storm (twitter, Nathan Marz)
• Apache Samza (linkedin)
• Apache Flink
• Apache Kafka Streams
• Simple library
• Reprocessing
• No microbatch = everything is a stream
• Local state
• Key operations: filter, aggregate, join
Daimler TSS Data Warehouse / DHBW 141
Kafka @Linkedin
Source: https://de.slideshare.net/JayKreps1/i-32858698Daimler TSS Data Warehouse / DHBW 142
Summary: Stream data platform using Kafkaunifying batch and stream processing
Source: https://de.slideshare.net/JayKreps1/i-32858698Daimler TSS Data Warehouse / DHBW 143
Kafka and microservices
• A message-oriented implementation requires an efficient messaging backbone that facilitates the exchange of data in a reliable and secure way with the lowest latency possible.
• Creating small, self-contained, data-driven applications that meld streaming data and microservices together is a good practice to break down large problems and projects into approachable chunks, reduce risk, and deliver value faster.
• Think of combinations of data-processing applications with microservices to deliver specific features and insights from a data stream.
Daimler TSS Data Warehouse / DHBW 144
Kappa architecture
Source:
Real-Time Layer
Real-Time
Engine
Serving Layer
Serving
BackendData Queries
Raw history
data
Result
data
Daimler TSS Data Warehouse / DHBW 145
PS-3C architecture
Data Library
Storage
Engine
3C Layer
Preparation
EngineData Queries
Serving Layer
Delivery
Engine
Raw history
data
Integrated
subjects
Result
data
Daimler TSS Data Warehouse / DHBW 146
PS-3C architecture
• Architecture by Rogier Werschkull
• Store incoming data in Data Library Layer (Persistent staging = PS)
• Prepare data in a 3C layer for “Concept – Context – Connector”-model
• Concept + Connector can be virtualized on data in Data Library Layer
Daimler TSS Data Warehouse / DHBW 147
Polyglot warehouse
• Architecture by Joe Caserta
• Big Data Warehouse may live in one or more platforms on premises or in the cloud
• Hadoop only
• Hadoop + MPP or RDBMS
• Additionally NoSQL or Search
Daimler TSS Data Warehouse / DHBW 148
Polyglot warehouse
Source: https://www.slideshare.net/CasertaConcepts/Hadoop-and-your-data-warehouse
Daimler TSS Data Warehouse / DHBW 149
The extended Data Warehouse architecture (xdw)The Enterprise Analytics Architecture
• Architecture by Claudia Imhoff
• combine the stability and reliability of the BI architectures while embracing new and innovative technologies and techniques
• 3 components that extend the EDW environment
• Investigative computing platform
• Data refinery
• Real-time (RT) analysis platform
Data Warehouse /
DHBWDaimler TSS150Daimler TSS Data Warehouse / DHBW 150
The extended Data Warehouse architecture (xdw)The Enterprise Analytics Architecture
Source: https://upside.tdwi.org/articles/2016/03/15/extending-traditional-data-warehouse.aspx
Daimler TSS Data Warehouse / DHBW 151
Gartner Data Lake architecture styles
• Inflow Lake: accommodates a collection of data ingested from many different sources that are disconnected outside the lake but can be used together by being colocated within a single place
• Outflow Lake: a landing area for freshly arrived data available for immediate access or via streaming. It employs schema-on-read for the downstream data interpretation and refinement.
• Data Science Lab: most suitable for data discovery and for developing new advanced analytics models
Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/
Daimler TSS Data Warehouse / DHBW 152
Gartner Data Lake architecture styles
Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/
Daimler TSS Data Warehouse / DHBW 153
Gartner – the logical DWH
https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to-be-done/
https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to-be-done/
Daimler TSS Data Warehouse / DHBW 154
Summary
Landing Area
Storage
Engine
Data Lake
Integration
EngineData Queries
Data Presentation
Delivery
Engine
Raw history
dataLightly
integrated data
Result
data
Daimler TSS Data Warehouse / DHBW 155
Data core – Dave Wells
• Architecture by Eckerson Group
• DWHs exist together with MDM, ODS, and portions of the Data Lake as a collection of data that is curated, profiled, and trusted for enterprise reporting and analysis
https://www.eckerson.com/articles/the-future-of-the-data-warehouse
Daimler TSS Data Warehouse / DHBW 156
Data core – Dave Wells
https://www.eckerson.com/articles/the-future-of-the-data-warehouse
Daimler TSS Data Warehouse / DHBW 157
Data lifecycle – Dave Wells
https://www.eckerson.com/articles/the-future-of-the-data-warehouse
Daimler TSS Data Warehouse / DHBW 158
DWH and Data Lake – Dave WellsIn parallel vs inside
https://www.eckerson.com/articles/the-future-of-the-data-warehouse
Daimler TSS Data Warehouse / DHBW 159
Dimensional modeling in the age of Big Data
• Dimensional modeling is not dead.
• The benefits are still valid in the age of Big Data, Hadoop, Spark, etc:
• Eliminate joins
• Data model is understandable for end users
• Well-suited for columnar storage + processing (e.g. SIMD)
• Nesting technique
• E.g. tables with lower granularity can be nested into large fact table
• Usage in SQL: Flatten(kvgen(<json>))
Daimler TSS Data Warehouse / DHBW 160
Self-service data
Functional
area
Why important Self-service approach
Data
acceleration
With shadow analytics, users create
redundant data copies.
The system must be capable of autonomously
identifying the best optimizations and adapting to
emerging query patterns over time.
Data
catalog
Data consumers struggle to find data that
is important to their work. Users keep
private notes about data sources and data
quality, meaning there is no governance.
In the self-service approach, the catalog is
automatic—as new data sources are brought online.
Data
virtualizatio
n
It is virtually impossible for an organization
to centralize all data in a single system.
Data consumers need to be able to access all data
sets equally well, regardless of the underlying
technology or location of the system.
https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics
Daimler TSS Data Warehouse / DHBW 161
Self-service data
Functional
area
Why important Self-service approach
Data
curation
There is no single “shape” of data that
works for everyone.
Data consumers need the ability to interact with
data sets from the context of the data itself, not
exclusively from simple metadata that fails to tell the
whole story. Data consumers should be capable of
reshaping data to their own needs without writing
any code or learning new languages.
Data lineage As data is accessed by data consumers
and in different processes, it is important
to track the provenance of the data, who
accessed the data, how the data was
accessed, what tools were used, and what
results were obtained.
As users reshape and share data sets with one
another through a virtual context, a self-service data
platform can seamlessly track these actions and all
states of data along the way, providing full audit
capabilities as well.
https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics
Daimler TSS Data Warehouse / DHBW 162
Self-service data
Functional
area
Why important Self-service approach
Open
source
Because data is essential to every area of
every business, the underlying data
formats and technologies used to access
and process the data should be open
source
Self-service data platforms build on open source
standards like Apache Parquet, Apache Arrow, and
Apache Calcite to store, query, and analyze data
from any source.
Security
controls
Organizations safeguard their data assets
with security controls that govern
authentication (you are who you say you
are), authorization (you can perform
specific actions), auditing (a record of the
actions you take), and encryption (you can
only read the data if you have the right
key).
Self-service data platforms integrate with existing
security controls of the organization, such as LDAP
and Kerberos.
https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics
Daimler TSS Data Warehouse / DHBW 163
Is the DWH dead?
Hadoop (and for similar reasons Spark) has its strengths but no as a DWH replacement, e.g.
• Fast query reads only possible in HBase with an inflexible (use case specific) data model
• No sophisticated query optimizer
• Hadoop is very complex with many tools/versions/vendors and no standard
• Security is still at the beginning
Daimler TSS Data Warehouse / DHBW 164
Digitization is the process of converting information into a digital (i.e. computer-readable) format
Digitization – the digital data explosion
Source: https://slideplayer.com/slide/10254426/
Daimler TSS Data Warehouse / DHBW 165
Future mobility
https://blog.daimler.com/en/2018/07/19/future-mobility-metropolitan-cities-congress/Daimler TSS Data Warehouse / DHBW 166
Display video
Is china the next silicon valley?
Daimler TSS Data Warehouse / DHBW 167
Some statisticsbat – baidu + alibaba + tencent
Sources:
https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/
https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdlDaimler TSS Data Warehouse / DHBW 168
Volume
Velocity
Variety
Veracity
Value
Digitization – challenges & opportunities
Ethics
AI
Impact
IoT, Industry 4.0, …
Data integration
Real-time
Data quality
Data-driven
Digital Services
Daimler TSS Data Warehouse / DHBW 169
Digitization public authorities
Daimler TSS Data Warehouse / DHBW 170