Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege ›...

40
Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass Camp 2018 Stefan Kirner 8.2.2018

Transcript of Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege ›...

Page 1: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Data flows in Azure Data Factory –

endlich auch hier Transformationen!

Pass Camp 2018

Stefan Kirner 8.2.2018

Page 2: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

2

Stefan KirnerTeamlead Business Intelligence Solutions @inovex GmbH

› Mehr als 15 Jahre Erfahrung mit dem Microsoft Business Intelligence Toolset

› MCSE for Data Management & Analytics & Cloud Infrastructure

› Microsoft P-TSP Data Platform

› RGV SQL PASS e.V. Community Karlsruhe

› Speaker at conferences and user groups about BI- und cloud-themes

› Mail: [email protected]: @KirnerKa

Page 3: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

1. Intro Data Factory v2

2. Control Flow & Triggers

3. Data Flow

4. Roadmap & Q+A

Agenda

Page 4: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass
Page 5: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

New Pipeline ModelRich pipeline orchestration

Triggers – on-demand, schedule, event

Data Movement as a ServiceCloud, Hybrid

30 connectors provided

Data flow as NEW Data Transformation Layer

SSIS Package Execution In a managed cloud environment

Use familiar tools, SSMS & SSDT

Author & MonitorProgrammability (Python, .NET, Powershell, etc)

Visual Tools for Control Flow and NEW: Data Flow

Page 6: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

6

Data Factory EssentialsArtefacts in Data Factory

Data flow

NEW Activity!

Page 7: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

SSIS Package

Pipeline

New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com

Page 8: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

ADFv2 Pipelines

Activity 1 Data Flow

Activity 3

“On Error” Activity 1

params

params

My Pipeline 1

My Pipeline 2

For Each…

Activity 4

params

Trigger

EventWall ClockOn Demand

params

New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com

Page 9: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Scalableper job elasticity (cloud data movement units)

Up to 1 GB/s

Set parallelism of threads used in source and target

SimpleVisually author or via code (Python, .Net, etc)

Serverless, no infrastructure to manage

Staged copy (compress/decompress in hybrid scenarios, SQL DW load using polybase, bypass firewall restrictions)

Access all your data 30+ connectors provided and growing (cloud, on premises, SaaS)

Data Movement as a Service: 17 points of presence world wide

Self-hostable Integration Runtime for hybrid movement

Data Movementaka

“Copy Activity”

New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com

Page 10: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass
Page 11: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

11

ActivitiesKnown from v1 - Data Transformation ActivitiesData transformation activity Compute environment

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Machine Learning activities: Batch Execution and Update Resource

Azure VM

Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server

U-SQL Azure Data Lake Analytics

Page 12: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

12https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity and others in How-To > Control Flow

ActivitiesNew! Control Flow Activities…nicht vollständigControl activity Description

Execute Pipeline Activity

allows a Data Factory pipeline to invoke another pipeline.

ForEachActivity used to iterate over a collection and executes specified activities in a loop.

WebActivity call a custom REST endpoint and pass datasets and linked services

Lookup Activity look up a record/ table name/ value from any external source to be referenced by succeeding activities. Could be used for incremental loads!

Get MetadataActivity

retrieve metadata of any data in Azure Data Factory e.g. did another pipeline finish

Do Until Activity similar to Do-Until looping structure in programming languages.

If ConditionActivity

do something based on condition that evaluates to true or false.

Execute SSIS Execute SSIS Package

Page 13: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

13https://www.purplefrogsystems.com/paul/2017/09/whats-new-in-azure-data-factory-version-2-adfv2/

ActivitiesConcepts

Page 14: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

exec

exec

Pipeline check metadata

14

Demo Control FlowGet all data of a system by metadata

Activity: For eachtable in source

Activity: CopyData to Data Lake

/ Blob Store

exec

Page 15: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

1. on-demand

2. Wall-clock Schedule

3. Tumbling Window (aka time-slices in v1)

4. Event on Blob Store

15

TriggersHow do pipelines get started

Page 16: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass
Page 17: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Visual Data Flows

17

Page 18: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

1. Does not require understanding of Spark, Big Data Execution Engines, Clusters, Scala, Python …

2. Focus on building business logic and data transformation› Data cleansing

› Aggregation

› Data conversions

› Data prep

› Data exploration

Code-free Data Transformation At Scale

Page 19: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

ADF Data Flow Overview

Sort, Merge, Join, Lookup …

DestinationStagingData Sources

Transformations

Page 20: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

ADF Data Flow Workstream

Data Sources Staging Transformations Destination

Sort, Merge, Join, Lookup …

• Explicit user action• User places data

source(s) on design surface, from toolbox

• Select explicit sources

• Implicit/Explicit• Data Lake staging area as

default• User does not need to

configure this manually• Advanced feature to set

staging area options• File Formats / Types

(Parquet, JSON, txt, CSV …)

• Explicit user action• User places

transformations on design surface, from toolbox

• User must set properties for transformation steps and step connectors

• Explicit user action• User chooses destination

connector(s)• User sets connector

property options

Page 21: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Data flow started - what happens at databricks?

21

Page 22: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Modern DWH Layers

Business / custom apps

(Structured)

Logs, files and media

(unstructured)

Azure Data

Lake Store

Structured data / master data

SQL Database

(Data-Warehouse)

Data factory

Data factory

Azure Databricks

(Spark)

Power BI

(Analytical dashboards)

Store ResultsProcessingStoreIngest Presentation

Analysis Services

(Sem. Modell)

(optional, depends on data)

Data Model

Data Science

(Workloads and use

cases)

Page 23: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

1. Visual “Data Flow Builder” / “Data Mapping”

2. Extensible through scripting and expressions

3. Data Flow can be embedded into ISV / SaaS apps

› Embed UI

› Embed Parameterize Data Flows

4. A graphical UI for building data transformation routines on Spark

5. Built for resiliency and operationalized environments

Visual Data Flow Key Tenets

Page 24: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

25https://www.youtube.com/watch?v=vSTn_aGq3C8 Video zum Feature von Mark Kromer

Handing Schema Changes

Data flow will accept both columns (here in derived columns)

Use Auto Mappingin Sink

Page 25: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

26

Overview current Transformations 1/2

Transformation Description

Source

Source transform lets you configure the sources we want to bring into the data flow. A data flow can have one or more sources. It should be able to connect to any type of source supported by ADF.Note: Private preview only supports blob

SelectSelect transform allows to select list of columns from the input stream which you can pass to other transformation. It allows providing alias names to columns.

Derived ColumnDerived column allows you to create new columns or modify existing column. Its support a wide range of data manipulation functions.

Filter Filter transform allows you to restrict the rows based on filter expression.

Conditional SplitConditional split allows you to split the input stream into n number of output stream based on expressions conditions. Rows not matching the condition will be routed to default output.

SortSort allows you to sort the based-on order rules. The output rows will follow the same order in the subsequent transformation. It has other options like toggle “case sensitive”, computed columns etc.

AggregateAggregate transform allows you to define aggregation functions by group by columns. You can also build custom expressions in this transform.

Page 26: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

27

Overview current Transformations 2/2

Transformation Description

ExistsExist transform allows you to filter rows in one input stream based on another source. You have an option of applying either “Exists” or “Not Exists” condition.

LookupLookup transform allows you to lookup values from another stream based on the lookup condition. Its works like inner join.

JoinerJoiner transform allows you to join two streams based on a condition. You have an option to perform inner, left, right, outer and cross joins.

Union Union transform allows you merge two streams into a single stream.

New BranchNew branch transform allows you to replicate current stream. You can create a new stream or copy of a stream.

OutputOutput sink will output the data into all kinds of storage supported by ADF.Note: private preview only supports blob and SQL DW

Page 27: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

• Azure SLAs are NA for preview services (private or public preview) until GA of the service.

• Limited Preview Support

• Handled directly with the Azure Engineering team via [email protected]. Turn-around time on fixing issues during private preview will depend upon access to customer data sources and customer Databricks clusters for RCA and debugging.

• Public Preview Support

• Normal Azure customer service channels

Data Flow Limited Preview Support & SLAs

Page 28: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Demo: Data FlowVisual Design of Transformations in Spark

Page 29: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

30

Conclusion: Data Flow in ADFIs cool because…

• visual design for fast learning & understanding

• ETL using spark technology in the background whichcould scale (but does not have to in any case)

• Azure Databricks as elastic processing engine fordifferent workloads, tools and user groups

• Integration in Control Flow enables modellingdependencies and cost-efficient orchestration of Azureresources

Page 30: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass
Page 31: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

32

Klar, bin VeteranData, what?

HOL 1 Copy Task HOL 3 Control Flow

HOL 2 Trigger

HOL 4 Data Flow

Yes!

Schon mal was gemacht mit der Data Factory?

Subscription im Preview Prog-ramm für DF?

Start: Einrichten Azure Resourcen

Page 32: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Step 1: Azure Basics – set up resources on Azure

33

Page 33: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Pipeline

34https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity

HOL 1/2: Data Movement & TriggerCopy Activity

Activity: Copy datafrom input file to

SQL table

On- demandTrigger

run

Linked service: Blob Store

Linked service:Azure SQL DB

Dataset: Container + Flat file

Dataset: Table

Page 34: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

exec

exec

Pipeline check metadata

35

HOL 3: Control FlowGet all data of a system by metadata

Activity: For eachtable in source

Activity: CopyData to Data Lake

/ Blob Store

exec

Page 35: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

HOL 4: Data FlowVisual Design of Transformations in Spark

Page 36: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Attention: Use Databricks 5.0 includes A. Spark 2.4.0, Scala 2.11

37

Page 37: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass
Page 38: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

1. Microsoft documentation: https://docs.microsoft.com/en-us/azure/data-factory/2. Azure Data Factory – data flows preview documentationhttps://github.com/kromerm/adfdataflowdocs3. Cool screencasts about data flowshttps://github.com/kromerm/adfdataflowdocs/tree/master/videos4. Another good blogpost about ADF Data Flowshttps://visualbi.com/blogs/microsoft/azure/azure-data-factory-data-flow-activity/5. Comparison ADF Data Flows vs. SSIS vs. T-SQL https://sqlplayer.net/2018/12/azure-data-factory-v2-and-its-available-components-in-data-flows/

39

Links and further informationsNoch mal überarbeiten mit aktuellen Links

Page 39: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

40

inovex ist ein IT-Projekthaus mit dem Schwerpunkt „Digitale Transformation“:

Product Ownership · DatenprodukteWeb · Apps · Smart Devices · BI Big Data · Data Science · SearchReplatforming · Cloud · DevOpsData Center Automation & HostingTrainings · Coachings

Wir nutzen Technologien, um unsere Kunden glücklich zu machen.Und uns selbst.

inovex gibt es in Karlsruhe · Pforzheim · Stuttgart · München · Köln · Hamburg

Und natürlich unter www.inovex.de

Page 40: Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege › ... · Data flows in Azure Data Factory – endlich auch hier Transformationen! Pass

Vielen Dank

Stefan Kirner

Head of BI Solutions

inovex GmbH

Ludwig-Erhard-Allee 6

76131 Karlsruhe

[email protected]

0173 3181 012