Was waren die Learnings und Challenges um eine auf Azure basierende, moderne Data Analytics Plattform für einen großen Konzern als Service bereitzustellen und in das Enterprise zu integrieren? Ein Projekt mit vielen interessanten Aspekten über Azure BI Services wie HDInsight, die Integration in ein Enterprise in einem "as a Service" Model, Management der Kosten und Verrechnungen der Services, und noch viel mehr. Diese Session bietet Einblicke in eines unserer Projekte, die Ihnen in Ihrem nächsten Projekt behilflich sein werden.
Azure Days 2019: Wie bringt man eine Data Analytics Plattform in die Cloud? (Florian van Keulen)
1. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF
HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH
Blog.Trivadis.com@Trivadis
Provisioning of Data Platforms
Wie bringt man eine Data Analytics Plattform in die Cloud
Florian van Keulen
2. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF
HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH
Florian van Keulen
● Function at Trivadis:
Head of Product Design – Cloud & Security
Cloud Solution Architekt
● CV:
Studierte „Security in Distributed Systems“
bekämpfte Malware Weltweit
IT Security Officer & Cloud Architekt.
Identifiziere Chancen in der Cloud und nutze sie sicher!
● Hobbies:
Tauchen, BBQ, Woodwork…
7. Projekt Details….
§ Zentralisierte, strategische Analytics Plattform
§ Sensor & Messdaten aus vielen Quellen
§ Auch für AdHoc Analytics
§ Automatisierte Daten Modelle
§ z.B. für Predictive Maintanance
§ Komplexe File / Daten Strukturen
§ Strikte und komplexe Zugriffssicherheit
Und alles „as a Service“…
NDA
18. Accessing Data
• M6 – Accessing the Data Lake • M7 – Pushing Data to External Systems
M6.1 - Accessing Data through SQL M7.1 - Exporting Data into a Relational Database
19. Beispiel eines End-to-End Prozesses
Data Analytics
Data Processing
FTP Server
Data Ingestion
Format
Translation
Merge Data
& Perform
Analytics
Save to
redefined
data storage
Save data to
raw storage
Collect data
every
10mins
Enrich with
Metadata
Save data to
usage
optimized
data storage
Data Access
Access
through SQL
Visualize &
Reporting
Map to
Table &
Access
Control
20. Analytics & Machine Learning
(5.2)
Big Data Batch Processing
(M3.1)
FTP Server
Batch Data Ingestion
(M1.1)
Format
Translation
Merge Data
& Perform
Analytics
Save to
redefined
data storage
Save data to
raw storage
Collect data
every
10mins
Enrich with
Master Data
Save data to
usage
optimized
data storage
Accessing the Data Lake
(M6.1)
Access
through SQL
Visualize &
Reporting
Map to
Table &
Access
Control
Beispiel eines End-to-End Prozesses
21. Architektur nach Phase I
Integration
Bulk Data Flow
Create Blob
Disk Service
Analytical Platform Automation
Meta DataGeneratorTemplate
Create Blob
(Deployment)
Information Governance & Security
Event Catalog
Sync Data
Assets
Big Data Storage
Raw Zone
Trusted/Refined Zone
Usage-Optimized
Zone
Big Data Processing
Transform
Event Hub
SQL
REST / SOAP
Event Stream
Event Stream
API Call
Real-Time Big Data Processing
Stream
Analytics
Usage-Optimized
Data
Enterprise Apps
Big Data Analytics
Machine
Learning
Big Data Federation
Information
Consumer
Batch Data
Visualization
Self-Service
Analytics
EDWH
RDBMS
Data Flow
Data Science Lab
Service Bus
Business
Process
API Call
SQL
API Call
API Call
Read
Create / Delete
Read
SQL
APICall
API Call / SQL / Query
Load
Read
Create / Delete
Archival
API Call
Scheduler
API Call
Data Catalog
Containerized Apps
Microservice
SQL
API Call
Access Mgmt Encryption &
Protection
Multi-
Dimensional
ML Model
SQL / Query
Event Stream
Event Stream
Usage-
Optimized Data
Event Stream
Master Data
CRUD
Data Lineage
Master Data
Event
Handler
Event
Handler
Streaming Data
Visualization
Cleansing /
Validating
Enrichment Aggregation
Image/Video
Recognition
Timeseries
Analysis
Graph/Link
Analytics
Location
Analytics
Landing Zone
Sandbox Zone
App Marketplace
Query
Engine
API / Service
Master DataAPI / Service
Data Enterprise
App
API / Service
Archived Zone
API / Service
Azure Storage
Blob
Azure Storage
Blob
Azure Storage
Blob
Master Data
Services (MDS)
HDInsight Spark
Azure Data
Catalog
Trivadis biGENiUS
HDInsight
Kafka
Azure Functions
HDInsight
Interactive Query
StreamSets
Data Collector
Azure Cosmos
DB Azure SQL
Database
Azure Databricks
Azure Logic
Apps
Azure
Scheduler
Power BI
Tableau / SAP
BO
MATLAB
Data Catalog
Web UI
Azure Databricks
UI
Azure Data
Box
Azure
Import
StreamSets
Data Collector
Azure Time
Series Insights
Excel with
MDS Plugin
Azure Storage
Explorer
Azure Kubernetes
Service (AKS)
Azure SQL
Database
HDInsight
(Ranger)
Azure
StreamAnalytics
Spark
Streaming
Azure Event
Hub
Bulk Import
Event Stream
Edge
(Bulk) Data Flow
Stream
Analytics
Event
Stream
Bulk Data
Bulk Data
Event
Stream
Event Hub
API Call
API Call
Event
Handler
Data Sources
DB
Extract
File
Weather
DB
CDC
File
CDC
Mobile
Apps
Connected Car
Robot
Windpark
Air Traffic
Event
Message
Bulk
Stream
Service
Social
Media
Smart City
Sensor
Market Feed
Bulk Data Flow
Disk
Data Flow
Scheduler
API / Service
Control-M
FTP Server
Azure Event
Grid
24. Integration
Bulk Data Flow
Create Blob
Disk Service
Analytical Platform Automation
Meta DataGeneratorTemplate
Create Blob
(Deployment)
Information Governance & Security
Event Catalog
Sync Data
Assets
Big Data Storage
Raw Zone
Trusted/Refined Zone
Usage-Optimized
Zone
Big Data Processing
Transform
Event Hub
SQL
REST / SOAP
Event Stream
Event Stream
API Call
Real-Time Big Data Processing
Stream
Analytics
Usage-Optimized
Data
Enterprise Apps
Big Data Analytics
Machine
Learning
Big Data Federation
Information
Consumer
Batch Data
Visualization
Self-Service
Analytics
EDWH
RDBMS
Data Flow
Data Science Lab
Service Bus
Business
Process
API Call
SQL
API Call
API Call
Read
Create / Delete
Read
SQL
APICall
API Call / SQL / Query
Load
Read
Create / Delete
Archival
API Call
Scheduler
API Call
Data Catalog
Containerized Apps
Microservice
SQL
API Call
Access Mgmt Encryption &
Protection
Multi-
Dimensional
ML Model
SQL / Query
Event Stream
Event Stream
Usage-
Optimized Data
Event Stream
Master Data
CRUD
Data Lineage
Master Data
Event
Handler
Event
Handler
Streaming Data
Visualization
Cleansing /
Validating
Enrichment Aggregation
Image/Video
Recognition
Timeseries
Analysis
Graph/Link
Analytics
Location
Analytics
Landing Zone
Sandbox Zone
App Marketplace
Query
Engine
API / Service
Master DataAPI / Service
Data Enterprise
App
API / Service
Archived Zone
API / Service
Azure Storage
Blob
Azure Storage
Blob
Azure Storage
Blob
Master Data
Services (MDS)
HDInsight Spark
Azure Data
Catalog
Trivadis biGENiUS
HDInsight
Kafka
Azure Functions
HDInsight
Interactive Query
StreamSets
Data Collector
Azure Cosmos
DB Azure SQL
Database
Azure Databricks
Azure Logic
Apps
Azure
Scheduler
Power BI
Tableau / SAP
BO
MATLAB
Data Catalog
Web UI
Azure Databricks
UI
Azure Data
Box
Azure
Import
StreamSets
Data Collector
Azure Time
Series Insights
Excel with
MDS Plugin
Azure Storage
Explorer
Azure Kubernetes
Service (AKS)
Azure SQL
Database
HDInsight
(Ranger)
Azure
StreamAnalytics
Spark
Streaming
Azure Event
Hub
Bulk Import
Event Stream
Edge
(Bulk) Data Flow
Stream
Analytics
Event
Stream
Bulk Data
Bulk Data
Event
Stream
Event Hub
API Call
API Call
Event
Handler
Data Sources
DB
Extract
File
Weather
DB
CDC
File
CDC
Mobile
Apps
Connected Car
Robot
Windpark
Air Traffic
Event
Message
Bulk
Stream
Service
Social
Media
Smart City
Sensor
Market Feed
Bulk Data Flow
Disk
Data Flow
Scheduler
API / Service
Control-M
FTP Server
Azure Event
Grid
Challenge: HD Insight
25. Challenge: HD Insight - Authentication
HDInsight
Customer’s AzureAnalytics Platform Azure
Customer’s OnPrem
BigData
Storage
Azure AD
gateway
head nodeworker
node(s)
ranger
worker
node(s)
worker
node(s) Zeppelin
web services
SQL interface
other
Customer
Azure AD
Azure AD B2B
Federate
Sync/Federate
Authentication
OnPrem
Authentication/Authorization
(SAML,OAuth)
Benötigtes Setup für Azure HDInsight
26. Challenge: HD Insight - Authentication
HDInsight
Customer’s AzureAnalytics Platform Azure
Customer’s OnPrem
BigData
Storage
Azure AD
gateway
head nodeworker
node(s)
ranger
worker
node(s)
worker
node(s) Zeppelin
web services
SQL interface
other
Customer
Azure AD
Syncincl.Passwords
Authentication
OnPrem
Auth
Kerberos/LDAP
Azure Active
Directory
Domain
Services
Domain Join
Domain Join& Authentication
Empfohlenes Deployment von Microsoft
27. Challenge: HD Insight - Authentication
HDInsight
Customer’s AzureAnalytics Platform Azure
Customer’s OnPrem
BigData
Storage
Azure AD
gateway
head nodeworker
node(s)
ranger
worker
node(s)
worker
node(s) Zeppelin
web services
SQL interface
other
Customer
Azure AD
Auth
Kerberos/LDAP
Azure Active
Directory
Domain
Services
Domain Join
Synchronizing same identity
in 2 Azure ADs not possible
Syncincl.Passwords
Authentication
OnPrem
Sync
X
Möglicher Workaround 1
28. Challenge: HD Insight - Authentication
HDInsight
Ørsted AzureSMM Platform Azure
Ørsted OnPrem
BigData
Storage
Azure AD
Apache KNOX Gateway
head nodeworker
node(s)
ranger
worker
node(s)
worker
node(s) Zeppelin
web services
SQL interface
other
Ørsted
Azure AD
Azure AD B2B
Federate
Sync/Federate
Authentication
OnPrem
Authentication/Authorization
(SAML,OAuth)
Austausch einer
standard
HDInsight Komponente
Möglicher Workaround 2
29. Challenge: HD Insight
HDInsight
Customer’s AzureAnalytics Platform Azure
Customer’s OnPrem
BigData
Storage
Azure AD
gateway
head nodeworker
node(s)
ranger
worker
node(s)
worker
node(s) Zeppelin
web services
SQL interface
other
Customer
Azure AD
Provision/DeprovisionIdentities
Auth
Kerberos/LDAP
Azure Active
Directory
Domain
Services
Domain Join
Azure AD managing Identities
All Azure IAM features available
Self Service IAM additional possible
(e.g. Password reset)
provide initial credentials
Customers
Identity Manager
Script execution
Möglicher Workaround 3
30. Challenge: HD Insight - Kosten
HDInsight Spark
Head Node Worker
Min. 2 Nodes
HDInsight Kafka
Head Node Worker
Min. 2 Nodes
HDInsight Interactive Q
Head Node Worker
Min. 2 Nodes
3 Clusters, jeder min. 3 VMs
Kosten pro Stunde
Kein stoppen möglich, nur deprovisionieren
31. Challenge: HD Insight - Kosten
HDInsight Spark
Head Node Worker
Min. 2 Nodes
HDInsight Kafka
Head Node Worker
Min. 2 Nodes
HDInsight Interactive Q
Head Node Worker
Min. 2 Nodes
36. Takeaways
§ HD Insight ist mächtig, aber nicht wirklich cloud aware…
§ Identity Management und Access Management für HD Insight eher traditionell
§ Kosten von HD Insight nicht unterschätzen…
§ Wo möglich automatisiert deprovisionieren und porvisionieren
§ Oder Databricks & Data Factory nutzen
§ Tagging ideal für Kosten Verteilungen