SlideShare uma empresa Scribd logo
1 de 22
Architecting a Datalake
Laurent Léturgez – Sep 2019
Big Data Meetup - Lille
Whoami
• Database and BigData Architect (Hadoop, Data Science and other
cool topics)
• Former Developer and Consultant
• Owner@Premiseo: Data Management on Premises and in the
Cloud
• Blogger since 2004
• http://laurent-leturgez.com
• Twitter : @lleturgez
What’s on the menu ?
• What is a Datalake ?
• Keys to architect a Datalake
• Design, Security
• Data movement, Data Processing
• Discovery
• Solutions available
• Example
• Datalake Implementation driven by IoT
What is a Datalake ?
• Repository of data stored in natural format
• Single Store of Enterprise data
• Raw Data
• Transformed Data : Reports, DataViz, Results (AI, ML …)
• Data Structure:
• Structured Data : Row, Columns, Relational Data
• Semi Structured Data: CSV, XML, JSON, log files
• Unstructured Data: Mails, Documents, Binaries (Images, Videos)
What is a Datalake ?
• Features
• Data are usually integrated unprocessed
• Processed data can be kept in the Datalake
• Data are kept … ready to be transformed
• Data are saved as long as possible
• A Datalake is
• Organized
• Managed
What is a Datalake ?
• A Datalake is
not a datawarehouse
Source: martinfowler.com
Keys to architect a Datalake
• A well thought design
• Vital for
• Success
• Discovery efficiency
• ETL development effort
• Coupled with Security and business process
Keys to architect a Datalake
• A well thought design … example
• Operational Areas
• Raw Area
• Data landing zone in native raw format
• Data are kept indefinitely in this area
• Data Tagging
• Folder Structure organized by Source, Dataset, Date etc.
• Staging Area
• Data Preparation Area : Decompression, cleansing, aggregation
• Data Quality Management is usually made here
• Hub Area
• Trusted layer of data
• Data is ready for analytics organized functionaly
Keys to architect a Datalake
• A well thought design … example
• (Extra) Supported Area
• Master Data Area
• Customer, Products, Financial Data
• Used by Analytics
• Exploratory Area
• Playground for Data Scientists and Analysts
• Temporary Area
• Testing Data decompression
• Single point of data storage before move accross network
Keys to architect a Datalake
• Security
• Data Access Control
• By User
• By Application
• ETL Softwares
• Analytics
• …
• By Operational zone
• By Source
Key Point: IAM Integration
Keys to architect a Datalake
• Security
• Data Security
• Data Lake Management (Role Control)
• Data Resilience
• Disaster recovery
• Backup / Restore
• SLA: Availability, RTO, RPO
• Data Encryption
• At rest
• In transit
Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point for
• Data Ingestion
• Data Processing
Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point
• Data Ingestion
• Tools / ETL
• Metadata strategy should be in place (Data Catalog for tagging)
• Data Format
• Naming convention for files/directories: ingestion date, format, source etc.
• Batch or real time
• Many small files or few big files
• Data Partitioning  Maximum query and processing performance
• Cloud or OnPrem ?
• Network issues, hybrid Cloud considerations
• Data Processing
Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point
• Data Ingestion
• Data Processing
• Tools
• Hadoop (on Prem / Cloud)
• Legacies Database systems (SQL Server PolyBase, Oracle Connector for Hadoop, AWS
Spectrum/Athena etc.)
• Analytics, DataViz and ML
• Data Bricks, Power BI, SAS, Qlik etc.
• Data Colocation
• Data Format
• Compressed / Uncompressed
• Column oriented
Keys to architect a Datalake
• Orchestration
• Cloud Automation or Job Automation ?
• Batch or real time
• Batch automation
• Monitoring
• Data volume
• Real Time (Usually used for IoT)
• How is built the pipeline ?
• Event based or not ?
• Monitoring
Keys to architect a Datalake
• Discovery
• Tagging and Metadata management : Similar … but different
• MetaData management :
• Data about data : creation and modification date, source, format etc.
• Traditional metadata: source, connection string, data type, length, versions etc.
• Modern metadata: included in files (AVRO For example) or a database
• Advanced metadata: automated processing of metadata
• Tagging
• Set of tag to understand/describe datasets in the datalake
• Usually stored in a Catalog or KV database or through Naming conventions
• Key points: When the data has been tagged ? Who owns the tagging system ?
Solutions available
• Solutions available
• On Prem :
• Hadoop / HDFS
• Cloud
• AWS : S3 Buckets
• Azure : Azure Datalake Store Gen1/Gen2, Storage Accounts
• GCP: Google Cloud Storage
• Oracle Cloud Infrastructure: Object Storage
Implementation
• Example : Solution
• Customer : Industry, Trucks maker
• Project : Parts failure prediction
• Sensors are embedded in trucks
• Data collection for parts health
• Data are integrated real time in the Datalake
• Legacy data are integrated into the datalake (batch mode)
• Parts related data (mostly coming from ERPs) : Serial number, provider, purchases etc.
• Predictive algorithms are designed to replace parts before they broke
Implementation
• Example: Solution
• Azure Datalake Store / Storage Accounts closely integrated with MS SQL
Databases
• Why not on Prem ?
• Infrastructure costs
• Fuzzy Data volume prediction
• Hadoop management
Implementation
• Example: Solution
• Why Azure ?
• Microsoft long time customer
• Many services already used (Legacy databases: MS SQL DWH, Power BI etc.)
• Active Directory Integration: Security, ACL and
• Batch Integration by Talend
• Real Time Integration by Azure Products (Iot Hub + Azure Functions)
• Close integration with DataBricks for Analytics and Data Processing
Conclusion
• DataLake are now central components for enterprises
• Without …
• Organized Data
• Managed Data (Security, design etc.)
• High volume of Data
• No powerful AI or ML algorithms
• No powerful Analytic processes
Questions ?

Mais conteúdo relacionado

Mais procurados

Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
SQL to Azure Migrations
SQL to Azure MigrationsSQL to Azure Migrations
SQL to Azure MigrationsDatavail
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 

Mais procurados (20)

Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Azure Synapse Analytics
Azure Synapse AnalyticsAzure Synapse Analytics
Azure Synapse Analytics
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
SQL to Azure Migrations
SQL to Azure MigrationsSQL to Azure Migrations
SQL to Azure Migrations
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 

Semelhante a Architecting a datalake

How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudCAMMS
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Rittman Analytics
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.Łukasz Grala
 
Ds03 data analysis
Ds03   data analysisDs03   data analysis
Ds03 data analysisDotNetCampus
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Hadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_whichHadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_whichDan TheMan
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Martin Bém
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 

Semelhante a Architecting a datalake (20)

How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
 
Ds03 data analysis
Ds03   data analysisDs03   data analysis
Ds03 data analysis
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Hadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_whichHadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_which
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 

Mais de Laurent Leturgez

Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementLaurent Leturgez
 
Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Laurent Leturgez
 
Oracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachOracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachLaurent Leturgez
 
Improve oracle 12c security
Improve oracle 12c securityImprove oracle 12c security
Improve oracle 12c securityLaurent Leturgez
 
Which cloud provider for your oracle database
Which cloud provider for your oracle databaseWhich cloud provider for your oracle database
Which cloud provider for your oracle databaseLaurent Leturgez
 
SIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemorySIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemoryLaurent Leturgez
 

Mais de Laurent Leturgez (6)

Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data management
 
Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Oracle hadoop let them talk together !
Oracle hadoop let them talk together !
 
Oracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachOracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approach
 
Improve oracle 12c security
Improve oracle 12c securityImprove oracle 12c security
Improve oracle 12c security
 
Which cloud provider for your oracle database
Which cloud provider for your oracle databaseWhich cloud provider for your oracle database
Which cloud provider for your oracle database
 
SIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemorySIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In Memory
 

Último

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 

Último (20)

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 

Architecting a datalake

  • 1. Architecting a Datalake Laurent Léturgez – Sep 2019 Big Data Meetup - Lille
  • 2. Whoami • Database and BigData Architect (Hadoop, Data Science and other cool topics) • Former Developer and Consultant • Owner@Premiseo: Data Management on Premises and in the Cloud • Blogger since 2004 • http://laurent-leturgez.com • Twitter : @lleturgez
  • 3. What’s on the menu ? • What is a Datalake ? • Keys to architect a Datalake • Design, Security • Data movement, Data Processing • Discovery • Solutions available • Example • Datalake Implementation driven by IoT
  • 4. What is a Datalake ? • Repository of data stored in natural format • Single Store of Enterprise data • Raw Data • Transformed Data : Reports, DataViz, Results (AI, ML …) • Data Structure: • Structured Data : Row, Columns, Relational Data • Semi Structured Data: CSV, XML, JSON, log files • Unstructured Data: Mails, Documents, Binaries (Images, Videos)
  • 5. What is a Datalake ? • Features • Data are usually integrated unprocessed • Processed data can be kept in the Datalake • Data are kept … ready to be transformed • Data are saved as long as possible • A Datalake is • Organized • Managed
  • 6. What is a Datalake ? • A Datalake is not a datawarehouse Source: martinfowler.com
  • 7. Keys to architect a Datalake • A well thought design • Vital for • Success • Discovery efficiency • ETL development effort • Coupled with Security and business process
  • 8. Keys to architect a Datalake • A well thought design … example • Operational Areas • Raw Area • Data landing zone in native raw format • Data are kept indefinitely in this area • Data Tagging • Folder Structure organized by Source, Dataset, Date etc. • Staging Area • Data Preparation Area : Decompression, cleansing, aggregation • Data Quality Management is usually made here • Hub Area • Trusted layer of data • Data is ready for analytics organized functionaly
  • 9. Keys to architect a Datalake • A well thought design … example • (Extra) Supported Area • Master Data Area • Customer, Products, Financial Data • Used by Analytics • Exploratory Area • Playground for Data Scientists and Analysts • Temporary Area • Testing Data decompression • Single point of data storage before move accross network
  • 10. Keys to architect a Datalake • Security • Data Access Control • By User • By Application • ETL Softwares • Analytics • … • By Operational zone • By Source Key Point: IAM Integration
  • 11. Keys to architect a Datalake • Security • Data Security • Data Lake Management (Role Control) • Data Resilience • Disaster recovery • Backup / Restore • SLA: Availability, RTO, RPO • Data Encryption • At rest • In transit
  • 12. Keys to architect a Datalake • Data Movement, Data Processing • Consider the Data Lake as central point for • Data Ingestion • Data Processing
  • 13. Keys to architect a Datalake • Data Movement, Data Processing • Consider the Data Lake as central point • Data Ingestion • Tools / ETL • Metadata strategy should be in place (Data Catalog for tagging) • Data Format • Naming convention for files/directories: ingestion date, format, source etc. • Batch or real time • Many small files or few big files • Data Partitioning  Maximum query and processing performance • Cloud or OnPrem ? • Network issues, hybrid Cloud considerations • Data Processing
  • 14. Keys to architect a Datalake • Data Movement, Data Processing • Consider the Data Lake as central point • Data Ingestion • Data Processing • Tools • Hadoop (on Prem / Cloud) • Legacies Database systems (SQL Server PolyBase, Oracle Connector for Hadoop, AWS Spectrum/Athena etc.) • Analytics, DataViz and ML • Data Bricks, Power BI, SAS, Qlik etc. • Data Colocation • Data Format • Compressed / Uncompressed • Column oriented
  • 15. Keys to architect a Datalake • Orchestration • Cloud Automation or Job Automation ? • Batch or real time • Batch automation • Monitoring • Data volume • Real Time (Usually used for IoT) • How is built the pipeline ? • Event based or not ? • Monitoring
  • 16. Keys to architect a Datalake • Discovery • Tagging and Metadata management : Similar … but different • MetaData management : • Data about data : creation and modification date, source, format etc. • Traditional metadata: source, connection string, data type, length, versions etc. • Modern metadata: included in files (AVRO For example) or a database • Advanced metadata: automated processing of metadata • Tagging • Set of tag to understand/describe datasets in the datalake • Usually stored in a Catalog or KV database or through Naming conventions • Key points: When the data has been tagged ? Who owns the tagging system ?
  • 17. Solutions available • Solutions available • On Prem : • Hadoop / HDFS • Cloud • AWS : S3 Buckets • Azure : Azure Datalake Store Gen1/Gen2, Storage Accounts • GCP: Google Cloud Storage • Oracle Cloud Infrastructure: Object Storage
  • 18. Implementation • Example : Solution • Customer : Industry, Trucks maker • Project : Parts failure prediction • Sensors are embedded in trucks • Data collection for parts health • Data are integrated real time in the Datalake • Legacy data are integrated into the datalake (batch mode) • Parts related data (mostly coming from ERPs) : Serial number, provider, purchases etc. • Predictive algorithms are designed to replace parts before they broke
  • 19. Implementation • Example: Solution • Azure Datalake Store / Storage Accounts closely integrated with MS SQL Databases • Why not on Prem ? • Infrastructure costs • Fuzzy Data volume prediction • Hadoop management
  • 20. Implementation • Example: Solution • Why Azure ? • Microsoft long time customer • Many services already used (Legacy databases: MS SQL DWH, Power BI etc.) • Active Directory Integration: Security, ACL and • Batch Integration by Talend • Real Time Integration by Azure Products (Iot Hub + Azure Functions) • Close integration with DataBricks for Analytics and Data Processing
  • 21. Conclusion • DataLake are now central components for enterprises • Without … • Organized Data • Managed Data (Security, design etc.) • High volume of Data • No powerful AI or ML algorithms • No powerful Analytic processes