SlideShare a Scribd company logo
1 of 37
Demystifying Data
Engineering
AI in Production Meetup
About me
• Bob Bui(linkedin.com/in/thangbn/ )
• Senior Data Engineer @ EquitySim : AI-EdTech building financial
simulation platform.
• Previously
• Senior Software Engineer @ SAP Innovation Center Singapore: SAP
Leonardo Machine Learning.
• Also building a variety of software products.
Agenda
• Data engineering
• Revisit Data Science
• The need of data engineering & data engineer
• Common concept
• Typical big data analytics architecture
Data science
Extract
Insight & Knowledge
Unstructured
Structured
Data scientist
Maslow's Hierarchy of Needs
Source: https://www.simplypsychology.org/maslow.html
S o u rc e : The AI Hierarchy of Needs M o n ic a
AI and ML needs a Strong
Data Foundation
But what if …
OR
there is nothing a big mess
Data Engineer is
Who prepare the big data infrastructure to be analyzed by
Data Scientists.
Data Engineer’s Skills set BI
Big Data
software engineering
data
warehousing
Skills: Data Scientist vs Data Engineer
Data Engineer Data Scientist
Programming
Data Wrangling
Software Engineering
Software Design & Architecture
Software Ops
Data Intuition
Statistics & Mathematics
AI/Machine Learning
Data Visualization
Other related roles
• Data Analyst: querying data, process data, provide reports,
summarize and visualize data.
• BI Developer/Report Developer: building BI and reporting solutions.
• ML Developer: having ML, Statistics knowledge; focus on implement
ML algorithm.
Common Concepts
Business Analytics vs. Business Intelligence
• BI: analysis of historical data → problem identification & resolution → improve
business
• BA: exploration of historical data → identify trends, patterns & understand the
information → drive business change
BI BA
Collect, analyzes, Visualize Data
✅ ✅
Identify problem
✅ ✅
Descriptive Analytics
✅ ❌
Diagnostics Analytics
✅ ❌
Predictive Analytics
❌ ✅
Prescriptive Analytics
❌ ✅
Data lake vs Data warehouse
• Data warehouse: current and historical data used for reporting and data
analysis
• Data lake: repo to store raw, structured, unstructured data; anything,
everything.
• Data swamp: poorly managed data lake → inaccessible, little value
Data lake vs Data warehouse
Data Warehouse Data Lake
processed
structured
DATA processed/unprocessed
structured, unstructured, raw
Scheme-on-write
ETL
More expensive
PROCESSING schema-on-read
ELT
Less expensive
Fixed, less agile AGILITY Flexible, highly agile
Ready to be analyzed READINESS Need more processing before
become useful
ETL vs. ELT
Typical Big Data Analytics
System Architecture
Typical architecture
Break down architecture
Data ingestion
• Role: Streaming data from source into pipeline.
• Characteristics :
• High Performance, Low latency
• Superbly Scalable
• Durable
• Integration with existing DB systems
• Common options:
• Kafka
• AWS Kinesis
• GCP PubSub
Big Data Processing Techs
• Uses:
• ETL: Clean, flatten, transform, aggregate data into more-analyzable format.
• Analytics
• Training data for Machine Learning
• Characteristics:
• Able to handle big data
• Scale out
• Low latency
Processing model: Batch vs Stream
Batch Processing Stream Processing
Data scope Processing over all or most of the data set processing over data on rolling window or
most recent data record
Data size Large batches of data Individual records or micro batches of few
records
Latency in minutes to
hours
in the order of seconds or milliseconds
Analytics Complex analytics Simple instant response functions,
aggregates, and rolling metrics
Processing model: Data parallelism vs Task
parallelism
Data parallelism vs Task parallelism
Data parallelism Task parallelism
Fashion
Same operations are performed on different
subsets of same data.
Different operations are
performed on the same or
different data.
Computation Synchronous Asynchronous
Amount of
parallelization
proportional to the input data size.
is proportional to the number of
independent tasks to be
performed.
Unified Model
• Combine
• Batch and Stream
• Data parallelism vs task parallelism
Popular processing techs
• Hadoop ecosystem: on-disk batch processing
• Spark: in-memory batch/”pseudo-streaming” processing
• Flink, Storm: native stream processing
• Beam: unified model framework
• Hosted:
• GCP Dataflow: programming framework
• AWS Data pipeline: S3, AWS EMR centric, web service
• Azure Data Factory: Drag & drop data pipeline builder GUI
Why need another “database”?
• Collect data from multiple sources
• Different data model need
• Workload: Transactional vs Analytical (OLTP vs. OLAP)
• Some NoSQL is not suitable for data analytics
• Storage structure optimize to slice & dice query
Storage Unstructured
• Unstructured: Text, CSV, Image, Video, etc..
• Usually a highly scalable key-value object store.
• Options:
• Managed: Google Cloud Storage, AWS S3
• Open source: OpenStack Swift, Minio.
Storage Structured
• Database: SQL, NoSQL
• Characteristics:
• Analytics query language: ideally SQL-like
• Massively scalable to billion of rows
• Low latency data ingestion
• read focus over large portion of data
• Have MANY options
Option 1: using same DB as application DB
• App database
• A read-replica of app DB
• A separate data warehouse
running your “app database”
App DB
App DB Read Replica
App DB DWApp DB
sync
Option 2: SQL-based analytics DB
• SQL-like Database which optimize toward analytical workload
• Options:
• Open source: Postgres-based: Citus, Greenplum
• Hosted: Athena, Redshift, Azure, BigQuery
• Proprietary : Teradata, Oracle
Option 3: SQL-on-Hadoop
• Leverage data from Hadoop-based processing framework
• Techs: Spark SQL, Drill, Hive, Impala, Presto
• Pros:
• Can scale to massive data sets
• Use common SQL dialects
• Decent tool support
• Join between different type of data sources: SQL, NoSQL, Structured file.
• Cons:
• Languages are very low level
• Requires running a Hadoop cluster
Option 4: ElasticSearch
• Leverage its query language to power search-oriented analytics
• Pros:
• FAST
• Strong ability to search your data
• Cons:
• Slow ingestion
• Difficult query language that is optimized for search, not analytics
Option 5: In-memory databases
• If you want super low latency
• Techs: Druid, Pinot, SAP HANA
• Pros:
• FAST, FAST, FAST
• Cons:
• A LOT OF RAM
• Not so flexible and powerful query language
• Joins is limited
• Challenging to deploy and manage
Take away
• There is overlapping between Data Scientist vs Data Engineer but
the distinction is becoming clearer
• No role is better than another, know what your organization need

More Related Content

What's hot

What's hot (20)

Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
What makes it worth becoming a Data Engineer?
What makes it worth becoming a Data Engineer?What makes it worth becoming a Data Engineer?
What makes it worth becoming a Data Engineer?
 
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbtSiligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Data Engineering.pdf
Data Engineering.pdfData Engineering.pdf
Data Engineering.pdf
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 

Similar to Demystifying data engineering

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 

Similar to Demystifying data engineering (20)

How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Apache drill
Apache drillApache drill
Apache drill
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoop
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Accelerating analytics in a new era of data
Accelerating analytics in a new era of dataAccelerating analytics in a new era of data
Accelerating analytics in a new era of data
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 

Demystifying data engineering

  • 2. About me • Bob Bui(linkedin.com/in/thangbn/ ) • Senior Data Engineer @ EquitySim : AI-EdTech building financial simulation platform. • Previously • Senior Software Engineer @ SAP Innovation Center Singapore: SAP Leonardo Machine Learning. • Also building a variety of software products.
  • 3. Agenda • Data engineering • Revisit Data Science • The need of data engineering & data engineer • Common concept • Typical big data analytics architecture
  • 4. Data science Extract Insight & Knowledge Unstructured Structured
  • 6. Maslow's Hierarchy of Needs Source: https://www.simplypsychology.org/maslow.html
  • 7. S o u rc e : The AI Hierarchy of Needs M o n ic a
  • 8. AI and ML needs a Strong Data Foundation
  • 9. But what if … OR there is nothing a big mess
  • 10. Data Engineer is Who prepare the big data infrastructure to be analyzed by Data Scientists.
  • 11. Data Engineer’s Skills set BI Big Data software engineering data warehousing
  • 12. Skills: Data Scientist vs Data Engineer Data Engineer Data Scientist Programming Data Wrangling Software Engineering Software Design & Architecture Software Ops Data Intuition Statistics & Mathematics AI/Machine Learning Data Visualization
  • 13. Other related roles • Data Analyst: querying data, process data, provide reports, summarize and visualize data. • BI Developer/Report Developer: building BI and reporting solutions. • ML Developer: having ML, Statistics knowledge; focus on implement ML algorithm.
  • 15. Business Analytics vs. Business Intelligence • BI: analysis of historical data → problem identification & resolution → improve business • BA: exploration of historical data → identify trends, patterns & understand the information → drive business change BI BA Collect, analyzes, Visualize Data ✅ ✅ Identify problem ✅ ✅ Descriptive Analytics ✅ ❌ Diagnostics Analytics ✅ ❌ Predictive Analytics ❌ ✅ Prescriptive Analytics ❌ ✅
  • 16. Data lake vs Data warehouse • Data warehouse: current and historical data used for reporting and data analysis • Data lake: repo to store raw, structured, unstructured data; anything, everything. • Data swamp: poorly managed data lake → inaccessible, little value
  • 17. Data lake vs Data warehouse Data Warehouse Data Lake processed structured DATA processed/unprocessed structured, unstructured, raw Scheme-on-write ETL More expensive PROCESSING schema-on-read ELT Less expensive Fixed, less agile AGILITY Flexible, highly agile Ready to be analyzed READINESS Need more processing before become useful
  • 19. Typical Big Data Analytics System Architecture
  • 22. Data ingestion • Role: Streaming data from source into pipeline. • Characteristics : • High Performance, Low latency • Superbly Scalable • Durable • Integration with existing DB systems • Common options: • Kafka • AWS Kinesis • GCP PubSub
  • 23. Big Data Processing Techs • Uses: • ETL: Clean, flatten, transform, aggregate data into more-analyzable format. • Analytics • Training data for Machine Learning • Characteristics: • Able to handle big data • Scale out • Low latency
  • 24. Processing model: Batch vs Stream Batch Processing Stream Processing Data scope Processing over all or most of the data set processing over data on rolling window or most recent data record Data size Large batches of data Individual records or micro batches of few records Latency in minutes to hours in the order of seconds or milliseconds Analytics Complex analytics Simple instant response functions, aggregates, and rolling metrics
  • 25. Processing model: Data parallelism vs Task parallelism
  • 26. Data parallelism vs Task parallelism Data parallelism Task parallelism Fashion Same operations are performed on different subsets of same data. Different operations are performed on the same or different data. Computation Synchronous Asynchronous Amount of parallelization proportional to the input data size. is proportional to the number of independent tasks to be performed.
  • 27. Unified Model • Combine • Batch and Stream • Data parallelism vs task parallelism
  • 28. Popular processing techs • Hadoop ecosystem: on-disk batch processing • Spark: in-memory batch/”pseudo-streaming” processing • Flink, Storm: native stream processing • Beam: unified model framework • Hosted: • GCP Dataflow: programming framework • AWS Data pipeline: S3, AWS EMR centric, web service • Azure Data Factory: Drag & drop data pipeline builder GUI
  • 29. Why need another “database”? • Collect data from multiple sources • Different data model need • Workload: Transactional vs Analytical (OLTP vs. OLAP) • Some NoSQL is not suitable for data analytics • Storage structure optimize to slice & dice query
  • 30. Storage Unstructured • Unstructured: Text, CSV, Image, Video, etc.. • Usually a highly scalable key-value object store. • Options: • Managed: Google Cloud Storage, AWS S3 • Open source: OpenStack Swift, Minio.
  • 31. Storage Structured • Database: SQL, NoSQL • Characteristics: • Analytics query language: ideally SQL-like • Massively scalable to billion of rows • Low latency data ingestion • read focus over large portion of data • Have MANY options
  • 32. Option 1: using same DB as application DB • App database • A read-replica of app DB • A separate data warehouse running your “app database” App DB App DB Read Replica App DB DWApp DB sync
  • 33. Option 2: SQL-based analytics DB • SQL-like Database which optimize toward analytical workload • Options: • Open source: Postgres-based: Citus, Greenplum • Hosted: Athena, Redshift, Azure, BigQuery • Proprietary : Teradata, Oracle
  • 34. Option 3: SQL-on-Hadoop • Leverage data from Hadoop-based processing framework • Techs: Spark SQL, Drill, Hive, Impala, Presto • Pros: • Can scale to massive data sets • Use common SQL dialects • Decent tool support • Join between different type of data sources: SQL, NoSQL, Structured file. • Cons: • Languages are very low level • Requires running a Hadoop cluster
  • 35. Option 4: ElasticSearch • Leverage its query language to power search-oriented analytics • Pros: • FAST • Strong ability to search your data • Cons: • Slow ingestion • Difficult query language that is optimized for search, not analytics
  • 36. Option 5: In-memory databases • If you want super low latency • Techs: Druid, Pinot, SAP HANA • Pros: • FAST, FAST, FAST • Cons: • A LOT OF RAM • Not so flexible and powerful query language • Joins is limited • Challenging to deploy and manage
  • 37. Take away • There is overlapping between Data Scientist vs Data Engineer but the distinction is becoming clearer • No role is better than another, know what your organization need