Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
2. About me
• Bob Bui(linkedin.com/in/thangbn/ )
• Senior Data Engineer @ EquitySim : AI-EdTech building financial
simulation platform.
• Previously
• Senior Software Engineer @ SAP Innovation Center Singapore: SAP
Leonardo Machine Learning.
• Also building a variety of software products.
3. Agenda
• Data engineering
• Revisit Data Science
• The need of data engineering & data engineer
• Common concept
• Typical big data analytics architecture
12. Skills: Data Scientist vs Data Engineer
Data Engineer Data Scientist
Programming
Data Wrangling
Software Engineering
Software Design & Architecture
Software Ops
Data Intuition
Statistics & Mathematics
AI/Machine Learning
Data Visualization
13. Other related roles
• Data Analyst: querying data, process data, provide reports,
summarize and visualize data.
• BI Developer/Report Developer: building BI and reporting solutions.
• ML Developer: having ML, Statistics knowledge; focus on implement
ML algorithm.
15. Business Analytics vs. Business Intelligence
• BI: analysis of historical data → problem identification & resolution → improve
business
• BA: exploration of historical data → identify trends, patterns & understand the
information → drive business change
BI BA
Collect, analyzes, Visualize Data
✅ ✅
Identify problem
✅ ✅
Descriptive Analytics
✅ ❌
Diagnostics Analytics
✅ ❌
Predictive Analytics
❌ ✅
Prescriptive Analytics
❌ ✅
16. Data lake vs Data warehouse
• Data warehouse: current and historical data used for reporting and data
analysis
• Data lake: repo to store raw, structured, unstructured data; anything,
everything.
• Data swamp: poorly managed data lake → inaccessible, little value
17. Data lake vs Data warehouse
Data Warehouse Data Lake
processed
structured
DATA processed/unprocessed
structured, unstructured, raw
Scheme-on-write
ETL
More expensive
PROCESSING schema-on-read
ELT
Less expensive
Fixed, less agile AGILITY Flexible, highly agile
Ready to be analyzed READINESS Need more processing before
become useful
22. Data ingestion
• Role: Streaming data from source into pipeline.
• Characteristics :
• High Performance, Low latency
• Superbly Scalable
• Durable
• Integration with existing DB systems
• Common options:
• Kafka
• AWS Kinesis
• GCP PubSub
23. Big Data Processing Techs
• Uses:
• ETL: Clean, flatten, transform, aggregate data into more-analyzable format.
• Analytics
• Training data for Machine Learning
• Characteristics:
• Able to handle big data
• Scale out
• Low latency
24. Processing model: Batch vs Stream
Batch Processing Stream Processing
Data scope Processing over all or most of the data set processing over data on rolling window or
most recent data record
Data size Large batches of data Individual records or micro batches of few
records
Latency in minutes to
hours
in the order of seconds or milliseconds
Analytics Complex analytics Simple instant response functions,
aggregates, and rolling metrics
26. Data parallelism vs Task parallelism
Data parallelism Task parallelism
Fashion
Same operations are performed on different
subsets of same data.
Different operations are
performed on the same or
different data.
Computation Synchronous Asynchronous
Amount of
parallelization
proportional to the input data size.
is proportional to the number of
independent tasks to be
performed.
28. Popular processing techs
• Hadoop ecosystem: on-disk batch processing
• Spark: in-memory batch/”pseudo-streaming” processing
• Flink, Storm: native stream processing
• Beam: unified model framework
• Hosted:
• GCP Dataflow: programming framework
• AWS Data pipeline: S3, AWS EMR centric, web service
• Azure Data Factory: Drag & drop data pipeline builder GUI
29. Why need another “database”?
• Collect data from multiple sources
• Different data model need
• Workload: Transactional vs Analytical (OLTP vs. OLAP)
• Some NoSQL is not suitable for data analytics
• Storage structure optimize to slice & dice query
30. Storage Unstructured
• Unstructured: Text, CSV, Image, Video, etc..
• Usually a highly scalable key-value object store.
• Options:
• Managed: Google Cloud Storage, AWS S3
• Open source: OpenStack Swift, Minio.
31. Storage Structured
• Database: SQL, NoSQL
• Characteristics:
• Analytics query language: ideally SQL-like
• Massively scalable to billion of rows
• Low latency data ingestion
• read focus over large portion of data
• Have MANY options
32. Option 1: using same DB as application DB
• App database
• A read-replica of app DB
• A separate data warehouse
running your “app database”
App DB
App DB Read Replica
App DB DWApp DB
sync
34. Option 3: SQL-on-Hadoop
• Leverage data from Hadoop-based processing framework
• Techs: Spark SQL, Drill, Hive, Impala, Presto
• Pros:
• Can scale to massive data sets
• Use common SQL dialects
• Decent tool support
• Join between different type of data sources: SQL, NoSQL, Structured file.
• Cons:
• Languages are very low level
• Requires running a Hadoop cluster
35. Option 4: ElasticSearch
• Leverage its query language to power search-oriented analytics
• Pros:
• FAST
• Strong ability to search your data
• Cons:
• Slow ingestion
• Difficult query language that is optimized for search, not analytics
36. Option 5: In-memory databases
• If you want super low latency
• Techs: Druid, Pinot, SAP HANA
• Pros:
• FAST, FAST, FAST
• Cons:
• A LOT OF RAM
• Not so flexible and powerful query language
• Joins is limited
• Challenging to deploy and manage
37. Take away
• There is overlapping between Data Scientist vs Data Engineer but
the distinction is becoming clearer
• No role is better than another, know what your organization need