SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Data Engineering
Challenges
DSE Days - 10 Sept 2015
Structure
1. Data Engineering
2. Data Pipeline
3. Data Engineering Challenges
4. Closing
1. Data Engineering
All those buzzwords...
- Data explosion, big data
- Data scientist
- IoT
- Data driven company
Who is Data Engineer?
“The role of data engineer is now used throughout industry
to describe the highly specialized software
engineers who create and maintain
these robust big data pipelines.” -
Insight Data Engineering
Basically we are software engineers.
2. Data Pipeline
Data Pipeline
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Lambda Architecture
INGESTION
Take it
DATA MANAGEMENT
Manage them
BATCH
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
STREAM
PROCESSING
Process it NOW
Big Data Pipeline
3. Data Engineering
Challenges
Challenges - Ingestion
Throughput, availability, scalability
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Ingestion
Sample Problem:
Facebook page view ~ 1 trillion/month
385,802 log or insert per second
Sample Solution:
Kafka, 2 million write/s (on 3 cheap machines)
- Simple (Log) → Throughput, O(1)
- Partitioning → Scalability
- Replication → Availability
Challenges - Ingestion
Challenge 1 - Wiring to Main App
● May introduce some changes in application
Challenge 2 - Failure isolation
● Minimize failure in application when logging
Challenges - Processing
Integrity, Dependency, Performance
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Processing
Sample Problem:
How many page views are from Indonesia in Aug 2015?
~100PB data if 10kb/datum
Sample Solution:
● Spark/Hadoop for computing
● HDFS for storing and Avro as file format
● Oozie as workflow management
Challenges - Processing
Challenge 1 - Learning Curve
● New way of thinking in processing data: Map Reduce
● New technology and operational concerns
Challenge 2 - Putting it All Together
● Incompatible release versions
● Minimum documentation
Challenges - Storage
Efficiency, Performance
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Storage
Sample Problems:
1. We want to get number of daily page view from
Indonesia for last 7 days
2. We want to retrieve user’s latest transaction to personalize
search result better
Sample Solution:
1. You might need Columnar Store for OLAP queries
2. You might need Key-Value Store since it will be retrieved per user id
Challenges - Storage
Challenge 1 - Choosing the right storage
● There are so many kind of database nowadays. Pick it
wisely to support your use cases best.
Challenge 2 - Develop the right model
● Each database has different way to model data.
Relational model might not be appropriate. We need to
understand how the database work.
Challenges - Retrieval
Ease of Use, Reusability, Adaptiveness
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Retrieval
Sample Problem:
● We want to visualize number of daily page view from
Indonesia for last 7 days
● and other problems like ad hoc query and reporting
Sample Solution:
● Create backend service to query and application to
visualize query result
Challenges - Retrieval
Challenge 1 - Ease of Use, Reusability
● It is very important to be easy to use since retrieval is
user facing product. Data product have to be
reusable and discoverable across data users.
Challenge 2 - Adaptiveness
● As there are many kind of databases now, query
service need to be extensible and adaptive to enable
usage of data from various sources.
Challenges - Data Management
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Data Management
Challenge 1 - Centralized Metadata
● Manage data at various places, with various schema
(sometime schemaless).
Challenge 2 - Security, Access Control
● Most of them are newly developed, and usually security
is last thing we consider.
4. Closing
Takeaway Points
● Think critically
○ Be wise, don’t get carried away, do not use
something just because it is cool, make sure you are
using what you need.
● Keep curious
○ New technology is coming everyday, one of them
might save your day
What is it like, to be a Data Engineer?
● Exhilarating
○ Be in critical position, handle big volume of data, be the nerve of
company, and have to make sure pipeline is robust.
● Challenging
○ Have to be DBA, data architect, big data programmer, software
engineer, and data analyst at the same time!
● Fun
○ Need to always learn new technology, new way to solve things
● High Demand
○ Data engineers are one of the most in-demand job roles at today’s
leading companies.
Q&A
References
● http://insightdataengineering.com/blog/The-
Data-Engineering-Ecosystem-An-Interactive-
Map.html
● http://insightdataengineering.com/Insight_Da
ta_Engineering_White_Paper.pdf

Mais conteúdo relacionado

Mais procurados

Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summitOpen Analytics
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureOliver Buckley-Salmon
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidDatabricks
 
MongoDB in a Mainframe World
MongoDB in a Mainframe WorldMongoDB in a Mainframe World
MongoDB in a Mainframe WorldMongoDB
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Big Data Spain
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to RedshiftTreasure Data, Inc.
 
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...DataStax
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixData Con LA
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Natalino Busa
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataTreasure Data, Inc.
 
Data Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStaxData Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStaxDataStax
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AIDatabricks
 

Mais procurados (20)

Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB Testing
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
MongoDB in a Mainframe World
MongoDB in a Mainframe WorldMongoDB in a Mainframe World
MongoDB in a Mainframe World
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
Data Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStaxData Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStax
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
 

Semelhante a Data Engineering Challenges - DSE Day at Bandung Institute of Technology

Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Webinar: Overcoming the Storage Roadblock to Data Center Modernization
Webinar: Overcoming the Storage Roadblock to Data Center ModernizationWebinar: Overcoming the Storage Roadblock to Data Center Modernization
Webinar: Overcoming the Storage Roadblock to Data Center ModernizationStorage Switzerland
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
 
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)Denodo
 
Bimodal IT and EDW Modernization
Bimodal IT and EDW ModernizationBimodal IT and EDW Modernization
Bimodal IT and EDW ModernizationRobert Gleave
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentTasktop
 
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...Ashnikbiz
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale OverviewPete Jarvis
 
Solving the Database Problem
Solving the Database ProblemSolving the Database Problem
Solving the Database ProblemJay Gordon
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon
 
Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challengesDilpreet kaur Virk
 
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarBuild and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarImpetus Technologies
 
S dillon mtlc 5-02-2013
S dillon   mtlc 5-02-2013S dillon   mtlc 5-02-2013
S dillon mtlc 5-02-2013MassTLC
 
Thinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsThinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsInside Analysis
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLTriNimbus
 
Big Data & Information Management Channel Manager
Big Data & Information Management Channel ManagerBig Data & Information Management Channel Manager
Big Data & Information Management Channel ManagerArrow ECS UK
 

Semelhante a Data Engineering Challenges - DSE Day at Bandung Institute of Technology (20)

Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Webinar: Overcoming the Storage Roadblock to Data Center Modernization
Webinar: Overcoming the Storage Roadblock to Data Center ModernizationWebinar: Overcoming the Storage Roadblock to Data Center Modernization
Webinar: Overcoming the Storage Roadblock to Data Center Modernization
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
 
Bimodal IT and EDW Modernization
Bimodal IT and EDW ModernizationBimodal IT and EDW Modernization
Bimodal IT and EDW Modernization
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
Fundamentals of Big Data
Fundamentals of Big DataFundamentals of Big Data
Fundamentals of Big Data
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
 
Solving the Database Problem
Solving the Database ProblemSolving the Database Problem
Solving the Database Problem
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02
 
Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challenges
 
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarBuild and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
 
S dillon mtlc 5-02-2013
S dillon   mtlc 5-02-2013S dillon   mtlc 5-02-2013
S dillon mtlc 5-02-2013
 
Thinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsThinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters Analytics
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
Big Data & Information Management Channel Manager
Big Data & Information Management Channel ManagerBig Data & Information Management Channel Manager
Big Data & Information Management Channel Manager
 

Último

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Data Engineering Challenges - DSE Day at Bandung Institute of Technology

  • 2. Structure 1. Data Engineering 2. Data Pipeline 3. Data Engineering Challenges 4. Closing
  • 4. All those buzzwords... - Data explosion, big data - Data scientist - IoT - Data driven company
  • 5. Who is Data Engineer? “The role of data engineer is now used throughout industry to describe the highly specialized software engineers who create and maintain these robust big data pipelines.” - Insight Data Engineering Basically we are software engineers.
  • 7. Data Pipeline INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 8. Lambda Architecture INGESTION Take it DATA MANAGEMENT Manage them BATCH PROCESSING Process it STORAGE Store it RETRIEVAL Use it STREAM PROCESSING Process it NOW
  • 11. Challenges - Ingestion Throughput, availability, scalability INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 12. Challenges - Ingestion Sample Problem: Facebook page view ~ 1 trillion/month 385,802 log or insert per second Sample Solution: Kafka, 2 million write/s (on 3 cheap machines) - Simple (Log) → Throughput, O(1) - Partitioning → Scalability - Replication → Availability
  • 13. Challenges - Ingestion Challenge 1 - Wiring to Main App ● May introduce some changes in application Challenge 2 - Failure isolation ● Minimize failure in application when logging
  • 14. Challenges - Processing Integrity, Dependency, Performance INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 15. Challenges - Processing Sample Problem: How many page views are from Indonesia in Aug 2015? ~100PB data if 10kb/datum Sample Solution: ● Spark/Hadoop for computing ● HDFS for storing and Avro as file format ● Oozie as workflow management
  • 16. Challenges - Processing Challenge 1 - Learning Curve ● New way of thinking in processing data: Map Reduce ● New technology and operational concerns Challenge 2 - Putting it All Together ● Incompatible release versions ● Minimum documentation
  • 17. Challenges - Storage Efficiency, Performance INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 18. Challenges - Storage Sample Problems: 1. We want to get number of daily page view from Indonesia for last 7 days 2. We want to retrieve user’s latest transaction to personalize search result better Sample Solution: 1. You might need Columnar Store for OLAP queries 2. You might need Key-Value Store since it will be retrieved per user id
  • 19. Challenges - Storage Challenge 1 - Choosing the right storage ● There are so many kind of database nowadays. Pick it wisely to support your use cases best. Challenge 2 - Develop the right model ● Each database has different way to model data. Relational model might not be appropriate. We need to understand how the database work.
  • 20. Challenges - Retrieval Ease of Use, Reusability, Adaptiveness INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 21. Challenges - Retrieval Sample Problem: ● We want to visualize number of daily page view from Indonesia for last 7 days ● and other problems like ad hoc query and reporting Sample Solution: ● Create backend service to query and application to visualize query result
  • 22. Challenges - Retrieval Challenge 1 - Ease of Use, Reusability ● It is very important to be easy to use since retrieval is user facing product. Data product have to be reusable and discoverable across data users. Challenge 2 - Adaptiveness ● As there are many kind of databases now, query service need to be extensible and adaptive to enable usage of data from various sources.
  • 23. Challenges - Data Management INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 24. Challenges - Data Management Challenge 1 - Centralized Metadata ● Manage data at various places, with various schema (sometime schemaless). Challenge 2 - Security, Access Control ● Most of them are newly developed, and usually security is last thing we consider.
  • 26. Takeaway Points ● Think critically ○ Be wise, don’t get carried away, do not use something just because it is cool, make sure you are using what you need. ● Keep curious ○ New technology is coming everyday, one of them might save your day
  • 27. What is it like, to be a Data Engineer? ● Exhilarating ○ Be in critical position, handle big volume of data, be the nerve of company, and have to make sure pipeline is robust. ● Challenging ○ Have to be DBA, data architect, big data programmer, software engineer, and data analyst at the same time! ● Fun ○ Need to always learn new technology, new way to solve things ● High Demand ○ Data engineers are one of the most in-demand job roles at today’s leading companies.
  • 28. Q&A