SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Presented by Kriangkrai Chaonithi @spicydog
14/11/2019 | KMUTT | Applied Computer Science
Introduction to
Data Engineer
and
Data Pipeline
at
Hello! My name is Gap
Education
● BS Applied Computer Science (KMUTT)
● MS Computer Engineering (KMUTT)
Work Experience
● Former Android, iOS & PHP Developer at Longdo.COM
● Former R&D Manager at Insightera
● CTO & co-founder at Credit OK
Fields of Interests
● Software Engineering
● Cloud Architecture & Distributed Computing
● Computer Security
● Machine Learning & NLP https://spicydog.me
Agenda
● What is Big Data?
○ Why data is big?
○ Structured vs Unstructured Data
● Data Engineering
○ Data technology careers
○ What do data engineers do?
○ Skills for data engineers
○ Knowledages & technologies for data engineer
● What is Data Pipeline?
○ ETL - Extract, Transform, Load
○ Batch vs streaming
● Data Pipeline at Credit OK
○ Introduction to GCP technologies
○ Problem and solution on data pipeline
○ Data pipeline architecture in details
● Summary
https://medium.com/@smartrac/the-deep-web-the-dark-web-and-simple-things-2e601ec980ac
What is Big Data?
https://unsplash.com/photos/LqKhnDzSF-8
Why data is big?
● Faster internet better infrastructure
● Business digitization
● Social network
● IoT & embedded systems
● Automated software
● Etc.
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/KiH2-tdGQRY
Structured vs. Unstructured Data
https://unsplash.com/photos/QBpZGqEMsKg
https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
Data Engineering
https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
Data Technology Careers
https://unsplash.com/photos/QBpZGqEMsKghttps://www.springboard.com/blog/data-science-career-paths-different-roles-industry/
What do Data Engineers do?
https://medium.com/@info_46914/data-engineer-บุคคลที่องคกรไมควรมองขาม-e863b37af79
Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
■ Local Storage
■ Network Attached Storage
■ Object Storage
○ Databases Architecture
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
● Document-oriented Database
● Columnar Database
● Graph Database
● Key-value Database
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Crontab (Task Scheduler)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
○ File Storage Architecture
○ Databases Architecture
■ SQL (RDBMS)
■ NoSQL
● Document-oriented Database
● Columnar Database
● Graph Database
● Key-value Database
○ Data Warehouse
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Crontab (Task Scheduler)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
Skills for Data Engineers
● Data Architecture
● Cloud Computing and Infrastructure
● Programming on Data Manipulation
○ Data Ingestion
○ Data Cleaning
○ Data Manipulation & Data Pipeline
○ Task Scheduler (Crontab)
https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
What is Data Pipeline?
https://unsplash.com/photos/9AxFJaNySB8
ETL - Extract, Transform, Load
https://unsplash.com/photos/QBpZGqEMsKghttps://www.astera.com/type/blog/etl-process-and-steps/
Batch vs Streaming Processing
https://unsplash.com/photos/QBpZGqEMsKg
Batch Streaming
Multiple record processing Per record processing
Scheduled / manual Real-time
Longer processing time Shorter processing time
Large window data processing Small window data processing
Credit Scoring Platform on Big Data Analytics
creditok.co
GCP Storages & Databases
Non-serverless
Serverless
GCP Data Analytics
Pipeline Analytics Visualization
Why do we use serverless on big data?
● No server maintenance
● Scalable & high performance
● Easier to optimize
● Only pay per use
Requirements
● Have a HUGE data warehouse for batch processing
● Our customer have on-premise data on >400 sites
● Data ingestor app is needed to install to every site
● Data ingestor app must be able to run on
● Data ingestor app must be super robust and easy to install
● Must work automatically everyday, task scheduler
When >400 sites upload large files
to your server at the same time..
This is kinna DDoS!
We use cloud functions
● Auto scale
● Almost zero maintenance!
● But only accept <10 MB body size
For the larger files,
we use
Google Cloud Run
Google Kubernetes Engine
Google Compute Engine
Raw Data
Source
Raw Data
Source
Data Pipeline Architecture
Raw Data
Source
Raw Data
Source
GCF - Load zipped file data via HTTPS protocol
GCF - Save zipped file data to GCS INPUT bucket
Raw Data
Source
Raw Data
Source
GCS - Auto trigger GCF when zipped file is put to the INPUT bucket
GCF - (data cleansing) Process text encoding (tis602, utf8)
GCF - (data cleansing) Check and clean CSV format, make it in the best possible one
GCF - Save output CSV to GCD the OUTPUT bucket
GCF - Log all the results for file ingestion reports
Raw Data
Source
Raw Data
Source
Cron - Auto run every some period to load CSV data from OUTPUT bucket
GBQ - Load data from OUTPUT bucket into RAW STAGING table in string format
Raw Data
Source
Raw Data
Source
GBQ - Cron to run data cleansing SQL from RAW STAGING table to CLEANED STAGING table
GBQ - Cron to run append data with SQL from CLEANED STAGING table to MAIN table
GBQ - Cron to run data processing SQL task from MAIN table to another tables til ready to FINAL tables
Raw Data
Source
Raw Data
Source
Frequently Used Data
Lumen - Cron to dump FINAL tables data to real-time database on frequently used data
Laravel - Load data from real-time database of Lumen via internal REST API
Vue - Use data processed from Laravel
Rarely Used Data
Lumen - Load data from BQ directly
Laravel - Load and process data from Lumen
Vue - Use data processed from Laravel
Summary
● Big data is possible because of technology advancement
● Store and process big data requires special technology and knowledge
● Data engineers are the geeks who work on processing data for the team
● Data pipeline is all about automation about data processing process
● Understanding about data going to process is crucial
● Don’t forget to log data pipeline to monitoring system
● Data engineer is in high demand in Thailand, we have dirty data, we have data scientist, we have
no one to process data => data scientist do everything! THAT’S WRONG!
Data Engineer is in need
Question & Answer
Time is short, let’s utilize the networks.
Feel free to connect with me via spicydog.me

Mais conteúdo relacionado

Mais procurados

How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleHow to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleDATAVERSITY
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Data Engineering.pdf
Data Engineering.pdfData Engineering.pdf
Data Engineering.pdfDatacademy.ai
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY
 
The Importance of Metadata
The Importance of MetadataThe Importance of Metadata
The Importance of MetadataDATAVERSITY
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Liberating data with Talend Data Catalog
Liberating data with Talend Data CatalogLiberating data with Talend Data Catalog
Liberating data with Talend Data CatalogJean-Michel Franco
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management DATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Sprinkle Data Inc
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
 

Mais procurados (20)

How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleHow to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Data Engineering.pdf
Data Engineering.pdfData Engineering.pdf
Data Engineering.pdf
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
 
The Importance of Metadata
The Importance of MetadataThe Importance of Metadata
The Importance of Metadata
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Liberating data with Talend Data Catalog
Liberating data with Talend Data CatalogLiberating data with Talend Data Catalog
Liberating data with Talend Data Catalog
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 

Semelhante a Introduction to Data Engineer and Data Pipeline at Credit OK

Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKKriangkrai Chaonithi
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Jason Flittner
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managabilityGaurav Bahrani
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow ObstructionsTatiana Al-Chueyr
 
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18Imre Nagi
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govChris Shenton
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Sid Anand
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntLast Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntMark Grebler
 

Semelhante a Introduction to Data Engineer and Data Pipeline at Credit OK (20)

Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18Data Provision API with BigQuery  - Google Cloud Summit Jakarta 18
Data Provision API with BigQuery - Google Cloud Summit Jakarta 18
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architecture
 
The Big Bad Data
The Big Bad DataThe Big Bad Data
The Big Bad Data
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntLast Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
 

Mais de Kriangkrai Chaonithi

Introduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OKIntroduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OKKriangkrai Chaonithi
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps TechnologiesKriangkrai Chaonithi
 
Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)Kriangkrai Chaonithi
 

Mais de Kriangkrai Chaonithi (6)

Introduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OKIntroduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OK
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps Technologies
 
Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)Laravel Basic Workshop (Build a Simple Webboard)
Laravel Basic Workshop (Build a Simple Webboard)
 
Laravel level 2 (Let's Practical)
Laravel level 2 (Let's Practical)Laravel level 2 (Let's Practical)
Laravel level 2 (Let's Practical)
 
Laravel Level 1 (The Basic)
Laravel Level 1 (The Basic)Laravel Level 1 (The Basic)
Laravel Level 1 (The Basic)
 
Laravel level 0 (introduction)
Laravel level 0 (introduction)Laravel level 0 (introduction)
Laravel level 0 (introduction)
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Introduction to Data Engineer and Data Pipeline at Credit OK

  • 1. Presented by Kriangkrai Chaonithi @spicydog 14/11/2019 | KMUTT | Applied Computer Science Introduction to Data Engineer and Data Pipeline at
  • 2. Hello! My name is Gap Education ● BS Applied Computer Science (KMUTT) ● MS Computer Engineering (KMUTT) Work Experience ● Former Android, iOS & PHP Developer at Longdo.COM ● Former R&D Manager at Insightera ● CTO & co-founder at Credit OK Fields of Interests ● Software Engineering ● Cloud Architecture & Distributed Computing ● Computer Security ● Machine Learning & NLP https://spicydog.me
  • 3. Agenda ● What is Big Data? ○ Why data is big? ○ Structured vs Unstructured Data ● Data Engineering ○ Data technology careers ○ What do data engineers do? ○ Skills for data engineers ○ Knowledages & technologies for data engineer ● What is Data Pipeline? ○ ETL - Extract, Transform, Load ○ Batch vs streaming ● Data Pipeline at Credit OK ○ Introduction to GCP technologies ○ Problem and solution on data pipeline ○ Data pipeline architecture in details ● Summary
  • 5. What is Big Data? https://unsplash.com/photos/LqKhnDzSF-8
  • 6. Why data is big? ● Faster internet better infrastructure ● Business digitization ● Social network ● IoT & embedded systems ● Automated software ● Etc. https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/KiH2-tdGQRY
  • 7. Structured vs. Unstructured Data https://unsplash.com/photos/QBpZGqEMsKg https://towardsdatascience.com/data-engineering-101-for-dummies-like-me-cf6b9e89c2b4
  • 10. What do Data Engineers do? https://medium.com/@info_46914/data-engineer-บุคคลที่องคกรไมควรมองขาม-e863b37af79
  • 11. Skills for Data Engineers ● Data Architecture ● Cloud Computing and Infrastructure ● Programming on Data Manipulation https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 12. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ■ Local Storage ■ Network Attached Storage ■ Object Storage ○ Databases Architecture ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 13. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ○ Databases Architecture ■ SQL (RDBMS) ■ NoSQL ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 14. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ○ Databases Architecture ■ SQL (RDBMS) ■ NoSQL ● Document-oriented Database ● Columnar Database ● Graph Database ● Key-value Database ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation ○ Data Ingestion ○ Data Cleaning ○ Data Manipulation & Data Pipeline ○ Crontab (Task Scheduler) https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 15. Skills for Data Engineers ● Data Architecture ○ File Storage Architecture ○ Databases Architecture ■ SQL (RDBMS) ■ NoSQL ● Document-oriented Database ● Columnar Database ● Graph Database ● Key-value Database ○ Data Warehouse ● Cloud Computing and Infrastructure ● Programming on Data Manipulation ○ Data Ingestion ○ Data Cleaning ○ Data Manipulation & Data Pipeline ○ Crontab (Task Scheduler) https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 16. Skills for Data Engineers ● Data Architecture ● Cloud Computing and Infrastructure ● Programming on Data Manipulation
  • 17. Skills for Data Engineers ● Data Architecture ● Cloud Computing and Infrastructure ● Programming on Data Manipulation ○ Data Ingestion ○ Data Cleaning ○ Data Manipulation & Data Pipeline ○ Task Scheduler (Crontab) https://unsplash.com/photos/QBpZGqEMsKghttps://unsplash.com/photos/Z9AU36chmQI
  • 18. What is Data Pipeline? https://unsplash.com/photos/9AxFJaNySB8
  • 19. ETL - Extract, Transform, Load https://unsplash.com/photos/QBpZGqEMsKghttps://www.astera.com/type/blog/etl-process-and-steps/
  • 20. Batch vs Streaming Processing https://unsplash.com/photos/QBpZGqEMsKg Batch Streaming Multiple record processing Per record processing Scheduled / manual Real-time Longer processing time Shorter processing time Large window data processing Small window data processing
  • 21. Credit Scoring Platform on Big Data Analytics creditok.co
  • 22.
  • 23. GCP Storages & Databases Non-serverless Serverless
  • 24. GCP Data Analytics Pipeline Analytics Visualization
  • 25.
  • 26. Why do we use serverless on big data? ● No server maintenance ● Scalable & high performance ● Easier to optimize ● Only pay per use
  • 27. Requirements ● Have a HUGE data warehouse for batch processing ● Our customer have on-premise data on >400 sites ● Data ingestor app is needed to install to every site ● Data ingestor app must be able to run on ● Data ingestor app must be super robust and easy to install ● Must work automatically everyday, task scheduler
  • 28. When >400 sites upload large files to your server at the same time.. This is kinna DDoS!
  • 29. We use cloud functions ● Auto scale ● Almost zero maintenance! ● But only accept <10 MB body size For the larger files, we use Google Cloud Run Google Kubernetes Engine Google Compute Engine
  • 30.
  • 31. Raw Data Source Raw Data Source Data Pipeline Architecture
  • 32. Raw Data Source Raw Data Source GCF - Load zipped file data via HTTPS protocol GCF - Save zipped file data to GCS INPUT bucket
  • 33. Raw Data Source Raw Data Source GCS - Auto trigger GCF when zipped file is put to the INPUT bucket GCF - (data cleansing) Process text encoding (tis602, utf8) GCF - (data cleansing) Check and clean CSV format, make it in the best possible one GCF - Save output CSV to GCD the OUTPUT bucket GCF - Log all the results for file ingestion reports
  • 34. Raw Data Source Raw Data Source Cron - Auto run every some period to load CSV data from OUTPUT bucket GBQ - Load data from OUTPUT bucket into RAW STAGING table in string format
  • 35. Raw Data Source Raw Data Source GBQ - Cron to run data cleansing SQL from RAW STAGING table to CLEANED STAGING table GBQ - Cron to run append data with SQL from CLEANED STAGING table to MAIN table GBQ - Cron to run data processing SQL task from MAIN table to another tables til ready to FINAL tables
  • 36. Raw Data Source Raw Data Source Frequently Used Data Lumen - Cron to dump FINAL tables data to real-time database on frequently used data Laravel - Load data from real-time database of Lumen via internal REST API Vue - Use data processed from Laravel Rarely Used Data Lumen - Load data from BQ directly Laravel - Load and process data from Lumen Vue - Use data processed from Laravel
  • 37. Summary ● Big data is possible because of technology advancement ● Store and process big data requires special technology and knowledge ● Data engineers are the geeks who work on processing data for the team ● Data pipeline is all about automation about data processing process ● Understanding about data going to process is crucial ● Don’t forget to log data pipeline to monitoring system ● Data engineer is in high demand in Thailand, we have dirty data, we have data scientist, we have no one to process data => data scientist do everything! THAT’S WRONG! Data Engineer is in need
  • 39. Time is short, let’s utilize the networks. Feel free to connect with me via spicydog.me