ETL Practices for Better or Worse

•

1 like•651 views

Eric Sun

Data & Analytics Technology

ETL Practices
for Better or Worse
opportunities and challenges
along with the rise of machines

Agenda
1. Bring the data together and then transform
2. Data modeling for pragmatism
3. Let the datasets talk to each other
4. Beef up sandbox and backfill
5. Performance tuning
6. Examples, questions and comments

ELT and RIT
● ELT =
○ Extract from Source
○ Load to HDFS/MPP
○ Transform and Integrate inside Hadoop/MPP
● RIT =
○ Replicate message/log to Queue (JMS/Kafka/CDC)
○ Stream from Queue and Ingest to HDFS/MPP
○ Transform and Integrate inside Hadoop/MPP

Why ELT and RIT (instead of ETL)
● Store related raw data together for better
leverage
● Big queryable staging area is quite useful
● Reduce workload impact against source
systems
● Write data cleansing and business logic in
similar languages/scripting used by BI

Is Data Modeling still Important
● Do we still need to model the data in the era
of NoSQL and Big Data?
● Shall we de-normalize/pre-join everything?
● Shall we use hierarchical JSON/XML and/or
Key-Value pair for everything?
● Balance and trade-off: analytics, reusability,
metadata-driven, size vs. easy-to-query

Is Data Modeling still Important
● Cluster all attributes and children objects to
the tree structure (thinking in NoSQL way)
● Can we live without JOIN operator?
● Is mutable/updatable dataset still useful?
● Why not snapshot everything? Why SCD2?
● Is relational-model outdated?
● Model it in source or fix it in report?
● All or none: Index, Hash, and Full Scan?

Integration Brings the True Value
● Like the idea of SOA, be careful with DQ
● Data producers are loosely-coupled for the
sake of scalability
● Integration and cross-reference is deferred
to DW/BI layer, yet someone has to do it
● How can unique identifier help here?
● Replicate dim/ref/lkp and Federate tx/event

Profiling Prototype Deploy Backfill
● Profiling data to understand data
● Prototype with real data
● Enrich and harden the derived data in
sandbox before deploying to production
● Be ready to backfill the data because it will
happen (easier to produce or to consume?)

Self-service is Cool, but
● Strong automation tools must be built first
● Software can monitor and throttle
● If a user’s job gets killed, there needs to be
enough info/clue to explain why and how to
(try to) fix it
● Education and knowledge sharing is
essential, is wiki page/runbook good enough

Performance Matters
● Good instrumentation/logging will pay off big
time in performance tuning
● Can we run complex OLAP reports on top of
operational metadata?
Where needs the tuning the most?
● Swapping (spill to disk) can be a big issue
● Detect and kill the bad jobs early

Examples
1. data exploration in MPP/Hadoop instead of
in source systems (don’t let brain wait)
2. web click stream backend transaction
3. replicate/synchronize rollup hierarchy
(mapping lookup) to multiple data systems;
then produce near-real-time aggregation in
each system; finally federate the aggregates

What's hot

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...DataWorks Summit

In Search of Database Nirvana: Challenges of Delivering HTAPHBaseCon

Splice Machine OverviewKunal Gupta

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack

Hadoop data ingestionVinod Nayal

Powering Interactive BI Analytics with Presto and Delta LakeDatabricks

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman

Real-time Analytics with Trino and Apache PinotXiang Fu

Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar

High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

Presto: SQL-on-anythingDataWorks Summit

Time-oriented event search. A new level of scale DataWorks Summit/Hadoop Summit

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit

Large scale ETL with HadoopOReillyStrata

Operationalizing Big Data Pipelines At ScaleDatabricks

Architecting Big Data Ingest & ManipulationGeorge Long

The hidden engineering behind machine learning products at HelixaAlluxio, Inc.

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

What's hot (20)

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...

In Search of Database Nirvana: Challenges of Delivering HTAP

Splice Machine Overview

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

Hadoop data ingestion

Powering Interactive BI Analytics with Presto and Delta Lake

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)

Real-time Analytics with Trino and Apache Pinot

Hoodie: How (And Why) We built an analytical datastore on Spark

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Presto: SQL-on-anything

Time-oriented event search. A new level of scale

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Large scale ETL with Hadoop

Operationalizing Big Data Pipelines At Scale

Architecting Big Data Ingest & Manipulation

The hidden engineering behind machine learning products at Helixa

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Viewers also liked

How To Buy Data WarehouseEric Sun

Bigger Faster Easier: LinkedIn Hadoop Summit 2015Shirshanka Das

Airflow at WePayChris Riccomini

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Shirshanka Das

Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das

Insights Without Tradeoffs: Using Structured StreamingDatabricks

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Hbase hive pigXuhong Zhang

What to Expect for Big Data and Apache Spark in 2017 Databricks

Viewers also liked (9)

How To Buy Data Warehouse

Bigger Faster Easier: LinkedIn Hadoop Summit 2015

Airflow at WePay

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem

Insights Without Tradeoffs: Using Structured Streaming

File Format Benchmark - Avro, JSON, ORC & Parquet

Hbase hive pig

What to Expect for Big Data and Apache Spark in 2017

Similar to ETL Practices for Better or Worse

ETL with WSO2 Enterprise Middleware Platform WSO2

Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon

Resume_Sita_Ramadas_akkineniSita Ramadas Akkineni

Ajith_kumar_4.3 Years_Informatica_ETLAjith Kumar Pampatti

Chandan's_ResumeChandan Das

Big data analytics beyond beer and diapersKai Zhao

Resume_Informatica&IDQ_4+years_of_exprajarao marisa

Dwh faqsinfor123

Rapid Data Analytics @ NetflixMonisha Kanoth

Rapid Data Analytics @ NetflixData Con LA

Hadoop vs Java Batch Processing JSR 352Armel Nene

Resume_gmailShirisha Pothakanuri (Immediate Joinee)

Experimentation Platform on HadoopDataWorks Summit

eBay Experimentation Platform on HadoopTony Ng

Resume_RaghavMahajan_ETL_DeveloperRaghav Mahajan

#1 Calicut MuleSoft Meetup - Introduction to Enterprise Integration and MuleSoftJohnMathewPhilip

Arun KondraArun Roy Kondra

Resume quaish abuzerquaish abuzer

Resume_kallesh_latestKallesha CB

Richa_ProfileRicha Sharma

Similar to ETL Practices for Better or Worse (20)

ETL with WSO2 Enterprise Middleware Platform

Transforming Data Architecture Complexity at Sears - StampedeCon 2013

Resume_Sita_Ramadas_akkineni

Ajith_kumar_4.3 Years_Informatica_ETL

Chandan's_Resume

Big data analytics beyond beer and diapers

Resume_Informatica&IDQ_4+years_of_exp

Dwh faqs

Rapid Data Analytics @ Netflix

Hadoop vs Java Batch Processing JSR 352

Resume_gmail

Experimentation Platform on Hadoop

eBay Experimentation Platform on Hadoop

Resume_RaghavMahajan_ETL_Developer

#1 Calicut MuleSoft Meetup - Introduction to Enterprise Integration and MuleSoft

Arun Kondra

Resume quaish abuzer

Resume_kallesh_latest

Richa_Profile

Recently uploaded

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

RadioAdProWritingCinderellabyButleri.pdfgstagge

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

ASML's Taxonomy Adventure by Daniel Cantervoginip

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Learn How Data Science Changes Our WorldEduminds Learning

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Recently uploaded (20)

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

RadioAdProWritingCinderellabyButleri.pdf

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

20240419 - Measurecamp Amsterdam - SAM.pdf

ASML's Taxonomy Adventure by Daniel Canter

Semantic Shed - Squashing and Squeezing.pptx

Learn How Data Science Changes Our World

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Vision, Mission, Goals and Objectives ppt..pptx

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh

Advanced Machine Learning for Business Professionals

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

ETL Practices for Better or Worse

1. ETL Practices for Better or Worse opportunities and challenges along with the rise of machines

2. Agenda 1. Bring the data together and then transform 2. Data modeling for pragmatism 3. Let the datasets talk to each other 4. Beef up sandbox and backfill 5. Performance tuning 6. Examples, questions and comments

3. ELT and RIT ● ELT = ○ Extract from Source ○ Load to HDFS/MPP ○ Transform and Integrate inside Hadoop/MPP ● RIT = ○ Replicate message/log to Queue (JMS/Kafka/CDC) ○ Stream from Queue and Ingest to HDFS/MPP ○ Transform and Integrate inside Hadoop/MPP

4. Why ELT and RIT (instead of ETL) ● Store related raw data together for better leverage ● Big queryable staging area is quite useful ● Reduce workload impact against source systems ● Write data cleansing and business logic in similar languages/scripting used by BI

5. Is Data Modeling still Important ● Do we still need to model the data in the era of NoSQL and Big Data? ● Shall we de-normalize/pre-join everything? ● Shall we use hierarchical JSON/XML and/or Key-Value pair for everything? ● Balance and trade-off: analytics, reusability, metadata-driven, size vs. easy-to-query

6. Is Data Modeling still Important ● Cluster all attributes and children objects to the tree structure (thinking in NoSQL way) ● Can we live without JOIN operator? ● Is mutable/updatable dataset still useful? ● Why not snapshot everything? Why SCD2? ● Is relational-model outdated? ● Model it in source or fix it in report? ● All or none: Index, Hash, and Full Scan?

7. Integration Brings the True Value ● Like the idea of SOA, be careful with DQ ● Data producers are loosely-coupled for the sake of scalability ● Integration and cross-reference is deferred to DW/BI layer, yet someone has to do it ● How can unique identifier help here? ● Replicate dim/ref/lkp and Federate tx/event

8. Profiling Prototype Deploy Backfill ● Profiling data to understand data ● Prototype with real data ● Enrich and harden the derived data in sandbox before deploying to production ● Be ready to backfill the data because it will happen (easier to produce or to consume?)

9. Self-service is Cool, but ● Strong automation tools must be built first ● Software can monitor and throttle ● If a user’s job gets killed, there needs to be enough info/clue to explain why and how to (try to) fix it ● Education and knowledge sharing is essential, is wiki page/runbook good enough

10. Performance Matters ● Good instrumentation/logging will pay off big time in performance tuning ● Can we run complex OLAP reports on top of operational metadata? Where needs the tuning the most? ● Swapping (spill to disk) can be a big issue ● Detect and kill the bad jobs early

11. Examples 1. data exploration in MPP/Hadoop instead of in source systems (don’t let brain wait) 2. web click stream backend transaction 3. replicate/synchronize rollup hierarchy (mapping lookup) to multiple data systems; then produce near-real-time aggregation in each system; finally federate the aggregates

ETL Practices for Better or Worse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to ETL Practices for Better or Worse

Similar to ETL Practices for Better or Worse (20)

Recently uploaded

Recently uploaded (20)

ETL Practices for Better or Worse