SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
From Raw Data to 
Analytics with No 
ETL 
Marcel Kornacker // Cloudera, Inc.! 
Lenni Kuff // Cloudera, Inc.
‹‹#›› 
Outline 
• Evolution of ETL in the context of analytics! 
• traditional systems! 
• Hadoop today! 
• Cloudera’s vision for ETL: no ETL! 
• with qualifications
‹‹#›› 
Traditional ETL 
• Extract: physical extraction from source data store! 
• could be an RDBMS acting as an operational data store! 
• or log data materialized as json! 
• Transform:! 
• data cleansing and standardization! 
• conversion of naturally complex/nested data into a flat relational 
schema! 
• Load: the targeted analytic DBMS converts the transformed data into 
its binary format (typically columnar)
‹‹#›› 
Traditional ETL 
• Three aspects to the traditional ETL process:! 
1. semantic transformation such as data standardization/cleansing 
-> makes data more queryable, adds value! 
2. representational transformation: from source to target schema 
(from complex/nested to flat relational) 
-> “lateral” transformation that doesn’t change semantics, 
adds operational overhead! 
3. data movement: from source to staging area to target system 
-> adds yet more operational overhead
‹‹#›› 
Traditional ETL 
• The goals of “analytics with no ETL”:! 
• simplify aspect 1! 
• eliminate aspects 2 and 3
‹‹#›› 
ETL with Hadoop Today 
• A typical ETL workflow with Hadoop looks like this:! 
• raw source data initially lands in HDFS (examples: text/xml/json log files)! 
• that data is mapped into a table to make it queryable: 
CREATE TABLE RawLogData (…) ROW FORMAT DELIMITED FIELDS 
LOCATION ‘/raw-log-data/‘;! 
• the target table is mapped to a different location: 
CREATE TABLE LogData (…) STORED AS PARQUET LOCATION ‘/log-data/‘; 
• the raw source data is converted to the target format: 
INSERT INTO LogData SELECT * FROM RawLogData; 
• the data is then available for batch reporting/analytics (via Impala, Hive, Pig, 
Spark) or interactive analytics (via Impala, Search)
‹‹#›› 
ETL with Hadoop Today 
• Compared to traditional ETL, this has several advantages:! 
• Hadoop acts as a centralized location for all data: raw source data 
lives side by side with the transformed data! 
• data does not need to be moved between multiple platforms/clusters! 
• data in the raw source format is queryable as soon as it lands, 
although at reduced performance, compared to an optimized 
columnar data format! 
• all data transformations are expressed through the same platform 
and can reference any of the Hadoop-resident data sources (and 
more)
‹‹#›› 
ETL with Hadoop Today 
• However, even this still has drawbacks:! 
• new data needs to be loaded periodically into the target table, and 
doing that reliably and within SLAs can be a challenge! 
• you now have two tables: 
one with current but slow data 
another with lagging but fast data
‹‹#›› 
A Vision for Analytics with No ETL 
• Goals:! 
• no explicit loading/conversion step to move raw data into a target 
table! 
• a single view of the data that is! 
• up-to-date! 
• (mostly) in an efficient columnar format
‹‹#›› 
A Vision for Analytics with No ETL 
• Elements of an ETL-light analytic stack:! 
• support for complex/nested schemas 
-> avoid remapping of raw data into a flat relational schema! 
• background and incremental data conversion 
-> retain in-place single view of entire data set, with most data 
being in an efficient format! 
• bonus: schema inference and schema evolution 
-> start analyzing data as soon as it arrives, regardless of its 
complexity
‹‹#›› 
Support for Complex Schemas in Impala 
• Standard relational: all columns have scalar values: 
CHAR(n), DECIMAL(p, s), INT, DOUBLE, TIMESTAMP, etc.! 
• Complex types: structs, arrays, maps 
in essence, a nested relational schema! 
• Supported file formats: 
Parquet, json, XML, Avro! 
• Design principle for SQL extensions: maintain SQL’s way of dealing 
with multi-valued data
‹‹#›› 
Support for Complex Schemas in Impala 
• Example: 
CREATE TABLE Customers ( 
cid BIGINT, 
address STRUCT { 
street STRING, 
zip INT 
}, 
orders ARRAY<STRUCT { 
oid BIGINT, 
total DECIMAL(9, 2), 
items ARRAY< STRUCT { 
iid BIGINT, qty INT, price DECIMAL(9, 2) }> 
}> 
)
‹‹#›› 
Support for Complex Schemas in Impala 
• Total revenue with items that cost more than $10: 
SELECT SUM(i.price * i.qty) 
FROM Customers.orders.items i 
WHERE i.price > 10! 
• Customers and order totals in zip 94611: 
SELECT c.cid, o.total 
FROM Customers c, c.orders o 
WHERE c.address.zip = 94611
‹‹#›› 
Support for Complex Schemas in Impala 
• Customers that have placed more than 10 orders: 
SELECT c.cid 
FROM Customers c 
WHERE COUNT(c.orders) > 10 
(shorthand for: 
WHERE (SELECT COUNT(*) FROM c.orders) > 10)! 
• Number of orders and average item price per customer: 
SELECT c.cid, COUNT(c.orders), 
AVG(c.orders.items.price) 
FROM Customers c
‹‹#›› 
Background Format Conversion 
• Sample workflow:! 
• create table for data: 
CREATE TABLE LogData (…) WITH CONVERSION TO PARQUET;! 
• load data into table: 
LOAD DATA INPATH ‘/raw-log-data/file1’ INTO LogData 
SOURCE FORMAT SEQUENCEFILE; 
• Pre-requisite for incremental conversion: 
multi-format tables and partitions! 
• currently: each table partition has a single file format! 
• instead: allow a mix of file formats (separated into format-specific 
subdirectories)
‹‹#›› 
Background Format Conversion 
• Conversion process! 
• atomic: the switch from the source to the target data files is atomic 
from the perspective of a running query (but any running query 
sees the full data set) ! 
• redundant: with option to retain original data! 
• incremental: Impala’s catalog service detects new data files that are 
not in the target format automatically
‹‹#›› 
Schema Inference and Schema Evolution 
• Schema inference from data files is useful to reduce the barrier to 
analyzing complex source data! 
• as an example, log data often has hundreds of fields! 
• the time required to create the DDL manually is substantial! 
• Example: schema inference from structured data files! 
• available today: 
CREATE TABLE LogData LIKE PARQUET ‘/log-data.pq’! 
• future formats: XML, json, Avro
‹‹#›› 
Schema Inference and Schema Evolution 
• Schema evolution:! 
• a necessary follow-on to schema inference: every schema evolves over time; 
explicit maintenance is as time-consuming as the initial creation! 
• algorithmic schema evolution requires sticking to generally safe schema 
modifications: adding new fields! 
• adding new top-level columns! 
• adding fields within structs! 
• Example workflow: 
LOAD DATA INPATH ‘/path’ INTO LogData SOURCE FORMAT JSON WITH 
SCHEMA EXPANSION;! 
• scans data to determine new columns/fields to add! 
• synchronous: if there is an error, the ‘load’ is aborted and the user notified
‹‹#›› 
Conclusion 
• Hadoop offers a number of advantages over traditional multi-platform ETL 
solutions:! 
• availability of all data sets on a single platform! 
• data becomes accessible through SQL as soon as it lands! 
• However, this can be improved further:! 
• a richer analytic SQL that is extended to handle nested data! 
• an automated background conversion process that preserves an up-to-date 
view of all data while providing BI-typical performance! 
• simple automation of initial schema creation and subsequent maintenance 
that makes dealing with large, complex schemas less labor-intensive
Thank you
‹‹#›› 
This an example 
segue slide on a blue 
background. This could 
also be a quote slide. 
This is an optional subtitle or space for attribution
‹‹#›› 
This an example 
segue slide on a blue 
background. This could 
also be a quote slide. 
This is an optional subtitle or space for attribution
‹‹#›› 
This an example 
segue slide on a blue 
background. This could 
also be a quote slide. 
This is an optional subtitle or space for attribution

Mais conteúdo relacionado

Mais procurados

Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
Anant Kumar
 

Mais procurados (19)

Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Introducing Amazon Aurora
Introducing Amazon AuroraIntroducing Amazon Aurora
Introducing Amazon Aurora
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 

Destaque

Destaque (6)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 

Semelhante a From Raw Data to Analytics with No ETL

Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 

Semelhante a From Raw Data to Analytics with No ETL (20)

Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
StreamHorizon overview
StreamHorizon overviewStreamHorizon overview
StreamHorizon overview
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
ETL
ETL ETL
ETL
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Turning a Search Engine into a Relational Database
Turning a Search Engine into a Relational DatabaseTurning a Search Engine into a Relational Database
Turning a Search Engine into a Relational Database
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

From Raw Data to Analytics with No ETL

  • 1. From Raw Data to Analytics with No ETL Marcel Kornacker // Cloudera, Inc.! Lenni Kuff // Cloudera, Inc.
  • 2. ‹‹#›› Outline • Evolution of ETL in the context of analytics! • traditional systems! • Hadoop today! • Cloudera’s vision for ETL: no ETL! • with qualifications
  • 3. ‹‹#›› Traditional ETL • Extract: physical extraction from source data store! • could be an RDBMS acting as an operational data store! • or log data materialized as json! • Transform:! • data cleansing and standardization! • conversion of naturally complex/nested data into a flat relational schema! • Load: the targeted analytic DBMS converts the transformed data into its binary format (typically columnar)
  • 4. ‹‹#›› Traditional ETL • Three aspects to the traditional ETL process:! 1. semantic transformation such as data standardization/cleansing -> makes data more queryable, adds value! 2. representational transformation: from source to target schema (from complex/nested to flat relational) -> “lateral” transformation that doesn’t change semantics, adds operational overhead! 3. data movement: from source to staging area to target system -> adds yet more operational overhead
  • 5. ‹‹#›› Traditional ETL • The goals of “analytics with no ETL”:! • simplify aspect 1! • eliminate aspects 2 and 3
  • 6. ‹‹#›› ETL with Hadoop Today • A typical ETL workflow with Hadoop looks like this:! • raw source data initially lands in HDFS (examples: text/xml/json log files)! • that data is mapped into a table to make it queryable: CREATE TABLE RawLogData (…) ROW FORMAT DELIMITED FIELDS LOCATION ‘/raw-log-data/‘;! • the target table is mapped to a different location: CREATE TABLE LogData (…) STORED AS PARQUET LOCATION ‘/log-data/‘; • the raw source data is converted to the target format: INSERT INTO LogData SELECT * FROM RawLogData; • the data is then available for batch reporting/analytics (via Impala, Hive, Pig, Spark) or interactive analytics (via Impala, Search)
  • 7. ‹‹#›› ETL with Hadoop Today • Compared to traditional ETL, this has several advantages:! • Hadoop acts as a centralized location for all data: raw source data lives side by side with the transformed data! • data does not need to be moved between multiple platforms/clusters! • data in the raw source format is queryable as soon as it lands, although at reduced performance, compared to an optimized columnar data format! • all data transformations are expressed through the same platform and can reference any of the Hadoop-resident data sources (and more)
  • 8. ‹‹#›› ETL with Hadoop Today • However, even this still has drawbacks:! • new data needs to be loaded periodically into the target table, and doing that reliably and within SLAs can be a challenge! • you now have two tables: one with current but slow data another with lagging but fast data
  • 9. ‹‹#›› A Vision for Analytics with No ETL • Goals:! • no explicit loading/conversion step to move raw data into a target table! • a single view of the data that is! • up-to-date! • (mostly) in an efficient columnar format
  • 10. ‹‹#›› A Vision for Analytics with No ETL • Elements of an ETL-light analytic stack:! • support for complex/nested schemas -> avoid remapping of raw data into a flat relational schema! • background and incremental data conversion -> retain in-place single view of entire data set, with most data being in an efficient format! • bonus: schema inference and schema evolution -> start analyzing data as soon as it arrives, regardless of its complexity
  • 11. ‹‹#›› Support for Complex Schemas in Impala • Standard relational: all columns have scalar values: CHAR(n), DECIMAL(p, s), INT, DOUBLE, TIMESTAMP, etc.! • Complex types: structs, arrays, maps in essence, a nested relational schema! • Supported file formats: Parquet, json, XML, Avro! • Design principle for SQL extensions: maintain SQL’s way of dealing with multi-valued data
  • 12. ‹‹#›› Support for Complex Schemas in Impala • Example: CREATE TABLE Customers ( cid BIGINT, address STRUCT { street STRING, zip INT }, orders ARRAY<STRUCT { oid BIGINT, total DECIMAL(9, 2), items ARRAY< STRUCT { iid BIGINT, qty INT, price DECIMAL(9, 2) }> }> )
  • 13. ‹‹#›› Support for Complex Schemas in Impala • Total revenue with items that cost more than $10: SELECT SUM(i.price * i.qty) FROM Customers.orders.items i WHERE i.price > 10! • Customers and order totals in zip 94611: SELECT c.cid, o.total FROM Customers c, c.orders o WHERE c.address.zip = 94611
  • 14. ‹‹#›› Support for Complex Schemas in Impala • Customers that have placed more than 10 orders: SELECT c.cid FROM Customers c WHERE COUNT(c.orders) > 10 (shorthand for: WHERE (SELECT COUNT(*) FROM c.orders) > 10)! • Number of orders and average item price per customer: SELECT c.cid, COUNT(c.orders), AVG(c.orders.items.price) FROM Customers c
  • 15. ‹‹#›› Background Format Conversion • Sample workflow:! • create table for data: CREATE TABLE LogData (…) WITH CONVERSION TO PARQUET;! • load data into table: LOAD DATA INPATH ‘/raw-log-data/file1’ INTO LogData SOURCE FORMAT SEQUENCEFILE; • Pre-requisite for incremental conversion: multi-format tables and partitions! • currently: each table partition has a single file format! • instead: allow a mix of file formats (separated into format-specific subdirectories)
  • 16. ‹‹#›› Background Format Conversion • Conversion process! • atomic: the switch from the source to the target data files is atomic from the perspective of a running query (but any running query sees the full data set) ! • redundant: with option to retain original data! • incremental: Impala’s catalog service detects new data files that are not in the target format automatically
  • 17. ‹‹#›› Schema Inference and Schema Evolution • Schema inference from data files is useful to reduce the barrier to analyzing complex source data! • as an example, log data often has hundreds of fields! • the time required to create the DDL manually is substantial! • Example: schema inference from structured data files! • available today: CREATE TABLE LogData LIKE PARQUET ‘/log-data.pq’! • future formats: XML, json, Avro
  • 18. ‹‹#›› Schema Inference and Schema Evolution • Schema evolution:! • a necessary follow-on to schema inference: every schema evolves over time; explicit maintenance is as time-consuming as the initial creation! • algorithmic schema evolution requires sticking to generally safe schema modifications: adding new fields! • adding new top-level columns! • adding fields within structs! • Example workflow: LOAD DATA INPATH ‘/path’ INTO LogData SOURCE FORMAT JSON WITH SCHEMA EXPANSION;! • scans data to determine new columns/fields to add! • synchronous: if there is an error, the ‘load’ is aborted and the user notified
  • 19. ‹‹#›› Conclusion • Hadoop offers a number of advantages over traditional multi-platform ETL solutions:! • availability of all data sets on a single platform! • data becomes accessible through SQL as soon as it lands! • However, this can be improved further:! • a richer analytic SQL that is extended to handle nested data! • an automated background conversion process that preserves an up-to-date view of all data while providing BI-typical performance! • simple automation of initial schema creation and subsequent maintenance that makes dealing with large, complex schemas less labor-intensive
  • 21. ‹‹#›› This an example segue slide on a blue background. This could also be a quote slide. This is an optional subtitle or space for attribution
  • 22. ‹‹#›› This an example segue slide on a blue background. This could also be a quote slide. This is an optional subtitle or space for attribution
  • 23. ‹‹#›› This an example segue slide on a blue background. This could also be a quote slide. This is an optional subtitle or space for attribution