SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
By:
Harshal Solao
Maulik Thaker
Ankit Dudhat
INTRODUCTION
 Over the last few years, organization across public and private

sectors have made strategic decisions to turn Big-data into
competitive advantages.
 At the heart of this challenge is the process used to extract

data from multiple sources, transform it into analytical needs
and load it into a data warehouse for subsequent analysis
known as “Extract, Transform & Load”(ETL).
 This examine some of the platform h/w and s/w considerations

in using HADOOP for ETL.
ETL BOTTLENECK IN BIG DATA
 Big Data refers to the large amounts, at least terabytes, of

structured data that flows continuously through and around
organizations, including video, text, sensor logs, and
transactional records.
 Increasingly, the data we need is embedded in economic
reports,
discussion
forums,
news
sites,
social
networks, weather reports, tweets, and blogs, as well as
transactions.
 By analyzing all the data available, decision-makers can better
assess competitive threats, anticipate changes in customer
behavior, strengthen supply chains, improve the effectiveness
of marketing campaigns, and enhance business continuity.
TRADITIONAL ETL PROCESS
 Traditional

ETL process extracts data from multiple
sources, then cleanses, formats, and loads it into a data
warehouse for analysis.

 When

the source data sets are large, fast, and
unstructured,
traditional
ETL
can
become
the
bottleneck, because it is too complex to develop, too expensive
to operate, and takes too long to execute.

 By most accounts, 80 percent of the development effort in a big

data project goes into data integration and only 20 percent goes
toward data analysis.
USEFUL COMPONENTS OF ETL
 Apache Flume
 Apache Sqoop

 Apache Hive & Apache Pig
 ODBC/JDBC Connectors
 Apache

Flume: It is a distributed system for
collecting, aggregating, and moving large amounts of data
from multiple sources into HDFS or another central data
store.

 Apache Sqoop: It is a tool for transferring data between

Hadoop and relational databases. You can use Sqoop to
import data from a MySQL or Oracle database into
HDFS, run MapReduce on the data, and then export the
data back into an RDBMS.
 Apache Hive & Apache Pig: These are programming

languages that simplify development of applications
employing the MapReduce framework. HiveQL is a
dialect of SQL and supports a subset of the syntax. Pig
Latin is a procedural programming language that provides
high-level abstractions for MapReduce.
 ODBC/JDBC Connectors: for HBase and Hive are often

proprietary components included in distributions for
Apache Hadoop software.
 They provide connectivity with SQL applications by
translating standard SQL queries into HiveQL commands
that can be executed upon the data in HDFS or HBase.
ETL big data with apache hadoop
PHYSICAL INFRASTUCTURE FOR
ETL WITH HADOOP
 The rule of thumb for Hadoop infrastructure planning has

long been to “throw more nodes at the problem.” This is a
reasonable approach when the size of your cluster matches
a web-scale challenge, such as returning search results or
personalizing web pages for hundreds of millions of
online shoppers.

 But a typical Hadoop cluster in an enterprise has around

100 nodes and is supported by IT resources that are
significantly more constrained than those of Yahoo! and
Facebook.
 TeraSort: (Map: CPU Bound; Reduce: I/O Bound)

TeraSort transforms data from one representation to
another. Since the size of data remains the same from
input through shuffle to output, TeraSort tends to be I/O
bound.
 WordCount: (CPU Bound) WordCount extracts a small
amount of interesting information from a large data set,
which means that the map output and the reduce output
are much smaller than the job input.
 Nutch Indexing:(Map: CPU Bound; Reduce: I/O Bound)
Nutch Indexing decompresses the crawl data in the map
stage, which is CPU bound, and converts intermediate
results to inverted index files in the reduce stage, which is
I/O bound.
 PageRank

Search:(CPU Bound) The Page Rank
workload is representative of the original inspiration for
Google search based on the PageRank algorithm for link
analysis.
 Bayesian Classification: (I/O Bound) This workload
implements the trainer part of the Naïve Bayesian
classifier, a popular algorithm for knowledge discovery
and data mining.
 K-means Clustering: (CPU Bound in iteration; I/O
Bound in clustering) This workload first computes the
centroid of each cluster by running a Hadoop job
iteratively until iterations converge or the maximum
number of iterations is reached.
CONCLUSION
 The newest wave of big data is generating new

opportunities and new challenges for businesses across
every industry.
 Apache Hadoop provides a cost-effective and massively
scalable platform for ingesting big data and preparing it
for analysis.
 Using Hadoop to offload the traditional ETL processes
can reduce time to analysis by hours or even days.
Running the Hadoop cluster efficiently means selecting an
optimal infrastructure of servers, storage, networking, and
software.
ETL big data with apache hadoop

Mais conteúdo relacionado

Mais procurados

Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introductionmattcasters
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 

Mais procurados (20)

Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
 
Lecture2 oracle ppt
Lecture2 oracle pptLecture2 oracle ppt
Lecture2 oracle ppt
 
Hive
HiveHive
Hive
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Spark
SparkSpark
Spark
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Oracle 10g Introduction 1
Oracle 10g Introduction 1Oracle 10g Introduction 1
Oracle 10g Introduction 1
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 

Destaque

Big Data Specialization in PySpark - Berkeley
Big Data Specialization in PySpark - BerkeleyBig Data Specialization in PySpark - Berkeley
Big Data Specialization in PySpark - BerkeleyGianfranco Campana
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 

Destaque (6)

Big Data Specialization in PySpark - Berkeley
Big Data Specialization in PySpark - BerkeleyBig Data Specialization in PySpark - Berkeley
Big Data Specialization in PySpark - Berkeley
 
Real Time BI with Hadoop
Real Time BI with HadoopReal Time BI with Hadoop
Real Time BI with Hadoop
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 

Semelhante a ETL big data with apache hadoop

TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)ruchabhandiwad
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRABhadra Gowdra
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy snehal parikh
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business IntelligenceHGanesh
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureKovid Academy
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Data ingestion
Data ingestionData ingestion
Data ingestionnitheeshe2
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...ijcses
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...IJCSES Journal
 

Semelhante a ETL big data with apache hadoop (20)

TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architecture
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
 

ETL big data with apache hadoop

  • 2. INTRODUCTION  Over the last few years, organization across public and private sectors have made strategic decisions to turn Big-data into competitive advantages.  At the heart of this challenge is the process used to extract data from multiple sources, transform it into analytical needs and load it into a data warehouse for subsequent analysis known as “Extract, Transform & Load”(ETL).  This examine some of the platform h/w and s/w considerations in using HADOOP for ETL.
  • 3. ETL BOTTLENECK IN BIG DATA  Big Data refers to the large amounts, at least terabytes, of structured data that flows continuously through and around organizations, including video, text, sensor logs, and transactional records.  Increasingly, the data we need is embedded in economic reports, discussion forums, news sites, social networks, weather reports, tweets, and blogs, as well as transactions.  By analyzing all the data available, decision-makers can better assess competitive threats, anticipate changes in customer behavior, strengthen supply chains, improve the effectiveness of marketing campaigns, and enhance business continuity.
  • 5.  Traditional ETL process extracts data from multiple sources, then cleanses, formats, and loads it into a data warehouse for analysis.  When the source data sets are large, fast, and unstructured, traditional ETL can become the bottleneck, because it is too complex to develop, too expensive to operate, and takes too long to execute.  By most accounts, 80 percent of the development effort in a big data project goes into data integration and only 20 percent goes toward data analysis.
  • 6. USEFUL COMPONENTS OF ETL  Apache Flume  Apache Sqoop  Apache Hive & Apache Pig  ODBC/JDBC Connectors
  • 7.  Apache Flume: It is a distributed system for collecting, aggregating, and moving large amounts of data from multiple sources into HDFS or another central data store.  Apache Sqoop: It is a tool for transferring data between Hadoop and relational databases. You can use Sqoop to import data from a MySQL or Oracle database into HDFS, run MapReduce on the data, and then export the data back into an RDBMS.
  • 8.  Apache Hive & Apache Pig: These are programming languages that simplify development of applications employing the MapReduce framework. HiveQL is a dialect of SQL and supports a subset of the syntax. Pig Latin is a procedural programming language that provides high-level abstractions for MapReduce.  ODBC/JDBC Connectors: for HBase and Hive are often proprietary components included in distributions for Apache Hadoop software.  They provide connectivity with SQL applications by translating standard SQL queries into HiveQL commands that can be executed upon the data in HDFS or HBase.
  • 10. PHYSICAL INFRASTUCTURE FOR ETL WITH HADOOP  The rule of thumb for Hadoop infrastructure planning has long been to “throw more nodes at the problem.” This is a reasonable approach when the size of your cluster matches a web-scale challenge, such as returning search results or personalizing web pages for hundreds of millions of online shoppers.  But a typical Hadoop cluster in an enterprise has around 100 nodes and is supported by IT resources that are significantly more constrained than those of Yahoo! and Facebook.
  • 11.  TeraSort: (Map: CPU Bound; Reduce: I/O Bound) TeraSort transforms data from one representation to another. Since the size of data remains the same from input through shuffle to output, TeraSort tends to be I/O bound.  WordCount: (CPU Bound) WordCount extracts a small amount of interesting information from a large data set, which means that the map output and the reduce output are much smaller than the job input.  Nutch Indexing:(Map: CPU Bound; Reduce: I/O Bound) Nutch Indexing decompresses the crawl data in the map stage, which is CPU bound, and converts intermediate results to inverted index files in the reduce stage, which is I/O bound.
  • 12.  PageRank Search:(CPU Bound) The Page Rank workload is representative of the original inspiration for Google search based on the PageRank algorithm for link analysis.  Bayesian Classification: (I/O Bound) This workload implements the trainer part of the Naïve Bayesian classifier, a popular algorithm for knowledge discovery and data mining.  K-means Clustering: (CPU Bound in iteration; I/O Bound in clustering) This workload first computes the centroid of each cluster by running a Hadoop job iteratively until iterations converge or the maximum number of iterations is reached.
  • 13. CONCLUSION  The newest wave of big data is generating new opportunities and new challenges for businesses across every industry.  Apache Hadoop provides a cost-effective and massively scalable platform for ingesting big data and preparing it for analysis.  Using Hadoop to offload the traditional ETL processes can reduce time to analysis by hours or even days. Running the Hadoop cluster efficiently means selecting an optimal infrastructure of servers, storage, networking, and software.