SlideShare a Scribd company logo
1 of 26
Download to read offline
Introduction To Hadoop Ecosystem
InSemble Inc.
http://www.insemble.com
Agenda
What is Big Data ?1
Use Cases & Java Developer fit4
Hadoop Ecosystem3
Relevance to your Enterprise2
Demo5
Big Data Definitions
• Wikipedia defines it as “ Data Sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage and process data
within a tolerable elapsed time
• Gartner defines it as Data with the following characteristics
– High Velocity
– High Variety
– High Volume
• Another Definition is “ Big Data is a large volume, unstructured data which
cannot be handled by traditional database management systems
Why a game changer
• Schema on Read
– Interpreting data at processing time
– Key, Values are not intrinsic properties of data but chosen by person
analyzing the data
• Move code to data
– With traditional, we bring data to code and I/O becomes a
bottleneck
– With distributed systems, we have to deal with our own
checkpointing/recovery
• More data beats better algorithms
Enterprise Relevance
• Missed Opportunities
– Channels
– Data that is analyzed
• Constraint was high cost
– Storage
– Processing
• Future-proof your business
– Schema on Read
– Access pattern not as relevant
– Not just future-proofing your architecture
Motivation and History
• Disk access speeds have not caught up with storage capacities
• Need a high speed parallel processing platform to process large
datasets on a distributed filesharing framework
• Google published MapReduce architecture in 2004
• Mapreduce framework
– Split the query, distribute it and process in parallel(Map Step)
– Gather the results and deliver it ( Reduce Step)
• Apache Open Source Project called Hadoop implemented the
MapReduce framework
– “Software library that gives users ability to process large datasets across cluster of
commodity hardware in a reliable, fault-tolerant manner using a simple programming model”
Hadoop Ecosystem
Source: Apache Hadoop Documentation
HDFS Architecture
Source: Hadoop Definitive Guide by Tom White
MapReduce framework
Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes
Map Reduce
• Restrictive programming model
– Key, values
– Map, reduce functions with only coordination being just
passing keys and values
• But still considered a general data-processing tool
– Google used for production search indexes
– Image Analysis
– Machine learning algorithms
PIG
• High level scripting language
• Data Flow Language
– Good for describing data analysis problems as data flows
– Can plugin UDFs written in other languages such as Java, Scala,
JRuby
– Other languages can execute PIG scripts
– Predominant use cases are
• Production ETL jobs
• Data exploration by analysts
• Higher Level Abstraction
– Map Reduce
– Tez
Hive
• Framework for data warehouse on top of
Hadoop
– SQL Access on HDFS
– Queries for Analysis
• Batch Oriented
– Impala
– Tez
HBase
• NoSQL database on Hadoop
– Based on Google’s BigTable
– Column oriented database on HDFS
• Regular Interactive/Update use cases
– Real time read/write random access
– Row updates are atomic
SQOOP
• Import/Export data from RDBMS into Hadoop
– HDFS,Hive, Hbase
– CouchBase
– Uses JDBC driver to get the data types of the columns
– Serialization/Deserialization
• Actual load done internally by Mapreduce jobs
Apache Flume
Source: Apache Flume Documentation
Real time streaming with Kafka &
Storm
• Kafka
– Pub/Sub messaging using topics
– Kafka producers publish to topics
• Storm
– Real time computational engine
– Consumes data from spouts and passes data to bolts
– Can run on top of YARN
– Uses Zookeeper, implemented in Clojure
– You define workflows as Directed Acyclic Graphs
– True stream processing engine, so used for low latency ingestion
– Can support At most once, At least once and Exactly Once semantics
Apache Spark
• High speed general purpose engine for large-scale data
processing
• Does not need Hadoop, just needs a shared file system such as
S3, NFS or HDFS
• Spark can run on YARN
• Spark is implemented in Scala
• Has Streaming API but a true batch processing engine that micro-
batches
• Can only support Exactly once, but under some failure
conditions degrades to At-least once
Common Use Cases
• Queries from Detail Record Data
• Queries from longer duration data
• Diagnostic/Metrics/Web Logs Data Analysis
• 360 degree view incorporating clickstream data
• Unable to generate report within the needed timeframe
• Capture and analyze sensor data
• Analyze large volume of image data
• Build User profiles from large volumes of data
• Sentiment Analysis
• Recommendation Engines
• Risk Analysis
Securing Hadoop Data
Source: http://www.voltage.com
Closing
• Technology in hyper growth phase
• Complex
• Tools/Productivity/Monitoring products
evolving
• Pilot Project
• Incremental Journey
Demo - Start HDP cluster in AWS
• Total 6 EC2 machine, type t2.medium
• RHEL 6.5, 3.75G Memory, 10G hard drive
• 1 Ambari server + 5-node cluster
• 1 Namenode + 1 Secondary node + 3 Data
Node
• Public data set from 

https://data.cityofchicago.org
Managing Hadoop Cluster using
Ambari
• Ambari in Indian language stands for a seat
sitting on top of an elephant
• Ambari is an Apache open source project that
is used to
• Provision Hadoop cluster
• Manage Hadoop cluster
• Monitor Hadoop cluster
• Agent-based deployment model
Ambari Architecture
Taken from http://docs.hortonworks.com/
Demo — Hue
• Apache Hue provides web interface for
analyzing data in Hadoop
• Use HCatalog to create table
• Demo Hive Script
• Demo Pig Script
Demo — Advanced Hive
• Use built-in UDF to extract latitude and longitude info
• Use custom UDF (scala) to calculate distance
between two locations
• Join tables between library and school data and find
libraries within 1 mile for each school
• Use Tableau to connect to Hive through ODBC driver
to plot social economy data

More Related Content

What's hot

What's hot (20)

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 

Viewers also liked

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
HGanesh
 

Viewers also liked (20)

Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Desayuno de arquitectos: Big data en azure
Desayuno de arquitectos: Big data en azureDesayuno de arquitectos: Big data en azure
Desayuno de arquitectos: Big data en azure
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
Using hadoop for big data
Using hadoop for big dataUsing hadoop for big data
Using hadoop for big data
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
MongoDB and Fractal Tree Indexes
MongoDB and Fractal Tree IndexesMongoDB and Fractal Tree Indexes
MongoDB and Fractal Tree Indexes
 

Similar to Introduction To Hadoop Ecosystem

BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 

Similar to Introduction To Hadoop Ecosystem (20)

Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Hadoop
HadoopHadoop
Hadoop
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Apache drill
Apache drillApache drill
Apache drill
 

Recently uploaded

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 

Recently uploaded (20)

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 

Introduction To Hadoop Ecosystem

  • 1. Introduction To Hadoop Ecosystem InSemble Inc. http://www.insemble.com
  • 2. Agenda What is Big Data ?1 Use Cases & Java Developer fit4 Hadoop Ecosystem3 Relevance to your Enterprise2 Demo5
  • 3. Big Data Definitions • Wikipedia defines it as “ Data Sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable elapsed time • Gartner defines it as Data with the following characteristics – High Velocity – High Variety – High Volume • Another Definition is “ Big Data is a large volume, unstructured data which cannot be handled by traditional database management systems
  • 4. Why a game changer • Schema on Read – Interpreting data at processing time – Key, Values are not intrinsic properties of data but chosen by person analyzing the data • Move code to data – With traditional, we bring data to code and I/O becomes a bottleneck – With distributed systems, we have to deal with our own checkpointing/recovery • More data beats better algorithms
  • 5. Enterprise Relevance • Missed Opportunities – Channels – Data that is analyzed • Constraint was high cost – Storage – Processing • Future-proof your business – Schema on Read – Access pattern not as relevant – Not just future-proofing your architecture
  • 6. Motivation and History • Disk access speeds have not caught up with storage capacities • Need a high speed parallel processing platform to process large datasets on a distributed filesharing framework • Google published MapReduce architecture in 2004 • Mapreduce framework – Split the query, distribute it and process in parallel(Map Step) – Gather the results and deliver it ( Reduce Step) • Apache Open Source Project called Hadoop implemented the MapReduce framework – “Software library that gives users ability to process large datasets across cluster of commodity hardware in a reliable, fault-tolerant manner using a simple programming model”
  • 7. Hadoop Ecosystem Source: Apache Hadoop Documentation
  • 8. HDFS Architecture Source: Hadoop Definitive Guide by Tom White
  • 10. Hadoop 2 with YARN Source: Hadoop In Practice by Alex Holmes
  • 11. Map Reduce • Restrictive programming model – Key, values – Map, reduce functions with only coordination being just passing keys and values • But still considered a general data-processing tool – Google used for production search indexes – Image Analysis – Machine learning algorithms
  • 12. PIG • High level scripting language • Data Flow Language – Good for describing data analysis problems as data flows – Can plugin UDFs written in other languages such as Java, Scala, JRuby – Other languages can execute PIG scripts – Predominant use cases are • Production ETL jobs • Data exploration by analysts • Higher Level Abstraction – Map Reduce – Tez
  • 13. Hive • Framework for data warehouse on top of Hadoop – SQL Access on HDFS – Queries for Analysis • Batch Oriented – Impala – Tez
  • 14. HBase • NoSQL database on Hadoop – Based on Google’s BigTable – Column oriented database on HDFS • Regular Interactive/Update use cases – Real time read/write random access – Row updates are atomic
  • 15. SQOOP • Import/Export data from RDBMS into Hadoop – HDFS,Hive, Hbase – CouchBase – Uses JDBC driver to get the data types of the columns – Serialization/Deserialization • Actual load done internally by Mapreduce jobs
  • 16. Apache Flume Source: Apache Flume Documentation
  • 17. Real time streaming with Kafka & Storm • Kafka – Pub/Sub messaging using topics – Kafka producers publish to topics • Storm – Real time computational engine – Consumes data from spouts and passes data to bolts – Can run on top of YARN – Uses Zookeeper, implemented in Clojure – You define workflows as Directed Acyclic Graphs – True stream processing engine, so used for low latency ingestion – Can support At most once, At least once and Exactly Once semantics
  • 18. Apache Spark • High speed general purpose engine for large-scale data processing • Does not need Hadoop, just needs a shared file system such as S3, NFS or HDFS • Spark can run on YARN • Spark is implemented in Scala • Has Streaming API but a true batch processing engine that micro- batches • Can only support Exactly once, but under some failure conditions degrades to At-least once
  • 19. Common Use Cases • Queries from Detail Record Data • Queries from longer duration data • Diagnostic/Metrics/Web Logs Data Analysis • 360 degree view incorporating clickstream data • Unable to generate report within the needed timeframe • Capture and analyze sensor data • Analyze large volume of image data • Build User profiles from large volumes of data • Sentiment Analysis • Recommendation Engines • Risk Analysis
  • 20. Securing Hadoop Data Source: http://www.voltage.com
  • 21. Closing • Technology in hyper growth phase • Complex • Tools/Productivity/Monitoring products evolving • Pilot Project • Incremental Journey
  • 22. Demo - Start HDP cluster in AWS • Total 6 EC2 machine, type t2.medium • RHEL 6.5, 3.75G Memory, 10G hard drive • 1 Ambari server + 5-node cluster • 1 Namenode + 1 Secondary node + 3 Data Node • Public data set from 
 https://data.cityofchicago.org
  • 23. Managing Hadoop Cluster using Ambari • Ambari in Indian language stands for a seat sitting on top of an elephant • Ambari is an Apache open source project that is used to • Provision Hadoop cluster • Manage Hadoop cluster • Monitor Hadoop cluster • Agent-based deployment model
  • 24. Ambari Architecture Taken from http://docs.hortonworks.com/
  • 25. Demo — Hue • Apache Hue provides web interface for analyzing data in Hadoop • Use HCatalog to create table • Demo Hive Script • Demo Pig Script
  • 26. Demo — Advanced Hive • Use built-in UDF to extract latitude and longitude info • Use custom UDF (scala) to calculate distance between two locations • Join tables between library and school data and find libraries within 1 mile for each school • Use Tableau to connect to Hive through ODBC driver to plot social economy data