Introduction To Hadoop Ecosystem

Introduction To Hadoop Ecosystem
InSemble Inc.
http://www.insemble.com

Agenda
What is Big Data ?1
Use Cases & Java Developer fit4
Hadoop Ecosystem3
Relevance to your Enterprise2
Demo5

Big Data Definitions
• Wikipedia defines it as “ Data Sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage and process data
within a tolerable elapsed time
• Gartner defines it as Data with the following characteristics
– High Velocity
– High Variety
– High Volume
• Another Definition is “ Big Data is a large volume, unstructured data which
cannot be handled by traditional database management systems

Why a game changer
• Schema on Read
– Interpreting data at processing time
– Key, Values are not intrinsic properties of data but chosen by person
analyzing the data
• Move code to data
– With traditional, we bring data to code and I/O becomes a
bottleneck
– With distributed systems, we have to deal with our own
checkpointing/recovery
• More data beats better algorithms

Enterprise Relevance
• Missed Opportunities
– Channels
– Data that is analyzed
• Constraint was high cost
– Storage
– Processing
• Future-proof your business
– Schema on Read
– Access pattern not as relevant
– Not just future-proofing your architecture

Motivation and History
• Disk access speeds have not caught up with storage capacities
• Need a high speed parallel processing platform to process large
datasets on a distributed filesharing framework
• Google published MapReduce architecture in 2004
• Mapreduce framework
– Split the query, distribute it and process in parallel(Map Step)
– Gather the results and deliver it ( Reduce Step)
• Apache Open Source Project called Hadoop implemented the
MapReduce framework
– “Software library that gives users ability to process large datasets across cluster of
commodity hardware in a reliable, fault-tolerant manner using a simple programming model”

Hadoop Ecosystem
Source: Apache Hadoop Documentation

HDFS Architecture
Source: Hadoop Definitive Guide by Tom White

Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes

Map Reduce
• Restrictive programming model
– Key, values
– Map, reduce functions with only coordination being just
passing keys and values
• But still considered a general data-processing tool
– Google used for production search indexes
– Image Analysis
– Machine learning algorithms

PIG
• High level scripting language
• Data Flow Language
– Good for describing data analysis problems as data flows
– Can plugin UDFs written in other languages such as Java, Scala,
JRuby
– Other languages can execute PIG scripts
– Predominant use cases are
• Production ETL jobs
• Data exploration by analysts
• Higher Level Abstraction
– Map Reduce
– Tez

Hive
• Framework for data warehouse on top of
Hadoop
– SQL Access on HDFS
– Queries for Analysis
• Batch Oriented
– Impala
– Tez

HBase
• NoSQL database on Hadoop
– Based on Google’s BigTable
– Column oriented database on HDFS
• Regular Interactive/Update use cases
– Real time read/write random access
– Row updates are atomic

SQOOP
• Import/Export data from RDBMS into Hadoop
– HDFS,Hive, Hbase
– CouchBase
– Uses JDBC driver to get the data types of the columns
– Serialization/Deserialization
• Actual load done internally by Mapreduce jobs

Apache Flume
Source: Apache Flume Documentation

Real time streaming with Kafka &
Storm
• Kafka
– Pub/Sub messaging using topics
– Kafka producers publish to topics
• Storm
– Real time computational engine
– Consumes data from spouts and passes data to bolts
– Can run on top of YARN
– Uses Zookeeper, implemented in Clojure
– You define workflows as Directed Acyclic Graphs
– True stream processing engine, so used for low latency ingestion
– Can support At most once, At least once and Exactly Once semantics

Apache Spark
• High speed general purpose engine for large-scale data
processing
• Does not need Hadoop, just needs a shared file system such as
S3, NFS or HDFS
• Spark can run on YARN
• Spark is implemented in Scala
• Has Streaming API but a true batch processing engine that micro-
batches
• Can only support Exactly once, but under some failure
conditions degrades to At-least once

Common Use Cases
• Queries from Detail Record Data
• Queries from longer duration data
• Diagnostic/Metrics/Web Logs Data Analysis
• 360 degree view incorporating clickstream data
• Unable to generate report within the needed timeframe
• Capture and analyze sensor data
• Analyze large volume of image data
• Build User profiles from large volumes of data
• Sentiment Analysis
• Recommendation Engines
• Risk Analysis

Securing Hadoop Data
Source: http://www.voltage.com

Closing
• Technology in hyper growth phase
• Complex
• Tools/Productivity/Monitoring products
evolving
• Pilot Project
• Incremental Journey

Demo - Start HDP cluster in AWS
• Total 6 EC2 machine, type t2.medium
• RHEL 6.5, 3.75G Memory, 10G hard drive
• 1 Ambari server + 5-node cluster
• 1 Namenode + 1 Secondary node + 3 Data
Node
• Public data set from  
https://data.cityofchicago.org

Managing Hadoop Cluster using
Ambari
• Ambari in Indian language stands for a seat
sitting on top of an elephant
• Ambari is an Apache open source project that
is used to
• Provision Hadoop cluster
• Manage Hadoop cluster
• Monitor Hadoop cluster
• Agent-based deployment model

Ambari Architecture
Taken from http://docs.hortonworks.com/

Demo — Hue
• Apache Hue provides web interface for
analyzing data in Hadoop
• Use HCatalog to create table
• Demo Hive Script
• Demo Pig Script

Demo — Advanced Hive
• Use built-in UDF to extract latitude and longitude info
• Use custom UDF (scala) to calculate distance
between two locations
• Join tables between library and school data and find
libraries within 1 mile for each school
• Use Tableau to connect to Hive through ODBC driver
to plot social economy data

Introduction To Hadoop Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction To Hadoop Ecosystem

Similar to Introduction To Hadoop Ecosystem (20)

Recently uploaded

Recently uploaded (20)

Introduction To Hadoop Ecosystem