Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for the storage and analysis of datasets that are too large for single servers. The document discusses several key Hadoop components including HDFS for storage, MapReduce for processing, HBase for column-oriented storage, Hive for SQL-like queries, Pig for data flows, and Sqoop for data transfer between Hadoop and relational databases. It provides examples of how each component can be used and notes that Hadoop is well-suited for large-scale batch processing of data.
2. LP Facts
8500+ customers
450M unique visitors
per month
1.3B visits per month
60TB of new data
every month
3. LP Facts
Databases in Liveperson
Oracle – B2B SQL Server – B2C
Hadoop – Raw Data Vertica – Forecasting / BI
Cassandra – Application HA MySQL - Segmentation
MySQL NDB - ETL MongoDB – Predictive Targeting
4. Hadoop in Liveperson
• 2TB Of data streamed into Hadoop each day
• 100+ Data Nodes serving our data needs
• DataNodes are 36GB RAM, 1TBx12 SATA disks, 2
quad CPU
• Dozens (and growing) of daily MR jobs
• 5 different project (and growing) based on Hadoop
eco-system
5. Hadoop
What is Hadoop?
• Open Source project from Apache
• Able to store and process large amounts of data. Including not only structured
data but also complex, unstructured data as well.
• Hadoop is not actually a single product but instead
a collection of several components
• Commodity hardware. Low software and hardware costs
• Shared nothing machines - Scalability
6. Example Comparison: RDBMS vs. Hadoop
Typical Traditional RDBMS Hadoop
Data Size Gigabytes Petabytes
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Scaling Nonlinear Linear
Query Response Can be near immediate Has latency (due to batch processing)
Time
7. Hadoop Distributed File System - HDFS
• HDFS has three types of Nodes
• Namenode (MasterNode)
• Distribute files in the cluster
• Responsible for the replication
between the datanodes and for file
blocks location
• Datanodes
• Responsible for actual file store
• Serving data from files(data) to client
• BackupNode (version 0.23 and up)
• It‟s a backup of the NameNode
8. Hadoop Ecosystem
• MapReduce - Framework for writing/executing algorithms
• Hbase – Distributed, versioned, column-oriented database
• Hive - SQL like interface for large datasets stored in HDFS.
• Pig - Pig consists on a high-level language (Pig Latin) for
expressing data analysis programs
• Sqoop - (“SQL-to-Hadoop”) is a straightforward tool. Able to
Imports/export tables or databases in HDFS/Hive/Hbase
9. MapReduce
What is MapReduce?
• Runs programs (jobs) across many computers
• Protects against single server failure by re-run failed steps.
• MR jobs can be written in Java, C, Phyton, Ruby and etc
• Users only write Map and Reduce functions
• MAP - Takes a large problem and divides into sub problems.
Performs the same function on all subsystems
• REDUCE - Combine the output from all sub-problems
Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop-streaming.jar
- input myInputDirs
- output myOutputDir
- mapper /bin/cat
- reducer /bin/wc
10. Hbase
What is Hbase and why should you use HBase?
• Huge volumes of randomly accessed data.
• There is no restrictions on column numbers for rows it‟s dynamic.
• Consider HBase when you‟re loading data by key, searching data by key (or
range), serving data by key, querying data by key or when storing data by
row that doesn‟t conform well to a schema.
Hbase dont’s?
• It doesn‟t talk SQL, have an optimizer, support in transactions or joins. If
you don‟t use any of these in your database application then HBase could
very well be the perfect fit.
Example: create ‘blogposts’, ‘post’, ‘image’ ---create table
put „blogposts‟, „id1′, „post:title‟, „Hello World‟ ---insert value
put „blogposts‟, „id1′, „post:body‟, „This is a blog post‟ ---insert value
put „blogposts‟, „id1′, „image:header‟, „image1.jpg‟ ---insert value
get „blogposts‟, „id1′ ---select records
11. Hive
What is Hive?
• Built on the MapReduce framework so it generates M/R jobs behind
• Hive is a data warehouse that enables easy data summarization and
ad-hoc queries via an SQL-like interface for large datasets stored in
HDFS/Hbase.
• Have partitioning and partition swapping
• Good for random sampling
Example: CREATE EXTERNAL TABLE vs_hdfs ( select session_id,
site_id string, get_json_object(concat(tttt, "}"), '$.BY'),
session_id string, get_json_object(concat(tttt, "}"), '$.TEXT') from
time_stamp bigint, (
visitor_id bigint, select session_id,concat("{",
row_unit string, regexp_replace(event, "[{|}]", ""), "}") tttt
evts string, from (
biz string, select session_id,get_json_object(plne,
plne string, '$.PLine.evts[*]') pln
dims string) from vs_hdfs_v1 where site='6964264'
partitioned by (site string,day string) and day='20120201' and plne!='{}' limit 10 ) t
ROW FORMAT DELIMITED FIELDS TERMINATED LATERAL VIEW explode(split(pln, "},{"))
BY '001' adTable AS event )t2
STORED AS SEQUENCEFILE LOCATION
'/home/data/';
12. Pig
What is Pig?
• Data flow processing
• Uses Pig Latin query language
• Highly parallel in order to distribute data processing across many
servers
• Combining multiple data sources (Files, Hbase, Hive)
Example:
13. Sqoop
What is Sqoop?
• It‟s a command line tool for moving data between HDFS and
relational database systems.
• You can download drivers for Sqoop from Microsoft and
• Import Data/Query results from SQL Server to Hadoop.
• Export Data from Hadoop to SQL Server.
• It‟s like BCP
Example:
$bin/sqoop import --connect
'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch'
--table lineitem --hive-import
$bin/sqoop export --connect
'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table
lineitem --export-dir /data/lineitemData
14. Other projects in Hadoop
There is many other projects we didn‟t talk about
• Chukwa
• Mahout
• Avro
• Zookeeper
• Fuse
• Flume
• Oozie
• Hue
• Hiho
• ………………..
15. Hadoop and Microsoft
• Bring Hive data directly to Excel through the Microsoft Hive Add-in for Excel
• Build a PowerPivot/PowerView on top of Hive
• Instead of manually refreshing a PowerPivot workbook based on Hive on
their desktop, users can use PowerPivot for SharePoint to schedule a data
refresh feature to refresh a central copy shared with others, without worrying
about the time or resources it takes.
• BI Professionals can build BI Semantic Model or Reporting Services
Reports on Hive in SQL Server Data tools
16. HOW DO I GET STARTED
• Microsoft
• https://www.hadooponazure.com/
• http://www.microsoft.com/download/en/details.aspx?id=27584 (sqoop driver)
• Open Source: http://hadoop.apacha.org
• Vendors
• http://www.cloudera.com
• http://hortonworks.com
• http://mapr.com