hadoop 101 aug 21 2012 tohug

August 21 2012 – Toronto Hadoop User Group
a.k.a. THUGs
Introduction to Hadoop:
Pretty Picture Version
{Due credit to Todd’s Magic}

Why are we here?

• Become exposed to the core concepts of
Hadoop
• Understand the projects within Hadoop
and how they fit together
• Review Common Use Cases for Hadoop
• Share beginner experiences with Hadoop
• Ask a @$%$#-load of questions about
Hadoop

2
©2011 Cloudera, Inc. All Rights Reserved.

What I won’t be able to give you…

• A complete introduction to the technology
(takes too long)
• Enough information to begin development or
implementation of Hadoop (too complicated)
• Enough information to install and configure
Hadoop (I recommend you start with the
Cloudera VMWare image individually or
Cloudera Manager for a real cluster)
• Have a hands-on Pig-fest or Hive-fest (that’s
a THUG meetup to come…)

3

Users of Cloudera
Financial Retail &
Web Telecom Media
Consumer

4

Hadoop Use Cases
Use Case Application Industry Application Use Case

Social Network Analysis Web Clickstream Sessionization

Content Optimization Media Clickstream Sessionization
ADVANCED ANALYTICS

DATA PROCESSING
Network Analytics Telco Mediation

Loyalty & Promotions
Retail Data Factory
Analysis

Fraud Analysis Financial Trade Reconciliation

Entity Analysis Federal SIGINT

Sequencing Analysis Bioinformatics Genome Mapping

5

CDH
File System Mount UI Framework SDK
FUSE-DFS HUE HUE SDK

Workflow Scheduling Metadata
APACHE OOZIE APACHE OOZIE APACHE HIVE

Query / Analytics

APACHE PIG, APACHE Fast
Data Integration HIVE, APACHE MAHOUT Read/Write
Access

APACHE
FLUME, APACHE HDFS, MAPREDUCE APACHE
SQOOP HBASE

Coordination
APACHE ZOOKEEPER

6
©2012 Cloudera, Inc. | Company confidential

Typical Data Pipeline

Marts

Processing
Layer
Data Sources

Data
(Temporary)
Warehouse
Storage

Archive

7

Typical Data Pipeline with Hadoop

Hadoop
Marts
Oozie

Result or Calculated Data
Original Source Data
Data Sources

Pig
Data
Hive Warehouse
MapReduce Sqoop

Sqoop
Flume HDFS

8

Several advantages

• Store more data, cheaply
• Use commodity hardware
• Scale linearly, predictably
• Tolerate hardware failure
• Turn data into strategic asset
– Ad hoc analytics
– Predictive analytics

9

Several more advantages

• Get long term view of data
• Add unstructured, semi-structured data

• Change schema on the fly (late binding)
• Integrate with existing infrastructure

10

HDFS

Self-healing, high bandwidth

1

2

3 HDFS

4 2 1 1 2 1
4 2 3 3 3
5 5 5 4 5 4

HDFS breaks incoming files into blocks and stores them redundantly across the cluster.

11

HDFS

Self-healing, high bandwidth

1

2

3 HDFS

4 2 1 2 1
4 3 3 3
5 5 4 5 4

HDFS breaks incoming files into blocks and stores them redundantly across the cluster.

12

MapReduce: Map

• Records from the data source (lines out of files, rows of a
database, etc.) are fed into the map function as key*value
pairs: e.g., (filename, line).

• map() produces one or more intermediate values along
with an output key from the input.
(key
(key 1, int.
1, value
values)
s)

Map (key Shuffle Final
(key 1, int. Reduce
Task 2, value Phase (key, value
values) Task
s) s)

(key
(key 1, int.
3, value
values)
s)

13

MapReduce: Reduce

• After the map phase is over, all the intermediate values for
a given output key are combined together into a list

• reduce() combines those intermediate values into one or
more final values for that same output key

(key
(key 1, int.
1, value
values)
s)

Map (key Shuffle Final
(key 1, int. Reduce
Task 2, value Phase (key, value
values) Task
s) s)

(key
(key 1, int.
3, value
values)
s)

14

MapReduce: Execution

15

MapReduce: WordCount
Input text: The cat sat on the mat. The aardvark sat on the sofa.

Mapping Shufﬂing Reducing
The, 1 aardvark, 1 aardvark, 1
cat, 1
cat, 1 Final Result
sat, 1 cat, 1
on, 1 aardvark, 1
the, 1 mat, 1 mat, 1 cat, 1
mat, 1 mat, 1
The, 1 on [1, 1] on, 2 on, 2
aardvark, 1 sat, 2
sat, 1 sat [1, 1] sat, 2 sofa, 1
on, 1 the, 4
the, 1 sofa, 1 sofa, 1
sofa, 1
the [1, 1, 1, 1] the, 4

16

Sqoop: RDBMS to HDFS

17

Sqoop: HDFS to RDBMS

18

FlumeNG: High-level Architecture

Client

Agent

Client

Agent

Client

Agent

Client

Channel Sink 1
Examples 1
Source
Sources: Avro, netcat, exec
Channel Sink 2
Channels: memory, JDBC 2

Sink: HDFS, Avro Agent

19

HBase: Table Structure
Column family “contents” Column family “anchor_text”

Row Key Column Timestamp Cell Column Timestamp Cell
Key Key
Com.cloudera.info 1273716197868 <html> Bar.com 1273871824184 Cloudera!...
…
Com.cloudera.www 1273746289103 <html> Baz.org 1273871962874 Hadoop!...
…
Com.foo.www 1273698729045 <html>
…
Com.foo.www 1273699734191 <html> Bar.gov 1273879456211 Edu.foo…
…
…

20

HBase: Architecture

21

Hive
SQL-based data warehousing application
 Language is SQL-like
 Features for analyzing very large data sets
 Partition columns, Sampling, Buckets

SELECT
s.word, s.freq, k.freq
FROM shakespeare
JOIN ON (s.word= k.word)
WHERE s.freq >= 5;

22

Pig

Data-flow oriented language – “Pig latin”
 Datatypes include sets, associative arrays, tuples
 High-level language for routing data, allows easy
integration of Java for complex tasks

emps = LOAD 'people.txt’ AS (id,name,salary);
rich = FILTER emps BY salary > 200000;
sorted_rich = ORDER rich BY salary DESC;
STORE sorted_rich INTO ’rich_people.txt';

23

Oozie
Workflow/coordination service to manage data processing
jobs for Hadoop

24

Oozie
Workflow/coordination service to manage data processing
jobs for Hadoop

25

Hadoop Security

 Authentication is secured by Kerberos v5 and integrated with LDAP
 Hadoop server can ensure that users and groups are who they say they are
 Job Control includes Access Control Lists, which means Jobs can specify who
can view logs, counters, configurations and who can modify a job
 Tasks now run as the user who launched the job

26

Typical Use Cases
©2011 Cloudera, Inc. All
Rights Reserved.
27

Common Challenges

1 Network Analysis and Sessionization
2 Content Optimization and Engagement Modeling
3 Usage Analysis and Mediation
4 Entity Surveillance and Signal Monitoring
5 Recommendations and Modeling
6 Loyalty, Promotion Analysis and Targeting
7 Fraud Analysis, Reconciliation and Risk
8 Time series Analysis, Mapping and Modeling

28

What Can Hadoop Do For You?
Two Core Use Cases

1 2
Applied Across Verticals
INDUSTRY TERM VERTICAL INDUSTRY TERM

Social Network Analysis Web Clickstream Sessionization
ADVANCED ANALYTICS

DATA PROCESSING
Content Optimization Media Engagement

Network Analytics Telco Mediation

Loyalty & Promotions Analysis Retail Data Factory

Fraud Analysis Financial Trade Reconciliation

Entity Analysis Federal SIGINT

Sequencing Analysis Bioinformatics Genome Mapping

29

Financial Services

1 Customer Risk Analysis

2 Surveillance and Fraud Detection
3 Central Data Repository
4 Personalization and Asset Management
5 Market Risk Modeling
6 Trade Performance Analytics

30

Customer Risk Analysis

Build comprehensive data picture of customer side risk
Publish a consolidated set of attributes for analysis
Map ratings across products
Parse and aggregate data from difference sources
Credit and debit cards, product payments, deposits and savings
Banking activity, browsing behavior, call logs, e-mails and chats
Merge data into a single view
A “fuzzy join” among data sources
Structure and normalize attributes
Sentiment analysis, pattern recognition

31
Copyright 2010 Cloudera Inc. All rights reserved

Surveillance and Fraud Detection

Trade surveillance records activity in a central
repository
Centralized logging across all execution platforms
Structured and raw log data from multiple applications
Pattern recognition detect anomalies/harmful behavior
Feature set and timeline vector are very dynamic
Schema on read provides flexibility for analysis
Data is primarily served and processed in HDFS with MR
Data filtering and projection in Pig and Hive
Statistical modeling of data sets in R or SAS

32

Central Data Repository

Financial Data messy due to many interacting systems
Personal data is obfuscated for security and records get out of sync
Trades need to be “sessionized” into accounts and products
Discrepancies are difficult to reconcile, need to track corrections
Hadoop is a centralized platform for data collection
Single source for data, processing happens on the platform
Metadata used to track information lifecycle
Workflows run and monitor data transformation pipelines
Data served via APIs or in Batch
Single version of the truth, data processed and cleansed centrally
Clear audit trail of data dependencies and usage

33

Personalization and Asset Mgmt

Institutional and personal investing services
Arms investor with sophisticated models for their positions
Success measured by upsell and conversion (as well as profit)
Data analysis across distinct data sources
Market data and individual assets by investor
Investor strategy, goals and interactive behavior
Data sources combined in HDFS
Models written in Pig with UDFs and generated regularly
Reports for sales and fed into online recommendation system

34

Market Risk Modeling

Evaluating asset risk is very data intensive
Trade volumes have increased dramatically
Classic indicators at the daily level don’t provide a clear picture
Trends across complex instruments can be hard to spot
Models require massive brute force calculation
Multiple models built in batch and in parallel
Data is primarily structured and sourced from RDBMS
Transactional data sqooped to combine with market feeds
Resulting predictions sqooped and served via RDBMS

35

Trade Performance Analytics

Increased Demands on Trade Analytics
Regulatory requirements for best price trading across exchanges
Increased competition and scrutiny adds a focus on optimization
Trade Analytics becomes a Clickstream problem
Trade execution systems include order trails and execution logs
Sessionized across order systems and combined with system logs
Processing, Analysis and Audit Trail all in Hadoop
KPIs summarized as regular reports written in Hive
Data available for historical analysis and discovery

36

Science and Energy

1 Genomics
2 Utilities and Power Grid
3 Smart Meters
4 Biodiversity Indexing
5 Network Failures
6 Seismic Data

37

Genomics

Cost of DNA Sequencing Falling Very Fast
Raw data needs to be aligned and matched
Scientists want to collect and analyze these sequences
Hadoop Can Read Native Format
hadoop-bam Java library for manipulation of Binary Alignment/Map
Alignment, SNP discovery, genotyping
Genomic Tools Based On Hadoop
SEAL – distributed short read alignment
BlastReduce – parallel read mapping
Crossbow – whole genome re-sequencing analysis
Cloudburst - sensitive MapReduce alignment

38

Utilities and the Power Grid

Power grid is aging and maintained incrementally
Failures hard to predicate and can have cascading effects
Looking at vibration of transformers over time to find patterns
Predicting failure of grid equipment
Supervised learning to scan time series data for fuzzy patterns
Identify likely faulting equipment for targeted replacement
Hadoop based tools to model equipment behavior
openPDC project: http://openpdc.codeplex.com
Lumberyard - indexing time series data for low latency fuzzy queries

39

Smart Meter Example Workflow

Looking at usage patterns in home smart meter data
How to educate consumers to save energy
Capacity planning for the grid
Individual analysis is critical
Personalized reporting to consumers
Predictive modeling of peak usage and potential cost savings
Hadoop for collection, reporting and analysis
Collect time series samples in Hadoop
Partition at various granularities and roll up reports and models

40

Biodiversity Indexing

Consolidation and serving of Biological data
Provide free and open access to biodiversity data
Collection, search, discovery and access to a variety of data
Data matching and cleansing
Geography, Water/land mapping
Dictionaries and taxonomic services
Data is harvested into multiple RDBMS
Sqoop to Hadoop for processing workflows and index generation
Sqoop back to MySQL for Web app serving
Future development is to crawl into and serve from HBase

41

Preventing Network Failure

Need to Model and understand Network behavior
Better understanding how the network reacts to fluctuations
Discrete anomalies may, in fact, be interconnected
Collection and forensic analysis of emerging patterns
Record the data exhaust – all metrics, logs, traffic metadata
Identify leading indicators of component failure
New techniques when all data is available
Expand the range of indexing techniques
Starting with simple scans to more complex data mining

42

Processing Seismic Data

Optimize the IO-intensive phases of seismic processing
Incorporate additional parallelism where it makes sense
Simplify gather/transpose operations with MapReduce
Seismic Unix for Core Algorithms
Well-known, used at many grad programs in geophysics
SU file format can be easily transformed for processing on HDFS
Hadoop Streaming
Seismic Unix, SEPlib, Javaseis - non-Java code in MR
Framework is aware of parameter files needed by SU commands


Retail and Manufacturing

1 Customer Churn
2 Brand and Sentiment Analysis
3 Point of Sales
4 Pricing Models
5 Customer Loyalty
6 Targeted Offers

44

Customer Churn Analysis

Understanding Customer Behavior and Preferences
Rapidly test and build behavioral model of customer
Combine disparate data sources (transactional, social,etc)
Structure and analyze with Hadoop
Traversing usage and social graphs
Pattern identification and recognition to find indicators
Feature Extraction to find Root Causes
Defining attributes and modeling statistical significance
Combinations and sequence of attributes and actions factor in

45

Brands and Sentiment Analysis

Internet generates a lot of chatter about brands
Understanding what’s being said is crucial to protecting brand value
Facebook, Twitter generate a lot of data for a global top brand
Capturing and Processing direct feedback
Better engagement and alerting via Sentiment Analysis
Not yet ready for fully automated customer service
Hadoop handles the diverse data types and processing
Sources of data changing and semantics continuously evolving
Sophistication of algorithms is improving daily

46

Point of Sale Transaction Analysis

Lot’s of machine generated data available
Line items, stock, coupons, ads
Stored in various formats
Pattern recognition enables constant reassessment
Optimizing across multiple data sources
Demand prediction based on
Joining multiple data sets for more insight
Retail Supply Chain
Weather and Financial data

47

Pricing Models

Retailers have increased flexibility in pricing
Comparison shopping is dynamic
Customer weighs combined value and time to delivery
Understand how prices affect purchasing
New techniques apply such as A/B testing and spot discounts
Motivations can be difficult to discern, need to look for correlations
Combinations multiply, Hadoop provides scale to analyze
Bundles can have incentive discounts
Clustering and supervised learning to group attributes

48

Customer Loyalty

Comparison shopping is making Retail hyper-competitive
Discount programs, e-mail correspondence entice shoppers
Brand loyalty means attention to detail and service
Customer lifecycle is more than purchases
Browsing and online data used to capture customer attention
Loyalty programs bridge the gap between purchases
Reach into online channels
Online engagement is personalized just as in store
Connecting online and in store shows customer awareness

49

Targeted Offers

The checkout lane is everywhere
Cookies track users through ad impressions
Purchasing behavior is time sensitive
Logs collected from on-site and off-site browsing
Data is ingested incrementally
Process happens at a variety of time scales
Data logged to HBase as primary store
Some events naturally associate, others require deeper analysis
Random access useful for debugging algorithms

50

Web and e-Commerce

1 Online Media

2 Mobile
3 Online Gaming
4 Search Quality
5 Recommendations
6 Influence

51

Online Media

Centralized platform for consolidated log processing
Many online properties each with separate sys, ad, ops logs
Different standards and techniques for processing
Data feeds are varied
Advertising logs, website traffic feeds from 3rd party
providers, system logs, application logs and other operational
metrics
Data pipeline can be normalized
Cleansing, standard analytics and reporting
Soon an exploratory platform as well as storage across all
properties

52

Mobile

Mobile advertisement platform
Measuring metrics impressions, clicks, actions and conversions.
Most metrics are arbitrary text strings (data is dirty)
Stringent SLAs for delivering results
SLA of several minutes between event and report to advertisers
SLA also covers data accuracy
Hadoop for ETL, Analytics, reporting
HBase for serving results to advertisers
Mimics the popular online analytics services

53

Online Gaming

Consolidating data silos for a holistic view of users
Various silos of data – user reg, financial, game play, web
Poplar games simulate real world sports
First goal is accessibility
Multiple business can access all data
Game play metrics are extremely detailed (think sensor data)
Second is exploratory
Distributions, event triggers, distinct counts and association rates
Compute online statistics such as leaderboards

54

Search Quality

Understand user search behavior
Improve service, assess quality of results
Understand load, identify trends, generate predictive search
Search query logs stored in HDFS
Hive based aggregation
Sqoop to RDBMS for end user analytics
Now focused on internal monitoring
Analytics have become a critical part of the service
Where are analytic needs growing?
What data about searches do people want to see?

55

Recommendations and Forecasting

Collect and serve personalization information
Wide variety of constantly changing data sources
Data guaranteed to be messy
Data ingestion includes collection of raw data
Filtering and fixing of poorly formatted data
Normalization and matching across data sources
Analysis looks for reliable attributes and groupings
Interpretation (e.g. gender by name)
Aggregation across likely matching identifiers
Identify possible predicted attributes or preferences

56

Influence

Collect a fire hose of data about social commentary
Personal opinions, references to opinions, links
Look for tracking and referencing (like very messy page rank)
Hadoop to bucket and prepare for analysis
Meta data and distinct topics
Social graph scoring, bot and spam detection
Hadoop stack used throughout
Pig and Java, coordinated with Oozie
Batch serve data in CSV and load to HBase for API servers

57

August 2012

Cloudera University
Sarah Sproehnle

Why invest in training?

• Maximize your investment in a new
technology
• Make fewer mistakes by learning the best
practices
• Cheaper and easier to cross-train than
hire
– Existing DBAs, Analysts and System
Administrators can become Hadoop users

Cloudera University
• Experience
– We’ve trained over 12,000 people
– Our courses incorporate the best practices that Cloudera has learned
from supporting our customers
• Depth of courseware
– A comprehensive, role-based curriculum
– We can train your entire staff in all aspects of CDH
• Geographical coverage
– We offer public and private classes in over 20 countries including
US, Canada, Brazil, Germany, UK, Poland, Spain, Israel, France, The
Netherlands, South Africa, China, India, Australia and Singapore
• Certification
– Available worldwide at Pearson VUE (vouchers included in our courses)
– Certifications for Developers (CCDH), Admins (CCAH), and HBase
(CCSHB)

60 ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
Reproduction or redistribution without written permission is
prohibited.

Value proposition of private training
• 12k/day for up to 20 students
– NEW: 8k/day for up to 10 students
– Price includes courseware, lab materials, cert
vouchers (for Dev, Admin, HBase), and T&E
• We can tailor a class
– We have ~ 3 weeks of content that we can mix
and match into a customized class
– Saves the customer’s time by covering the most
relevant topics, cutting out non essential material
• Customer chooses location and date
• We’re under NDA

hadoop 101 aug 21 2012 tohug

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (10)

Destaque

Destaque (20)

Semelhante a hadoop 101 aug 21 2012 tohug

Semelhante a hadoop 101 aug 21 2012 tohug (20)

Mais de Adam Muise

Mais de Adam Muise (20)

Último

Último (20)

hadoop 101 aug 21 2012 tohug

Notas do Editor