Practical introduction to hadoop

Practical Introduction to
Hadoop
Alex Gorbachev
30-May-2013

© 2012 – Pythian
Who is Pythian?
15 Years of Data
infrastructure management
consulting
170+ Top brands
6000+ databases under
management
Over 200 DBA’s, in 26
countries
Top 5% of DBA work force, 9
Oracle ACE’s, 2 Microsoft
MVP’s
Oracle, Microsoft, MySQL,
Netezza, Hadoop, MongoDB,
Oracle Apps, Enterprise
Infrastructure

© 2012 – Pythian
When to Engage Pythian?
Tier 3 Data
Local Retailer
Strategic upside value
from data
Tier 2 Data
eCommerce
Tier 1 Data
Health Care
Profit
Loss
Impact of an incident,
whether it be data loss,
security, human error,
etc.
Value of Data
LOVE YOUR DATA

© 2012 – Pythian
4
© 2012 Pythian
Alex Gorbachev
• Chief Technology Officer at Pythian
• Blogger
• Cloudera Champion of Big Data
• OakTable Network member
• Oracle ACE Director
• Founder of BattleAgainstAnyGuess.com
• Founder of Sydney Oracle Meetup
• IOUG Director of Communities
• EVP, Ottawa Oracle User Grou
4

© 2012 – Pythian
Given enough skill and money –
relational databases can do anything.
Sometimes it’s just unfeasibly
expensive.

© 2012 – Pythian
Hadoop Principles
Bring Code to Data Share Nothing

© 2012 – Pythian
Hadoop in a Nutshell
Replicated Distributed Big-
Data File System
Map-Reduce - framework for
writing massively parallel
jobs

© 2012 – Pythian
HDFS architecture
simplified view
• Files are split in large blocks
• Each block is replicated on
write
• A file can be only created and
deleted by one client
• Uploading new data? => new file
• Append supported in recent versions
• Update data? => recreate file
• No concurrent writes to a file
• Clients transfer blocks directly
to & from data nodes
• Data nodes use cheap local
disks
• Local reads are the most
efficient

© 2012 – Pythian
HDFS design principles

© 2012 – Pythian
Map Reduce example histogram
calculation

© 2012 – Pythian
Map Reduce pros & cons
Advantages
• Simple programming
paradigm
• Flexible
• Highly scalable
• Good fit for HDFS – mappers
read locally
• Fault tolerant
• Task failure or node failure
doesn’t affect the whole job –
they are restartable
Pitfalls
• Low efficiency
• Lots of intermediate data
• Lots of network traffic on
shuffle
• Complex manipulation
requires pipeline of multiple
jobs
• No high-level language
• Only mappers leverage
local reads on HDFS

© 2012 – Pythian
Some components of Hadoop
ecosystem
• Hive – HiveQL is SQL like query language
• Generates MapReduce jobs
• Pig – data sets manipulation language (like create your
own query execution plan)
• Mahout – machine learning libraries
• Oozie – workflow scheduler services
• Sqoop – transfer data between Hadoop and relational
database

© 2012 – Pythian
MapReduce is SLOW!
Speed through massive
parallelization

© 2012 – Pythian
Non-MR processing on Hadoop
• HBase – columnar-oriented key-value store (NoSQL)
• SQL without Map Reduce
• Impala (Cloudera)
• Drill (MapR)
• Phoenix (Salesforce.com)
• Hadapt (commercial)
• Shark – Spark in-memory analytics on Hadoop
• Platfora (commercial)
• In-memory analytics + visualization & reporting tool

© 2012 – Pythian
Hadoop Benefits
• Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to leverage parallelism
• Designed to scale
• Flexible development platform
• Solution Ecosystem

© 2012 – Pythian
• Hadoop is scalable but not fast
• Some assembly required
• Batteries not included
• Instrumentation not included
either
• DIY mindset (remember
MySQL?)
• Commercial distributions are
not free
• Simplistic security models
Hadoop Limitations

© 2012 – Pythian
How much does it cost?
$300K DIY on SuperMicro
• 100 data nodes
• 2 name nodes
• 3 racks
• 800 Sandy Bridge CPU
cores
• 6.4 TB RAM
• 600 x 2TB disks
• 1.2 PB of raw disk capacity
• 400 TB usable (triple
mirror)
• Open-source s/w, maybe
commercial distribution

© 2012 – Pythian
Hadoop vs. Relational Database
Cool
Load first, structure later
“Cheap” hardware
DIY
Flexible data store
Effectiveness via scale
Petabytes
100s – 1000s cluster nodes
⇔Old
⇔Structure first, load later
⇔Enterprise grade hardware
⇔Repeatable solutions
⇔Efficient data store
⇔Effectiveness via efficiency
⇔Terabytes
⇔Dozens of nodes (maybe)

© 2012 – Pythian
Use Cases for Big Data
• Top-line contributors
• Analyze customer behavior
• Optimize ad placements
• Customized promotions and etc
• Recommendation systems
• Netflix, Pandora, Amazon
• New products and services
• Prismatic, smart home

© 2012 – Pythian
Use Cases for Big Data
• Bottom-line contributors
• Cheap archives storage
• ETL layer – transformation engine, data cleansing
•
•
•
•
•
•
•
•

© 2012 – Pythian
Typical Initial Use-Cases for
Hadoop in modern Enterprise IT
• Transformation engine (part of ETL)
• Scales easily
• Inexpensive processing capacity
• Any data source and destination
• Data Landfill
• Stop throwing away any data
• Don’t know how to use data today? Maybe tomorrow you will
• Hadoop is very inexpensive but very reliable

© 2012 – Pythian
Advanced: Data Science
Platform
• Data warehouse is good when questions are known, data
domain and structure is defined
• Hadoop is great for seeking new meaning of data, new
types of insights
• Unique information parsing and interpretation
• Huge variety of data sources and domains
• When new insights are found and
new structure defined, Hadoop often
takes place of ETL engine
• Newly structured information is then
loaded to more traditional data-
warehouses (still today)

© 2012 – Pythian
Pythian Internal Hadoop Use
• OCR of screen video capture from Pythian privileged
access surveillance system
• Input raw frames from video capture
• Map-Reduce job runs OCR on frames and produces text
• Map-Reduce job identifies text changes from frame to frame and
produces text stream with timestamp when it was on the screen
• Other Map-Reduce jobs mine text (and keystrokes) for insights
• Credit Cart patterns
• Sensitive commands (like DROP TABLE)
• Root access
• Unusual activity patterns
• Merge with monitoring and documentation systems

© 2012 – Pythian
2
Thank you & Q&A
http://www.pythian.com/blog/
http://www.facebook.com/pages/The-Pythian-Group/
http://twitter.com/pythian http://twitter.com/alexgorbachev
http://www.linkedin.com/company/pythian
1-866-PYTHIAN
sales@pythian.com gorbachev@pythian.com
To contact us…
To follow us…

Practical introduction to hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Practical introduction to hadoop

Semelhante a Practical introduction to hadoop (20)

Mais de inside-BigData.com

Mais de inside-BigData.com (20)

Último

Último (20)

Practical introduction to hadoop

Notas do Editor