This document provides an overview of Hadoop and its uses. It defines Hadoop as a distributed processing framework for large datasets across clusters of commodity hardware. It describes HDFS for distributed storage and MapReduce as a programming model for distributed computations. Several examples of Hadoop applications are given like log analysis, web indexing, and machine learning. In summary, Hadoop is a scalable platform for distributed storage and processing of big data across clusters of servers.
2. Š 2012 â Pythian
Who is Pythian?
15 Years of Data
infrastructure management
consulting
170+ Top brands
6000+ databases under
management
Over 200 DBAâs, in 26
countries
Top 5% of DBA work force, 9
Oracle ACEâs, 2 Microsoft
MVPâs
Oracle, Microsoft, MySQL,
Netezza, Hadoop, MongoDB,
Oracle Apps, Enterprise
Infrastructure
3. Š 2012 â Pythian
When to Engage Pythian?
Tier 3 Data
Local Retailer
Strategic upside value
from data
Tier 2 Data
eCommerce
Tier 1 Data
Health Care
Profit
Loss
Impact of an incident,
whether it be data loss,
security, human error,
etc.
Value of Data
LOVE YOUR DATA
4. Š 2012 â Pythian
4
Š 2012 Pythian
Alex Gorbachev
⢠Chief Technology Officer at Pythian
⢠Blogger
⢠Cloudera Champion of Big Data
⢠OakTable Network member
⢠Oracle ACE Director
⢠Founder of BattleAgainstAnyGuess.com
⢠Founder of Sydney Oracle Meetup
⢠IOUG Director of Communities
⢠EVP, Ottawa Oracle User Grou
4
5. Š 2012 â Pythian
Given enough skill and money â
relational databases can do anything.
Sometimes itâs just unfeasibly
expensive.
7. Š 2012 â Pythian
Hadoop Principles
Bring Code to Data Share Nothing
8. Š 2012 â Pythian
Hadoop in a Nutshell
Replicated Distributed Big-
Data File System
Map-Reduce - framework for
writing massively parallel
jobs
9. Š 2012 â Pythian
HDFS architecture
simplified view
⢠Files are split in large blocks
⢠Each block is replicated on
write
⢠A file can be only created and
deleted by one client
⢠Uploading new data? => new file
⢠Append supported in recent versions
⢠Update data? => recreate file
⢠No concurrent writes to a file
⢠Clients transfer blocks directly
to & from data nodes
⢠Data nodes use cheap local
disks
⢠Local reads are the most
efficient
11. Š 2012 â Pythian
Map Reduce example histogram
calculation
12. Š 2012 â Pythian
Map Reduce pros & cons
Advantages
⢠Simple programming
paradigm
⢠Flexible
⢠Highly scalable
⢠Good fit for HDFS â mappers
read locally
⢠Fault tolerant
⢠Task failure or node failure
doesnât affect the whole job â
they are restartable
Pitfalls
⢠Low efficiency
⢠Lots of intermediate data
⢠Lots of network traffic on
shuffle
⢠Complex manipulation
requires pipeline of multiple
jobs
⢠No high-level language
⢠Only mappers leverage
local reads on HDFS
13. Š 2012 â Pythian
Some components of Hadoop
ecosystem
⢠Hive â HiveQL is SQL like query language
⢠Generates MapReduce jobs
⢠Pig â data sets manipulation language (like create your
own query execution plan)
⢠Generates MapReduce jobs
⢠Mahout â machine learning libraries
⢠Generates MapReduce jobs
⢠Oozie â workflow scheduler services
⢠Sqoop â transfer data between Hadoop and relational
database
14. Š 2012 â Pythian
MapReduce is SLOW!
Speed through massive
parallelization
16. Š 2012 â Pythian
Hadoop Benefits
⢠Reliable solution based on unreliable hardware
⢠Designed for large files
⢠Load data first, structure later
⢠Designed to maximize throughput of large scans
⢠Designed to leverage parallelism
⢠Designed to scale
⢠Flexible development platform
⢠Solution Ecosystem
17. Š 2012 â Pythian
⢠Hadoop is scalable but not fast
⢠Some assembly required
⢠Batteries not included
⢠Instrumentation not included
either
⢠DIY mindset (remember
MySQL?)
⢠Commercial distributions are
not free
⢠Simplistic security models
Hadoop Limitations
18. Š 2012 â Pythian
How much does it cost?
$300K DIY on SuperMicro
⢠100 data nodes
⢠2 name nodes
⢠3 racks
⢠800 Sandy Bridge CPU
cores
⢠6.4 TB RAM
⢠600 x 2TB disks
⢠1.2 PB of raw disk capacity
⢠400 TB usable (triple
mirror)
⢠Open-source s/w, maybe
commercial distribution
19. Š 2012 â Pythian
Hadoop vs. Relational Database
Cool
Load first, structure later
âCheapâ hardware
DIY
Flexible data store
Effectiveness via scale
Petabytes
100s â 1000s cluster nodes
âOld
âStructure first, load later
âEnterprise grade hardware
âRepeatable solutions
âEfficient data store
âEffectiveness via efficiency
âTerabytes
âDozens of nodes (maybe)
21. Š 2012 â Pythian
Use Cases for Big Data
⢠Top-line contributors
⢠Analyze customer behavior
⢠Optimize ad placements
⢠Customized promotions and etc
⢠Recommendation systems
⢠Netflix, Pandora, Amazon
⢠New products and services
⢠Prismatic, smart home
23. Š 2012 â Pythian
Use Cases for Big Data
⢠Top-line contributors
⢠Analyze customer behavior
⢠Optimize ad placements
⢠Customized promotions and etc
⢠Recommendation systems
⢠Netflix, Pandora, Amazon
⢠New products and services
⢠Prismatic, smart home
24. Š 2012 â Pythian
Use Cases for Big Data
⢠Bottom-line contributors
⢠Cheap archives storage
⢠ETL layer â transformation engine, data cleansing
â˘
â˘
â˘
â˘
â˘
â˘
â˘
â˘
25. Š 2012 â Pythian
Typical Initial Use-Cases for
Hadoop in modern Enterprise IT
⢠Transformation engine (part of ETL)
⢠Scales easily
⢠Inexpensive processing capacity
⢠Any data source and destination
⢠Data Landfill
⢠Stop throwing away any data
⢠Donât know how to use data today? Maybe tomorrow you will
⢠Hadoop is very inexpensive but very reliable
26. Š 2012 â Pythian
Advanced: Data Science
Platform
⢠Data warehouse is good when questions are known, data
domain and structure is defined
⢠Hadoop is great for seeking new meaning of data, new
types of insights
⢠Unique information parsing and interpretation
⢠Huge variety of data sources and domains
⢠When new insights are found and
new structure defined, Hadoop often
takes place of ETL engine
⢠Newly structured information is then
loaded to more traditional data-
warehouses (still today)
27. Š 2012 â Pythian
Pythian Internal Hadoop Use
⢠OCR of screen video capture from Pythian privileged
access surveillance system
⢠Input raw frames from video capture
⢠Map-Reduce job runs OCR on frames and produces text
⢠Map-Reduce job identifies text changes from frame to frame and
produces text stream with timestamp when it was on the screen
⢠Other Map-Reduce jobs mine text (and keystrokes) for insights
⢠Credit Cart patterns
⢠Sensitive commands (like DROP TABLE)
⢠Root access
⢠Unusual activity patterns
⢠Merge with monitoring and documentation systems
28. Š 2012 â Pythian
2
Thank you & Q&A
http://www.pythian.com/blog/
http://www.facebook.com/pages/The-Pythian-Group/
http://twitter.com/pythian http://twitter.com/alexgorbachev
http://www.linkedin.com/company/pythian
1-866-PYTHIAN
sales@pythian.com gorbachev@pythian.com
To contact usâŚ
To follow usâŚ
Notas do Editor
Big Data is not new. Whatâs new is that itâs not cost prohibitive anymore for broad commercial adoption.
The ideas are simple:Data is big, code is small. It is far more efficient to move the small code to the big data than vice versa. It you have a pillow and a sofa in a room, you typically move the pillow to the sofa, and not vice versa. But many developers are too comfortable with the select-than-process anti-pattern. This principle is in place to help with the throughput challenges.2. Sharing is nice, but safe sharing of data typically mean locking, queueing, bottle necks and race conditions. It is notoriously difficult to get concurrency right, and even if you do â it is slower than the alternative. Hadoop works around the whole thing. This principle is in place to deal with parallel processing challenges
Default block size is 64M,You can place any data file in HDFS. Later processing can find the meaning in the data.
Many projects fail because people imagine a very rosy image of Hadoop. They think they can just throw all the data there and it will magically and quickly become value. Such misguided expectations also happen with other platforms and doom other projects too. To be successful with Hadoop, we need to be realistic about it.
Patterns â pregnant 16 y.o. example
Women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc.