More Related Content Similar to Data Science with Hadoop: A Primer (20) More from DataWorks Summit (20) Data Science with Hadoop: A Primer1. © Hortonworks Inc. 2013
Hortonworks
Data Science with Hadoop – A Primer
Hadoop Summit, June 2013
Ofer Mendelevitch
ofer@hortonworks.com
@ofermend
2. © Hortonworks Inc. 2013 Page 2
Who am I?
currently <- c(
role=“director of data sciences”,
company=“Hortonworks”)
• Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc…
• Blog: www.achessdad.com
3. © Hortonworks Inc. 2013 Page 3
What I will be talking about?
•What is Data Science?
•Hadoop and Data Science
•Use-cases: data science with Hadoop
•How to get started?
4. © Hortonworks Inc. 2013 Page 4
What is Data Science?
What is a data scientist?
A person who does this
Data Product: software product whose core
functionality relies on applying statistical (or
machine learning) methods to data.
What is Data Science?
The art of building data products
6. © Hortonworks Inc. 2013 Page 6
With Hadoop…
Time and cost of building large scale
data products is dramatically reduced
7. © Hortonworks Inc. 2013
ApplianceCloudOS / VM
An Apache Hadoop Platform
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, …
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
8. © Hortonworks Inc. 2013
A typical Big Data Architecture
Page 8
APPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
OPERATIONAL
TOOLS
MANAGE &
MONITOR
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
HORTONWORKS
DATA PLATFORM
9. © Hortonworks Inc. 2013 Page 9
Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed to work
together
• Affordable at scale
– Use “commodity” hardware nodes
– Self-healing; failure handled by software
– Very good at batch processing of large datasets
10. © Hortonworks Inc. 2013 Page 10
Hadoop improves productivity of data
scientists
•All data in one place
–Ability to store all the data in raw format
–Data silo convergence
–Data scientists will find innovative uses of combined data
assets
•Data/compute capabilities available as shared asset
–Data scientists can quickly prototype a new idea without an
up-front request for funding
11. © Hortonworks Inc. 2013 Page 11
Data-driven innovation is accelerated since
Hadoop is “schema on read”
I need
new data
Finally, w
e start
collecting
Let me
see… is it
any good?
Start 6 months 9 months
“Schema change” project
Let’s just put
it in a folder
on HDFS
Let me
see… is it
any good?
3 months
My model is
awesome!
12. © Hortonworks Inc. 2013 Page 12
Hadoop is ideal for pre-processing of large
raw datasets
Strip away
HTML/PDF/DOC/P
PT
Entity resolution
Document vector
generation
Sampling, filtering
Joins
Raw Data
Processed
Data
Term
normalization
13. © Hortonworks Inc. 2013 Page 13
In machine learning, very often:
more data -> better outcomes
Banko & Brill, 2001
•More examples to learn from
•More possible feature types
–We’re looking for the most useful
for our task
15. © Hortonworks Inc. 2013 Page 15
A (partial) map of data science “tasks”
Discovery
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Prediction
Classification
Predict a category
Regression
Predict a value
Recommendation
Predict a preference
Big Data Science: High energy physics, Genomics, etc
16. © Hortonworks Inc. 2013 Page 16
Use-case: product recommendation
•Inputs:
–Explicit product ratings (when provided)
–Implicit information: purchase transactions, page views,
comments
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
U101
U102
U103
U104
U105
…
Ratings
Page views
Forum
Comments
17. © Hortonworks Inc. 2013 Page 17
Goal: predict a preference
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
5 2 4 1 3
4 1 5 2 3
1 2 4 1 3
3 2 3 1 5
U101
U102
U103
U104
U105
…
U101
U102
U103
U104
U105
…
Epic
X-Men
Hobbit
Argo
Pirates
18. © Hortonworks Inc. 2013 Page 18
Using Hadoop for recommendation
Pre-process
SQL
Online serving
HDFS
Map Reduce
Transactions
Page views
Content
Recommend
Data sources
Custom
Logic
With Hadoop, we can process
very large preference datasets
19. © Hortonworks Inc. 2013 Page 19
Use-case: failure prediction
•Inputs:
–Equipment history: install date, model, past issues
–Equipment sensor data
–Product catalog: product families, expected lifetime
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
… … … …
history
Sensor data
Product
Catalog
20. © Hortonworks Inc. 2013 Page 20
Building a prediction model
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
… … … …
Unseen data
Model
TTF
Labeled Data
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
332456 3/3/2013 1345 94005 71
442343 6/6/2013 1112 77485 67
21. © Hortonworks Inc. 2013 Page 21
Using Hadoop for failure prediction
• HDFS: central repository for all data
– Service records (word, pdf, etc)
– Equipment purchase transaction data
– Product catalog: SKUs, model numbers, etc
• Pre-process
– Convert service records to item features: remove PDF
formatting, detect entities in records
– Normalize data using service records, product catalog
– Create feature matrix; ready for modeling algorithm
22. © Hortonworks Inc. 2013 Page 22
Use-case: SaaS application security
•Inputs:
–Click-stream: user interaction with application
User ID User
since
Logins/m
onth
Avg DL
KB/day
…
123456 1/3/2004 6 30
998323 5/3/2009 1 5
345375 8/2/2005 22 120
… … … …
User data
Clicks
23. © Hortonworks Inc. 2013 Page 23
Detecting anomalous behavior records
• User access profile modeled as vector of features
• Detect anomalies in application access patterns
– Rules based
– Machine learning based (determine “outlier factor”: 0…1)
24. © Hortonworks Inc. 2013 Page 24
Using Hadoop for anomaly detection
• HDFS: central repository for all raw data
– Raw user-access logs
– User information (organization, demographics)
• Pre-process
– Build access-profile (behavioral) for each user
• Detect anomalies
– In Hadoop
– Using existing tools: R, SAS, rules engine, etc
26. © Hortonworks Inc. 2013 Page 26
1. Pick a good use-case that delivers immediate
business value
2. Implement a proof-of-value (POV)
3. Build a team (hire/train)
Getting started with Data science on Hadoop
27. © Hortonworks Inc. 2013 Page 27
• Put together a Hadoop cluster
• Define the POV business use-case
• Pull raw data you need into the cluster
• Build it
• Show the business value of your data assets
Contact us. We can help!
Implement a proof-of-value
28. © Hortonworks Inc. 2013 Page 28
Build a team:
The data scientist skillset continuum
Software
engineer
Research
Scientist
Data
Engineer
Data
Scientist
Applied
Scientist
Role Data Engineer Applied Scientist
Function Builds production-grade data products Finds signal/meaning in the data
Applies statistical/ML models and tunes the
algorithm
Good at…. Data and Systems architecture
Hadoop, PIG/HIVE, MapReduce, mahout
Java, Python, Perl, SQL, C++, etc
NoSQL (Hbase, Cassandra, Mongo)
Statistics, Machine learning
Text processing, NLP
R, Matlab, SAS, SQL
Sciptring, prototyping
Visualization / telling the story
29. © Hortonworks Inc. 2013 Page 29
Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend
We’re hiring!
Data Science training: www.hortonworks.com/training
Editor's Notes Data science is not new. But now we need to do it with much larger datasets. As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring