ApacheBigData - Budapest, 2015
Data Science from the trenches
What are the issues?
How to select best algorithm?
How to tune?
What are the problems with visualization?
How does Zeppelin help
Apache con big data 2015 - Data Science from the trenches
1. Page 1
Data Science: A view from the trenches
Ram Sriharsha
Twitter: @halfbrane
Vinay Shukla
Twitter: @neomythos
2. Page 2
Agenda
• Problems we work on
• Common Challenges
• Reductions
• Handling label sparsity
– Co Training
– Adaptive Learning
• When you have to be fast and accurate
– Online Clustering
– Sketches
– Online Learning
• Visualization
3. Page 3
Some Problems
• Search Advertising
– Click Prediction: Given a query, ad and user context, how likely is the user to click on ad?
– Feature Engineering: Query/ ad categorization, query -> feature vector
• Entity Resolution and Disambiguation
• Over / Under Payment of claims detection
• Document Matching
• Login Risk Detection
4. Page 4
Common Challenges
• Labeling is expensive and not clean
– Selectively ask for labels (active learning)
– Co-Training to expand label set
• Not enough high quality implementations of algorithms
– Modular extensions of base implementations (Reductions)
– Boosting
• Speed of training/ scoring important
– Online learning
– Online clustering
– Sketches
• Freshness of models
– Online and adaptive learning
• Visualizing performance and feature importance
– Zeppelin
5. Page 5
Reductions
OVR
Let R = rejection sampling algorithm
For each example h, sample according to
Cost of h and feed to 0/1 classifier
A
A
…
Randomize over classifiers that
Output yes
Importance Weighting R A
R^-1
Let A = Algorithm for optimizing 0/1 loss
6. Page 6
Active Learning
• Given a pool of examples determine which ones is the classifier least confident about
• Ask those examples to be labeled, and feed to training
• Choose query points that shrink the space of classifiers rapidly
• Exploit natural structure in data
45% 45%2.5% 2.5%5%
7. Page 7
Co Training
• Suppose you have two “views” of the data
– e.g, web pages have content, and hyperlinks pointing to and from them
– Suppose problem is to label webpage as about literature/ or not (binary classification)
• One approach:
– Label web pages manually. Train classifier to use both content text and hyperlinks as
features
– This requires a large # of labeled pages
• Other approach:
– Since we have two views , try to learn two classifiers
– Each classifier learns on a subset of labeled examples.
– The scores of each classifier are used to label a subset of unlabeled web pages and extend
the labels for the other classifier.
8. Page 8
Sketches
• Store a “summary” of the dataset
• Querying the sketch is “almost” as good as querying the dataset
• Example: frequent items in a stream
– Initialize associative array A of size k-1
– Process: for each j
- if j is in keys(A), A[j] += 1
- else if |keys(A)| < k - 1, A[j] = 1
- else
– for each l in keys(A),
» A(l) -=1 ;
» if A(l) = 0, remove l;
– done
9. Page 9
Clustering is not fast enough
• Sample and then cluster
• Do clusters need to dynamically adapt?
– Online clustering
– Streaming K Means
10. Page 10
K Means
• Initialize cluster centers somehow
– random
– K means ++
• Alternate
– Assign each point to closest cluster center
– Move cluster center to average of points assigned to center
• Stop when convergence criteria reached
– Points don’t move “much”
– Number of iterations reached.
12. Page 12
Assign each point
k1
k2
k3
X
Y
Assign
each point
to the closest
cluster
center
13. Page 13
Recompute Cluster Centers
X
Y
Move
each cluster center
to the mean
of each cluster
k1
k2
k2
k1
k3
k3
14. Page 14
Streaming K Means
• For each new point
– Assign to closest cluster center
– Update cluster center to incrementally move in direction of new point
• Online version of Lloyd’s algorithm
• Good enough in practice
16. Page 16
Online Clustering (Liberty, Sriharsha, Sviridenko)
• Initialization Phase:
– First point is its own cluster
– Pick some Normalization factor f
• Update Phase for point p:
– Let d = distance from p to closest center so far
– With probability proportional to d/ f , attach p to closest center
– With probability max (1 – d/f, 1), form a new cluster center at p.
• Merge Phase:
– Once sufficient clusters have opened up, or sufficient cost accumulated, merge clusters
17. Page 17
Properties
• Provably close to optimal in online setting
• Does not open more than O(log(OPT)) clusters pays O(OPT) cost
• Very efficient to implement
• Adaptive algorithm
• Forgetfulness can be introduced in the merge process
• Leaving out the merge process still produces a clustering that might be indicative of
structure, i.e useful as a machine learning feature
18. Page 18
My classifier is not fast enough
• Even for batch problems online learning might be good enough!
• For real time problems, online learning or incremental learning is needed.
19. Page 19
What is online learning?
• Batch Learning:
– Classifier sees a set of labeled examples, and trains a model
– Predicts on trained model for unseen examples
• Online Learning:
– Classifier sees an example at a time.
– Limited look back window (often 0)
– Predicts on example and is revealed the cost
– Learns from mistake
– Yields a batch learning algorithm that is one pass: simply run online algorithm for each
example in a batch.
20. Page 20
Challenges of online learning
• normalization
– In batch set up, can normalize data by making a pass over the full dataset
– In online setting, cannot make a second pass
– Solution: Adaptive normalization
• Late arriving features
– In Batch setting, all features are recorded in the dataset
– In online setting different features may arrive at different times
– Solution: Adagrad (Adaptive gradient technique)
• Stochastic Gradient Descent convergence can be slow
– More data helps
– Adaptive normalization improves convergence
– Adagrad improves convergence and reduces sensitivity to step size
22. Page 22
The Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript
23. Page 23
Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”
24. Page 24
Zeppelin today in Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript
25. Page 25
Zeppelin – Road Ahead
Operations
- Deploy to the cluster with Ambari
Security
- Authentication against LDAP
- SSL
- Run in Kerberized Cluster
- Authorization of notebooks
Sharing/ Collaboration
- Share selected notebooks with selected
users/groups
- Ability to read/publish notebooks to github
Data Import
- Visual data import/download
- Clean data as it comes
Usability
- Summary Data – See column summary
- Keyboard shortcuts, Auto complete, syntax high
light, line numbers
Visualization
- Pluggable visualization & more charts, maps &
tables.
R support
- Harden SparkR interpreter
Enterprise ReadyEase of Use
26. Page 26
Upcoming Work
• Entity Resolution package GA
– Supports Entity Graph based resolution
– Includes Random Walk algorithm for computing similarity score
• Online learning and clustering Spark Packages
• Contribute more Reduction algorithms to Spark ML
– Cost Sensitive Classification
– Filter tree based Multiclass Reduction
• Zeppelin GA