SlideShare uma empresa Scribd logo
1 de 144
Baixar para ler offline
Data Science Training
Spark
“I want to die on Mars but not on impact”
— Elon Musk, interview with Chris Anderson
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner
"There are no facts, only interpretations." - Friedrich Nietzsche
"If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions!
------ We are not going to hang data by it’s legs !!
http://training.databricks.com/workshop/datasci.pdf
ADVANCED: DATA SCIENCE WITH APACHE SPARK
Data Science applications with Apache Spark combine the scalability of Spark and the
distributed machine learning algorithms.
This material expands on the “Intro to Apache Spark” workshop. Lessons focus on
industry use cases for machine learning at scale, coding examples based on public
data sets, and leveraging cloud-based notebooks within a team context. Includes
limited free accounts on Databricks Cloud.
Topics covered include:
Data transformation techniques based on both Spark SQL and functional
programming in Scala and Python.
Predictive analytics based on MLlib, clustering with KMeans, building classifiers
with a variety of algorithms and text analytics – all with emphasis on an
iterative cycle of feature engineering, modeling, evaluation.
Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights.
Understand how the primitives like Matrix Factorization are implemented in a
distributed parallel framework from the designers of MLlib
Several hands-on exercises using datasets such as Movielens, Titanic, State Of
the Union speeches, and RecSys Challenge 2015.
Prerequisites:
Intro to Apache Spark workshop or
equivalent (e.g., Spark Developer Certificate)
Experience coding in Scala, Python, SQL
Have some familiarity with Data Science
topics (e.g., business use cases)
Agenda
•  Detailed)agenda)In)Google)doc)
•  https://docs.google.com/document/d/
1T9AkXUmL6gDYTpAEEgqsy9hfJtqGlavjnGzMqzUwDf
E/edit)
Goals
•  Patterns):)Data)wrangling)(Transform,)Model)&)Reason))with)Spark)
o  Use)RDDs,)Transformations)and)Actions)in)the)context)of)a)Data)Science)Problem,)an)Algorithms))&)a)
Dataset)
•  Spend)time)working)through)MLlib))
•  Balance)between)internals)&)handsWon)
o  Internals)from)Reza,)the)MLlib)lead)
•  ~65%)of)time)on)Databricks)Cloud)&)Notebooks)
o  Take)the)time)to)get)familiar)with)the)Interface)&)the)Data)Science)Cloud)
o  Make*mistakes,*experiment,…*
•  Good)Time)for)this)course,)this)version)
o  Will)miss)many)of)the)gory)details)as)the)framework)evolves)
•  Summarized)materials)for)a)3)day)course)
o  Even)if)we)don’t)finish)the)exercises)today,)that)is)fine)
o  Complete)the)work)at)home)W)There*are*also**homework*notebooks*
o  Ask)us)questions)@ksankar,@pacoid,*@reza_zadeh,*@mhfalaki,*@andykonwinski,*@xmeng,*
@michaelarmbrust,*@tathadas*
Tutorial)Outline:)
morning' a)ernoon'
o  Welcome)+)Ge3ng)Started)(Krishna))
o  Databricks)Cloud)mechanics)(Andy)))
o  Ex)3):)Clustering)B)In)which)we)explore)SegmenFng)
Frequent)InterGallacFcHoppers)(Krishna))
o  Ex)0:)PreBFlight)Check)(Krishna)) o  Ex)4):)RecommendaFon)(Krishna))
o  DataScience)DevOps)B)IntroducFon)to)Spark)(Krishna)) o  Theory):)Matrix)FactorizaFon,)SVD,…)(Reza))
o  Ex)1:)MLlib):)StaFsFcs,)Linear)Regression)(Krishna)) o  OnBline)kBmeans,)spark)streaming)(Reza))
o  MLlib)Deep)Dive)–)Lecture)(Reza))
o  Design)Philosophy,)APIs)
o  Ex)2:)In)which)we)explore)Disasters,)Trees,)
ClassificaFon)&)the)Kaggle)CompeFFon)(Krishna))
o  Random)Forest,)Bagging,)Data)DeBcorrelaFon)
o  Ex)5):)Mood)of)the)UnionBText)AnalyFcs(Krishna))
o  In)which)we)analyze)the)Mood)of)the)naFon)from)
inferences)on)SOTU)by)the)POTUS)(State)of)the)Union)
Addresses)by)The)President)Of)the)US))
o  Deepdive)B)Leverage)parallelism)of)RDDs,)sparse)
vectors,)etc)(Reza))
o  Ex)99):)RecSys)2015)Challenge)(Krishna))
o  Ask)Us)Anything)B)Panel)
Introducing:)
Reza Zadeh
@Reza_Zadeh
Hossein Falaki
@mhfalaki
Andy Konwinski
@andykonwinski
Krishna Sankar
@ksankar
Paco Nathan
@pacoid
Xiangrui Meng
@xmeng
Michael Armbrust
@michaelarmbrust
Tathagata Das
@tathadas
About Me
o Chief Data Scientist at BlackArrow.tv
o Have been speaking at OSCON, PyCon, Pydata,
Strata et al
o Reviewer “Machine Learning with Spark”
o Picked up co-authorship Second Edition of “Fast
Data Processing with Spark”
o Have done lots of things:
• 
• 
• 
• 
• 
• 
o  @ksankar, doubleclix.wordpress.com ksankar42@gmail.com
The)Nuthead)band)!)
Pre-requisites
①  Register)&)Download)data)from)Kaggle.)
We)cannot)distribute)Kaggle)data.)
Moreover)you)need)an)account)to)submit)entries)
a)  Setup)an)account)in)Kaggle)(www.kaggle.com))
b)  We)will)be)using)the)data)from)the)competition)“Titanic:)
Machine)Learning)from)Disaster”)
c)  Download)data)from)
http://www.kaggle.com/c/titanicWgettingStarted)
②  Register)for)RecSys)2015)Competition)
a)  http://2015.recsyschallenge.com/)
Welcome +
Getting Started
9:00
Everyone will receive a username/password for one !
of the Databricks Cloud shards. Use your laptop and browser to login there.
We find that cloud-based notebooks are a simple way to get started using
Apache Spark – as the motto “Making Big Data Simple” states.
Please create and run a variety of notebooks on your account throughout the
tutorial. These accounts will remain open long enough for you to export your
work.
See the product page or FAQ for more details, or contact Databricks to register
for a trial account.
10
Getting Started: Step 1
11
Getting Started: Step 1 – Credentials
url: https://class01.cloud.databricks.com/
user: student-777
pass: 93ac11xq23z5150
cluster: student-777
12
Open in a browser window, then click on the navigation menu in the top/left
corner:
Getting Started: Step 2
13
The next columns to the right show folders,!
and scroll down to click on databricks_guide
Getting Started: Step 3
14
Scroll to open the 01 Quick Start notebook, then follow the discussion
about using key features:
Getting Started: Step 4
15
See /databricks-guide/01 Quick Start !
Key Features:
•  Workspace / Folder / Notebook
•  Code Cells, run/edit/move/comment
• Markdown
•  Results
•  Import/Export
Getting Started: Step 5
16
Click on the Workspace menu and create your !
own folder (pick a name):
Getting Started: Step 6
17
Navigate to /_DataScience

Hover on its drop-down menu, on the right side:
Getting Started: Step 7
18
Navigate to /_DataScience

Hover on its drop-down menu, on the right side:
Click Clone:
Getting Started: Step 8
19
Then create a clone of this folder in the folder that you just created:
Getting Started: Step 9
20
Now let’s get started with the coding exercise!
We’ll define an initial Spark app in three lines of code:
Click on _00.pre-flight-check
Getting Started: Coding Exercise
21
Attach your cluster – same as your username:
Getting Started: Step 10
Introduction To Spark
Data Science :
The art of building a model with known knowns, which when let loose,
works with unknown unknowns!
http://smartorg.com/2013/07/valuepoint19/
The
World

Knowns&
&
&
&
&
&
Unknowns&
You
UnKnown& Known&
o  Others)know,)you)don’t) o  What)we)do)
o  Facts,)outcomes)or)
scenarios)we)have)not)
encountered,)nor)
considered)
o  “Black)swans”,)outliers,)
long)tails)of)probability)
distribuFons)
o  Lack)of)experience,)
imaginaFon)
o  PotenFal)facts,)
outcomes)we)
are)aware,)but)
not))with)
certainty)
o  StochasFc)
processes,)
ProbabiliFes)
o 
o 
o 
o 
o 
o 
The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data
Products
o Data Scientist should tell a story
http://doubleclix.wordpress.com/2014/01/25/theWcuriousWcaseWofWtheWdataWscientistWprofession/
Data Science - Context
o  Scalable)Model)
Deployment)
o  Big)Data)
automation)&)
purpose)built)
appliances)(soft/
hard))
o  Manage)SLAs)&)
response)times)
o  Volume)
o  Velocity)
o  Streaming)Data)
o  Canonical)form)
o  Data)catalog)
o  Data)Fabric)across)the)
organization)
o  Access)to)multiple)
sources)of)data))
o  Think)Hybrid)–)Big)
Data)Apps,)Appliances)
&)Infrastructure)
Collect Store Transform
o  Metadata)
o  Monitor)counters)&)
Metrics)
o  Structured)vs.)MultiW
structured)
o  Flexible)&)Selectable)
!  Data)Subsets))
!  Attribute)sets)
o  Refine)model)with)
!  Extended)Data)
subsets)
!  Engineered)
Attribute)sets)
o  Validation)run)across)a)
larger)data)set)
Reason Model Deploy
Data Management
Data Science
o  Dynamic)Data)Sets)
o  2)way)keyWvalue)tagging)of)
datasets)
o  Extended)attribute)sets)
o  Advanced)Analytics)
ExploreVisualize Recommend Predict
o  Performance)
o  Scalability)
o  Refresh)Latency)
o  InWmemory)Analytics)
o  Advanced)Visualization)
o  Interactive)Dashboards)
o  Map)Overlay)
o  Infographics)
"  Bytes to Business
a.k.a. Build the full
stack
"  Find Relevant Data
For Business
"  Connect the Dots
Volume
Velocity
Variety
Data Science - Context
Context
Connect
edness
Intelligence
Interface
Inference
o  Three Amigos
o  Interface = Cognition
o  Intelligence = Compute(CPU) & Computational(GPU)
o  Infer Significance & Causality
Day in the life of a (super) Model
Intelligence
Inference
Data Representation
Interface
Algorithms)
Parameters)Agributes)
Data)(Scoring))
Model)
SelecFon)
Reason)&)
Learn)
Models)
Visualize,)
Recommend,)
Explore)
Model)Assessment)
Feature)SelecFon)Dimensionality)ReducFon)
The'Sense'&'Sensibility'of'a'
DataScien7st'DevOps'
Factory)=)OperaFonal)
Lab)=)InvesFgaFve)
http://doubleclix.wordpress.com/2014/05/11/theWsenseW
sensibilityWofWaWdataWscientistWdevops/)
Spark-The Stack
RDD – The workhorse of Spark
o Resilient Distributed Datasets
• 
o Transformations – create RDDs
• 
o Actions – Get values
• 
o We will apply these operations during this tutorial
MLlib Hands-on
Stats, Linear Regression
9:20
Algorithm spectrum
o 
o 
o 
o 
o 
o 
o 
o 
)
o 
o 
o 
o 
o 
o 
o 
)
Machine0
Learning0
Cute0Math0
Ar6ficial0
Intelligence0
Session – 1 : MLlib - Statistics &
Linear Regression
1.  Notebook):)01_StatsLRW1)
①  Read)car)data)
②  Stats)(Guided))
③  Correlation)(Guided))
④  Coding)ExerciseW21WTemplate)
(Correlation))
2.  Notebook):)02_StatsLRW2)
①  CEW21WSolution)
3.  Linear)Regression)
①  LR)(Guided))
②  CEW22WTemplate((LR)on)Car)
Data))
4.  Notebook):)03_StatsLRW3)
①  CEW22WSolution)
②  Explain)
Linear Regression - API
LabeledPoint The)features)and)labels)of)a)data)point
LinearModel weights,)intercept
LinearRegressionModelBase predict()
LinearRegressionModel
LinearRegressionWithSGD train(cls,)data,)iteraFons=100,)step=1.0,)miniBatchFracFon=1.0,)
iniFalWeights=None,)regParam=1.0,)regType=None,)intercept=False)
LassoModel LeastBsquares)fit)with)an)L1)penalty)term.
LassoWithSGD train(cls,)data,)iteraFons=100,)step=1.0,)regParam=1.0,)
miniBatchFracFon=1.0,iniFalWeights=None)
RidgeRegressionModel LeastBsquares)fit)with)an)L2)penalty)term.
RidgeRegressionWithSGD train(cls,)data,)iteraFons=100,)step=1.0,)regParam=1.0,)miniBatchFracFon=1.0,)
iniFalWeights=None)
MLlib - Deep Dive
•  Design Philosophy & APIs
•  Algorithms - Regression, SGD et al
•  Interfaces
9:50
Reza Zadeh
Distributed Machine Learning"
on Spark
@Reza_Zadeh | http://reza-zadeh.com
Outline
Data flow vs. traditional network programming
Spark computing engine
Optimization Examples
Matrix Computations
MLlib + {Streaming, GraphX, SQL}
Future of MLlib
Data Flow Models
Restrict the programming interface so that the
system can do more automatically
Express jobs as graphs of high-level operators
» System picks how to split each operator into tasks
and where to run each task
» Run parts twice fault recovery
Biggest example: MapReduce
Map
Map
Map
Reduce
Reduce
Spark Computing Engine
Extends a programming language with a
distributed collection data-structure
» “Resilient distributed datasets” (RDD)
Open source at Apache
» Most active community in big data, with 50+
companies contributing
Clean APIs in Java, Scala, Python
Community: SparkR, soon to be merged
Key Idea
Resilient Distributed Datasets (RDDs)
» Collections of objects across a cluster with user
controlled partitioning & storage (memory, disk, ...)
» Built via parallel transformations (map, filter, …)
» The world only lets you make make RDDs such that
they can be:
Automatically rebuilt on failure
MLlib History
MLlib is a Spark subproject providing machine
learning primitives

Initial contribution from AMPLab, UC Berkeley

Shipped with Spark since Sept 2013
MLlib: Available algorithms
classification: logistic regression, linear SVM,"
naïve Bayes, least squares, classification tree
regression: generalized linear models (GLMs),
regression tree
collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
clustering: k-means||
decomposition: SVD, PCA
optimization: stochastic gradient descent, L-BFGS
Optimization
At least two large classes of optimization
problems humans can solve:"

»  Convex
»  Spectral
Optimization Example: Convex Optimization
Logistic Regression
data$=$spark.textFile(...).map(readPoint).cache()$
$
w$=$numpy.random.rand(D)$
$
for$i$in$range(iterations):$
$$$$gradient$=$data.map(lambda$p:$
$$$$$$$$(1$/$(1$+$exp(Bp.y$*$w.dot(p.x))))$*$p.y$*$p.x$
$$$$).reduce(lambda$a,$b:$a$+$b)$
$$$$w$B=$gradient$
$
print$“Final$w:$%s”$%$w$
Separable Updates
Can be generalized for
»  Unconstrained optimization
»  Smooth or non-smooth
»  LBFGS, Conjugate Gradient, Accelerated
Gradient methods, …
Logistic Regression Results
0
500
1000
1500
2000
2500
3000
3500
4000
1
 5
 10
 20
 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
100 GB of data on 50 m1.xlarge EC2 machines
!
Behavior with Less RAM
68.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
0%
 25%
 50%
 75%
 100%
Iterationtime(s)
% of working set in memory
Optimization Example: Spectral Program
Spark PageRank
Given directed graph, compute node
importance. Two RDDs: 
»  Neighbors (a sparse graph/matrix)
»  Current guess (a vector)
"
Using cache(), keep neighbor list in RAM
Spark PageRank
Using cache(), keep neighbor lists in RAM
Using partitioning, avoid repeated hashing
Neighbors
(id, edges)
Ranks
(id, rank)
join
partitionBy
join
 join
…
PageRank Results
171
72
23
0
50
100
150
200
Timeperiteration(s)
Hadoop
Basic Spark
Spark + Controlled
Partitioning
Spark PageRank
Generalizes!to!Matrix!Multiplication,!opening!many!algorithms!
from!Numerical!Linear!Algebra!
Distributing Matrix Computations
Distributing Matrices
How to distribute a matrix across machines?
»  By Entries (CoordinateMatrix)
»  By Rows (RowMatrix)
»  By Blocks (BlockMatrix)
All of Linear Algebra to be rebuilt using these
partitioning schemes
As!of!version!1.3!
Distributing Matrices
Even the simplest operations require thinking
about communication e.g. multiplication

How many different matrix multiplies needed?
»  At least one per pair of {Coordinate, Row,
Block, LocalDense, LocalSparse} = 10
»  More because multiplies not commutative
Coffee-Break
Back at 10:45
MLlib - Hands On #2 – Kaggle
Competition Predicting Titanic
Survivors:
•  Feature Engineering
•  Classification Algorithms(Random forest)
•  Submission & Leaderboard scores
10:45
Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions
o Learning = Representation + Evaluation + Optimization
o It’s Generalization that counts
• 
o Data alone is not enough
• 
o Machine Learning is not magic – one cannot get something from nothing
• 
• 
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
Classification - Spark API
•  Logistic)Regression)
•  SVMWithSGD)
•  DecisionTrees)
•  Data)as)LabelledPoint)(we)will)see)in)a)moment))
•  DecisionTree.trainClassifier(data,)numClasses,)categoricalFeaturesInfo,)impurity="gini",)
maxDepth=4,)maxBins=100))
•  Impurity)–)“entropy”)or)“gini”)
•  maxBins)=)control)to)throttle)communication)at)the)expense)of)accuracy)
o  Larger)=)Higher)Accuracy)
o  Smaller)=)less)communication)(as)#)of)bins)=)number)of)instances))
•  Data)Adaptive)–)i.e.)decision)tree)samples)on)the)driver)and)figures)out)the)bin)spacing)i.e.)the)
places)you)slice)for)binning)
•  Spark*=*Intelligent*Framework*D*need*this*for*scale*
Lookout for these interesting Spark
features
•  Concept)of)Labeled)Point)&)how)to)create)an)
RDD)of)LPs)
•  Print)the)tree)
•  Calculate)Accuracy)&)MSE)from)RDDs)
Anatomy Of a Kaggle
Competition
Kaggle Data Science Competitions
o Hosts Data Science Competitions
o Competition Attributes:
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
Titanic)Passenger)Metadata)
•  Small)
•  3)Predictors)
•  Class)
•  Sex)
•  Age)
•  Survived?)
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic
http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html
City Bike Sharing Prediction
(Washington DC)
Walmart Store Forecasting
Train.csv)
Taken)from)Titanic)
Passenger)Manifest)
Variable' Descrip7on'
Survived) 0BNo,)1=yes)
Pclass) Passenger)Class)()1st,2nd,3rd)))
Sibsp) Number)of)Siblings/Spouses)Aboard)
Parch) Number)of)Parents/Children)Aboard)
Embarked) Port)of)EmbarkaFon)
o  C)=)Cherbourg)
Titanic)Passenger)Metadata)
•  Small)
•  3)Predictors)
•  Class)
•  Sex)
•  Age)
•  Survived?)
Test.csv)
Submission
o 418 lines; 1st column should have 0 or 1 in each line
o Evaluation:
• 
Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
o Iteratively explore data
o Tools
• 
o Get your head around data
• 
o Don’t over-complicate
o If people give you data, don’t assume that you
need to use all of it
o Look at pictures !
o History of your submissions – keep a tab
o Don’t be afraid to submit simple solutions
• 
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-
jeremy-howard/
Session-2 : Kaggle, Classification &
Trees
1.  Notebook):)04_TitanicW01)
①  Read)Training)Data)
②  Henry)the)Sixth)Model)
③  Submit)to)Kaggle)
2.  Notebook):)05_TitanicW02)
①  Decision)Tree)Model)
②  CEW31)Template)
①  Create)Randomforest)Model)
②  Predict)Testset)
③  Submit)to)Kaggle)
3.  Notebook):)06_TitanicW03)
①  CEW32)Solution)
①  RandomForest)Model)
②  Predict)Testset)
③  Submit)Solution)2)
②  Discussion)about)Models)
Trees, Forests & Classification
•  Discuss)Random)Forest)
o Boosting,)Bagging)
o Data)deWcorrelation))
•  Why)it)didn’t)do)better)in)Titanic)dataset)
•  Data)Science)Folk)Wisdom)
o http://www.slideshare.net/ksankar/dataWscienceWfolkW
knowledge)
Results
Algorithm' Kaggle'Score'
Gender)based)model) ~0.7655)Rank):)1276)
Decision)Tree) ~0.775124,)Rank):)1147)
Random)Forest)) ~0.77512)Gender)Based)Model)
Why didn’t RF do better ? Bias/Variance
See next slide
Henry0the0Sixth,0Part02,0Act04,0Scene020
Why didn’t RF do better ? Bias/Variance
o High Bias
• 
• 
• 
• 
• 
o High Variance
• 
• 
• 
• 
Prediction Error!
Training !
Error!
Ref: Strata 2013 Tutorial by Olivier Grisel
Learning
Curve
Need*more*features*or*more*
complex*model*to*improve*
Need*more*data*
to*improve*
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
hgp://www.slideshare.net/ksankar/dataBscienceBfolkBknowledge)
Decision Tree – Best Practices
maxDepth) Tune)with)Data/Model)SelecFon)
maxBins) Set)low,)monitor)communicaFons,)increase)if)needed)
#)RDD)parFFons) Set)to)#)of)cores)
•  Usually the recommendation is that the RDD partitions should be over
partitioned ie “more partitions than cores”, because tasks take different
times, we need to utilize the compute power and in the end they average
out
•  But for Machine Learning especially trees, all tasks are approx equal
computationally intensive, so over partitioning doesn’t help
•  Joe Bradley talk (reference below) has interesting insights
hgps://speakerdeck.com/jkbradley/mllibBdecisionBtreesBatBsfBscalaBbamlBmeetup)
DecisionTree.trainClassifier(data,)numClasses,)categoricalFeaturesInfo,)
impurity="gini",)maxDepth=4,)maxBins=100))
◦  “Output)of)weak)classifiers)into)a)powerful)
commigee”)
◦  Final)PredicFon)=)weighted)majority)vote))
◦  Later)classifiers)get)misclassified)points))
!  With)higher)weight,))
!  So)they)are)forced))
!  To)concentrate)on)them)
◦  AdaBoost)(AdapFveBoosting))
◦  BoosFng)vs)Bagging)
!  Bagging)–)independent)trees)<B)Spark)shines)here)
!  BoosFng)–)successively)weighted)
Boosting
"  Goal!
◦  Model Complexity (-)!
◦  Variance (-)!
◦  Prediction Accuracy (+)!
◦  Builds)large)collecFon)of)deBcorrelated)trees)&)averages)them)
◦  Improves)Bagging)by)selecFng)i.i.d*)random)variables)for)
spli3ng)
◦  Simpler)to)train)&)tune)
◦  “Do0remarkably0well,0with0very0liIle0tuning0required”)–)ESLII)
◦  Less)suscepFble)to)over)fi3ng)(than)boosFng))
◦  Many)RF)implementaFons)
!  Original)version)B)FortranB77)!)By)Breiman/Cutler)
!  Python,)R,)Mahout,)Weka,)Milk)(ML)toolkit)for)py),)matlab))
!  And)of)course,)Spark)!)
* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+
"  Goal!
◦  Model Complexity (-)!
◦  Variance (-)!
◦  Prediction Accuracy (+)!
Random Forests
•  While)Boosting)splits)based)on)best)among)all)variables,)RF)splits)based)
on)best)among)randomly*chosen)variables)
•  )Simpler)because)it)requires)two)variables)–)no.)of)Predictors)(typically)√k))
&)no.)of)trees)(500)for)large)dataset,)150)for)smaller))
•  Error)prediction)
o  For)each)iteration,)predict)for)dataset)that)is)not)in)the)sample)(OOB)data))
o  Aggregate)OOB)predictions)
o  Calculate)Prediction)Error)for)the)aggregate,)which)is)basically)the)OOB)estimate)of)error)rate)
•  Can)use)this)to)search)for)optimal)#)of)predictors)
o  We)will)see)how)close)this)is)to)the)actual)error)in)the)Heritage)Health)Prize)
•  Assumes)equal)cost)for)misWprediction.)Can)add)a)cost)function)
•  Proximity)matrix)&)applications)like)adding)missing)data,)dropping)
outliers))
Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
◦  Two)Step)
!  Develop)a)set)of)learners)
!  Combine)the)results)to)develop)a)composite)predictor)
◦  Ensemble)methods)can)take)the)form)of:)
!  Using)different)algorithms,))
!  Using)the)same)algorithm)with)different)se3ngs)
!  Assigning)different)parts)of)the)dataset)to)different)
classifiers)
◦  Bagging)&)Random)Forests)are)examples)of)
ensemble)method))
Ref: Machine Learning In Action
Ensemble Methods
"  Goal!
◦  Model Complexity (-)!
◦  Variance (-)!
◦  Prediction Accuracy (+)!
Deepdive : Leverage
parallelism of RDDs, sparse
vectors, etc.
11:30
Lunch
Back at 1:30
Clustering - Hands On :
•  Normalization & Centering
•  Clustering
•  Optimizing k based on cohesively of the
clusters (WSSE)
1:30
Data Science “folk knowledge” (3 of A)
o More Data Beats a Cleverer Algorithm
• 
• 
o Learn many models, not Just One
• 
• 
• 
o Simplicity Does not necessarily imply Accuracy
o Representable Does not imply Learnable
• 
o Correlation Does not imply Causation o  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o  A few useful things to know about machine learning - by Pedro Domingos
!  http://dl.acm.org/citation.cfm?id=2347755
Session-3 : Clustering
1.  Notebook):)07_ClusterW1)
①  Read)Data)
②  Cluster)
③  Modeling)ExerciseW41WTemplate)
2.  Notebook):)08_ClusterW2)
①  MEW41WSolution)
②  Center)and)Scale)
③  Cluster)
④  Inspect)centroid)
⑤  CEW42WTemplate):)Cluster)
Semantics)
3.  Notebook):)09_ClusterW3)
①  CEW42)Solution)
②  Cluster)Semantics)W)Discussion)
Clustering - Theory
•  Clustering)is)unsupervised)learning)
•  While)the)computers)can)dissect)a)dataset)into)“similar”)clusters,)it)
still)needs)human)direction)&)domain)knowledge)to)interpret)&)
guide)
•  Two)types:)
o  Centroid)based)clustering)–)kWmeans)clustering)
o  Tree)based)Clustering)–)hierarchical)clustering)
•  Spark)implements)the)Scalable)Kmeans++))
o  Paper):)http://theory.stanford.edu/~sergei/papers/vldb12Wkmpar.pdf)
Lookout for these interesting Spark
features
•  Application)of)Statistics)toolbox)
•  Center)&)Scale)RDD)
•  Filter)RDDs)
Clustering - API
•  from)pyspark.mllib.clustering)import)KMeans)
•  Kmeans.train)
•  train(cls,)data,)k,)maxIterations=100,)runs=1,)initializationMode="kW
means||"))
•  K)=)number)of)clusters)to)create,)default=2)
•  initializationMode)=)The)initialization)algorithm.)This)can)be)either)
"random")to)choose)random)points)as)initial)cluster)centers,)or)"kW
means||")to)use)a)parallel)variant)of)kWmeans++)(Bahmani)et)al.,)Scalable)
KWMeans++,)VLDB)2012).)Default:)kWmeans||)
•  KMeansModel.predict)
•  Maps)a)point)to)a)cluster)
Interpretation
C#' AVG' Interpreta7on'
1)
2) )
)
Not)that)acFve)–)Give)them)flight)coupons)to)encourage)them)to)fly)more.)Ask)them)why)they)are)not)
flying.)May)be)they)are)flying)to)desFnaFons)(say)Jupiter))where)InterGallacFc)has)less)gates)
3)
4)
5)
Best0Customers,0S6ll0lots0
of0nonLflying0miles0
Very0ac6ve0onLline.0Why0are0they0coming0to0us0instead0of0Amazon0?0
Note0:00
•  This0is0just0a0sample0interpreta6on.0
•  In0real0life0we0would0“noodle”0over0the0clusters0&0tweak0
them0to0be0useful,0interpretable0and0dis6nguishable.0
•  May0be030is0more0suited0to0create0targeted0promo6ons0
Epilogue
•  KMeans)in)Spark)has)enough)controls)
•  It)does)a)decent)job)
•  We)were)able)to)control)the)clusters)based)on)our)experience)(2)
cluster)is)too)low,)10)is)too)high,)5)seems)to)be)right))
•  We)can)see)that)the)Scalable)KMeans)has)control)over)runs,)
parallelism)et)al.)(Home)work):)Explore)the)scalability))
•  We)were)able)to)interpret)the)results)with)domain)knowledge)and)
arrive)at)a)scheme)to)solve)the)business)opportunity)
•  Naturally)we)would)tweak)the)clusters)to)fit)the)business)viability.)
20)clusters)with)corresponding)promotion)schemes)are)unwieldy,)
even)if)the)WSSE)is)the)minimum.)
Recommendation - Hands On :
•  Movie Lens - Medium Data
•  Movie Lens - Large Data (Homework)
2:00
hgps://doubleclix.wordpress.com/2014/03/07/aBglimpseBofBgoogleBnasaBpeterBnorvig/)
Session-4 : Recommendation at
Scale
1.  NotebookW10_RecoW1)
①  Read)Movielens)medium)
data)
②  CEW51)Template)–)Partition)
Data)
2.  NotebookW11_RecoW2)
①  CEW51)Solution)
②  ALS)Slide)
③  Train)ALS)&)Predict)
④  Calculate)Model)
Performance)
Recommendation & Personalization - Spark
Automated*AnalyticsW)Let)Data)tell)story)
Feature)Learning,)AI,)Deep)Learning)
Learning*Models*W)fit)parameters)
as)it)gets)more)data))
Dynamic*Models*–)model)
selection)based)on)context)
o  Knowledge)Based)
o  Demographic)Based)
o  Content)Based)
o  Collaborative)Filtering)
o  Item)Based)
o  User)Based)
o  Latent)Factor)based)
o  User)Rating)
o  Purchased)
o  Looked/Not)purchased)
Spark)(in)1.1.0))implements)the)user)based)ALS)collaboraFve)filtering)
Ref:))
ALS)B)CollaboraFve)Filtering)for)Implicit)Feedback)Datasets,)Yifan)Hu);)AT&T)Labs.,)Florham)Park,)NJ);)Koren,)Y.);)Volinsky,)C.)
ALSBWR)B)LargeBScale)Parallel)CollaboraFve)Filtering)for)the)Nezlix)Prize,)Yunhong)Zhou,)Dennis)Wilkinson,)Robert)Schreiber,)Rong)P
Spark Collaborative Filtering API
•  ALS.train(cls,)ratings,)rank,)iterations=5,)
lambda_=0.01,)blocks=W1))
•  ALS.trainImplicit(cls,)ratings,)rank,)iterations=5,)
lambda_=0.01,)blocks=W1,)alpha=0.01))
•  MatrixFactorizationModel.predict(self,)user,)
product))
•  MatrixFactorizationModel.predictAll(self,)
usersProducts))
Theory : Matrix Factorization, SVD,…
On-line k-means, spark streaming
2:30
Singular Value Decomposition on Spark
Singular Value
Decomposition
Singular Value Decomposition
Two cases
»  Tall and Skinny
»  Short and Fat (not really)
»  Roughly Square
SVD method on RowMatrix takes care of
which one to call.
Tall and Skinny SVD
Tall and Skinny SVD
Gets%us%%%V%and%the%
singular%values%
Gets%us%%%U%by%one%
matrix%multiplication%
Square SVD
ARPACK: Very mature Fortran77 package for
computing eigenvalue decompositions"

JNI interface available via netlib-java"

Distributed using Spark – how?
Square SVD via ARPACK
Only interfaces with distributed matrix via
matrix-vector multiplies

The result of matrix-vector multiply is small.
The multiplication can be distributed.
Square SVD
With 68 executors and 8GB memory in each,
looking for the top 5 singular vectors
Communication-Efficient 
All pairs similarity on Spark (DIMSUM)
All pairs Similarity
All pairs of cosine scores between n vectors
»  Don’t want to brute force (n choose 2) m
»  Essentially computes 
"
Compute via DIMSUM
»  Dimension Independent Similarity
Computation using MapReduce
Intuition
Sample columns that have many non-zeros
with lower probability. "

On the flip side, columns that have fewer non-
zeros are sampled with higher probability."

Results provably correct and independent of
larger dimension, m.
Spark implementation
MLlib + {Streaming, GraphX, SQL}
A General Platform
Spark Core
Spark
Streaming"
real-time
Spark SQL
structured
GraphX
graph
MLlib
machine
learning
…
Standard libraries included with Spark
Benefit for Users
Same engine performs data extraction, model
training and interactive queries

…
DFS
read
DFS
write
parse
DFS
read
DFS
write
train
DFS
read
DFS
write
query
DFS
DFS
read
parse
train
query
Separate engines
Spark
MLlib + Streaming
As of Spark 1.1, you can train linear models in
a streaming fashion, k-means as of 1.2

Model weights are updated via SGD, thus
amenable to streaming

More work needed for decision trees
MLlib + SQL
df = context.sql(“select latitude, longitude from tweets”)!
model = pipeline.fit(df)!
DataFrames in Spark 1.3! (March 2015)
Powerful coupled with new pipeline API
MLlib + GraphX
Future of MLlib
Goals for next version
Tighter integration with DataFrame and spark.ml API

Accelerated gradient methods & Optimization interface

Model export: PMML (current export exists in Spark 1.3, but
not PMML, which lacks distributed models)

Scaling: Model scaling (e.g. via Parameter Servers)
Research Goal: General
Distributed Optimization
Distribute%CVX%by%
backing%CVXPY%with%
PySpark%
%
EasyBtoBexpress%
distributable%convex%
programs%
%
Need%to%know%less%
math%to%optimize%
complicated%
objectives%
Most active open source community in big data
200+ developers, 50+ companies contributing
Spark Community
0
50
100
150
Contributors in past year
Continuing Growth
source: ohloh.net
Contributors per month to Spark
Spark and ML
Spark has all its roots in research, so we hope
to keep incorporating new ideas!
Coffee-Break
Back at 3:15
Hands On :
•  Mood Of the Union
•  RecSys 2015 Challenge
3:15
The Art of ELO Ranking
& Super Bowl XLIX
o The real formula is
o Not what is written on the glass !
o But then that is Hollywood !
Ref : Who is #1, Princeton University Presshttps://doubleclix.wordpress.com/2015/01/20/theWartWofWnflWrankingWtheWeloWalgorithmWandWfivethirtyeight/)
Session-5 : Mood Of the Union –
Data Science on SOTU by POTUS
1.  DataScience/12_SOTUW1)
①  Read)BO)
②  CEW61)Template):)Has)BO)
changed)since)2014)?)
2.  DataScience/13_SOTUW2)
①  CEW61)Solution)
②  Read)GW)
③  Preprocess)
④  CEW62)Template):)What)
mood)the)country)was)in)
1790W1796)vs.)2009W2015)
3.  Notebook):)14_SOTUW3)
①  CEW62)Solution)
②  Homework)
①  GWB)vs)Clinton)
②  WJC)vs)AL)
③  Discussions ) ))
WJC)
Mood Of The Nation
JFK)
FDR)
GW)
AL)
JFK)
Epilogue
•  Interesting)Exercise)
•  Highlights)
o  Map-reduce in a couple of lines !
•  But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
o  Set differences using substractByKey
o  Ability to sort a map by values (or any arbitrary function, for that matter)
•  To)Explore)as)homework:)
o  TF-IDF in
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
hgp://blog.cloudera.com/blog/2014/09/howBtoBtranslateBfromBmapreduceBtoBapacheBspark/)
Session-7 : Predict Buying Pattern
Recsys 2015 Challenge
1.  Notebook):)
99_RecsysW2015W1)
1.  Read)Data)
2.  Explore)Options)
3.  CEW71)
After Hours
Homework
Notebooks to Explore
•  Run)thru)all)the)
homework)in)you)
Databricks)cloud)
GraphX examples
82
GraphX:
spark.apache.org/docs/latest/graphx-programming-guide.html
Key Points:
•  graph-parallel systems
•  importance of workflows
•  optimizations
83
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs!
J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin!
graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf
Pregel: Large-scale graph computing at Google!
Grzegorz Czajkowski, et al.!
googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
GraphX: Unified Graph Analytics on Spark!
Ankur Dave, Databricks!
databricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf
Advanced Exercises: GraphX!
databricks-training.s3.amazonaws.com/graph-analytics-with-graphx.html
GraphX: Further Reading…
// http://spark.apache.org/docs/latest/graphx-programming-guide.html
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
case class Peep(name: String, age: Int)
val nodeArray = Array(
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),
(5L, Peep("Leslie", 45))
)
val edgeArray = Array( 84
GraphX: Example – simple traversals
85
GraphX: Example – routing problems
What is the cost to reach node 0 from any other node in the graph? This is a
common use case for graph algorithms, e.g., Djikstra
86
Run /<your folder>/_DataSCience/08.graphx!
in your folder:
GraphX: Coding Exercise
Case Studies
Case Studies: Apache Spark, DBC, etc.
Additional details about production deployments for Apache Spark can be
found at:
https://cwiki.apache.org/confluence/display/SPARK/Powered+By
+Spark
https://databricks.com/blog/category/company/partners
http://go.databricks.com/customer-case-studies
88
Case Studies: Automatic Labs
89
Spark Plugs Into Your Car!
Rob Ferguson!
spark-summit.org/east/2015/talk/spark-plugs-into-
your-car
finance.yahoo.com/news/automatic-labs-turns-
databricks-cloud-140000785.html
Automatic creates personalized driving habit dashboards
•  wanted to use Spark while minimizing investment in DevOps
•  provides data access to non-technical analysts via SQL
•  replaced Redshift and disparate ML tools with single platform
•  leveraged built-in visualization capabilities in notebooks to generate
dashboards easily and quickly
Spark at Twitter: Evaluation & Lessons Learnt!
Sriram Krishnan!
slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter
•  Spark can be more interactive, efficient than MR
•  support for iterative algorithms and caching
•  more generic than traditional MapReduce
•  Why is Spark faster than Hadoop MapReduce?
•  fewer I/O synchronization barriers
•  less expensive shuffle
•  the more complex the DAG, the greater the !
performance improvement
90
Case Studies: Twitter
Pearson uses Spark Streaming for next generation
adaptive learning platform
Dibyendu Bhattacharya
databricks.com/blog/2014/12/08/pearson-uses-
spark-streaming-for-next-generation-adaptive-
learning-platform.html
91
• Kafka + Spark + Cassandra + Blur, on AWS on a YARN cluster
• single platform/common API was a key reason to replace Storm
with Spark Streaming
• custom Kafka Consumer for Spark Streaming, using Low Level
Kafka Consumer APIs
• handles: Kafka node failures, receiver failures, leader changes,
committed offset in ZK, tunable data rate throughput
Case Studies: Pearson
Unlocking Your Hadoop Data with Apache Spark and CDH5
Denny Lee
slideshare.net/Concur/unlocking-your-hadoop-data-with-apache-
spark-and-cdh5
92
• leading provider of spend management solutions and services
• delivers recommendations based on business users’ travel and
expenses – “to help deliver the perfect trip”
• use of traditional BI tools with Spark SQL allowed analysts to
make sense of the data without becoming programmers
• needed the ability to transition quickly between Machine Learning
(MLLib), Graph (GraphX), and SQL usage
• needed to deliver recommendations in real-time
Case Studies: Concur
Stratio Streaming: a new approach to Spark Streaming
David Morales, Oscar Mendez
spark-summit.org/2014/talk/stratio-streaming-a-new-
approach-to-spark-streaming
93
• Stratio Streaming is the union of a real-time messaging bus with a
complex event processing engine atop Spark Streaming
• allows the creation of streams and queries on the fly
• paired with Siddhi CEP engine and Apache Kafka
• added global features to the engine such as auditing and statistics
Case Studies: Stratio
Collaborative Filtering with Spark!
Chris Johnson!
slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark
•  collab filter (ALS) for music recommendation
•  Hadoop suffers from I/O overhead
•  show a progression of code rewrites, converting a !
Hadoop-based app into efficient use of Spark
94
Case Studies: Spotify
Guavus Embeds Apache Spark !
into its Operational Intelligence Platform !
Deployed at the World’s Largest Telcos
Eric Carr
databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its-operational-intelligence-
platform-deployed-at-the-worlds-largest-telcos.html
95
•  4 of 5 top mobile network operators, 3 of 5 top Internet
backbone providers, 80% MSOs in NorAm
•  analyzing 50% of US mobile data traffic, +2.5 PB/day
•  latency is critical for resolving operational issues before they
cascade: 2.5 MM transactions per second
•  “analyze first” not “store first ask questions later”
Case Studies: Guavus
Case Studies: Radius Intelligence
96
From Hadoop to Spark in 4 months, Lessons Learned!
Alexis Roos!
http://youtu.be/o3-lokUFqvA
•  building a full SMB index took 12+ hours using !
Hadoop and Cascading
•  pipeline was difficult to modify/enhance
•  Spark increased pipeline performance 10x
•  interactive shell and notebooks enabled data scientists !
to experiment and develop code faster
•  PMs and business development staff can use SQL to !
query large data sets
Further Resources +
Q&A
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
Data Science with Spark - Training at SparkSummit (East)

Mais conteúdo relacionado

Mais procurados

Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkKrishna Sankar
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 

Mais procurados (20)

Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 

Destaque

Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2David Taieb
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1David Taieb
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Jongwook Woo
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecLoïc Descotte
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesNAYATech
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 

Destaque (20)

Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Democratizing Data Science by Bill Howe
Democratizing Data Science by Bill HoweDemocratizing Data Science by Bill Howe
Democratizing Data Science by Bill Howe
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Scala+RDD
Scala+RDDScala+RDD
Scala+RDD
 
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar Prokopec
 
ScalaTrainings
ScalaTrainingsScalaTrainings
ScalaTrainings
 
Scala+spark 2nd
Scala+spark 2ndScala+spark 2nd
Scala+spark 2nd
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 

Semelhante a Data Science with Spark - Training at SparkSummit (East)

Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013Ken Mwai
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Get connected with python
Get connected with pythonGet connected with python
Get connected with pythonJan Kroon
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial IntelligenceZavain Dar
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataAdaryl "Bob" Wakefield, MBA
 
Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Aron Ahmadia
 
Software Carpentry for the Geophysical Sciences
Software Carpentry for the Geophysical SciencesSoftware Carpentry for the Geophysical Sciences
Software Carpentry for the Geophysical SciencesAron Ahmadia
 

Semelhante a Data Science with Spark - Training at SparkSummit (East) (20)

Data science presentation
Data science presentationData science presentation
Data science presentation
 
BigData primer
BigData primerBigData primer
BigData primer
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Spark
SparkSpark
Spark
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Get connected with python
Get connected with pythonGet connected with python
Get connected with python
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT Data
 
Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013
 
Software Carpentry for the Geophysical Sciences
Software Carpentry for the Geophysical SciencesSoftware Carpentry for the Geophysical Sciences
Software Carpentry for the Geophysical Sciences
 

Mais de Krishna Sankar

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsKrishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsKrishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time SynchronizationKrishna Sankar
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0Krishna Sankar
 

Mais de Krishna Sankar (12)

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Último

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Último (20)

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Data Science with Spark - Training at SparkSummit (East)

  • 1. Data Science Training Spark “I want to die on Mars but not on impact” — Elon Musk, interview with Chris Anderson “The shrewd guess, the fertile hypothesis, the courageous leap to a tentative conclusion – these are the most valuable coin of the thinker at work” -- Jerome Seymour Bruner "There are no facts, only interpretations." - Friedrich Nietzsche "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions! ------ We are not going to hang data by it’s legs !! http://training.databricks.com/workshop/datasci.pdf
  • 2. ADVANCED: DATA SCIENCE WITH APACHE SPARK Data Science applications with Apache Spark combine the scalability of Spark and the distributed machine learning algorithms. This material expands on the “Intro to Apache Spark” workshop. Lessons focus on industry use cases for machine learning at scale, coding examples based on public data sets, and leveraging cloud-based notebooks within a team context. Includes limited free accounts on Databricks Cloud. Topics covered include: Data transformation techniques based on both Spark SQL and functional programming in Scala and Python. Predictive analytics based on MLlib, clustering with KMeans, building classifiers with a variety of algorithms and text analytics – all with emphasis on an iterative cycle of feature engineering, modeling, evaluation. Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights. Understand how the primitives like Matrix Factorization are implemented in a distributed parallel framework from the designers of MLlib Several hands-on exercises using datasets such as Movielens, Titanic, State Of the Union speeches, and RecSys Challenge 2015. Prerequisites: Intro to Apache Spark workshop or equivalent (e.g., Spark Developer Certificate) Experience coding in Scala, Python, SQL Have some familiarity with Data Science topics (e.g., business use cases)
  • 4. Goals •  Patterns):)Data)wrangling)(Transform,)Model)&)Reason))with)Spark) o  Use)RDDs,)Transformations)and)Actions)in)the)context)of)a)Data)Science)Problem,)an)Algorithms))&)a) Dataset) •  Spend)time)working)through)MLlib)) •  Balance)between)internals)&)handsWon) o  Internals)from)Reza,)the)MLlib)lead) •  ~65%)of)time)on)Databricks)Cloud)&)Notebooks) o  Take)the)time)to)get)familiar)with)the)Interface)&)the)Data)Science)Cloud) o  Make*mistakes,*experiment,…* •  Good)Time)for)this)course,)this)version) o  Will)miss)many)of)the)gory)details)as)the)framework)evolves) •  Summarized)materials)for)a)3)day)course) o  Even)if)we)don’t)finish)the)exercises)today,)that)is)fine) o  Complete)the)work)at)home)W)There*are*also**homework*notebooks* o  Ask)us)questions)@ksankar,@pacoid,*@reza_zadeh,*@mhfalaki,*@andykonwinski,*@xmeng,* @michaelarmbrust,*@tathadas*
  • 5. Tutorial)Outline:) morning' a)ernoon' o  Welcome)+)Ge3ng)Started)(Krishna)) o  Databricks)Cloud)mechanics)(Andy))) o  Ex)3):)Clustering)B)In)which)we)explore)SegmenFng) Frequent)InterGallacFcHoppers)(Krishna)) o  Ex)0:)PreBFlight)Check)(Krishna)) o  Ex)4):)RecommendaFon)(Krishna)) o  DataScience)DevOps)B)IntroducFon)to)Spark)(Krishna)) o  Theory):)Matrix)FactorizaFon,)SVD,…)(Reza)) o  Ex)1:)MLlib):)StaFsFcs,)Linear)Regression)(Krishna)) o  OnBline)kBmeans,)spark)streaming)(Reza)) o  MLlib)Deep)Dive)–)Lecture)(Reza)) o  Design)Philosophy,)APIs) o  Ex)2:)In)which)we)explore)Disasters,)Trees,) ClassificaFon)&)the)Kaggle)CompeFFon)(Krishna)) o  Random)Forest,)Bagging,)Data)DeBcorrelaFon) o  Ex)5):)Mood)of)the)UnionBText)AnalyFcs(Krishna)) o  In)which)we)analyze)the)Mood)of)the)naFon)from) inferences)on)SOTU)by)the)POTUS)(State)of)the)Union) Addresses)by)The)President)Of)the)US)) o  Deepdive)B)Leverage)parallelism)of)RDDs,)sparse) vectors,)etc)(Reza)) o  Ex)99):)RecSys)2015)Challenge)(Krishna)) o  Ask)Us)Anything)B)Panel)
  • 6. Introducing:) Reza Zadeh @Reza_Zadeh Hossein Falaki @mhfalaki Andy Konwinski @andykonwinski Krishna Sankar @ksankar Paco Nathan @pacoid Xiangrui Meng @xmeng Michael Armbrust @michaelarmbrust Tathagata Das @tathadas
  • 7. About Me o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata, Strata et al o Reviewer “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o Have done lots of things: •  •  •  •  •  •  o  @ksankar, doubleclix.wordpress.com ksankar42@gmail.com The)Nuthead)band)!)
  • 8. Pre-requisites ①  Register)&)Download)data)from)Kaggle.) We)cannot)distribute)Kaggle)data.) Moreover)you)need)an)account)to)submit)entries) a)  Setup)an)account)in)Kaggle)(www.kaggle.com)) b)  We)will)be)using)the)data)from)the)competition)“Titanic:) Machine)Learning)from)Disaster”) c)  Download)data)from) http://www.kaggle.com/c/titanicWgettingStarted) ②  Register)for)RecSys)2015)Competition) a)  http://2015.recsyschallenge.com/)
  • 10. Everyone will receive a username/password for one ! of the Databricks Cloud shards. Use your laptop and browser to login there. We find that cloud-based notebooks are a simple way to get started using Apache Spark – as the motto “Making Big Data Simple” states. Please create and run a variety of notebooks on your account throughout the tutorial. These accounts will remain open long enough for you to export your work. See the product page or FAQ for more details, or contact Databricks to register for a trial account. 10 Getting Started: Step 1
  • 11. 11 Getting Started: Step 1 – Credentials url: https://class01.cloud.databricks.com/ user: student-777 pass: 93ac11xq23z5150 cluster: student-777
  • 12. 12 Open in a browser window, then click on the navigation menu in the top/left corner: Getting Started: Step 2
  • 13. 13 The next columns to the right show folders,! and scroll down to click on databricks_guide Getting Started: Step 3
  • 14. 14 Scroll to open the 01 Quick Start notebook, then follow the discussion about using key features: Getting Started: Step 4
  • 15. 15 See /databricks-guide/01 Quick Start ! Key Features: •  Workspace / Folder / Notebook •  Code Cells, run/edit/move/comment • Markdown •  Results •  Import/Export Getting Started: Step 5
  • 16. 16 Click on the Workspace menu and create your ! own folder (pick a name): Getting Started: Step 6
  • 17. 17 Navigate to /_DataScience
 Hover on its drop-down menu, on the right side: Getting Started: Step 7
  • 18. 18 Navigate to /_DataScience
 Hover on its drop-down menu, on the right side: Click Clone: Getting Started: Step 8
  • 19. 19 Then create a clone of this folder in the folder that you just created: Getting Started: Step 9
  • 20. 20 Now let’s get started with the coding exercise! We’ll define an initial Spark app in three lines of code: Click on _00.pre-flight-check Getting Started: Coding Exercise
  • 21. 21 Attach your cluster – same as your username: Getting Started: Step 10
  • 23. Data Science : The art of building a model with known knowns, which when let loose, works with unknown unknowns! http://smartorg.com/2013/07/valuepoint19/ The World Knowns& & & & & & Unknowns& You UnKnown& Known& o  Others)know,)you)don’t) o  What)we)do) o  Facts,)outcomes)or) scenarios)we)have)not) encountered,)nor) considered) o  “Black)swans”,)outliers,) long)tails)of)probability) distribuFons) o  Lack)of)experience,) imaginaFon) o  PotenFal)facts,) outcomes)we) are)aware,)but) not))with) certainty) o  StochasFc) processes,) ProbabiliFes) o  o  o  o  o  o 
  • 24. The curious case of the Data Scientist o Data Scientist is multi-faceted & Contextual o Data Scientist should be building Data Products o Data Scientist should tell a story http://doubleclix.wordpress.com/2014/01/25/theWcuriousWcaseWofWtheWdataWscientistWprofession/
  • 25. Data Science - Context o  Scalable)Model) Deployment) o  Big)Data) automation)&) purpose)built) appliances)(soft/ hard)) o  Manage)SLAs)&) response)times) o  Volume) o  Velocity) o  Streaming)Data) o  Canonical)form) o  Data)catalog) o  Data)Fabric)across)the) organization) o  Access)to)multiple) sources)of)data)) o  Think)Hybrid)–)Big) Data)Apps,)Appliances) &)Infrastructure) Collect Store Transform o  Metadata) o  Monitor)counters)&) Metrics) o  Structured)vs.)MultiW structured) o  Flexible)&)Selectable) !  Data)Subsets)) !  Attribute)sets) o  Refine)model)with) !  Extended)Data) subsets) !  Engineered) Attribute)sets) o  Validation)run)across)a) larger)data)set) Reason Model Deploy Data Management Data Science o  Dynamic)Data)Sets) o  2)way)keyWvalue)tagging)of) datasets) o  Extended)attribute)sets) o  Advanced)Analytics) ExploreVisualize Recommend Predict o  Performance) o  Scalability) o  Refresh)Latency) o  InWmemory)Analytics) o  Advanced)Visualization) o  Interactive)Dashboards) o  Map)Overlay) o  Infographics) "  Bytes to Business a.k.a. Build the full stack "  Find Relevant Data For Business "  Connect the Dots
  • 26. Volume Velocity Variety Data Science - Context Context Connect edness Intelligence Interface Inference o  Three Amigos o  Interface = Cognition o  Intelligence = Compute(CPU) & Computational(GPU) o  Infer Significance & Causality
  • 27. Day in the life of a (super) Model Intelligence Inference Data Representation Interface Algorithms) Parameters)Agributes) Data)(Scoring)) Model) SelecFon) Reason)&) Learn) Models) Visualize,) Recommend,) Explore) Model)Assessment) Feature)SelecFon)Dimensionality)ReducFon)
  • 30. RDD – The workhorse of Spark o Resilient Distributed Datasets •  o Transformations – create RDDs •  o Actions – Get values •  o We will apply these operations during this tutorial
  • 31. MLlib Hands-on Stats, Linear Regression 9:20
  • 33. Session – 1 : MLlib - Statistics & Linear Regression 1.  Notebook):)01_StatsLRW1) ①  Read)car)data) ②  Stats)(Guided)) ③  Correlation)(Guided)) ④  Coding)ExerciseW21WTemplate) (Correlation)) 2.  Notebook):)02_StatsLRW2) ①  CEW21WSolution) 3.  Linear)Regression) ①  LR)(Guided)) ②  CEW22WTemplate((LR)on)Car) Data)) 4.  Notebook):)03_StatsLRW3) ①  CEW22WSolution) ②  Explain)
  • 34. Linear Regression - API LabeledPoint The)features)and)labels)of)a)data)point LinearModel weights,)intercept LinearRegressionModelBase predict() LinearRegressionModel LinearRegressionWithSGD train(cls,)data,)iteraFons=100,)step=1.0,)miniBatchFracFon=1.0,) iniFalWeights=None,)regParam=1.0,)regType=None,)intercept=False) LassoModel LeastBsquares)fit)with)an)L1)penalty)term. LassoWithSGD train(cls,)data,)iteraFons=100,)step=1.0,)regParam=1.0,) miniBatchFracFon=1.0,iniFalWeights=None) RidgeRegressionModel LeastBsquares)fit)with)an)L2)penalty)term. RidgeRegressionWithSGD train(cls,)data,)iteraFons=100,)step=1.0,)regParam=1.0,)miniBatchFracFon=1.0,) iniFalWeights=None)
  • 35. MLlib - Deep Dive •  Design Philosophy & APIs •  Algorithms - Regression, SGD et al •  Interfaces 9:50
  • 36. Reza Zadeh Distributed Machine Learning" on Spark @Reza_Zadeh | http://reza-zadeh.com
  • 37. Outline Data flow vs. traditional network programming Spark computing engine Optimization Examples Matrix Computations MLlib + {Streaming, GraphX, SQL} Future of MLlib
  • 38. Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks and where to run each task » Run parts twice fault recovery Biggest example: MapReduce Map Map Map Reduce Reduce
  • 39. Spark Computing Engine Extends a programming language with a distributed collection data-structure » “Resilient distributed datasets” (RDD) Open source at Apache » Most active community in big data, with 50+ companies contributing Clean APIs in Java, Scala, Python Community: SparkR, soon to be merged
  • 40. Key Idea Resilient Distributed Datasets (RDDs) » Collections of objects across a cluster with user controlled partitioning & storage (memory, disk, ...) » Built via parallel transformations (map, filter, …) » The world only lets you make make RDDs such that they can be: Automatically rebuilt on failure
  • 41. MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution from AMPLab, UC Berkeley Shipped with Spark since Sept 2013
  • 42. MLlib: Available algorithms classification: logistic regression, linear SVM," naïve Bayes, least squares, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means|| decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS
  • 43. Optimization At least two large classes of optimization problems humans can solve:" »  Convex »  Spectral
  • 46. Separable Updates Can be generalized for »  Unconstrained optimization »  Smooth or non-smooth »  LBFGS, Conjugate Gradient, Accelerated Gradient methods, …
  • 47. Logistic Regression Results 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s 100 GB of data on 50 m1.xlarge EC2 machines !
  • 48. Behavior with Less RAM 68.8 58.1 40.7 29.7 11.5 0 20 40 60 80 100 0% 25% 50% 75% 100% Iterationtime(s) % of working set in memory
  • 50. Spark PageRank Given directed graph, compute node importance. Two RDDs: »  Neighbors (a sparse graph/matrix) »  Current guess (a vector) " Using cache(), keep neighbor list in RAM
  • 51. Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing Neighbors (id, edges) Ranks (id, rank) join partitionBy join join …
  • 55. Distributing Matrices How to distribute a matrix across machines? »  By Entries (CoordinateMatrix) »  By Rows (RowMatrix) »  By Blocks (BlockMatrix) All of Linear Algebra to be rebuilt using these partitioning schemes As!of!version!1.3!
  • 56. Distributing Matrices Even the simplest operations require thinking about communication e.g. multiplication How many different matrix multiplies needed? »  At least one per pair of {Coordinate, Row, Block, LocalDense, LocalSparse} = 10 »  More because multiplies not commutative
  • 58. MLlib - Hands On #2 – Kaggle Competition Predicting Titanic Survivors: •  Feature Engineering •  Classification Algorithms(Random forest) •  Submission & Leaderboard scores 10:45
  • 59. Data Science “folk knowledge” (1 of A) o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions o Learning = Representation + Evaluation + Optimization o It’s Generalization that counts •  o Data alone is not enough •  o Machine Learning is not magic – one cannot get something from nothing •  •  A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
  • 60. Classification - Spark API •  Logistic)Regression) •  SVMWithSGD) •  DecisionTrees) •  Data)as)LabelledPoint)(we)will)see)in)a)moment)) •  DecisionTree.trainClassifier(data,)numClasses,)categoricalFeaturesInfo,)impurity="gini",) maxDepth=4,)maxBins=100)) •  Impurity)–)“entropy”)or)“gini”) •  maxBins)=)control)to)throttle)communication)at)the)expense)of)accuracy) o  Larger)=)Higher)Accuracy) o  Smaller)=)less)communication)(as)#)of)bins)=)number)of)instances)) •  Data)Adaptive)–)i.e.)decision)tree)samples)on)the)driver)and)figures)out)the)bin)spacing)i.e.)the) places)you)slice)for)binning) •  Spark*=*Intelligent*Framework*D*need*this*for*scale*
  • 61. Lookout for these interesting Spark features •  Concept)of)Labeled)Point)&)how)to)create)an) RDD)of)LPs) •  Print)the)tree) •  Calculate)Accuracy)&)MSE)from)RDDs)
  • 62. Anatomy Of a Kaggle Competition
  • 63. Kaggle Data Science Competitions o Hosts Data Science Competitions o Competition Attributes: •  •  •  •  •  •  •  •  •  • 
  • 64. Titanic)Passenger)Metadata) •  Small) •  3)Predictors) •  Class) •  Sex) •  Age) •  Survived?) http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html City Bike Sharing Prediction (Washington DC) Walmart Store Forecasting
  • 65. Train.csv) Taken)from)Titanic) Passenger)Manifest) Variable' Descrip7on' Survived) 0BNo,)1=yes) Pclass) Passenger)Class)()1st,2nd,3rd))) Sibsp) Number)of)Siblings/Spouses)Aboard) Parch) Number)of)Parents/Children)Aboard) Embarked) Port)of)EmbarkaFon) o  C)=)Cherbourg) Titanic)Passenger)Metadata) •  Small) •  3)Predictors) •  Class) •  Sex) •  Age) •  Survived?)
  • 66. Test.csv) Submission o 418 lines; 1st column should have 0 or 1 in each line o Evaluation: • 
  • 67. Data Science “folk knowledge” (Wisdom of Kaggle) Jeremy’s Axioms o Iteratively explore data o Tools •  o Get your head around data •  o Don’t over-complicate o If people give you data, don’t assume that you need to use all of it o Look at pictures ! o History of your submissions – keep a tab o Don’t be afraid to submit simple solutions •  Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by- jeremy-howard/
  • 68. Session-2 : Kaggle, Classification & Trees 1.  Notebook):)04_TitanicW01) ①  Read)Training)Data) ②  Henry)the)Sixth)Model) ③  Submit)to)Kaggle) 2.  Notebook):)05_TitanicW02) ①  Decision)Tree)Model) ②  CEW31)Template) ①  Create)Randomforest)Model) ②  Predict)Testset) ③  Submit)to)Kaggle) 3.  Notebook):)06_TitanicW03) ①  CEW32)Solution) ①  RandomForest)Model) ②  Predict)Testset) ③  Submit)Solution)2) ②  Discussion)about)Models)
  • 69. Trees, Forests & Classification •  Discuss)Random)Forest) o Boosting,)Bagging) o Data)deWcorrelation)) •  Why)it)didn’t)do)better)in)Titanic)dataset) •  Data)Science)Folk)Wisdom) o http://www.slideshare.net/ksankar/dataWscienceWfolkW knowledge)
  • 70. Results Algorithm' Kaggle'Score' Gender)based)model) ~0.7655)Rank):)1276) Decision)Tree) ~0.775124,)Rank):)1147) Random)Forest)) ~0.77512)Gender)Based)Model) Why didn’t RF do better ? Bias/Variance See next slide Henry0the0Sixth,0Part02,0Act04,0Scene020
  • 71. Why didn’t RF do better ? Bias/Variance o High Bias •  •  •  •  •  o High Variance •  •  •  •  Prediction Error! Training ! Error! Ref: Strata 2013 Tutorial by Olivier Grisel Learning Curve Need*more*features*or*more* complex*model*to*improve* Need*more*data* to*improve* 'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos hgp://www.slideshare.net/ksankar/dataBscienceBfolkBknowledge)
  • 72. Decision Tree – Best Practices maxDepth) Tune)with)Data/Model)SelecFon) maxBins) Set)low,)monitor)communicaFons,)increase)if)needed) #)RDD)parFFons) Set)to)#)of)cores) •  Usually the recommendation is that the RDD partitions should be over partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out •  But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help •  Joe Bradley talk (reference below) has interesting insights hgps://speakerdeck.com/jkbradley/mllibBdecisionBtreesBatBsfBscalaBbamlBmeetup) DecisionTree.trainClassifier(data,)numClasses,)categoricalFeaturesInfo,) impurity="gini",)maxDepth=4,)maxBins=100))
  • 73. ◦  “Output)of)weak)classifiers)into)a)powerful) commigee”) ◦  Final)PredicFon)=)weighted)majority)vote)) ◦  Later)classifiers)get)misclassified)points)) !  With)higher)weight,)) !  So)they)are)forced)) !  To)concentrate)on)them) ◦  AdaBoost)(AdapFveBoosting)) ◦  BoosFng)vs)Bagging) !  Bagging)–)independent)trees)<B)Spark)shines)here) !  BoosFng)–)successively)weighted) Boosting "  Goal! ◦  Model Complexity (-)! ◦  Variance (-)! ◦  Prediction Accuracy (+)!
  • 74. ◦  Builds)large)collecFon)of)deBcorrelated)trees)&)averages)them) ◦  Improves)Bagging)by)selecFng)i.i.d*)random)variables)for) spli3ng) ◦  Simpler)to)train)&)tune) ◦  “Do0remarkably0well,0with0very0liIle0tuning0required”)–)ESLII) ◦  Less)suscepFble)to)over)fi3ng)(than)boosFng)) ◦  Many)RF)implementaFons) !  Original)version)B)FortranB77)!)By)Breiman/Cutler) !  Python,)R,)Mahout,)Weka,)Milk)(ML)toolkit)for)py),)matlab)) !  And)of)course,)Spark)!) * i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm Random Forests+ "  Goal! ◦  Model Complexity (-)! ◦  Variance (-)! ◦  Prediction Accuracy (+)!
  • 75. Random Forests •  While)Boosting)splits)based)on)best)among)all)variables,)RF)splits)based) on)best)among)randomly*chosen)variables) •  )Simpler)because)it)requires)two)variables)–)no.)of)Predictors)(typically)√k)) &)no.)of)trees)(500)for)large)dataset,)150)for)smaller)) •  Error)prediction) o  For)each)iteration,)predict)for)dataset)that)is)not)in)the)sample)(OOB)data)) o  Aggregate)OOB)predictions) o  Calculate)Prediction)Error)for)the)aggregate,)which)is)basically)the)OOB)estimate)of)error)rate) •  Can)use)this)to)search)for)optimal)#)of)predictors) o  We)will)see)how)close)this)is)to)the)actual)error)in)the)Heritage)Health)Prize) •  Assumes)equal)cost)for)misWprediction.)Can)add)a)cost)function) •  Proximity)matrix)&)applications)like)adding)missing)data,)dropping) outliers)) Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk A Brief Overview of RF by Dan Steinberg
  • 76. ◦  Two)Step) !  Develop)a)set)of)learners) !  Combine)the)results)to)develop)a)composite)predictor) ◦  Ensemble)methods)can)take)the)form)of:) !  Using)different)algorithms,)) !  Using)the)same)algorithm)with)different)se3ngs) !  Assigning)different)parts)of)the)dataset)to)different) classifiers) ◦  Bagging)&)Random)Forests)are)examples)of) ensemble)method)) Ref: Machine Learning In Action Ensemble Methods "  Goal! ◦  Model Complexity (-)! ◦  Variance (-)! ◦  Prediction Accuracy (+)!
  • 77. Deepdive : Leverage parallelism of RDDs, sparse vectors, etc. 11:30
  • 79. Clustering - Hands On : •  Normalization & Centering •  Clustering •  Optimizing k based on cohesively of the clusters (WSSE) 1:30
  • 80. Data Science “folk knowledge” (3 of A) o More Data Beats a Cleverer Algorithm •  •  o Learn many models, not Just One •  •  •  o Simplicity Does not necessarily imply Accuracy o Representable Does not imply Learnable •  o Correlation Does not imply Causation o  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o  A few useful things to know about machine learning - by Pedro Domingos !  http://dl.acm.org/citation.cfm?id=2347755
  • 81. Session-3 : Clustering 1.  Notebook):)07_ClusterW1) ①  Read)Data) ②  Cluster) ③  Modeling)ExerciseW41WTemplate) 2.  Notebook):)08_ClusterW2) ①  MEW41WSolution) ②  Center)and)Scale) ③  Cluster) ④  Inspect)centroid) ⑤  CEW42WTemplate):)Cluster) Semantics) 3.  Notebook):)09_ClusterW3) ①  CEW42)Solution) ②  Cluster)Semantics)W)Discussion)
  • 82. Clustering - Theory •  Clustering)is)unsupervised)learning) •  While)the)computers)can)dissect)a)dataset)into)“similar”)clusters,)it) still)needs)human)direction)&)domain)knowledge)to)interpret)&) guide) •  Two)types:) o  Centroid)based)clustering)–)kWmeans)clustering) o  Tree)based)Clustering)–)hierarchical)clustering) •  Spark)implements)the)Scalable)Kmeans++)) o  Paper):)http://theory.stanford.edu/~sergei/papers/vldb12Wkmpar.pdf)
  • 83. Lookout for these interesting Spark features •  Application)of)Statistics)toolbox) •  Center)&)Scale)RDD) •  Filter)RDDs)
  • 84. Clustering - API •  from)pyspark.mllib.clustering)import)KMeans) •  Kmeans.train) •  train(cls,)data,)k,)maxIterations=100,)runs=1,)initializationMode="kW means||")) •  K)=)number)of)clusters)to)create,)default=2) •  initializationMode)=)The)initialization)algorithm.)This)can)be)either) "random")to)choose)random)points)as)initial)cluster)centers,)or)"kW means||")to)use)a)parallel)variant)of)kWmeans++)(Bahmani)et)al.,)Scalable) KWMeans++,)VLDB)2012).)Default:)kWmeans||) •  KMeansModel.predict) •  Maps)a)point)to)a)cluster)
  • 85. Interpretation C#' AVG' Interpreta7on' 1) 2) ) ) Not)that)acFve)–)Give)them)flight)coupons)to)encourage)them)to)fly)more.)Ask)them)why)they)are)not) flying.)May)be)they)are)flying)to)desFnaFons)(say)Jupiter))where)InterGallacFc)has)less)gates) 3) 4) 5) Best0Customers,0S6ll0lots0 of0nonLflying0miles0 Very0ac6ve0onLline.0Why0are0they0coming0to0us0instead0of0Amazon0?0 Note0:00 •  This0is0just0a0sample0interpreta6on.0 •  In0real0life0we0would0“noodle”0over0the0clusters0&0tweak0 them0to0be0useful,0interpretable0and0dis6nguishable.0 •  May0be030is0more0suited0to0create0targeted0promo6ons0
  • 86. Epilogue •  KMeans)in)Spark)has)enough)controls) •  It)does)a)decent)job) •  We)were)able)to)control)the)clusters)based)on)our)experience)(2) cluster)is)too)low,)10)is)too)high,)5)seems)to)be)right)) •  We)can)see)that)the)Scalable)KMeans)has)control)over)runs,) parallelism)et)al.)(Home)work):)Explore)the)scalability)) •  We)were)able)to)interpret)the)results)with)domain)knowledge)and) arrive)at)a)scheme)to)solve)the)business)opportunity) •  Naturally)we)would)tweak)the)clusters)to)fit)the)business)viability.) 20)clusters)with)corresponding)promotion)schemes)are)unwieldy,) even)if)the)WSSE)is)the)minimum.)
  • 87. Recommendation - Hands On : •  Movie Lens - Medium Data •  Movie Lens - Large Data (Homework) 2:00
  • 89. Session-4 : Recommendation at Scale 1.  NotebookW10_RecoW1) ①  Read)Movielens)medium) data) ②  CEW51)Template)–)Partition) Data) 2.  NotebookW11_RecoW2) ①  CEW51)Solution) ②  ALS)Slide) ③  Train)ALS)&)Predict) ④  Calculate)Model) Performance)
  • 90. Recommendation & Personalization - Spark Automated*AnalyticsW)Let)Data)tell)story) Feature)Learning,)AI,)Deep)Learning) Learning*Models*W)fit)parameters) as)it)gets)more)data)) Dynamic*Models*–)model) selection)based)on)context) o  Knowledge)Based) o  Demographic)Based) o  Content)Based) o  Collaborative)Filtering) o  Item)Based) o  User)Based) o  Latent)Factor)based) o  User)Rating) o  Purchased) o  Looked/Not)purchased) Spark)(in)1.1.0))implements)the)user)based)ALS)collaboraFve)filtering) Ref:)) ALS)B)CollaboraFve)Filtering)for)Implicit)Feedback)Datasets,)Yifan)Hu);)AT&T)Labs.,)Florham)Park,)NJ);)Koren,)Y.);)Volinsky,)C.) ALSBWR)B)LargeBScale)Parallel)CollaboraFve)Filtering)for)the)Nezlix)Prize,)Yunhong)Zhou,)Dennis)Wilkinson,)Robert)Schreiber,)Rong)P
  • 91. Spark Collaborative Filtering API •  ALS.train(cls,)ratings,)rank,)iterations=5,) lambda_=0.01,)blocks=W1)) •  ALS.trainImplicit(cls,)ratings,)rank,)iterations=5,) lambda_=0.01,)blocks=W1,)alpha=0.01)) •  MatrixFactorizationModel.predict(self,)user,) product)) •  MatrixFactorizationModel.predictAll(self,) usersProducts))
  • 92. Theory : Matrix Factorization, SVD,… On-line k-means, spark streaming 2:30
  • 95. Singular Value Decomposition Two cases »  Tall and Skinny »  Short and Fat (not really) »  Roughly Square SVD method on RowMatrix takes care of which one to call.
  • 97. Tall and Skinny SVD Gets%us%%%V%and%the% singular%values% Gets%us%%%U%by%one% matrix%multiplication%
  • 98. Square SVD ARPACK: Very mature Fortran77 package for computing eigenvalue decompositions" JNI interface available via netlib-java" Distributed using Spark – how?
  • 99. Square SVD via ARPACK Only interfaces with distributed matrix via matrix-vector multiplies The result of matrix-vector multiply is small. The multiplication can be distributed.
  • 100. Square SVD With 68 executors and 8GB memory in each, looking for the top 5 singular vectors
  • 101. Communication-Efficient All pairs similarity on Spark (DIMSUM)
  • 102. All pairs Similarity All pairs of cosine scores between n vectors »  Don’t want to brute force (n choose 2) m »  Essentially computes " Compute via DIMSUM »  Dimension Independent Similarity Computation using MapReduce
  • 103. Intuition Sample columns that have many non-zeros with lower probability. " On the flip side, columns that have fewer non- zeros are sampled with higher probability." Results provably correct and independent of larger dimension, m.
  • 105. MLlib + {Streaming, GraphX, SQL}
  • 106. A General Platform Spark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning … Standard libraries included with Spark
  • 107. Benefit for Users Same engine performs data extraction, model training and interactive queries … DFS read DFS write parse DFS read DFS write train DFS read DFS write query DFS DFS read parse train query Separate engines Spark
  • 108. MLlib + Streaming As of Spark 1.1, you can train linear models in a streaming fashion, k-means as of 1.2 Model weights are updated via SGD, thus amenable to streaming More work needed for decision trees
  • 109. MLlib + SQL df = context.sql(“select latitude, longitude from tweets”)! model = pipeline.fit(df)! DataFrames in Spark 1.3! (March 2015) Powerful coupled with new pipeline API
  • 112. Goals for next version Tighter integration with DataFrame and spark.ml API Accelerated gradient methods & Optimization interface Model export: PMML (current export exists in Spark 1.3, but not PMML, which lacks distributed models) Scaling: Model scaling (e.g. via Parameter Servers)
  • 113. Research Goal: General Distributed Optimization Distribute%CVX%by% backing%CVXPY%with% PySpark% % EasyBtoBexpress% distributable%convex% programs% % Need%to%know%less% math%to%optimize% complicated% objectives%
  • 114. Most active open source community in big data 200+ developers, 50+ companies contributing Spark Community 0 50 100 150 Contributors in past year
  • 116. Spark and ML Spark has all its roots in research, so we hope to keep incorporating new ideas!
  • 118. Hands On : •  Mood Of the Union •  RecSys 2015 Challenge 3:15
  • 119. The Art of ELO Ranking & Super Bowl XLIX o The real formula is o Not what is written on the glass ! o But then that is Hollywood ! Ref : Who is #1, Princeton University Presshttps://doubleclix.wordpress.com/2015/01/20/theWartWofWnflWrankingWtheWeloWalgorithmWandWfivethirtyeight/)
  • 120. Session-5 : Mood Of the Union – Data Science on SOTU by POTUS 1.  DataScience/12_SOTUW1) ①  Read)BO) ②  CEW61)Template):)Has)BO) changed)since)2014)?) 2.  DataScience/13_SOTUW2) ①  CEW61)Solution) ②  Read)GW) ③  Preprocess) ④  CEW62)Template):)What) mood)the)country)was)in) 1790W1796)vs.)2009W2015) 3.  Notebook):)14_SOTUW3) ①  CEW62)Solution) ②  Homework) ①  GWB)vs)Clinton) ②  WJC)vs)AL) ③  Discussions ) ))
  • 121. WJC) Mood Of The Nation JFK) FDR) GW) AL) JFK)
  • 122. Epilogue •  Interesting)Exercise) •  Highlights) o  Map-reduce in a couple of lines ! •  But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1) o  Set differences using substractByKey o  Ability to sort a map by values (or any arbitrary function, for that matter) •  To)Explore)as)homework:) o  TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf hgp://blog.cloudera.com/blog/2014/09/howBtoBtranslateBfromBmapreduceBtoBapacheBspark/)
  • 123. Session-7 : Predict Buying Pattern Recsys 2015 Challenge 1.  Notebook):) 99_RecsysW2015W1) 1.  Read)Data) 2.  Explore)Options) 3.  CEW71)
  • 125. Notebooks to Explore •  Run)thru)all)the) homework)in)you) Databricks)cloud)
  • 128. 83 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs! J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin! graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf Pregel: Large-scale graph computing at Google! Grzegorz Czajkowski, et al.! googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html GraphX: Unified Graph Analytics on Spark! Ankur Dave, Databricks! databricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf Advanced Exercises: GraphX! databricks-training.s3.amazonaws.com/graph-analytics-with-graphx.html GraphX: Further Reading…
  • 129. // http://spark.apache.org/docs/latest/graphx-programming-guide.html import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD case class Peep(name: String, age: Int) val nodeArray = Array( (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)), (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)), (5L, Peep("Leslie", 45)) ) val edgeArray = Array( 84 GraphX: Example – simple traversals
  • 130. 85 GraphX: Example – routing problems What is the cost to reach node 0 from any other node in the graph? This is a common use case for graph algorithms, e.g., Djikstra
  • 131. 86 Run /<your folder>/_DataSCience/08.graphx! in your folder: GraphX: Coding Exercise
  • 133. Case Studies: Apache Spark, DBC, etc. Additional details about production deployments for Apache Spark can be found at: https://cwiki.apache.org/confluence/display/SPARK/Powered+By +Spark https://databricks.com/blog/category/company/partners http://go.databricks.com/customer-case-studies 88
  • 134. Case Studies: Automatic Labs 89 Spark Plugs Into Your Car! Rob Ferguson! spark-summit.org/east/2015/talk/spark-plugs-into- your-car finance.yahoo.com/news/automatic-labs-turns- databricks-cloud-140000785.html Automatic creates personalized driving habit dashboards •  wanted to use Spark while minimizing investment in DevOps •  provides data access to non-technical analysts via SQL •  replaced Redshift and disparate ML tools with single platform •  leveraged built-in visualization capabilities in notebooks to generate dashboards easily and quickly
  • 135. Spark at Twitter: Evaluation & Lessons Learnt! Sriram Krishnan! slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter •  Spark can be more interactive, efficient than MR •  support for iterative algorithms and caching •  more generic than traditional MapReduce •  Why is Spark faster than Hadoop MapReduce? •  fewer I/O synchronization barriers •  less expensive shuffle •  the more complex the DAG, the greater the ! performance improvement 90 Case Studies: Twitter
  • 136. Pearson uses Spark Streaming for next generation adaptive learning platform Dibyendu Bhattacharya databricks.com/blog/2014/12/08/pearson-uses- spark-streaming-for-next-generation-adaptive- learning-platform.html 91 • Kafka + Spark + Cassandra + Blur, on AWS on a YARN cluster • single platform/common API was a key reason to replace Storm with Spark Streaming • custom Kafka Consumer for Spark Streaming, using Low Level Kafka Consumer APIs • handles: Kafka node failures, receiver failures, leader changes, committed offset in ZK, tunable data rate throughput Case Studies: Pearson
  • 137. Unlocking Your Hadoop Data with Apache Spark and CDH5 Denny Lee slideshare.net/Concur/unlocking-your-hadoop-data-with-apache- spark-and-cdh5 92 • leading provider of spend management solutions and services • delivers recommendations based on business users’ travel and expenses – “to help deliver the perfect trip” • use of traditional BI tools with Spark SQL allowed analysts to make sense of the data without becoming programmers • needed the ability to transition quickly between Machine Learning (MLLib), Graph (GraphX), and SQL usage • needed to deliver recommendations in real-time Case Studies: Concur
  • 138. Stratio Streaming: a new approach to Spark Streaming David Morales, Oscar Mendez spark-summit.org/2014/talk/stratio-streaming-a-new- approach-to-spark-streaming 93 • Stratio Streaming is the union of a real-time messaging bus with a complex event processing engine atop Spark Streaming • allows the creation of streams and queries on the fly • paired with Siddhi CEP engine and Apache Kafka • added global features to the engine such as auditing and statistics Case Studies: Stratio
  • 139. Collaborative Filtering with Spark! Chris Johnson! slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark •  collab filter (ALS) for music recommendation •  Hadoop suffers from I/O overhead •  show a progression of code rewrites, converting a ! Hadoop-based app into efficient use of Spark 94 Case Studies: Spotify
  • 140. Guavus Embeds Apache Spark ! into its Operational Intelligence Platform ! Deployed at the World’s Largest Telcos Eric Carr databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its-operational-intelligence- platform-deployed-at-the-worlds-largest-telcos.html 95 •  4 of 5 top mobile network operators, 3 of 5 top Internet backbone providers, 80% MSOs in NorAm •  analyzing 50% of US mobile data traffic, +2.5 PB/day •  latency is critical for resolving operational issues before they cascade: 2.5 MM transactions per second •  “analyze first” not “store first ask questions later” Case Studies: Guavus
  • 141. Case Studies: Radius Intelligence 96 From Hadoop to Spark in 4 months, Lessons Learned! Alexis Roos! http://youtu.be/o3-lokUFqvA •  building a full SMB index took 12+ hours using ! Hadoop and Cascading •  pipeline was difficult to modify/enhance •  Spark increased pipeline performance 10x •  interactive shell and notebooks enabled data scientists ! to experiment and develop code faster •  PMs and business development staff can use SQL to ! query large data sets
  • 143. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training