SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Something about
Data
Sanjeev Mishra Chris Bedford
Acknowledgement
● Bing for free images
● Machine Learning in Action (Peter Harrington)
● Wikipedia
Did you know that?
What about these?
What about these?
I guess you have heard of
● Siri or Google Now
● IBM Watson
● IBM Deep Blue
● Google Translate
● WolframAlpha
The Big Picture
What is Learning
Definition:
The acquisition of knowledge or skills through experience,
study, or by being taught.
Knowledge
Knowledge
reasoning
deduction
reasoning
What is Machine Learning
Field of study that gives computers the ability to learn
without being explicitly programmed
A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by
P, improves with experience E
Data Mining
● Computational process of
discovering patterns in large data
sets
○ Structured or unstructured data
○ Patterns must be: valid, novel, potentially
useful, understandable
■ 80% of customers who buy cheese and milk also
buy bread, and 5% of customers buy all of them
together
■ Correlation among variables: positive or negative
Types Machine Learning
Unsupervised Supervised
Learn the patterns in data
● no training
● face detection in a set images
● group objects based on some
similarity
● clustering (nominal data)
● density estimation (numeric data)
Predict or forecast a something
● training
● recognize a face in a set of images
● given an object predict the type
● classification (nominal data)
● regression or curve fitting (numeric
data)
Clustering
Clustering using k-Means
● Input
○ M (set of points)
○ k (number of clusters)
● Output
○ k cluster centroids c1,..
ck
(ci
is the centroid of all x
j
€ S
i
)
● Approach
○ Minimizing the squared error function
where is a chosen distance
measure between a data point
and the cluster centre , is an
indicator of the distance of the
n data points from their
respective cluster centres.
k-Means
create k points for starting centroids (random)
while any point has changed cluster assignment
for every point in our dataset:
for every centroid
calculate the distance between the centroid and point
assign the point to the cluster with the lowest distance
for every cluster calculate the mean of the points in that cluster
assign the centroid to the mean
Clustering Demo
k-Means
Pros
● Easy to implement
● Fast on small dataset
Cons
● A priori knowledge of K
● Slow on very large dataset
● Sensitive to outliers
● Can converge to local minima
k-Means (wrong k)
K = 4
K = 3
Improving K-means
● Bisecting K-means
○ Choose cluster with largest SSE
○ Split it till k
Supervised Learning: Linear Regression
Attempts to find a mathematical (linear) function that can approximate the relationship between a set of
one or more input variables and what is called a response variable.
Example: A web site for amusement park X
* Interested in offering ride coupons
* Rides have height requirements
* Avoid issuing coupons for ride if user is too short
* Most users sign up from Facebook, so we have their ages.
* So: we use age to predict height.
Supervised Learning: Linear Regression
Supervised Learning: Linear Regression
Supervised Learning: Linear Regression
A more complex data set: two input variables.
sqFt,bathrooms,priceInThousands
1200,1,750
1250,2,900
2000,2.5,1500
1800,2,1200
1000,1.5,700
1800,3,1400
1100,1.5,800
2200,3,1700
1250,1.5,850
1300,2,1100
Our previous example had a one dimensional set of input variables, now we have a 2-
dimensional set: for each two-tuple consisting of numBathrooms and squareFeet we
have the selling price of a corresponding home. From this training data, we
create a model that predicts a “plane of best fit”. Given a new two-tuple
[ numBathrooms-x, squareFeet-y ] our model will predict the point on the plane which
denotes the most likely selling prices for a house with those attributes.
FOR SALE
Supervised Learning: Linear Regression
For a one dimensional set of input variables we had a line of best fit, for a two
dimensional set, we have a plane of best fit. Here’s what our plane looks like.
Why Use R ?
Many data scientists use R, due to
- extensive, well tested libraries of statistical, mathematical functions
- math friendly syntax
- excellent support for charting and plotting functions
- active user community to provide support
R skills are valuable for big data engineers, since:
- data scientists we work with will often develop their models using R
- significant effort is required to translate such models to Java, C++, etc.
So: useful not only to understand R,
but also to be able to invoke R from your native language
R code for 2 dimensional model
values <- read.csv(filePath)
model <- lm(priceInThousands ~ sqFt + bathrooms, data=values)
# predict new value
#
# set up 'data frame'
newdata <- data.frame(sqFt=1600, bathrooms=3)
#
# invoke prediction function
predict(model, newdata)
csv file is in same format we saw in intro
slide on linear regression
response variableinput (independent) variables
R’s linear model
creation function response variableresponse variable
predict most likely selling price using model ‘model’ and the data frame that wraps
variables sqFt (1600), and bathrooms (3).
Calling R from Java
import org.rosuda.JRI.REXP;
import org.rosuda.JRI.Rengine;
class RegressionModelExecutor {
// Current R session (only one per JVM,
// since rjava is not multi-threaded).
Rengine rengine = null
RegressionModelExecutor(String inputDataPath) {
String []engineArgs = new String[1];
engineArgs [0] = "--vanilla";
rengine=new Rengine (engineArgs, false, null);
String script =
"""
values = read.csv('$inputDataPath')
newModel.lm = lm(
priceInThousands ~
sqFt + bathrooms, data=values)
"""
evaluateScript(script) // initialize model
}
public void shutdown() {
rengine.end();
}
// Apply model 'newModel.lm' to predict price of a house
// with given values for squareFeet and numBathrooms.
public double predictInstance(int sqft, float baths) {
rengine.eval(
"newdata = data.frame(
sqFt=$sqft, bathrooms=$baths)")
REXP result = rengine.eval(
"predict(newModel.lm , newdata)")
return result.asDouble()
}
// Evaluate block of R expressions, taking into account
// the fact that Rengine only executes one statement at
// a time. Unconditionally dumps out lines before executing
// the script so that if anything goes wrong we can copy
// paste the constructed output (scriptLines) directly
// into an R session.
public String evaluateScript(String scriptLines) {
println("evaluating: n$scriptLines")
for (String line: scriptLines.split("n")) {
rengine.eval(line)
}
}
~
Calling R from Java
More detailed article on R/Java:
http://buildlackey.com/integrating-r-and-java-with-jrirjava-a-jni-based-bridge/
How linear regression = Machine
learning?
A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by
P, improves with experience E
Supervised Learning: Linear Regression
LEARN MORE:
KHAN ACADEMY
https://www.khanacademy.org/
COURSERA:
Coding the Matrix Course (Linear Algebra)
http://www.youtube.com/watch?v=IWugXcWpfoM
MIT Open Courseware
Linear Algebra Course
http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/index.htm
Software and Tools
● Apache Mahout (http://mahout.apache.org/): Java, Apache
● http://prediction.io/ (Machine learning server)
● Weka (http://www.cs.waikato.ac.nz/ml/weka/): Java, GPL
● OpenNLP (http://opennlp.apache.org/): Java, Apache
● Stanford NLP (http://nlp.stanford.edu/software/): Java, GPL
● Scikit-learn (http://scikit-learn.org/stable/): Python, BSD
● mply (http://mlpy.sourceforge.net/): Python, GPL
● NLTK (http://nltk.org/): Python, Apache
● http://www.alchemyapi.com/
Tools
R, Matlab, Octave
http://mloss.org/software/
http://sourceforge.net/directory/science-engineering/ai/machinelearning/os:linux/freshness:recently-updated/
Courses and other materials
● Coursera (http://www.coursera.org/):
○ machine learning
○ natural language processing
○ neural networks
● Udacity (https://www.udacity.com/courses)
○ artificial intelligence
● http://cs229.stanford.edu/materials.html
● http://www.ai.mit.edu/courses/6.867-f03/lectures.html
● wikipedia.org
Something about
Data
Sanjeev Mishra Chris Bedford
sanjeev.mishra@gmail.com chris@buildlackey.com

Mais conteúdo relacionado

Mais procurados

Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
Rebecca Bilbro
 
Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)
Sri Prasanna
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 

Mais procurados (20)

Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection process
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Nearest neighbors
Nearest neighborsNearest neighbors
Nearest neighbors
 
Embedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking systemEmbedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking system
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
Mathematical Analysis of Non-Recursive Algorithm.
Mathematical Analysis of Non-Recursive Algorithm.Mathematical Analysis of Non-Recursive Algorithm.
Mathematical Analysis of Non-Recursive Algorithm.
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologies
 
Game playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphGame playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graph
 
Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonIntroduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in Python
 
Variational Inference in Python
Variational Inference in PythonVariational Inference in Python
Variational Inference in Python
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerank
 
Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
 
Nearest neighbour algorithm
Nearest neighbour algorithmNearest neighbour algorithm
Nearest neighbour algorithm
 
educational course/tutorialoutlet.com
educational course/tutorialoutlet.comeducational course/tutorialoutlet.com
educational course/tutorialoutlet.com
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 

Semelhante a Silicon valleycodecamp2013

Semelhante a Silicon valleycodecamp2013 (20)

Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In python
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Development Infographic
Development InfographicDevelopment Infographic
Development Infographic
 
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
 
Application's of Numerical Math in CSE
Application's of Numerical Math in CSEApplication's of Numerical Math in CSE
Application's of Numerical Math in CSE
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Silicon valleycodecamp2013

  • 2. Acknowledgement ● Bing for free images ● Machine Learning in Action (Peter Harrington) ● Wikipedia
  • 3. Did you know that?
  • 6. I guess you have heard of ● Siri or Google Now ● IBM Watson ● IBM Deep Blue ● Google Translate ● WolframAlpha
  • 8. What is Learning Definition: The acquisition of knowledge or skills through experience, study, or by being taught. Knowledge Knowledge reasoning deduction reasoning
  • 9. What is Machine Learning Field of study that gives computers the ability to learn without being explicitly programmed A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
  • 10. Data Mining ● Computational process of discovering patterns in large data sets ○ Structured or unstructured data ○ Patterns must be: valid, novel, potentially useful, understandable ■ 80% of customers who buy cheese and milk also buy bread, and 5% of customers buy all of them together ■ Correlation among variables: positive or negative
  • 11. Types Machine Learning Unsupervised Supervised Learn the patterns in data ● no training ● face detection in a set images ● group objects based on some similarity ● clustering (nominal data) ● density estimation (numeric data) Predict or forecast a something ● training ● recognize a face in a set of images ● given an object predict the type ● classification (nominal data) ● regression or curve fitting (numeric data)
  • 13. Clustering using k-Means ● Input ○ M (set of points) ○ k (number of clusters) ● Output ○ k cluster centroids c1,.. ck (ci is the centroid of all x j € S i ) ● Approach ○ Minimizing the squared error function where is a chosen distance measure between a data point and the cluster centre , is an indicator of the distance of the n data points from their respective cluster centres.
  • 14. k-Means create k points for starting centroids (random) while any point has changed cluster assignment for every point in our dataset: for every centroid calculate the distance between the centroid and point assign the point to the cluster with the lowest distance for every cluster calculate the mean of the points in that cluster assign the centroid to the mean Clustering Demo
  • 15. k-Means Pros ● Easy to implement ● Fast on small dataset Cons ● A priori knowledge of K ● Slow on very large dataset ● Sensitive to outliers ● Can converge to local minima
  • 16. k-Means (wrong k) K = 4 K = 3
  • 17. Improving K-means ● Bisecting K-means ○ Choose cluster with largest SSE ○ Split it till k
  • 18. Supervised Learning: Linear Regression Attempts to find a mathematical (linear) function that can approximate the relationship between a set of one or more input variables and what is called a response variable. Example: A web site for amusement park X * Interested in offering ride coupons * Rides have height requirements * Avoid issuing coupons for ride if user is too short * Most users sign up from Facebook, so we have their ages. * So: we use age to predict height.
  • 21. Supervised Learning: Linear Regression A more complex data set: two input variables. sqFt,bathrooms,priceInThousands 1200,1,750 1250,2,900 2000,2.5,1500 1800,2,1200 1000,1.5,700 1800,3,1400 1100,1.5,800 2200,3,1700 1250,1.5,850 1300,2,1100 Our previous example had a one dimensional set of input variables, now we have a 2- dimensional set: for each two-tuple consisting of numBathrooms and squareFeet we have the selling price of a corresponding home. From this training data, we create a model that predicts a “plane of best fit”. Given a new two-tuple [ numBathrooms-x, squareFeet-y ] our model will predict the point on the plane which denotes the most likely selling prices for a house with those attributes. FOR SALE
  • 22. Supervised Learning: Linear Regression For a one dimensional set of input variables we had a line of best fit, for a two dimensional set, we have a plane of best fit. Here’s what our plane looks like.
  • 23. Why Use R ? Many data scientists use R, due to - extensive, well tested libraries of statistical, mathematical functions - math friendly syntax - excellent support for charting and plotting functions - active user community to provide support R skills are valuable for big data engineers, since: - data scientists we work with will often develop their models using R - significant effort is required to translate such models to Java, C++, etc. So: useful not only to understand R, but also to be able to invoke R from your native language
  • 24. R code for 2 dimensional model values <- read.csv(filePath) model <- lm(priceInThousands ~ sqFt + bathrooms, data=values) # predict new value # # set up 'data frame' newdata <- data.frame(sqFt=1600, bathrooms=3) # # invoke prediction function predict(model, newdata) csv file is in same format we saw in intro slide on linear regression response variableinput (independent) variables R’s linear model creation function response variableresponse variable predict most likely selling price using model ‘model’ and the data frame that wraps variables sqFt (1600), and bathrooms (3).
  • 25. Calling R from Java import org.rosuda.JRI.REXP; import org.rosuda.JRI.Rengine; class RegressionModelExecutor { // Current R session (only one per JVM, // since rjava is not multi-threaded). Rengine rengine = null RegressionModelExecutor(String inputDataPath) { String []engineArgs = new String[1]; engineArgs [0] = "--vanilla"; rengine=new Rengine (engineArgs, false, null); String script = """ values = read.csv('$inputDataPath') newModel.lm = lm( priceInThousands ~ sqFt + bathrooms, data=values) """ evaluateScript(script) // initialize model } public void shutdown() { rengine.end(); } // Apply model 'newModel.lm' to predict price of a house // with given values for squareFeet and numBathrooms. public double predictInstance(int sqft, float baths) { rengine.eval( "newdata = data.frame( sqFt=$sqft, bathrooms=$baths)") REXP result = rengine.eval( "predict(newModel.lm , newdata)") return result.asDouble() } // Evaluate block of R expressions, taking into account // the fact that Rengine only executes one statement at // a time. Unconditionally dumps out lines before executing // the script so that if anything goes wrong we can copy // paste the constructed output (scriptLines) directly // into an R session. public String evaluateScript(String scriptLines) { println("evaluating: n$scriptLines") for (String line: scriptLines.split("n")) { rengine.eval(line) } } ~
  • 26. Calling R from Java More detailed article on R/Java: http://buildlackey.com/integrating-r-and-java-with-jrirjava-a-jni-based-bridge/
  • 27. How linear regression = Machine learning? A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
  • 28. Supervised Learning: Linear Regression LEARN MORE: KHAN ACADEMY https://www.khanacademy.org/ COURSERA: Coding the Matrix Course (Linear Algebra) http://www.youtube.com/watch?v=IWugXcWpfoM MIT Open Courseware Linear Algebra Course http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/index.htm
  • 29. Software and Tools ● Apache Mahout (http://mahout.apache.org/): Java, Apache ● http://prediction.io/ (Machine learning server) ● Weka (http://www.cs.waikato.ac.nz/ml/weka/): Java, GPL ● OpenNLP (http://opennlp.apache.org/): Java, Apache ● Stanford NLP (http://nlp.stanford.edu/software/): Java, GPL ● Scikit-learn (http://scikit-learn.org/stable/): Python, BSD ● mply (http://mlpy.sourceforge.net/): Python, GPL ● NLTK (http://nltk.org/): Python, Apache ● http://www.alchemyapi.com/ Tools R, Matlab, Octave http://mloss.org/software/ http://sourceforge.net/directory/science-engineering/ai/machinelearning/os:linux/freshness:recently-updated/
  • 30. Courses and other materials ● Coursera (http://www.coursera.org/): ○ machine learning ○ natural language processing ○ neural networks ● Udacity (https://www.udacity.com/courses) ○ artificial intelligence ● http://cs229.stanford.edu/materials.html ● http://www.ai.mit.edu/courses/6.867-f03/lectures.html ● wikipedia.org
  • 31. Something about Data Sanjeev Mishra Chris Bedford sanjeev.mishra@gmail.com chris@buildlackey.com