Silicon valleycodecamp2013

Something about
Data
Sanjeev Mishra Chris Bedford

Acknowledgement
● Bing for free images
● Machine Learning in Action (Peter Harrington)
● Wikipedia

I guess you have heard of
● Siri or Google Now
● IBM Watson
● IBM Deep Blue
● Google Translate
● WolframAlpha

What is Learning
Definition:
The acquisition of knowledge or skills through experience,
study, or by being taught.
Knowledge
Knowledge
reasoning
deduction
reasoning

What is Machine Learning
Field of study that gives computers the ability to learn
without being explicitly programmed
A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by
P, improves with experience E

Data Mining
● Computational process of
discovering patterns in large data
sets
○ Structured or unstructured data
○ Patterns must be: valid, novel, potentially
useful, understandable
■ 80% of customers who buy cheese and milk also
buy bread, and 5% of customers buy all of them
together
■ Correlation among variables: positive or negative

Types Machine Learning
Unsupervised Supervised
Learn the patterns in data
● no training
● face detection in a set images
● group objects based on some
similarity
● clustering (nominal data)
● density estimation (numeric data)
Predict or forecast a something
● training
● recognize a face in a set of images
● given an object predict the type
● classification (nominal data)
● regression or curve fitting (numeric
data)

Clustering using k-Means
● Input
○ M (set of points)
○ k (number of clusters)
● Output
○ k cluster centroids c1,..
ck
(ci
is the centroid of all x
j
€ S
i
)
● Approach
○ Minimizing the squared error function
where is a chosen distance
measure between a data point
and the cluster centre , is an
indicator of the distance of the
n data points from their
respective cluster centres.

k-Means
create k points for starting centroids (random)
while any point has changed cluster assignment
for every point in our dataset:
for every centroid
calculate the distance between the centroid and point
assign the point to the cluster with the lowest distance
for every cluster calculate the mean of the points in that cluster
assign the centroid to the mean
Clustering Demo

k-Means
Pros
● Easy to implement
● Fast on small dataset
Cons
● A priori knowledge of K
● Slow on very large dataset
● Sensitive to outliers
● Can converge to local minima

Improving K-means
● Bisecting K-means
○ Choose cluster with largest SSE
○ Split it till k

Supervised Learning: Linear Regression
Attempts to find a mathematical (linear) function that can approximate the relationship between a set of
one or more input variables and what is called a response variable.
Example: A web site for amusement park X
* Interested in offering ride coupons
* Rides have height requirements
* Avoid issuing coupons for ride if user is too short
* Most users sign up from Facebook, so we have their ages.
* So: we use age to predict height.

A more complex data set: two input variables.
sqFt,bathrooms,priceInThousands
1200,1,750
1250,2,900
2000,2.5,1500
1800,2,1200
1000,1.5,700
1800,3,1400
1100,1.5,800
2200,3,1700
1250,1.5,850
1300,2,1100
Our previous example had a one dimensional set of input variables, now we have a 2-
dimensional set: for each two-tuple consisting of numBathrooms and squareFeet we
have the selling price of a corresponding home. From this training data, we
create a model that predicts a “plane of best fit”. Given a new two-tuple
[ numBathrooms-x, squareFeet-y ] our model will predict the point on the plane which
denotes the most likely selling prices for a house with those attributes.
FOR SALE

For a one dimensional set of input variables we had a line of best fit, for a two
dimensional set, we have a plane of best fit. Here’s what our plane looks like.

Why Use R ?
Many data scientists use R, due to
- extensive, well tested libraries of statistical, mathematical functions
- math friendly syntax
- excellent support for charting and plotting functions
- active user community to provide support
R skills are valuable for big data engineers, since:
- data scientists we work with will often develop their models using R
- significant effort is required to translate such models to Java, C++, etc.
So: useful not only to understand R,
but also to be able to invoke R from your native language

R code for 2 dimensional model
values <- read.csv(filePath)
model <- lm(priceInThousands ~ sqFt + bathrooms, data=values)
# predict new value
#
# set up 'data frame'
newdata <- data.frame(sqFt=1600, bathrooms=3)
#
# invoke prediction function
predict(model, newdata)
csv file is in same format we saw in intro
slide on linear regression
response variableinput (independent) variables
R’s linear model
creation function response variableresponse variable
predict most likely selling price using model ‘model’ and the data frame that wraps
variables sqFt (1600), and bathrooms (3).

Calling R from Java
import org.rosuda.JRI.REXP;
import org.rosuda.JRI.Rengine;
class RegressionModelExecutor {
// Current R session (only one per JVM,
// since rjava is not multi-threaded).
Rengine rengine = null
RegressionModelExecutor(String inputDataPath) {
String []engineArgs = new String[1];
engineArgs [0] = "--vanilla";
rengine=new Rengine (engineArgs, false, null);
String script =
"""
values = read.csv('$inputDataPath')
newModel.lm = lm(
priceInThousands ~
sqFt + bathrooms, data=values)
"""
evaluateScript(script) // initialize model
}
public void shutdown() {
rengine.end();
}
// Apply model 'newModel.lm' to predict price of a house
// with given values for squareFeet and numBathrooms.
public double predictInstance(int sqft, float baths) {
rengine.eval(
"newdata = data.frame(
sqFt=$sqft, bathrooms=$baths)")
REXP result = rengine.eval(
"predict(newModel.lm , newdata)")
return result.asDouble()
}
// Evaluate block of R expressions, taking into account
// the fact that Rengine only executes one statement at
// a time. Unconditionally dumps out lines before executing
// the script so that if anything goes wrong we can copy
// paste the constructed output (scriptLines) directly
// into an R session.
public String evaluateScript(String scriptLines) {
println("evaluating: n$scriptLines")
for (String line: scriptLines.split("n")) {
rengine.eval(line)
}
}
~

Calling R from Java
More detailed article on R/Java:
http://buildlackey.com/integrating-r-and-java-with-jrirjava-a-jni-based-bridge/

How linear regression = Machine
learning?
A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by
P, improves with experience E

LEARN MORE:
KHAN ACADEMY
https://www.khanacademy.org/
COURSERA:
Coding the Matrix Course (Linear Algebra)
http://www.youtube.com/watch?v=IWugXcWpfoM
MIT Open Courseware
Linear Algebra Course
http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/index.htm

Software and Tools
● Apache Mahout (http://mahout.apache.org/): Java, Apache
● http://prediction.io/ (Machine learning server)
● Weka (http://www.cs.waikato.ac.nz/ml/weka/): Java, GPL
● OpenNLP (http://opennlp.apache.org/): Java, Apache
● Stanford NLP (http://nlp.stanford.edu/software/): Java, GPL
● Scikit-learn (http://scikit-learn.org/stable/): Python, BSD
● mply (http://mlpy.sourceforge.net/): Python, GPL
● NLTK (http://nltk.org/): Python, Apache
● http://www.alchemyapi.com/
Tools
R, Matlab, Octave
http://mloss.org/software/
http://sourceforge.net/directory/science-engineering/ai/machinelearning/os:linux/freshness:recently-updated/

Courses and other materials
● Coursera (http://www.coursera.org/):
○ machine learning
○ natural language processing
○ neural networks
● Udacity (https://www.udacity.com/courses)
○ artificial intelligence
● http://cs229.stanford.edu/materials.html
● http://www.ai.mit.edu/courses/6.867-f03/lectures.html
● wikipedia.org

Something about
Data
Sanjeev Mishra Chris Bedford
sanjeev.mishra@gmail.com chris@buildlackey.com

Silicon valleycodecamp2013

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Silicon valleycodecamp2013

Semelhante a Silicon valleycodecamp2013 (20)

Último

Último (20)

Silicon valleycodecamp2013