2. WHY AM I GIVING THIS TALK?
I am in the final stages of writing Clojure for Data Science.
It will be published by later this year.http://packtpub.com
3. AM I QUALIFIED?
I co-founded and was CTO of a data analytics company.
I am a software engineer, not a statistician.
4. WHY IS DATA SCIENCE IMPORTANT?
The robots are coming!
The rise of the computational developer.
These trends influence the kinds of systems we are all
expected to build.
5. WHY CLOJURE?
Clojure lends itself to interactive exploration and learning.
It has fantastic data manipulating abstractions.
The JVM hosts many of the workhorse data storage and
processing frameworks.
6. WHAT I WILL COVER
Distributions
Statistics
Visualisation with Quil
Correlation
Simple linear regression
Multivariable linear regression with Incanter
Break
Categorical data
Bayes classification
Logistic regression with Apache Commons Math
Clustering with Parkour and Apache Mahout
7. FOLLOW ALONG
The book's GitHub is available at
http://github.com/clojuredatascience
ch1-introduction
ch2-statistical-inference
ch3-linear-regression
ch5-classification
ch6-clustering
10. IF YOU'RE FOLLOWING ALONG
git clone git@github.com:clojuredatascience/ch1-introduction.git
cd ch1-introduction
script/download-data.sh
lein run -e 1.1
14. …EXPLAINED
is `(reduce + …)`.∑
is "for all xs"∑n
i=1
is a function of x and the mean of x( −xi μx )
2
(defn variance [xs]
(let [m (mean xs)
n (count xs)
square-error (fn [x]
(Math/pow (- x m) 2))]
(/ (reduce + (map square-error xs)) n)))
17. POINCARÉ'S BREAD
Poincaré weighed his bread every day for a year.
He discovered that the weights of the bread followed a
normal distribution, but that the peak was at 950g, whereas
loaves of bread were supposed to be regulated at 1kg. He
reported his baker to the authorities.
The next year Poincaré continued to weigh his bread from
the same baker, who was now wary of giving him the lighter
loaves. After a year the mean loaf weight was 1kg, but this
time the distribution had a positive skew. This is consistent
with the baker giving Poincaré only the heaviest of his loaves.
The baker was reported to the authorities again
34. CREDIT
Proceedings of the National Academy of Sciences, titled
"Statistical Detection of Election Irregularities," a team led by
Santa Fe Institute External Professor Stefan Thurner
38. SAMPLING SIZE
The values converge as the sample size increases.
We can often only infer the population parameters.
Sample Population
n N
X¯ μX
SX σX
49. SMALL SAMPLES
The standard error is calculated from the population
standard deviation, but we don't know it!
In practice they're assumed to be the same above around 30
samples, but there is another distribution that models the
loss of precision with small samples.
53. WHY THIS INTEREST IN MEANS?
Because often when we want to know if a difference in
populations is statistically significant, we'll compare the
means.
54. HYPOTHESIS TESTING
By convention the data is assumed not to support what the
researcher is looking for.
This conservative assumption is called the null hypothesis and
denoted .h0
The alternate hypothesis, , can then only be supported with
a given confidence interval.
h1
55. SIGNIFICANCE
The greater the significance of a result, the more certainty we
have that the null hypothesis can be rejected.
Let's use our range controller to adjust the significance
threshold.
59. POPULATION OF OLYMPIC SWIMMERS
The Guardian has helpfully provided data on the vital
statistics of Olympians
http://www.theguardian.com/sport/datablog/2012/aug/07/olym
2012-athletes-age-weight-height#data
62. LOG-NORMAL DISTRIBUTION
"A variable might be modeled as log-normal if it can be
thought of as the multiplicative product of many independent
random variables, each of which is positive. This is justified by
considering the central limit theorem in the log-domain."
66. CORRELATION
A few ways of measuring it, depending on whether your data
is continuous or discrete
http://xkcd.com/552/
67. PEARSON'S CORRELATION
Covariance divided by the product of standard deviations. It
measures linear correlation.
ρX, Y =
COV(X,Y)
σX σY
(defn pearsons-correlation [x y]
(/ (covariance x y)
(* (standard-deviation x)
(standard-deviation y))))
68. PEARSON'S CORRELATION
If is 0, it doesn’t necessarily mean that the variables are not
correlated. Pearson’s correlation only measures linear
relationships.
r
69. THIS IS A STATISTIC
The unknown population parameter for correlation is the
Greek letter . We are only able to calculate the sample
statistic .
ρ
r
How far we can trust as an estimate of will depend on two
factors:
r ρ
the size of the coefficient
the size of the sample
rX, Y =
COV(X,Y)
sX sY
71. SIMPLE LINEAR REGRESSION
(defn slope [x y]
(/ (covariance x y)
(variance x)))
(defn intercept [x y]
(- (mean y)
(* (mean x)
(slope x y))))
(defn predict [a b x]
(+ a (* b x)))
72. TRAINING A MODEL
(defn swimmer-data []
(->> (athlete-data)
($where {"Height, cm" {:$ne nil} "Weight" {:$ne nil}
"Sport" {:$eq "Swimming"}})))
(defn ex-3-12 []
(let [data (swimmer-data)
heights ($ "Height, cm" data)
weights (log ($ "Weight" data))
a (intercept heights weights)
b (slope heights weights)]
(println "Intercept: " a)
(println "Slope: " b)))
73. MAKING A PREDICTION
(predict 1.691 0.0143 185)
;; => 4.3365
(i/exp (predict 1.691 0.0143 185))
;; => 76.44
Corresponding to a predicted weight of 76.4kg
In 1979, Mark Spitz was 79kg.
http://www.topendsports.com/sport/swimming/profiles/spitz-
mark.htm
74. MORE DATA!
(defn features [dataset col-names]
(->> (i/$ col-names dataset)
(i/to-matrix)))
(defn gender-dummy [gender]
(if (= gender "F")
0.0 1.0))
(defn ex-3-26 []
(let [data (->> (swimmer-data)
(i/add-derived-column "Gender Dummy"
["Sex"] gender-dummy))
x (features data ["Height, cm" "Age" "Gender Dummy"])
y (i/log ($ "Weight" data))
model (s/linear-model y x)]
(:coefs model)))
;; => [2.2307529431422637 0.010714697827121089 0.002372188749408574 0.09
75412532492026]
86. STANDARD ERROR FOR A PROPORTION
SE =
p(1 − p)
n
‾ ‾‾‾‾‾‾‾‾
√
(defn standard-error-proportion [p n]
(-> (- 1 p)
(* p)
(/ n)
(Math/sqrt)))
= = 0.61
161 + 339
682 + 127
500
809
SE = 0.013
87. HOW SIGNIFICANT?
z =
−p1 p2
SE
P1: the proportion of women who survived is = 0.76339
446
P2: the proportion of men who survived = = 0.19161
843
SE: 0.013
z = 20.36
This is essentially impossible.
94. BAYES CLASSIFICATION
P(survive|third, male) =
P(survive)P(third|survive)P(male|
P(third, male)
P(perish|third, male) =
P(perish)P(third|perish)P(male|per
P(third, male)
Because the evidence is the same for all classes, we can
cancel this out.
95. PARSE THE DATA
(titanic-samples)
;; => ({:survived true, :gender :female, :class :first, :embarked "S", :
age "20-30"} {:survived true, :gender :male, :class :first, :embarked "S
", :age "30-40"} ...)
96. IMPLEMENTING A NAIVE BAYES MODEL
(defn safe-inc [v]
(inc (or v 0)))
(defn inc-class-total [model class]
(update-in model [class :total] safe-inc))
(defn inc-predictors-count-fn [row class]
(fn [model attr]
(let [val (get row attr)]
(update-in model [class attr val] safe-inc))))
99. MAKING PREDICTIONS
(defn n [model]
(->> (vals model)
(map :total)
(apply +)))
(defn conditional-probability [model test class]
(let [evidence (get model class)
prior (/ (:total evidence)
(n model))]
(apply * prior
(for [kv test]
(/ (get-in evidence kv)
(:total evidence))))))
(defn bayes-classify [model test]
(let [probs (map (fn [class]
[class (conditional-probability model test class)])
(keys model))]
(-> (sort-by second > probs)
(ffirst))))
100. DOES IT WORK?
(defn ex-5-7 []
(let [data (titanic-samples)
model (naive-bayes data :survived [:gender :class])]
(bayes-classify model {:gender :male :class :third})))
;; => false
(defn ex-5-8 []
(let [data (titanic-samples)
model (naive-bayes data :survived [:gender :class])]
(bayes-classify model {:gender :female :class :first})))
;; => true
101. WHY NAIVE?
Because it assumes all variables are independent. We know
they are not (e.g. being male and in third class) but naive
bayes weights all attributes equally.
In practice it works surprisingly well, particularly where there
are large numbers of features.
103. LOGISTIC REGRESSION
Logistic regression uses similar techniques to linear
regression but guarantees an output only between 0 and 1.
(x) = xhθ θT
(x) = g( x)hθ θT
Where the sigmoid function is
g(z) =
1
1 + e
−z
109. CALCULATING THE GRADIENT
(defn gradient-fn [h-theta xs ys]
(let [g (fn [x y]
(matrix/mmul (- (h-theta x) y) x))]
(->> (map g xs ys)
(matrix/transpose)
(map avg))))
We transpose to calculate the average for each feature
across all xs rather than average for each x across all
features.
115. PRODUCING A MODEL
(defn ex-5-11 []
(let [data (titanic-features)
initial-guess (-> data first count (take (repeatedly rand)))]
(run-logistic-regression data initial-guess)))
119. CLUSTERING
Find a grouping of a set of objects such that objects in the
same group are more similar to each other than those in
other groups.
120. SIMILARITY MEASURES
Many to choose from: Jaccard, Euclidean.
For text documents the Cosine measure is often chosen.
Good for high-dimensional spaces
Positive spaces the similarity is between 0 and 1.
121. COSINE SIMILARITY
cos(θ) =
A ⋅ B
∥A∥∥B∥
(defn cosine [a b]
(let [dot-product (->> (map * a b)
(apply +))
magnitude (fn [d]
(->> (map #(Math/pow % 2) d)
(apply +)
Math/sqrt))]
(/ dot-product
(* (magnitude a) (magnitude b)))))
125. WHY?
(cosine-sparse
(->> "music is the food of love"
stemmer/stems
(document-vector dictionary))
(->> "war is the locomotive of history"
stemmer/stems
(document-vector dictionary)))
;; => 0.0
(cosine-sparse
(->> "music is the food of love"
stemmer/stems
(document-vector dictionary))
(->> "it's lovely that you're musical" stemmer/stems
(document-vector dictionary)))
;; => 0.8164965809277259
128. GET THE DATA
We're going to be clustering the Reuters dataset.
Follow the readme instructions:
brew install mahout
script/download-reuters.sh
lein run -e 6.7
mahout seqdirectory -i data/reuters-txt -o data/reuters-sequencefile
129. VECTOR REPRESENTATION
Each document is converted into a vector representation.
All vectors share a dictionary providing a unique index for
each word.
133. WE NEED A UNIQUE ID
And we need to compute it in parallel.
134. PARKOUR MAPPING
(require '[clojure.core.reducers :as r]
'[parkour.mapreduce :as mr])
(defn document->terms [doc]
(clojure.string/split doc #"W+"))
(defn document-count-m
"Emits the unique words from each document"
{::mr/source-as :vals}
[documents]
(->> documents
(r/mapcat (comp distinct document->terms))
(r/map #(vector % 1))))
135. SHAPE METADATA
:keyvals ;; Re-shape as vectors of key-vals pairs.
:keys ;; Just the keys from each key-value pair.
:vals ;; Just the values from each key-value pair.
136. PLAIN OLD FUNCTIONS
(->> (document-count-m ["it's lovely that you're musical"
"music is the food of love"
"war is the locomotive of history"])
(into []))
;; => [["love" 1] ["music" 1] ["music" 1] ["food" 1] ["love" 1] ["war" 1
] ["locomot" 1] ["histori" 1]]
148. WHAT DID I LEAVE OUT?
Cluster quality measures
Spectral and LDA clustering
Collaborative filtering with Mahout
Random forests
Spark for movie recommendations with Sparkling
Graph data with Loom and Datomic
MapReduce with Cascalog and PigPen
Adapting algorithms for massive scale
Time series and forecasting
Dimensionality reduction, feature selection
More visualisation techniques
Lots more…
149. BOOK
Clojure for Data Science will be available in the second half
of the year from .http://packtpub.com
http://cljds.com