Brief overview of some basic algorithms used online and across data-mining, and a word on where to learn them. Prepared specially for UCC Boole Prize 2012.
Role Of Transgenic Animal In Target Validation-1.pptx
Mathematics online: some common algorithms
1. MATHEMATICS ONLINE
Data-Mining, Predictive Analytics, Clustering, A.I.,
Machine Learning… and where to learn all this.
Boole Prize 2012
Mark Moriarty
University College Cork
2. 3 SECTIONS:
• 1 - Overview to some applications of Maths online.
• 2 - Sample algorithms.
• 3 - Recommended online Maths courses.
3. SECTION 1 (MOTIVATION):
MATHEMATICS IN ACTION
• User Clustering. • Facebook Feed.
• Recommender Systems. Movie • Google’s PageRank.
recommendations.
• DNA sequencing.
• Shopper analytics – send
relevant coupons. • Health analytics.
• Voice recognition. Machine • Intelligent ad displays.
Learning. • etc.
• Spam detection.
• Fraud detection.
4. AWKS…
“My daughter got this in the mail!
She’s still in high school, and
you’re sending her coupons for
baby clothes and cribs? Are you
trying to encourage her to get
pregnant?! ”
5. HOW TARGET FIGURED OUT A TEEN GIRL WAS
PREGNANT BEFORE HER FATHER DID
As Pole’s computers crawled through the data, he was
able to identify about 25 products that, when analyzed
together, allowed him to assign each shopper a
“pregnancy prediction” score. More important, he
could also estimate her due date to within a small
window, so Target could send coupons timed to very
specific stages of her pregnancy.
Take a fictional Target shopper who is 23, and in March bought cocoa-
butter lotion, a purse large enough to double as a diaper bag, zinc and
magnesium supplements and a bright blue rug. There’s, say, an 87%
chance that she’s pregnant and that her delivery date is sometime in late
August.
6. HOW KHAN ACADEMY IS USING MACHINE
LEARNING TO ASSESS STUDENT MASTERY
Old method: To determine when a student has finished a certain
exercise, they awarded proficiency to a user who has answered at
least 10 problems in a row correctly — known as a streak.
New metric for accuracy…
What do I mean by accuracy? Now define it as
which is just notation desperately trying to say ‖Given that we just
gained proficiency, what’s the probability of getting the next
problem correct?‖
7. NETFLIX PRIZE
$1 million top prize for their verified
submission on July 26, 2009,
achieving the winning RMSE of
0.8567 on the test subset. This
represents a 10.06% improvement
over Cinematch’s score on the test
subset at the start of the contest.
8. PANDORA & THE MUSIC GENOME PROJECT®
• On January 6, 2000 a group of musicians and music-loving
technologists came together with the idea of creating the most
comprehensive analysis of music ever.
• Together we set out to capture the essence of music at the most
fundamental level. We ended up assembling literally hundreds
of musical attributes or "genes" into a very large Music Genome.
10. FACEBOOK NEWS FEED
The default wall setting is "Top News―.
EdgeRank is there to do the customizing for you, based on
how each item scores in the algorithm.
The three main criteria for an item's algorithm score are:
1. Affinity: How often you and your friends interact
on the platform
2. Weight: Each type of content is weighted
differently, based on the past interactions of that
type of content
3. Time: How old the published item is
15. LOGISTIC REGRESSION
• At the most basic level, for one input variable, linear
regression is simply ―fitting a line to some data‖.
• Let’s look at the in the sample case of the Khan
Academy:
16. LOGISTIC REGRESSION ALGORITHM
• vector x = the values of input features
(eg. % correct).
• vector w = how much each feature
makes it more likely that the user is
proficient.
• We can write compactly as a linear
algebra dot product:
Already, you can see that the higher z is, the more
likely the user is to be proficient. To obtain our
probability estimate, all we have to do is
―shrink‖ into the interval (0, 1). We can do this
by plugging into a sigmoid function:
17. LOGISTIC REGRESSION RESULTS
From http://david-hu.com/2011/11/02/how-khan-academy-is-using-machine-learning-to-assess-student-mastery.html
21. K-MEANS: INTRODUCTION
• Partitioning Clustering Approach
• a typical clustering analysis approach via partitioning data set iteratively
• construct a partition of a data set to produce several non-empty clusters
(usually, the number of clusters given in advance)
• in principle, partitions achieved via minimising the sum of squared distance in
each cluster K 2
E i 1 x Ci || x mi ||
• Given a K, find a partition of K clusters to optimise the chosen
partitioning criterion
• K-means algorithm: each cluster is represented by the centroid of the cluster
and the algorithm converges to stable centres of clusters.
22. K-MEAN ALGORITHM
• Given the cluster number K, the K-means algorithm is carried out in three steps:
Initialisation: set seed points
• Assign each object to the
cluster with the nearest seed
point
• Compute seed points as the
centroids of the clusters of the
current partition (the centroid
is the centre, i.e., mean point,
of the cluster)
• Go back to Step 1), stop when
no more new assignment
23. K-MEANS DEMO
1. User set up the number of
clusters they’d like. (e.g.
k=5)
Credit to Ke Chen for the example graphics used on this and next few slides.
24. K-MEANS DEMO
1. User set up the number of
clusters they’d like. (e.g.
K=5)
2. Randomly guess K cluster
Center locations
25. K-MEANS DEMO
1. User set up the number of
clusters they’d like. (e.g.
K=5)
2. Randomly guess K cluster
Center locations
3. Each data point finds out
which Center it’s closest to.
(Thus each Center “owns” a
set of data points)
26. K-MEANS DEMO
1. User set up the number of
clusters they’d like. (e.g.
K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each Center “owns” a
set of data points)
4. Each centre finds the
centroid of the points it owns
27. K-MEANS DEMO
1. User set up the number of
clusters they’d like. (e.g.
K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the
centroid of the points it owns
5. …and jumps there
28. K-MEANS DEMO
1. User set up the number of
clusters they’d like. (e.g.
K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the
centroid of the points it owns
5. …and jumps there
6. …Repeat until terminated!
29. RELEVANT ISSUES
• Efficient in computation
• O(tKn), where n is number of objects, K is number of clusters, and t is
number of iterations. Normally, K, t << n.
• Local optimum
• sensitive to initial seed points
• converge to a local optimum that may be unwanted solution
• Other problems
• Need to specify K, the number of clusters, in advance
• Unable to handle noisy data and outliers (K-Medoids algorithm)
• Not suitable for discovering clusters with non-convex shapes
• Applicable only when mean is defined, then what about categorical data? (K-
mode algorithm)
30. RELEVANT ISSUES
• Cluster Validity
• With different initial conditions, the K-means algorithm may result in different partitions
for a given data set.
• Which partition is the ―best‖ one for the given data set?
• In theory, no answer to this question as there is no ground-truth available in
unsupervised learning
• Nevertheless, there are several cluster validity criteria to assess the quality of
clustering analysis from different perspectives
• A common cluster validity criterion is the ratio of the total between-cluster to the total
within-cluster distances
• Between-cluster distance (BCD): the distance between means of two clusters
• Within-cluster distance (WCD): sum of all distance between data points and the
mean in a specific cluster
• A large ratio of BCD:WCD suggests good compactness inside clusters and good
separability among different clusters!
31. CONCLUSION
• K-means algorithm is a simple yet popular method for clustering
analysis
• Its performance is determined by initialisation and appropriate
distance measure
• There are several variants of K-means to overcome its weaknesses
• K-Medoids: resistance to noise and/or outliers
• K-Modes: extension to categorical data clustering analysis
42. REFERENCES
• ―One Learning Hypothesis‖ image from http://www.ml-class.org
• Khan Academy discussion from http://david-hu.com/2011/11/02/how-khan-academy-is-
using-machine-learning-to-assess-student-mastery.html
• K-Means images from
http://www.cs.manchester.ac.uk/ugt/COMP24111/materials/slides/K-means.ppt
• Word equation for Naïve Bayes: http://www.wikipedia.org
• K nearest neighbours image from http://mlpy.sourceforge.net/docs/3.0/_images/knn.png
• Recommender Systems image from
http://holehouse.org/mlclass/16_Recommender_Systems.html
QUESTIONS?
2012-22-02 UCC Boole Prize M@rkMoriarty.com