SlideShare uma empresa Scribd logo
1 de 58
How to do Predictive
Analytics with
Limited Data
Ulrich Rueckert

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Agenda
Introduction and Motivation
Semi-Supervised Learning

• Generative Models
• Large Margin Approaches
• Similarity Based Methods
Conclusion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Predictive Modeling
Traditional Supervised
Learning
•
•
•
•
•

Promotion on bookseller’s web page
Customers can rate books.
Will a new customer like this book?
Training set: observations on previous
customers
Test set: new customers

What happens if only few
customers rate a book?

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Test Data
Age

Attributes

Income
60K
80K
95K
52K
45K
75K
51K

+
+
+

52

47K

-

47
25

38K
22K

-

33

47K

?

+

20
43
26

41K

-

35

?

LikesBook
+

65
60

67K

39

Age
24

LikesBook

22

Target
Label

Income

+

Training Data

Model

Age
22

Income
67K

LikesBook
+

39

41K

-

Prediction
Predictive Modeling
Test Data

Traditional Supervised
Learning
•
•
•
•
•

Age

Test set: new customers

What happens if only few
customers rate a book?

41K

?

95K
52K

?
?

20

45K

?

43

75K

?

26
52

51K
47K

?
?

47

38K

?

25

22K

?

33

Training set: observations on previous
customers

?

60
35

Will a new customer like this book?

67K

39

Customers can rate books.

LikesBook

22

Promotion on bookseller’s web page

Income

47K

?

Target
Label

Attributes
Age

Income

LikesBook

24

60K

+

65

80K

-

Model

Training Data

Prediction

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Predictive Modeling
Traditional Supervised
Learning
•
•
•
•
•

Promotion on bookseller’s web page
Customers can rate books.
Will a new customer like this book?
Training set: observations on previous
customers
Test set: new customers

What happens if only few
customers rate a book?

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Test Data
Age

Attributes

Income
60K
80K
95K
52K
45K
75K
51K

?
+
?

52

47K

-

47
25

38K
22K

?
?

33

47K

?

?

20
43
26

41K

?

35

?

LikesBook
+

65
60

67K

39

Age
24

LikesBook

22

Target
Label

Income

?

Training Data

Model

Age
22

Income
67K

LikesBook
+

39

41K

-

Prediction
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised Learning
Can we make use of the
unlabeled data?
•
•

In theory: no
... but we can make assumptions

Popular Assumptions
•
•
•

Clustering assumption
Low density assumption
Manifold assumption

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised Learning
Can we make use of the
unlabeled data?
•
•

In theory: no
... but we can make assumptions

Popular Assumptions
•
•
•

Clustering assumption
Low density assumption
Manifold assumption

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classification targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classification targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classification targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classification targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Beyond Mixtures of Gaussians
Expectation-Maximization
•
•

Can be adjusted to all kinds of mixture models
E.g. use Naive Bayes as mixture model for text classification

Self-Training
•
•
•
•

Learn model on labeled instances only
Apply model to unlabeled instances
Learn new model on all instances
Repeat until convergence

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Assumption
•

The area between the two classes has
low density

•

Does not assume any specific form of
cluster

Support Vector Machine
•
•
•

Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Assumption
•

The area between the two classes has
low density

•

Does not assume any specific form of
cluster

Support Vector Machine
•
•
•

Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Assumption
•

The area between the two classes has
low density

•

Does not assume any specific form of
cluster

Support Vector Machine
•
•
•

Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•

Minimize distance to labeled and
unlabeled instances

•

Parameter to fine-tune influence of
unlabeled instances

•

Additional constraint: keep class balance
correct

Implementation
•
•

Simple extension of SVM
But non-convex optimization problem

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•

Minimize distance to labeled and
unlabeled instances

•

Parameter to fine-tune influence of
unlabeled instances

•

Additional constraint: keep class balance
correct

Implementation
•
•

Simple extension of SVM
But non-convex optimization problem

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•

Minimize distance to labeled and
unlabeled instances

•

Parameter to fine-tune influence of
unlabeled instances

•

Additional constraint: keep class balance
correct

Implementation
•
•

Simple extension of SVM
But non-convex optimization problem

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•
•
•

Minimize distance to labeled and
unlabeled instances
Parameter to fine-tune influence of
unlabeled instances
Additional constraint: keep class balance
correct

Implementation
•
•

Margins of
labeled
instances

Regularizer

2

min kwk
w

+C
+C

n
X

`(yi (wT xi + b))

i=1
n+m
X
?

`(|wT xi + b|)

i=n+1

Simple extension of SVM
But non-convex optimization problem

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Margins of
unlabeled
instances
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Reducers

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

1

1 1

2

1

3

Smarter Approaches
•

Data

1

4

1

1

1

1
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Reducers

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

1

1 1

1 2

1

1

2

2 1

2 2

2

2

3

Smarter Approaches
•

Data

1

2

4

1

2
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Reducers

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

1

1 1

1 2

1 3

1 4

2

2 1

2 2

2 3

2 4

3

Smarter Approaches
•

Data

3 1

3 2

3 3

3 4

4

4 1

4 2

4 3

4 4
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Data
1
2
3

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

4

Nearest Neighbors
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Data
1
2
3

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

4

Nearest Neighbors

Result
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Data
1

1 1

1 2

2

2 1

2 2

2 3

3 2

3 3

3 4

4 3

4 4

3

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Reducers

4
Nearest Neighbor Join
Block Nested Loop Join
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks

•

2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Nearest Neighbor Join
Block Nested Loop Join
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks

•

2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Conclusion
Semi-Supervised Learning
•

Only few training instances have labels

•

Unlabeled instances can still provide valuable signal

Different assumptions lead to different approaches
•
•
•

Cluster assumption: generative models
Low density assumption: semi-supervised support vector machines
Manifold assumption: label propagation

Code available: https://github.com/Datameer-Inc/lesel

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
@Datameer

Thursday, October 31, 13

Mais conteúdo relacionado

Destaque

Analysis of Overlapping Communities in Signed Complex Networks
Analysis of Overlapping Communities in Signed Complex NetworksAnalysis of Overlapping Communities in Signed Complex Networks
Analysis of Overlapping Communities in Signed Complex NetworksMohsen Shahriari
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Label propagation - Semisupervised Learning with Applications to NLP
Label propagation - Semisupervised Learning with Applications to NLPLabel propagation - Semisupervised Learning with Applications to NLP
Label propagation - Semisupervised Learning with Applications to NLPDavid Przybilla
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphsNicola Barbieri
 
MOSES: Community finding using Model-based Overlapping Seed ExpanSion
MOSES: Community finding using Model-based Overlapping Seed ExpanSionMOSES: Community finding using Model-based Overlapping Seed ExpanSion
MOSES: Community finding using Model-based Overlapping Seed ExpanSionaaronmcdaid
 
Language of Politics on Twitter - 03 Analysis
Language of Politics on Twitter - 03 AnalysisLanguage of Politics on Twitter - 03 Analysis
Language of Politics on Twitter - 03 AnalysisYelena Mejova
 
Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social MediaSymeon Papadopoulos
 

Destaque (7)

Analysis of Overlapping Communities in Signed Complex Networks
Analysis of Overlapping Communities in Signed Complex NetworksAnalysis of Overlapping Communities in Signed Complex Networks
Analysis of Overlapping Communities in Signed Complex Networks
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Label propagation - Semisupervised Learning with Applications to NLP
Label propagation - Semisupervised Learning with Applications to NLPLabel propagation - Semisupervised Learning with Applications to NLP
Label propagation - Semisupervised Learning with Applications to NLP
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphs
 
MOSES: Community finding using Model-based Overlapping Seed ExpanSion
MOSES: Community finding using Model-based Overlapping Seed ExpanSionMOSES: Community finding using Model-based Overlapping Seed ExpanSion
MOSES: Community finding using Model-based Overlapping Seed ExpanSion
 
Language of Politics on Twitter - 03 Analysis
Language of Politics on Twitter - 03 AnalysisLanguage of Politics on Twitter - 03 Analysis
Language of Politics on Twitter - 03 Analysis
 
Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Media
 

Semelhante a How to do Predictive Analytics with Limited Data

Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation SystemsSalil Navgire
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerDatameer
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
 
Clustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdfClustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdfigeabroad
 
Best practices machine learning final
Best practices machine learning finalBest practices machine learning final
Best practices machine learning finalDianna Doan
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxUnit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxDr.Shweta
 
DutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical PerspectiveDutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical PerspectiveBigML, Inc
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
BSSML17 - Clusters
BSSML17 - ClustersBSSML17 - Clusters
BSSML17 - ClustersBigML, Inc
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 

Semelhante a How to do Predictive Analytics with Limited Data (20)

Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation Systems
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly Detection
 
Clustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdfClustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdf
 
Best practices machine learning final
Best practices machine learning finalBest practices machine learning final
Best practices machine learning final
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxUnit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptx
 
DutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical PerspectiveDutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical Perspective
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
BSSML17 - Clusters
BSSML17 - ClustersBSSML17 - Clusters
BSSML17 - Clusters
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 

Mais de Datameer

Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data AnalyticsDatameer
 
Getting Started with Big Data for Business Managers
Getting Started with Big Data for Business ManagersGetting Started with Big Data for Business Managers
Getting Started with Big Data for Business ManagersDatameer
 
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...Datameer
 
Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Datameer
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarDatameer
 
How to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics WebinarHow to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics WebinarDatameer
 
Webinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-EndWebinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-EndDatameer
 
Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User Datameer
 
Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?Datameer
 
Why Use Hadoop?
Why Use Hadoop?Why Use Hadoop?
Why Use Hadoop?Datameer
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarDatameer
 
Instant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of AnalysisInstant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of AnalysisDatameer
 
Customer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data AnalyticsCustomer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data AnalyticsDatameer
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? Datameer
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Datameer
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsDatameer
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopDatameer
 
Lean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use CaseLean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use CaseDatameer
 
The Economics of SQL on Hadoop
The Economics of SQL on HadoopThe Economics of SQL on Hadoop
The Economics of SQL on HadoopDatameer
 

Mais de Datameer (20)

Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
 
Getting Started with Big Data for Business Managers
Getting Started with Big Data for Business ManagersGetting Started with Big Data for Business Managers
Getting Started with Big Data for Business Managers
 
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
 
Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
How to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics WebinarHow to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics Webinar
 
Webinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-EndWebinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-End
 
Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User
 
Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?
 
Why Use Hadoop?
Why Use Hadoop?Why Use Hadoop?
Why Use Hadoop?
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
 
Instant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of AnalysisInstant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of Analysis
 
Customer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data AnalyticsCustomer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data Analytics
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
Lean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use CaseLean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use Case
 
The Economics of SQL on Hadoop
The Economics of SQL on HadoopThe Economics of SQL on Hadoop
The Economics of SQL on Hadoop
 

Último

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Último (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

How to do Predictive Analytics with Limited Data

  • 1. How to do Predictive Analytics with Limited Data Ulrich Rueckert © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 2. Agenda Introduction and Motivation Semi-Supervised Learning • Generative Models • Large Margin Approaches • Similarity Based Methods Conclusion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 3. Predictive Modeling Traditional Supervised Learning • • • • • Promotion on bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K + + + 52 47K - 47 25 38K 22K - 33 47K ? + 20 43 26 41K - 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income + Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
  • 4. Predictive Modeling Test Data Traditional Supervised Learning • • • • • Age Test set: new customers What happens if only few customers rate a book? 41K ? 95K 52K ? ? 20 45K ? 43 75K ? 26 52 51K 47K ? ? 47 38K ? 25 22K ? 33 Training set: observations on previous customers ? 60 35 Will a new customer like this book? 67K 39 Customers can rate books. LikesBook 22 Promotion on bookseller’s web page Income 47K ? Target Label Attributes Age Income LikesBook 24 60K + 65 80K - Model Training Data Prediction © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 5. Predictive Modeling Traditional Supervised Learning • • • • • Promotion on bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K ? + ? 52 47K - 47 25 38K 22K ? ? 33 47K ? ? 20 43 26 41K ? 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income ? Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
  • 6. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 7. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 8. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 9. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 10. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 11. Semi-Supervised Learning Can we make use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 12. Semi-Supervised Learning Can we make use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 13. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 14. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 15. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 16. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 17. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 18. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 19. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 20. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 21. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 22. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 23. Beyond Mixtures of Gaussians Expectation-Maximization • • Can be adjusted to all kinds of mixture models E.g. use Naive Bayes as mixture model for text classification Self-Training • • • • Learn model on labeled instances only Apply model to unlabeled instances Learn new model on all instances Repeat until convergence © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 24. The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 25. The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 26. The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 27. The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 28. The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 29. The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 30. The Low Density Assumption Semi-Supervised Support Vector Machine • • • Minimize distance to labeled and unlabeled instances Parameter to fine-tune influence of unlabeled instances Additional constraint: keep class balance correct Implementation • • Margins of labeled instances Regularizer 2 min kwk w +C +C n X `(yi (wT xi + b)) i=1 n+m X ? `(|wT xi + b|) i=n+1 Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Margins of unlabeled instances
  • 31. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 32. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 33. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 34. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 35. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 36. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 37. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 38. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 39. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 40. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 41. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 42. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 43. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 44. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 2 1 3 Smarter Approaches • Data 1 4 1 1 1 1
  • 45. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 1 2 2 1 2 2 2 2 3 Smarter Approaches • Data 1 2 4 1 2
  • 46. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 3 1 4 2 2 1 2 2 2 3 2 4 3 Smarter Approaches • Data 3 1 3 2 3 3 3 4 4 4 1 4 2 4 3 4 4
  • 47. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors
  • 48. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors Result
  • 49. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 1 1 1 2 2 2 1 2 2 2 3 3 2 3 3 3 4 4 3 4 4 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Reducers 4
  • 50. Nearest Neighbor Join Block Nested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 51. Nearest Neighbor Join Block Nested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 52. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 53. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 54. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 55. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 56. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 57. Conclusion Semi-Supervised Learning • Only few training instances have labels • Unlabeled instances can still provide valuable signal Different assumptions lead to different approaches • • • Cluster assumption: generative models Low density assumption: semi-supervised support vector machines Manifold assumption: label propagation Code available: https://github.com/Datameer-Inc/lesel © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13