SlideShare uma empresa Scribd logo
1 de 7
A Short Discourse on Local Machine Learning

                                        Tak Wong
                                University of Massachusetts
                                          6/10/2008

Introduction

Traditional learning algorithms model biological human behaviors in perceiving and
organizing data, and subsequently extracting information by some heuristic experiments.
In particular, the engineering process of imitating the human analysis and inference of
data often focuses on a specific problem and provides a customized solution for the issues
at hand. This approach allows a great deal of freedom in the design and implementation
of very diverse strategies, but often makes it difficult to formalize the algorithms
analytically for generalization and optimization. At the opposite end of the spectrum,
modern methods of statistical modeling of Machine Learning seek to generalize the
adaptive process with algorithms that are not necessarily related to the biological or
social behavior, but are more tractable for quantitative analysis and optimization. In
order to be able to formalize the Machine Learning system in terms of the data and
inference, and quantify the training and decision processes analytically, the statistical
approach is often more viable, although it is more restrictive in the choice of models and
the actual design of the algorithm.

In recent years, with the advent of fast computers and sophisticated software, statistical
modeling of Machine Learning has taken on a new life, and analytical algorithms are
being applied successfully in many areas of industry and research: Data Mining,
Classification, Bio-Informatics, Pattern Recognition, Intrusion Detection, etc.
Fundamentally, the advantage of statistical modeling is the power in generalization and
optimization. The formalization of the structure of data sets and the algorithms for
training and decision allow the learning process to be understood and quantified
analytically. The statistical heuristics provide the direction and strategy for optimization.
Moreover, the underlying mathematical structure makes it relatively efficient to extend
the statistical model by combining and expanding mathematical properties of the data set
and training process, or transform decision algorithms by manipulating and adjusting
parameters and weight factors. As result, one of the notable recent development in
statistical Machine Learning is the Local Learning paradigm.

By Local Learning, we mean a methodology that does not use the complete, or
Global collection of data set for training or testing. In essence, the Global Learning
method processes and derives information from the complete data set, and
conclusively arrives at a universal approximation function for predicting all future
input patterns within the data space. In Global Learning, the training algorithm is
run on the complete data set until the optimal decision function is obtained.
Subsequently, unknown data patterns can be tested and classified by this decision
function. In most cases, the Global Learning method implies a density distribution
for the entire data space and a confidence level on recognizing new data points.

By contrast, the Local Learning model works on a subset of the available training data at
a time. The purpose of Local Learning is not so much to find a universal decision
function for all data patterns. Rather, it focuses on the specific task at hand and seeks a
solution for a specific data point. Different areas in the data set might require different
properties in the local algorithm Thus, a local training algorithm based on the chosen
subsets can render multiple and diverse decision functions which can be optimized and
combined to provide to solve the complete dataset. Moreover, in the case when a local
algorithm is run for each new data pattern, the Local Learning process can be very
computation-intensive.

The Motivation for Local Learning

In Machine Learning, the specific training process and method of generalization are
often determined by the structure of the data set: the dimension and characteristics
of the attributes, the size of the data, and its relationship with the decision space.
In addition, the might be some a priori knowledge on the distribution of the training data.
The traditional Global Learning approach intends to capture the characteristics of all the
data attributes within the feature space, not only based on currently available data, but
with some assumption on all future data pattern. In other words, Global Learning
ordinarily does not accommodate for generalization of changing hypotheses on account
of unstable values in data attributes. One issue is training data might not be evenly
distributed, causing the global training process to be skewed. Even with a strong
learning algorithm, the unevenness in the data set can lead to good decision near area of
well-represented training data and miscalculation at poorly sampled areas (high capacity).
On the other hand, if the algorithm is tweaked to compensate for poorly sampled data to
mitigate misclassification, it may lead to rejection of well-represented data (low
capacity).

In general, the Global Learning approach implies a density distribution, associated with a
decision function that is complete (all test data are accepted) and consistent (accurate
decision based on a pre-defined criterion, eg acceptable loss rate). Intuitively, these
cannot be realized unless the training data set is well-represented and evenly distributed,
which is not the case most of time. Moreover, requiring this type of “well-behaved”
training data defeats the fundamental purpose of Machine Learning, which is to infer the
characteristics of unknown data from whatever is available at hand.

As such, Global Learning models the trend of data distribution in the data space and
makes predictions accordingly. This is a very intuitive notion and is designed to work
well if the density distribution and associated parameters are chosen optimally. To this
end, this approach requires certain a priori or assumed knowledge about the training data
set. As a simple example, shown below to correlate temperature and traffic to the Cape
using a Naïve Bayesian algorithm:
Training Data Set:

                              No Traffic Jam         Traffic Jam
       Temperature            83                     85
                              70                     80
                              68                     65
                              64                     72
                              69                     71
                              75
                              75
                              72
                              81

I      Total                  9                      5

       Mean                   73                     74.6
       Std dev                6.2                    7.9

Input Pattern:

       Temperature = 66

Assuming the distribution of the training data is normal with the above mean and
standard deviation, the likelihoods of the input pattern are:

       P(temperature = 66 | no traffic jam) = 0.0340
       P(temperature = 66 | traffic jam) = 0.0291

Using the Bayesian formula:

       P(traffic jam | temperature = 66) =            0.0291
                                             ---------------------------- = 0.461
                                               0.0340 + 0.0291

                              Table 1

This simple machine learns from the statistics of the existing data and uses them to
predict the new unknown pattern. The critical element here is the assumption on the
normal distribution of the data.

Ultimately, indeed, it is the data set and data space that determine the learned
distribution. The quality of the learning algorithm, then, will depend on how effective
the arbitrary choice of the prior. A frequentist approach to the same problem might offer
a more “reasonable” alternative for the estimate of the prior:
Training Data Set:

               Temperature             No Traffic Jam          Traffic Jam
                    Hot                      2                        2
                    Mild                     4                        2
                    Cool                     3                        1

               Total                           9                        5

Input Pattern:

       Temperature = cool

       P(temperature = cool | no traffic jam) = 3/9 = 0.33
       P(temperature = cool | traffic jam) = 1/5 = 0.2

Using the Bayesian formula:

       P(traffic jam | temperature = cool) =            0.2
                                               --------------------- = 0.377
                                                    0.33 + 0.2


                               Table 2


Although the second model above applies the same Bayesian formula, the distribution of
the training data are derived from the actual existing data, which might be considered a
more valid prior.

The above exercises demonstrate that by changing the way data attributes are
represented, different models and results can be effected.

There are other issues with the structure of Global training data: The distribution can be
non-linear, or they can be derived from several varied distributions, high dimension and
small size of training data, or noise (unrecognizable attributes) in the data. While the
quality of a learning algorithm is general measured by the rate of error, the capacity
(completeness and consistence) is a practical concerned.

Local Learning methods, rather than modeling universal data to estimate the distribution,
focuses on a specific data pattern, given a subset of training data that are directly related
to the task at hand. The simplest example of Local Learning applies the following
procedure:

   1) Given a test data point x0, find the closest points among the training data set
      through some distance or similarity measure
   2) Define a vicinity or neighborhood in n dimension, with x0 as the center of the n-
      dimensional ball
   3) Run the learning algorithm using only the training points included inside the
       neighborhood of x0
   4) After the machine has been train, run the decision algorithm on the input pattern
      x0.


                       x x
                       x
                         x x          x       x
                       x x
                                      x x       x
                        x                     x0
                       x          x       x       x




                                   Figure 1


Firstly, this is a Lazy Learning algorithm, as the machine is not trained until an input data
pattern needs to be classified. Moreover, for each unclassifed data pattern xn, a new
neighborhood has to be defined accordingly – thus choosing a different set of tranining
data, and the training algorithm has to be run again for each xn. This is labor and
computation-intensive, but does solve several issues: By localizing the training data
relative to the input data pattern, effectively, the attributes contributing to the geometric
location of the data points can be eliminated. Since the algorithm is not seeking a
solution for all the unknown data points, the formulation of the hypothesis can be more
relaxed and simplified. For example, the prior information of the local training data can
be made arbitrary, without consideration of the universal data set. The idea is to focus on
the distribution of a group of data similar to the single input pattern at hand. A local
assumption may not be universally correct, but still serves the purpose of predicting the
input pattern. The capacity of the machine can be managed by manipulating the locality
of the training data. Since local algorithms always focus on a specific point, intuitively, it
should be more efficient and accurate than its global counterpart. Here is a toy example:

   We collect training data on dress code from around the world and use it to train a
   machine to decide whether the subject is a man or woman. A globally trained
algorithm will probably always designate the test subject a woman, if the subject is
    observed to wear a skirt. This decision is inaccurate, if the subject happens to be
    Scottish. Given the small percentage of Scottish subjects, it incurs a small overall
    error rate globally, but produces very inaccurate results when the algorithm is being
    applied on the population of Scots.

    Alternatively, if the Local Learning is applied, running the same learning algorithm, a
    Scottish subject will be predicted by the machine trained with a data set most similar
    to the subject, thus achieving a high rate of accuracy. Moreover, the attributes that
    determine the nationality, race, and locality of the subject can be eliminated to reduce
    the dimension. Because all the chosen test data points (Scotsmen) around the subject
    will have these same attributes.

Indeed, the “local” attributes in the training set can determine the local decision
function. The well-known Local Learning methods - the k-Nearest Neighbors
algorithm, relies entirely on the geometric vicinity of the training data to the subject
to decide the output.

Many other popular algorithms employ local methods of divide-and-conquer to
reduce an intractable problem to several less complex sub-problems. The practice
of Bagging, Boosting, and Cross-Validation all run algorithms on subsets of training
data to produce local decision functions, which are ultimately used to predict
universal data.

In some instances, local methods can be applied to problems that do not have a natural
linear solution. The so-called “XOR”-type of data points that cannot be classified by a
simple linear model, such as the Perceptron. This problem can be easily solved by a
Local Learning model because the training data are locally linearly separable.
1 X                  O                          1 X                    O

                                                                     x0


    O                X                               O                  X
0                    1                           0                     1


There is no linearly separable hyperplane     But a local algorithm can determine
for the global algorithm                      where the input pattern belongs with
                                              a linear hyperplane

                            Figure 2


Some local methods, instead of finding the locality strictly around the input pattern, seeks
selected local training points to generalize the global decision. The Radial Basis
Function assumes density distributions can be found in clusters in the data set, which
certain “attractor” points at the center of each cluster. Each cluster is then trained with a
local algorithm suitable for the assumed distribution. The resulting decision functions are
then combined and averaged out for predicting unseen global patterns.

In recent years, the Support Vector Machine has taken center stage in the Local Learning
arena. The SVM applies the training algorithm on selected local data points in order to
find the optimal hyperplane in terms of the widest margin between the hyperplane and the
closest local points to the hyperplane. The SVM is a linear model. If hyperplanes exist
for the training data, there is one hyperplane that is optimal – in the sense that this
hyperplane has the greatest orthogonal margin from any point in the training data set.
These points with the widest margin and closest to the hyperplane are called Support
Vectors, because based on the training algorithm, these points alone determine the
hyperplane, and thus the decision function. No other training points besides the support
vectors contribute to the decision function. The strength of the SVM is in the
transformation of linear model to non-linear ones using the “kernel trick” to map the
training data to a higher-dimension feature space.

As mentioned above, the SVM uses only a few local training points for the decision
function. Therefore, it does not retain global information and the general structure of
data set. The SVM is very stable – not affected by the variation of unknown data, but not
well-suited for predicting future data trend. In this respect, the SVM is suitable for
problems with data distributions that have a well-structured linear boundary, or can be
transformed to have a clear linear boundary in a higher feature space.



Bibliography

Bontempi, Gianluca                    Local Learning Techniques for Modeling,
                                      Prediction and Control

Bottou, Leon; Vapnik, Vladimir        Local Learning Algorithm

Burges, Christopher J C               A Tutorial on Support Vector Machines for Pattern
                                      Recognition

Frank, Eibe; Witten, Ivan             Data Mining

Huang, Kai Zu,                        Learning from Data Locally

Scholkopf and Smola                   Learning with Kernels

Vapnik, Vladimir                      Introduction to Support Vector Machine

Mais conteúdo relacionado

Mais procurados

Mr image compression based on selection of mother wavelet and lifting based w...
Mr image compression based on selection of mother wavelet and lifting based w...Mr image compression based on selection of mother wavelet and lifting based w...
Mr image compression based on selection of mother wavelet and lifting based w...
ijma
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan Chen
Xuan Chen
 
User Payment Prediction in Free-to-Play
User Payment Prediction in Free-to-PlayUser Payment Prediction in Free-to-Play
User Payment Prediction in Free-to-Play
Ahmed Hassan
 
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Devansh16
 

Mais procurados (19)

Analysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments DatasetAnalysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments Dataset
 
Multimodal Biometrics Recognition by Dimensionality Diminution Method
Multimodal Biometrics Recognition by Dimensionality Diminution MethodMultimodal Biometrics Recognition by Dimensionality Diminution Method
Multimodal Biometrics Recognition by Dimensionality Diminution Method
 
When deep learners change their mind learning dynamics for active learning
When deep learners change their mind  learning dynamics for active learningWhen deep learners change their mind  learning dynamics for active learning
When deep learners change their mind learning dynamics for active learning
 
FYP
FYPFYP
FYP
 
Impact of adversarial examples on deep learning models for biomedical image s...
Impact of adversarial examples on deep learning models for biomedical image s...Impact of adversarial examples on deep learning models for biomedical image s...
Impact of adversarial examples on deep learning models for biomedical image s...
 
40120130406009
4012013040600940120130406009
40120130406009
 
Performance evaluation of hepatitis diagnosis using single and multi classifi...
Performance evaluation of hepatitis diagnosis using single and multi classifi...Performance evaluation of hepatitis diagnosis using single and multi classifi...
Performance evaluation of hepatitis diagnosis using single and multi classifi...
 
Mr image compression based on selection of mother wavelet and lifting based w...
Mr image compression based on selection of mother wavelet and lifting based w...Mr image compression based on selection of mother wavelet and lifting based w...
Mr image compression based on selection of mother wavelet and lifting based w...
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarks
 
Ja3615721579
Ja3615721579Ja3615721579
Ja3615721579
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan Chen
 
Long-Term Robust Tracking Whith on Failure Recovery
Long-Term Robust Tracking Whith on Failure RecoveryLong-Term Robust Tracking Whith on Failure Recovery
Long-Term Robust Tracking Whith on Failure Recovery
 
Chapter 03 NCC STAT
Chapter 03 NCC STATChapter 03 NCC STAT
Chapter 03 NCC STAT
 
User Payment Prediction in Free-to-Play
User Payment Prediction in Free-to-PlayUser Payment Prediction in Free-to-Play
User Payment Prediction in Free-to-Play
 
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted Cluster
 
Unit 3classification
Unit 3classificationUnit 3classification
Unit 3classification
 
Best Student Selection Using Extended Promethee II Method
Best Student Selection Using Extended Promethee II MethodBest Student Selection Using Extended Promethee II Method
Best Student Selection Using Extended Promethee II Method
 
Are we really including all relevant evidence
Are we really including all relevant evidence Are we really including all relevant evidence
Are we really including all relevant evidence
 

Destaque (7)

20161014092242954
2016101409224295420161014092242954
20161014092242954
 
20161014092204721
2016101409220472120161014092204721
20161014092204721
 
recommendation DANO
recommendation DANOrecommendation DANO
recommendation DANO
 
Cach Tri Benh Viem Khop
Cach Tri Benh Viem KhopCach Tri Benh Viem Khop
Cach Tri Benh Viem Khop
 
Паллиативная помощь и благотворительность
Паллиативная помощь и благотворительностьПаллиативная помощь и благотворительность
Паллиативная помощь и благотворительность
 
Thấp Khớp Kiêng ăn Gì
Thấp Khớp Kiêng ăn GìThấp Khớp Kiêng ăn Gì
Thấp Khớp Kiêng ăn Gì
 
олон улсын худалдаа
олон улсын худалдааолон улсын худалдаа
олон улсын худалдаа
 

Semelhante a local_learning.doc - Word Format

Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
butest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...
Mumbai Academisc
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...
IJDKP
 
November, 2006 CCKM'06 1
November, 2006 CCKM'06 1 November, 2006 CCKM'06 1
November, 2006 CCKM'06 1
butest
 

Semelhante a local_learning.doc - Word Format (20)

Sequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsSequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_models
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4
 
Paper Explained: RandAugment: Practical automated data augmentation with a re...
Paper Explained: RandAugment: Practical automated data augmentation with a re...Paper Explained: RandAugment: Practical automated data augmentation with a re...
Paper Explained: RandAugment: Practical automated data augmentation with a re...
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
A data envelopment analysis method for optimizing multi response problem with...
A data envelopment analysis method for optimizing multi response problem with...A data envelopment analysis method for optimizing multi response problem with...
A data envelopment analysis method for optimizing multi response problem with...
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Fuzzy Optimization Method In The Search And Determination of Scholarship Reci...
Fuzzy Optimization Method In The Search And Determination of Scholarship Reci...Fuzzy Optimization Method In The Search And Determination of Scholarship Reci...
Fuzzy Optimization Method In The Search And Determination of Scholarship Reci...
 
Fuzzy Optimization Method In The Search And Determination of Scholarship Reci...
Fuzzy Optimization Method In The Search And Determination of Scholarship Reci...Fuzzy Optimization Method In The Search And Determination of Scholarship Reci...
Fuzzy Optimization Method In The Search And Determination of Scholarship Reci...
 
Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
 
November, 2006 CCKM'06 1
November, 2006 CCKM'06 1 November, 2006 CCKM'06 1
November, 2006 CCKM'06 1
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

local_learning.doc - Word Format

  • 1. A Short Discourse on Local Machine Learning Tak Wong University of Massachusetts 6/10/2008 Introduction Traditional learning algorithms model biological human behaviors in perceiving and organizing data, and subsequently extracting information by some heuristic experiments. In particular, the engineering process of imitating the human analysis and inference of data often focuses on a specific problem and provides a customized solution for the issues at hand. This approach allows a great deal of freedom in the design and implementation of very diverse strategies, but often makes it difficult to formalize the algorithms analytically for generalization and optimization. At the opposite end of the spectrum, modern methods of statistical modeling of Machine Learning seek to generalize the adaptive process with algorithms that are not necessarily related to the biological or social behavior, but are more tractable for quantitative analysis and optimization. In order to be able to formalize the Machine Learning system in terms of the data and inference, and quantify the training and decision processes analytically, the statistical approach is often more viable, although it is more restrictive in the choice of models and the actual design of the algorithm. In recent years, with the advent of fast computers and sophisticated software, statistical modeling of Machine Learning has taken on a new life, and analytical algorithms are being applied successfully in many areas of industry and research: Data Mining, Classification, Bio-Informatics, Pattern Recognition, Intrusion Detection, etc. Fundamentally, the advantage of statistical modeling is the power in generalization and optimization. The formalization of the structure of data sets and the algorithms for training and decision allow the learning process to be understood and quantified analytically. The statistical heuristics provide the direction and strategy for optimization. Moreover, the underlying mathematical structure makes it relatively efficient to extend the statistical model by combining and expanding mathematical properties of the data set and training process, or transform decision algorithms by manipulating and adjusting parameters and weight factors. As result, one of the notable recent development in statistical Machine Learning is the Local Learning paradigm. By Local Learning, we mean a methodology that does not use the complete, or Global collection of data set for training or testing. In essence, the Global Learning method processes and derives information from the complete data set, and conclusively arrives at a universal approximation function for predicting all future input patterns within the data space. In Global Learning, the training algorithm is run on the complete data set until the optimal decision function is obtained. Subsequently, unknown data patterns can be tested and classified by this decision
  • 2. function. In most cases, the Global Learning method implies a density distribution for the entire data space and a confidence level on recognizing new data points. By contrast, the Local Learning model works on a subset of the available training data at a time. The purpose of Local Learning is not so much to find a universal decision function for all data patterns. Rather, it focuses on the specific task at hand and seeks a solution for a specific data point. Different areas in the data set might require different properties in the local algorithm Thus, a local training algorithm based on the chosen subsets can render multiple and diverse decision functions which can be optimized and combined to provide to solve the complete dataset. Moreover, in the case when a local algorithm is run for each new data pattern, the Local Learning process can be very computation-intensive. The Motivation for Local Learning In Machine Learning, the specific training process and method of generalization are often determined by the structure of the data set: the dimension and characteristics of the attributes, the size of the data, and its relationship with the decision space. In addition, the might be some a priori knowledge on the distribution of the training data. The traditional Global Learning approach intends to capture the characteristics of all the data attributes within the feature space, not only based on currently available data, but with some assumption on all future data pattern. In other words, Global Learning ordinarily does not accommodate for generalization of changing hypotheses on account of unstable values in data attributes. One issue is training data might not be evenly distributed, causing the global training process to be skewed. Even with a strong learning algorithm, the unevenness in the data set can lead to good decision near area of well-represented training data and miscalculation at poorly sampled areas (high capacity). On the other hand, if the algorithm is tweaked to compensate for poorly sampled data to mitigate misclassification, it may lead to rejection of well-represented data (low capacity). In general, the Global Learning approach implies a density distribution, associated with a decision function that is complete (all test data are accepted) and consistent (accurate decision based on a pre-defined criterion, eg acceptable loss rate). Intuitively, these cannot be realized unless the training data set is well-represented and evenly distributed, which is not the case most of time. Moreover, requiring this type of “well-behaved” training data defeats the fundamental purpose of Machine Learning, which is to infer the characteristics of unknown data from whatever is available at hand. As such, Global Learning models the trend of data distribution in the data space and makes predictions accordingly. This is a very intuitive notion and is designed to work well if the density distribution and associated parameters are chosen optimally. To this end, this approach requires certain a priori or assumed knowledge about the training data set. As a simple example, shown below to correlate temperature and traffic to the Cape using a Naïve Bayesian algorithm:
  • 3. Training Data Set: No Traffic Jam Traffic Jam Temperature 83 85 70 80 68 65 64 72 69 71 75 75 72 81 I Total 9 5 Mean 73 74.6 Std dev 6.2 7.9 Input Pattern: Temperature = 66 Assuming the distribution of the training data is normal with the above mean and standard deviation, the likelihoods of the input pattern are: P(temperature = 66 | no traffic jam) = 0.0340 P(temperature = 66 | traffic jam) = 0.0291 Using the Bayesian formula: P(traffic jam | temperature = 66) = 0.0291 ---------------------------- = 0.461 0.0340 + 0.0291 Table 1 This simple machine learns from the statistics of the existing data and uses them to predict the new unknown pattern. The critical element here is the assumption on the normal distribution of the data. Ultimately, indeed, it is the data set and data space that determine the learned distribution. The quality of the learning algorithm, then, will depend on how effective the arbitrary choice of the prior. A frequentist approach to the same problem might offer a more “reasonable” alternative for the estimate of the prior:
  • 4. Training Data Set: Temperature No Traffic Jam Traffic Jam Hot 2 2 Mild 4 2 Cool 3 1 Total 9 5 Input Pattern: Temperature = cool P(temperature = cool | no traffic jam) = 3/9 = 0.33 P(temperature = cool | traffic jam) = 1/5 = 0.2 Using the Bayesian formula: P(traffic jam | temperature = cool) = 0.2 --------------------- = 0.377 0.33 + 0.2 Table 2 Although the second model above applies the same Bayesian formula, the distribution of the training data are derived from the actual existing data, which might be considered a more valid prior. The above exercises demonstrate that by changing the way data attributes are represented, different models and results can be effected. There are other issues with the structure of Global training data: The distribution can be non-linear, or they can be derived from several varied distributions, high dimension and small size of training data, or noise (unrecognizable attributes) in the data. While the quality of a learning algorithm is general measured by the rate of error, the capacity (completeness and consistence) is a practical concerned. Local Learning methods, rather than modeling universal data to estimate the distribution, focuses on a specific data pattern, given a subset of training data that are directly related
  • 5. to the task at hand. The simplest example of Local Learning applies the following procedure: 1) Given a test data point x0, find the closest points among the training data set through some distance or similarity measure 2) Define a vicinity or neighborhood in n dimension, with x0 as the center of the n- dimensional ball 3) Run the learning algorithm using only the training points included inside the neighborhood of x0 4) After the machine has been train, run the decision algorithm on the input pattern x0. x x x x x x x x x x x x x x0 x x x x Figure 1 Firstly, this is a Lazy Learning algorithm, as the machine is not trained until an input data pattern needs to be classified. Moreover, for each unclassifed data pattern xn, a new neighborhood has to be defined accordingly – thus choosing a different set of tranining data, and the training algorithm has to be run again for each xn. This is labor and computation-intensive, but does solve several issues: By localizing the training data relative to the input data pattern, effectively, the attributes contributing to the geometric location of the data points can be eliminated. Since the algorithm is not seeking a solution for all the unknown data points, the formulation of the hypothesis can be more relaxed and simplified. For example, the prior information of the local training data can be made arbitrary, without consideration of the universal data set. The idea is to focus on the distribution of a group of data similar to the single input pattern at hand. A local assumption may not be universally correct, but still serves the purpose of predicting the input pattern. The capacity of the machine can be managed by manipulating the locality of the training data. Since local algorithms always focus on a specific point, intuitively, it should be more efficient and accurate than its global counterpart. Here is a toy example: We collect training data on dress code from around the world and use it to train a machine to decide whether the subject is a man or woman. A globally trained
  • 6. algorithm will probably always designate the test subject a woman, if the subject is observed to wear a skirt. This decision is inaccurate, if the subject happens to be Scottish. Given the small percentage of Scottish subjects, it incurs a small overall error rate globally, but produces very inaccurate results when the algorithm is being applied on the population of Scots. Alternatively, if the Local Learning is applied, running the same learning algorithm, a Scottish subject will be predicted by the machine trained with a data set most similar to the subject, thus achieving a high rate of accuracy. Moreover, the attributes that determine the nationality, race, and locality of the subject can be eliminated to reduce the dimension. Because all the chosen test data points (Scotsmen) around the subject will have these same attributes. Indeed, the “local” attributes in the training set can determine the local decision function. The well-known Local Learning methods - the k-Nearest Neighbors algorithm, relies entirely on the geometric vicinity of the training data to the subject to decide the output. Many other popular algorithms employ local methods of divide-and-conquer to reduce an intractable problem to several less complex sub-problems. The practice of Bagging, Boosting, and Cross-Validation all run algorithms on subsets of training data to produce local decision functions, which are ultimately used to predict universal data. In some instances, local methods can be applied to problems that do not have a natural linear solution. The so-called “XOR”-type of data points that cannot be classified by a simple linear model, such as the Perceptron. This problem can be easily solved by a Local Learning model because the training data are locally linearly separable. 1 X O 1 X O x0 O X O X 0 1 0 1 There is no linearly separable hyperplane But a local algorithm can determine for the global algorithm where the input pattern belongs with a linear hyperplane Figure 2 Some local methods, instead of finding the locality strictly around the input pattern, seeks selected local training points to generalize the global decision. The Radial Basis
  • 7. Function assumes density distributions can be found in clusters in the data set, which certain “attractor” points at the center of each cluster. Each cluster is then trained with a local algorithm suitable for the assumed distribution. The resulting decision functions are then combined and averaged out for predicting unseen global patterns. In recent years, the Support Vector Machine has taken center stage in the Local Learning arena. The SVM applies the training algorithm on selected local data points in order to find the optimal hyperplane in terms of the widest margin between the hyperplane and the closest local points to the hyperplane. The SVM is a linear model. If hyperplanes exist for the training data, there is one hyperplane that is optimal – in the sense that this hyperplane has the greatest orthogonal margin from any point in the training data set. These points with the widest margin and closest to the hyperplane are called Support Vectors, because based on the training algorithm, these points alone determine the hyperplane, and thus the decision function. No other training points besides the support vectors contribute to the decision function. The strength of the SVM is in the transformation of linear model to non-linear ones using the “kernel trick” to map the training data to a higher-dimension feature space. As mentioned above, the SVM uses only a few local training points for the decision function. Therefore, it does not retain global information and the general structure of data set. The SVM is very stable – not affected by the variation of unknown data, but not well-suited for predicting future data trend. In this respect, the SVM is suitable for problems with data distributions that have a well-structured linear boundary, or can be transformed to have a clear linear boundary in a higher feature space. Bibliography Bontempi, Gianluca Local Learning Techniques for Modeling, Prediction and Control Bottou, Leon; Vapnik, Vladimir Local Learning Algorithm Burges, Christopher J C A Tutorial on Support Vector Machines for Pattern Recognition Frank, Eibe; Witten, Ivan Data Mining Huang, Kai Zu, Learning from Data Locally Scholkopf and Smola Learning with Kernels Vapnik, Vladimir Introduction to Support Vector Machine