O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Understanding Mahout classification documentation

134 visualizações

Publicada em

Understanding Mahout classification documentation

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Understanding Mahout classification documentation

  1. 1. Mahout Classification Brief Introduction : “Scalable machine learning library” Mahout is a solid Java framework in the Data Mining/Artificial Intelligence area. It is a machine learning project by the Apache Software Foundation that tries to build intelligent algorithms that learn from some data input. What is special about Mahout is that it is a scalable library, prepared to deal with huge datasets. Its algorithms are built on top of the Apache Hadoopproject and, so, they work with distributed computing. It’s also scalable. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. Finally, it’s a Java library. It doesn’t provide a user interface, a prepackaged server, or an installer. It’s a framework of tools intended to be used and adapted by developers. Although Mahout is, in theory, a project open to implementations of all kinds of machine learning techniques, it’s in practice a project that focuses on three key areas of machine learning at the moment. They are :- 1. Recommended Engines 2. Clustering 3. Classification Some examples where these are used : 1. Recommended Engines :Eg. Social networking sites like Facebook use variants on recommender techniques to identify people most likely to be as-yet-unconnected friends. 2. Clustering :Eg. Google News groups news articles by topic using clustering techniques, in order to present news grouped by logical story, rather than presenting a raw listing of all articles. 3. Classification :Eg. Yahoo! Mail decides whether or not incoming messages are spam based on prior emails and spam reports from users, as well as on characteristics of the email itself. Each of these techniques works best when provided with a large amount of good input data. In some cases, these techniques must not only work on large amounts of input, but must produce results quickly, and these factors make scalability a major issue. And, as mentioned before, one of Mahout’s key reasons for being is to produce implementations of these techniques that do scale up to huge input.
  2. 2. We have to focus on Classification technique . So coming on to it , we move forward with the Classification using Mahout . Classification : Classification is a simplified form of decision making that gives discrete answers to an individual question. Machine-based classification is an automation of this decision making process that learns from examples of correct decision making and emulates those decisions automatically—a core concept in predictive analytics. Mahout can be used on a wide range of classification projects, but the advantage of Mahout over other approaches becomes striking as the number of training examples gets extremely large. What large means can vary enormously. Up to about 100,000 examples, other classification systems can be efficient and accurate. But generally, as the input exceeds 1 to 10 million training examples, something scalable like Mahout is needed. The reason Mahout has an advantage with larger data sets is that as input data increases, the time or memory requirements for training may not increase linearly in a non-scalable system. A system that slows by a factor of 2 with twice the data may be acceptable, but if 5 times as much data input results in the system taking 100 times as long to run, another solution must be found. This is the sort of situation in which Mahout shines. Following table shows you , where Mahout is the best choice :- System size in number of examples Choice of classification approach < 100,000 Traditional, non-Mahout approaches should work very well. Mahout may even be slower for training. 100,000 to 1 million Mahout begins to be a good choice. The flexible API may make Mahout a preferred choice, even though there is no performance advantage. 1 million to 10 million Mahout is an excellent choice in this range. > 10 million Mahout excels where others fail.
  3. 3. Classification algorithms are at the heart of what is called predictive analytics. The goal of predictive analytics is to build automated systems that can make decisions to replicate human judgment. Classification algorithms are a fundamental tool for meeting that goal. One example of predictive analytics is spam detection. A computer uses the details of user history and features of email messages to determine whether new messages are spam or are relatively welcome email. Another example is credit card fraud detection. A computer uses the recent history of an account and the details of the current transaction to determine whether the transaction is fraudulent. There are two main phases involved in building a classification system: 1. the creation of a model produced by a learning algorithm, 2. the use of that model to assign new data to categories. The first phase includes a lot of job such as , selection of training data, output categories (the targets), the algorithm through which the system will learn, and the variables used as input. We should know about some terms before we go into deep in the classification part : Terms Meaning Model A computer program that makes decisions; in classification, the output of the training algorithm is a model. Training data A subset of training examples labelled with the value of the target variable and used as input to the learning algorithm to produce the model. Test data A withheld portion of the training data with the value of the target variable hidden so that it can be used to evaluate the model. Training The learning process that uses training data to produce a model. That model can then compute estimates of the target variable given the predictor variables as inputs. Training example An entity with features that will be used as input for learning algorithm. Feature A known characteristic of a training or a new example; a feature is equivalent to a characteristic. Variable In this context, the value of a feature or a function of several features. This usage is
  4. 4. somewhat different from the use of variable in a computer program. Record A container where an example is stored; such a record is composed of fields. Field Part of a record that contains the value of a feature (a variable). Predictor variable A feature selected for use as input to a classification model. Not all features need be used. Some features may be algorithmic combinations of other features. Target variable A feature that the classification model is attempting to estimate: the target variable is categorical, and its determination is the aim of the classification system. Workflow of typical classification project in Brief : Stage Step 1. Training the model Define target variable. Collect historical data. Define predictor variables. Select a learning algorithm. Use the learning algorithm to train the model. 2. Evaluating the model Run test data. Adjust the input (use different predictor variables, different algorithms, or both). 3. Using the model in production Input new examples to estimate unknown target values. Retrain the model as needed.
  5. 5. Breif Study of WorkFlow:- Work Flow for Stage 1 : 1. Define Categories for Target Variable :- The target variable can’t have an open-ended set of possible values. Your choice of categories,in turn, affects your choices for possible learning algorithms, because some algorithms are limited to binary target variables. Although you can have no. of categories , but if you can limit the categories to just two , u will have more options for learning algos. 2. Collect Historical Data:- The source of historical data you choose will be directed in part by the need to collect historical data with known values for the target variable. 3. Define Predictor Variable: These variables are the concreteencoding of the features extracted from the training and test examples. The predictor variables appear in records for the training and test data and for the production data. 4. Select a learning algo for training the model : This is one of the most imp part , there are no of algorithm such as: a) Logistic Regression (SGD) b) Bayesian c) Support Vector Machines (SVM) d) Perceptron and Winnow e) Neural Network f) Random Forests g) Restricted Boltzmann Machines h) Online Passive Aggressive i) Boosting j) Hidden Markov Models (HMM) - Training is done in Map-Reduce Work Flow for Stage 2 :evaluating the classification model An essential step before using the classification system in production is to find out how well it’s likely to work. To do this, you must evaluate the accuracy of the model and make large or small adjustments as needed before you begin classification.
  6. 6. Work Flow for Stage 3 : This is using the model in production Once the model’s output has reached an acceptable level of accuracy, classification of new data can begin. The performance of the classification system in production will depend on several factors, one of the most important being the quality of the input data. If the new data to be analyzed has inaccuracies in the values of predictor variables, or if the new data isn’t an appropriate match to the training data, or if external conditions change over time, the quality of the classification model’s output will degrade. In order to guard against this problem, periodic retesting of the model is useful, and retraining may be necessary. Point of different steps In Detail you must Know before starting : - 1 .In Training Classifier :- In Training , most imp part is the feature –extraction part , from which we find out the predictor variable . Note :Your classifier can only be as good as the training data lets it be… – If you don’t do good data prep, everything will perform poorly – Data collection and pre-processing takes the bulk of the time Preparing data for the training algorithm consists of two main steps: 1. Preprocessing raw data—Raw data is rearranged into records with identical fields. These fields can be of four types: continuous, categorical, word-like, or text-like in order to be classifiable. 2. Converting data to vectors—Classifiable data is parsed and vectorized using custom code or tools such as Luceneanalyzers and Mahout vector encoders. Some Mahout classifiers also include vectorization code. The features should be chosen very carefully , as it is the base for the performance of ant classification model . Like for an example : Sometimes age is better for classification, and sometimes birth date is better. For instance, in the case of insurance data on car accidents, age will be a better variable to use because having car accidents is more related to life-stage than it is to the generation a person belongs to. On the other hand, in the case of music purchases, birth date might be more interesting because people often retain early music preferences as they get older. Their tastes often reflect those of their generation. How to convert data into Vector :-
  7. 7. Approach : - Represent Vectors implicitly as bags of words Used : In Bayesian classifier method. Benefit : Involves one pass and no collisions, it avoids the need for a dictionary, but itmeans that it’s difficult to make use of Mahout’s linear algebra capabilities that require known and consistent lengths for the Vector objects involved. There are other techniques ,such as feature –hashing , which is used in SGD (Stochastic Gradient Descent) , in algos such as Linear Regression. Choosing an algorithm to train the classifier : Following tells u to choose the algo , in accordance to the size of training data : The algorithms differ somewhat in the overhead or cost of training, the size of the data set for which they’re most efficient, and the complexity of analyses they can deliver. We will learn abt the algo in the later section . 2 .Evaluating the classifier :- To evaluate classifiers, Mahout offers a variety of performance metrics. The main approaches are percent correct, confusion matrix, AUC, and log likelihood. The naive Bayes and complementary naive Bayes classifier algorithms are best evaluated using percentcorrect and confusion matrix. Any of these methods will work with the SGD algorithm; AUC or log likelihood may be particularly useful, because they provide insight into the model’s confidence level. There are all the classes in Mahout through u are goin to do this , so that needs no extra effort to be applied by us , we can directly use the Mahout classes…
  8. 8. Metric Supported by Mahout class Percent correct CrossFoldLearner Confusion matrix ConfusionMatrix, Auc Entropy matrix Auc AUC Auc, OnlineAuc, CrossFoldLearner, AdaptiveLogisticRegression Log likelihood CrossFoldLearner 3 .Deploying the classifier :- The deployment process can be broken down into these steps: 1. Scope out the problem 2. Optimize feature extraction as needed 3. Optimize vector extraction as needed 4. Deploy the scalable classifier service : Naive Bayes : • Called Naïve Bayes because its based on “Baye’s Rule” and “naively” assumes independence given the label – It is only valid to multiply probabilities when the events are independent – Simplistic assumption in real life – Despite the name, Naïve works well on actual datasets
  9. 9. • Simple probabilistic classifier based on – applying Baye’s theorem (from Bayesian statistics) – strong (naive) independence assumptions. – A more descriptive term for the underlying probability model would be “independent feature model". The Naive Bayes algorithm is a probabilistic classification algorithm. It makes its decisions about which class to assign to an input document using probabilities derived from training data. The training process analyzes the relationship between words in the training documents and categories, and then categories and the entire training set. The available facts are collected using calculations based on Bayes’ Theorem to produce the probability that a collection of words (a document) belongs in a certain class. Bayes’ Theorem states that the probability of a category given a document is equal to the Probability of a document given a category multiplied by the probability of the category divided by the probability of a document. This can be expressed as: P(Category | Document) = P(Document | Category) x P(Category) / P(Document)

×