1. Perfect Data Mining & Predictive
Analytics Model Methodology
Sub-field of computer science develop from computational learning and pattern reorganization theory in artificial
intelligence, Machine learning is the method of making analytical models to automatically search previously
unknown patterns from data that point out associations, anomalies (outliers), sequences, classifications, and clusters
and segments. These patterns reveal hidden strategy as to why an event happened.
Businesses and organizations can take benefit of various types of uses for
machine learning:
• Segmentation, sets of clients who have same or similar purchase
patterns for objective marketing
• Classification based on a set of attributes to make a prediction
• Forecasts—When purchase projections based on time series
• Pattern detection that associates one product with other one to
reveal cross-sell sequences and opportunities.
• Anomaly detection— fraud detecting (for illustration)
Predictive analytics model methodology
The most widely used Cross Industry Standard Process for Data Mining methodology is used to develop predictive
analytical models. It includes 6 phases:
1. business understanding
2. data understanding
3. data preparation
4. model development using supervised
5. unsupervised learning
6. model evaluation and model deployment
2. Business understanding
The understanding of business phase involves understand and define the use case or business problem, the business
target and the business query that require to be answered. It also include defining success criteria. Then the criterion
project-related action require to be process. These tasks involve defining resource needs such as defining any
constraints, technology, people, money, creating a project plan, requirements, assessing risks and creating a
contingency plan.
Data understanding
The understanding of data phase includes data needs such as internal and external data sources, origin and data
characteristics (feature and quality) including 3Vs data volumes, variety, velocity, formats and so on, also whether the
data is in a relational database, flat files, a Hadoop Distributed File System (HDFS) or if it is live, streaming data. This
phase also includes data exploration and investigation using statistical analysis to look at hug data, In addition, a data
quality assessment includes understanding the degree to which data is missing, has errors, is duplicated, and is
inconsistent.
Data preparation
The objective of the data preparation phase is to produce a set of information that can be fed into machine-learning
algos. This process requires a number of tasks including filtering and cleaning; data conversion; data transformation;
data enrichment; and variable identification, which is also known as dimensionality reduction or feature selection.
Variable identification’s objective is to create a data set of the most relevant variables to be used as model input to
get optimal results. The intention is also to remove variables from a data set that are not useful as model input
without compromising the model’s accuracy—for illustration, the accuracy of the predictions it makes.
Model development
The model development phase is about the development of a machine-learning model. Models can be build up to
predict, forecast or analyze information to find patterns such as sets, groups and associations
Two types of machine learning can be used in model development:
1. supervised learning
2. unsupervised learning
Typically, predictive models are build up using supervised learning. For illustration, if we require to develop a model
for equipment failure prediction, we can use data that describes equipment that has actually failed. We can use that
data to train the new model to distinguish the profile of a piece of equipment that is colorable going to fail. To fulfill
this profile recognition, we divide the data segments which inclusive failed equipment data records into a test data
set and a training data set. Then we train the model by fill the training data set and segments into an algorithm,
various of which can be used for prediction. Then we test the model by test data set.
Unsupervised learning is a method of analyzing data to try and search masked patterns in the data that indicate
product association and groupings—for illustration, customer segmentation. Grouping is based on minimizing or
maximizing similarity. The K-indicates clustering algorithm is a most widely used algorithm for this approach.
Predictive and descriptive analytical models can be build up using advanced Developed data mining tools, analytics
clouds, data science interactive workbooks with procedural or declarative programming languages and automated
model development tools.
3. Model evaluation
Afterward Model developed, the next phase is to evaluate the accuracy and purity of predictions. For predictions,
this assessment means understanding how many predictions were correct and incorrect? Various process can
achieve this evaluation. Key measures in model evaluation are the number of true positives, true negatives, false
positives and false negatives. The surface line is that we need to make surely that the model is accurate; otherwise, it
could generate hug false positives that may result in incorrect actions and decisions.
Model deployment
Once we are happy with the model we’ve developed, the final phase involves deploying models to run in many
various environment. These environments include spreadsheets, analytics servers, database management systems
(DBMSs), applications, analytical relational database management systems, Apache Hadoop, Apache Spark and
streaming analytics platforms.