SlideShare uma empresa Scribd logo
1 de 28
ii
A Seminar Report On
ARTIFICIAL NEURAL NETWORKS BASED DATA MINING
TECHNIQUES
Submitted in partial fulfilment of the requirements
For the award of degree
Of
INTEGRATED DUAL DEGREE
In
COMPUTER SCIENCE AND ENGINEERING
(With Specialization in Information Technology)
Submitted by
Vaibhav Dhattarwal
CSE-IDD
Enrolment No: 08211018
Under the guidance of
DR. DURGA TOSHINWAL
Professor
ELECTRONICS AND COMPUTER ENGINEERING DEPARTMENT
INDIAN INSTITUTE OF TECHNOLOGY ROORKEE
ROORKEE-247667
AUGUST 2012
iii
Abstract
This report presents an overview of Data Mining Techniques and some of the applications of
these techniques in various utility networks. Companies have been collecting data for
decades, building massive data warehouses in which to store it. Even though this data is
available, very few companies have been able to realize the actual value stored in it. The
question these companies are asking is how to extract this value. The answer is Data mining.
There are many technologies available to data mining practitioners, including Artificial
Neural Networks, Regression, and Decision Trees. Many practitioners are wary of Neural
Networks due to their black box nature, even though they have proven themselves in many
situations. This report also provides a brief overview of artificial neural networks and
questions their position as an applicable tool in data mining.
iv
Table of Contents
Page
Abstract i
Table of Contents ii
List of Figures iii
Chapter 1 Introduction 1
1.1 Objective of Seminar 2
Chapter 2 Data Mining 3
2.1 Data Mining Process 4
2.2 CRISP-DM Model 5
Chapter 3 Data Mining Techniques 7
3.1 Classification 7
3.2 Clustering 15
3.3 Regression 18
3.4 Association Rule 18
3.5 Neural Networks 18
Chapter 4 Neural Networks in Data Mining 20
4.1 Feed Forward Neural Network 21
4.2 Back Propagation Algorithm 21
Chapter 5 Applications of Data Mining Techniques 22
5.1 Specific Application Areas 22
5.2 Spatial Data Mining 24
5.3 Multimedia Data Mining 24
5.4 Web Mining 24
Chapter 6 Conclusion 26
References 27
v
List of Figures
Figure Title Page
2.1 Knowledge Discovery in Databases Process 3
2.2 Cross Industry Standard Process for Data Mining 6
3.1 Formation of Clusters 8
3.2 Linear Regression 9
4.1 An Artificial Neural Network 10
4.2 A Feed Forward Neural Network 10
5.1 Spatial Data Mining 11
5.2 Process Chart for conducting Text Mining 12
vi
1 Introduction
The development of Information Technology has generated large amount of databases and
huge data in various areas. Nowadays corporate and organizations are accumulating data at
an enormous rate and from a very broad variety of sources such as customer transactions,
credit card transactions, bank cash withdrawal to hourly weather data. A lot of relational
database servers have been built to store such massive quantities of data. As the matter of
fact, the data itself is critical to a company’s growth. It contains knowledge that could lead to
important business decisions that bring business to the next level. These data has never been
examined in a superficial manner. It is becoming data rich but knowledge poor. In other
words “We are drowning in data, but starving for knowledge!”
We need information but what we have is a huge amount of data flooding around companies,
organizations even individuals. Because of the amount of data is so enormous that humans
cannot process it fast enough to get the information out of it at the right time, the machine
learning technology has been established to solve this problem potentially.
The research in databases and information technology has given rise to an approach to store
and manipulate this precious data for further decision making.
Data mining is the term used to describe the process of extracting value from a database. A
Data-warehouse is a location where information is stored. The type of data stored depends
largely on the type of industry and the company. Data mining (the analysis step of the
"Knowledge Discovery in Databases" process, or KDD), a relatively young and
interdisciplinary field of computer science, is the process that attempts to discover patterns in
large data sets. It utilizes methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems. The overall goal of the data mining process is to
extract information from a data set and transform it into an understandable structure for
further use. Aside from the raw analysis step, it involves database and data
management aspects, data pre-processing, model and inference considerations,
interestingness metrics, complexity considerations, post-processing of discovered
structures, visualization, and online updating.
vii
Data mining is the business of answering questions that you’ve not asked yet. Data mining
reaches deep into databases. Data mining tasks can be classified into two categories:
Descriptive and predictive data mining.
Descriptive data mining provides information to understand what is happening inside the data
without a predetermined idea. Predictive data mining allows the user to submit records with
unknown field values, and the previous patterns discovered form the database. Data mining
models can be categorized according to the tasks they perform: Classification and Prediction,
Clustering, Association Rules. Classification and prediction is a predictive model, but
clustering and association rules are descriptive models.
The most common action in data mining is classification. It recognizes patterns that describe
the group to which an item belongs. It does this by examining existing items that already
have been classified and inferring a set of rules. Similar to classification is clustering. The
major difference being that no groups have been predefined. Prediction is the construction
and use of a model to assess the class of an unlabeled object or to assess the value or value
ranges of a given object is likely to have. The next application is forecasting. This is different
from predictions because it estimates the future value of continuous variables based on
patterns within the data.
Four things are required to data-mine effectively: high-quality data, the “right” data, an
adequate sample size and the right tool. There are many tools available to a data mining
practitioner. These include decision trees, various types of regression and neural networks.
1.1 Objective of the Seminar
The introduction of Data Mining and a description of the Data Mining Process are presented
in this seminar report. The objective of this seminar is to present an overview of Data Mining
techniques that are in use and are applicable in various scenarios. The application of these
techniques has also been discussed after an explanation of the implementation of the
technique.
viii
2 Data Mining
Data mining is a process of extraction of useful information and patterns from huge data. It is
also called as knowledge discovery process, knowledge mining from data, knowledge
extraction or data /pattern analysis. In other words, it can be referred to as Knowledge-
Discovery in Databases (KDD). It involves searching large volumes of data for patterns.
Figure 2.1 Knowledge Discovery in Databases Process
The Knowledge Discovery in Databases (KDD) process is commonly defined with the
stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation/Evaluation.
ix
2.1 Data Mining Process
Data Mining is performed on the following types of data
 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
o Object-oriented and object-relational databases
o Spatial databases
o Time-series data and temporal data
o Text databases and multimedia databases
o Heterogeneous and legacy databases
Some of the steps involved in the Data Mining process are:
 Data cleaning The task of this step is to remove noise and inconsistent data.
 Data integration In this step, multiple data sources like the ones mentioned in the
section above can be combined to an integrated collection of data.
 Data selection All the data relevant to the analysis task is retrieved from the database
in this step.
 Data transformation The data is transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations.
 Data mining The critical step where intelligent methods are applied in order to
extract data patterns.
 Pattern evaluation This step is deployed to identify the truly interesting patterns
representing knowledge based on certain measures.
 Knowledge presentation In the final step, various visualization and knowledge
representation techniques are used to present the mined knowledge to the user.
Data mining has five main functions:
 Classification: It infers the defining characteristics of a certain group.
 Clustering: It identifies groups of items that share a particular characteristic. (Clus-
tering differs from classification in that no predefining characteristic is given in clas-
sification.)
x
 Association: It identifies relationships between events that occur at one time.
 Sequencing: It is similar to association, except that the relationship exists over a
period of time.
 Forecasting: It estimates future values based on patterns within large sets of data.
2.2 Cross-Industry Standard Process for Data Mining (CRISP-DM) Model
Figure 2.2 Cross-Industry Standard Process for Data Mining (CRISP-DM)
1. Business understanding - In the business understanding phase, it is a must to understand
business objectives clearly and finding out what the client really want to achieve. Next,
we have to assess the situation by finding about the resources, assumptions, constraints
xi
and other important factors. Then from the business objectives and current situations, we
need to create goals to achieve the business objective within the current situation. Finally
a good data mining plan has to be established to achieve both business and data mining
goals.
2. Data understanding - This phase starts with initial data collection from available
sources to get familiar with data. Data load and Data integration must be carried out to
ensure successful data collection. Next, the “surface” properties of acquired data need to
be examined carefully and reported. Then, the data need to be explored by tackling the
data mining questions, which can be addressed using querying, reporting and
visualization. Finally, we must check whether the acquired data is complete, and ensure
that there are no missing values in the acquired data.
3. Data preparation - The data preparation normally consumes about 90% of the time. The
outcome of the data preparation phase is the final data set. When the available data
sources are identified, they need to be selected, cleaned, constructed and formatted into
the desired form.
4. Modelling - Several modelling techniques are selected to be used for the prepared
dataset. A test scenario must be generated to validate the model’s quality. One or more
models are created by running the modelling tool on the prepared dataset. The created
models need to be assessed carefully so that they meet business initiatives.
5. Evaluation - In the evaluation phase, the model results must be evaluated in the context
of business objectives in the first phase. In this phase, new business requirements may be
raised due to new patterns has been discovered in the model results or from other factors.
Gaining business understanding is an iterative process in data mining. The final decision
must be made in this step to move to the deployment phase.
6. Deployment - The knowledge or information gained through data mining process needs
to be presented in such a way that it can be used, whenever it is desired. In this phase, the
deployment, maintained and monitoring plans have to be created for deployment and
future supports. From project point of view, the final evaluation of the project needs to
summarize the project experiences and review the project to see what needs to be
improved.
xii
3 Data Mining Techniques
3.1. Classification
Classification is the most commonly applied data mining technique, which employs a set of
pre-classified examples to develop a model that can classify the population of records at
large. . Classification is a classic data mining technique based on machine learning. Basically
classification is used to classify each item in a set of data into one of predefined set of classes
or groups. Fraud detection and credit risk applications are particularly well suited to this type
of analysis. This approach frequently employs decision tree or neural network-based
classification algorithms.
The data classification process involves learning and classification. In Learning, the training
data are analyzed by classification algorithm. In classification, test data are used to estimate
the accuracy of the classification rules. If the accuracy is acceptable, the rules can be applied
to the new data tuples Classification method makes use of mathematical techniques such as
decision trees, linear programming, neural network and statistics. In classification, we make
the software that can learn how to classify the data items into groups.
The classifier-training algorithm uses these pre-classified examples to determine the set of
parameters required for proper discrimination. The algorithm then encodes these parameters
into a model called a classifier.
Types of classification models:
Classification by decision tree induction
Bayesian Classification
Support Vector Machines (SVM)
Classification Based on Associations
3.2. Clustering
Clustering can be defined as identification of similar classes of objects. Clustering is a data
mining technique that makes meaningful or useful cluster of objects that have similar
characteristic using automatic technique. By using clustering techniques we can further
identify dense and sparse regions in object space and can discover overall distribution pattern
and correlations among data attributes. Due to the fact that classification approach can
become costly, Clustering can be used as pre-processing approach for attribute subset
xiii
selection and classification. In clustering technique, the classes are defined and accordingly
objects are put in them, whereas in classification objects are assigned into predefined classes.
Figure 3.1 Formation of clusters
Types of clustering methods:
Partitioning Methods
Hierarchical methods
Density based methods
Grid-based methods
Model-based methods
3.3. Regression
Regression analysis helps in understanding how the typical value of the dependent variable
changes when any one of the independent variables is varied, while the other independent
variables are held fixed. Regression analysis estimates the conditional expectation of the
dependent variable given the independent variables. In other words, it estimates the average
value of the dependent variable when the independent variables are fixed.
In all cases, the estimation target is a function of the independent variables called the
regression function. In regression analysis, it is also of interest to characterize the variation of
the dependent variable around the regression function, which can be described by a
probability distribution.
xiv
Regression analysis is widely used for prediction and forecasting, where its use has
substantial overlap with the field of machine learning. Regression analysis is also used to
understand which independent variables are related to the dependent variable, and to explore
the forms of these relationships.
In data mining, independent variables are attributes already known and response variables
are what we want to predict. Real-world problems are very difficult to predict because they
may depend on complex interactions of multiple predictor variables. Therefore, more
complex techniques (e.g., logistic regression, decision trees, or neural nets) may be necessary
to forecast future values. The same model types can often be used for both regression and
classification. For example, the CART (Classification and Regression Trees) decision tree
algorithm can be used to build both classification trees (to classify categorical response
variables) and regression trees (to forecast continuous response variables). Neural networks
too can create both classification and regression models.
Figure 3.2 Linear Regression
Types of regression methods
Linear Regression
Multivariate Linear Regression
Nonlinear Regression
Multivariate Nonlinear Regression
3.4. Association Rule
Association is one of the best known data mining technique. In association, a pattern is
discovered based on a relationship of a particular item on other items in the same transaction.
xv
Association and correlation is usually to find frequent item set findings among large data sets.
This type of finding helps businesses to make certain decisions, such as catalogue design,
cross marketing and customer shopping behaviour analysis. Association rules are usually
required to satisfy a user-specified minimum support and a user-specified minimum
confidence at the same time. Association rule generation is usually split up into two separate
steps: First, minimum support is applied to find all frequent item sets in a database. Second,
these frequent item sets and the minimum confidence constraint are used to form rules.
Association Rule algorithms need to be able to generate rules with confidence values less
than one. However the number of possible Association Rules for a given dataset is generally
very large and a high proportion of the rules are usually of little (if any) value.
Types of association rule
Multilevel association rule
Multidimensional association rule
Quantitative association rule
3.5. Neural networks
An Artificial Neural Network (ANN), usually called neural network (NN), is a mathematical
model or computational model that is inspired by the structure and functional aspects of
biological neural networks. A neural network consists of an interconnected group of artificial
neurons, and it processes information using a connection based approach to computation. In
most cases an ANN is an adaptive system that changes its structure based on external or
internal information that flows through the network during the learning phase. Modern neural
networks are non-linear statistical data modelling tools. They are usually used to model
complex relationships between inputs and outputs or to find patterns in data.
Neural network is a set of connected input/output units and each connection has a weight
present with it. During the learning phase, network learns by adjusting weights so as to be
able to predict the correct class labels of the input tuples. Neural networks have the
remarkable ability to derive meaning from complicated or imprecise data and can be used to
extract patterns and detect trends that are too complex to be noticed by either humans or other
computer techniques. These are well suited for continuous valued inputs and outputs. Neural
networks are best at identifying patterns or trends in data and well suited for prediction or
forecasting needs.
xvi
4 Neural Networks in Data Mining
Neural networks are non-linear statistical data modelling tools. They can be used to model
complex relationships between inputs and outputs; or to find patterns in data and to infer rules
from them. Neural networks are useful in providing information on associations,
classifications, clusters, and forecasting. Using neural networks as a tool, data warehousing
firms can harvest information from datasets in the data mining process. Neural networks are
programmed to store, recognize, and associatively retrieve patterns or database entries; to
solve combinatorial optimization problems; to filter noise from measurement data; to control
ill-defined problems; in summary, to estimate sampled functions when we do not know the
form of the functions. The two abilities: pattern recognition and function estimation make
neural networks a very prevalent utility in data mining. With their model-free estimators and
their dual nature, neural networks serve data mining in a variety of ways.
Figure 4.1 an Artificial Neural Network
Neural networks, depending on the architecture, provide associations, classifications, clusters,
prediction and forecasting to the data mining industry. Neural networks essentially comprise
xvii
three pieces: the architecture or model; the learning algorithm; and the activation functions.
Due to neural networks, we can mine valuable information from a mass of history information
so that it can be efficiently used in financial areas. Hence, the applications of neural networks
in financial forecasting have become very popular.
4.1. Feed forward Neural Network:
Figure 4.2 a Feed Forward Neural Network
One of the simplest feed forward neural networks (FFNN), in Figure 4.2, consists of three
layers: an input layer, hidden layer and output layer. In each layer there are one or more
processing elements (PEs). PEs is meant to simulate the neurons in the brain and this is why
they are often referred to as neurons or nodes. A PE receives inputs from either the outside
world or the previous layer. There are connections between the PEs in each layer that have a
weight (parameter) associated with them. This weight is adjusted during training. Information
only travels in the forward direction through the network - there are no feedback loops. The
simplified process for training a FFNN is as follows:
xviii
1. Input data is presented to the network and propagated through the network until it
reaches the output layer. This forward process produces a predicted output.
2. The predicted output is subtracted from the actual output and an error value for the
networks is calculated.
3. The neural network then uses supervised learning, which in most cases is back
propagation, to train the network. Back propagation is a learning algorithm for
adjusting the weights. It starts with the weights between the output layer PE’s and
the last hidden layer PE’s and works backwards through the network.
4. Once back propagation has finished, the forward process starts again, and this cycle
is continued until the error between predicted and actual outputs is minimized.
4.2. The Back Propagation Algorithm:
Back propagation, or propagation of error, is a common method of teaching artificial neural
networks how to perform a given task. Back propagation is the method of training artificial
neural networks so as to minimize the objective function. The back propagation algorithm
performs learning on a feed-forward neural network. The back propagation algorithm is used
in layered feed forward ANNs. This means that the artificial neurons are organized in layers,
and send their signals “forward”, and then the errors are propagated backwards. The back
propagation algorithm uses supervised learning, which means that we provide the algorithm
with examples of the inputs and outputs we want the network to compute, and then the error
(difference between actual and expected results) is calculated. The idea of the back
propagation algorithm is to reduce this error, until the ANN learns the training data.
Algorithm for a 3-layer network:
Initialize the weights in the network
Do
For each example E in the training set
O = neural-net-output (network, e); forward pass
T = teacher output for e
Calculate error (T - O) at the output units
xix
Compute delta_wh for all weights from hidden layer to output layer ;
backward pass
Compute delta_wi for all weights from input layer to hidden layer ; backward
pass continued
Update the weights in the network
Until all examples classified correctly or stopping criterion satisfied
Return the network
The Back Propagation learning algorithm can be divided into two phases:
Phase 1: Propagation
Every propagation involves the following steps:
1. Forward propagation of a training pattern's input through the neural network.
2. Backward propagation of the propagation's output activations through the neural
network using the training pattern's target.
Phase 2: Weight update
For each weight-synapse the following steps are used:
1. Multiply its output delta and input activation to get the gradient of the weight.
2. Bring the weight in the opposite direction of the gradient by subtracting a ratio of it
from the weight.
Repeat phase 1 and 2 until the performance of the network is satisfactory.
xx
5 Applications of Data Mining
Data mining is a relatively new technology that has not fully matured. Despite this, there are
a number of industries that are already using it on a regular basis. Some of these
organizations include retail stores, hospitals, banks, and insurance companies.
Many of these organizations are combining data mining with such things as statistics, pattern
recognition, and other important tools. Data mining can be used to find patterns and
connections that would otherwise be difficult to find. This technology is popular with many
businesses because it allows them to learn more about their customers and make smart
marketing decisions.
There are a number of applications that data mining has. The first is called market
segmentation. With market segmentation, we can find behaviours that are common among
customers. We can look for patterns among customers that seem to purchase the same
products at the same time. Another application of data mining is called customer churn.
Customer churn allows us to estimate which customers are the most likely to stop purchasing
our products or services and go to one of our competitors. In addition to this, a company can
use data mining to find out which purchases are the most likely to be fraudulent.
For example, by using data mining in retail stores, we may be able to determine which
products are stolen the most. By finding out which products are stolen the most, steps can be
taken to protect those products and detect those who are stealing them. We can also use data
mining to determine the effectiveness of interactive marketing. Some of the customers will be
more likely to purchase the products online than offline, and we must identify them.
While many businesses use data mining to help increase their profits, it can also be used to
create new businesses and industries. One industry that can be created by data mining is the
automatic prediction of both behaviours and trends. Using this automated prediction we can
have an advantage over the competition. Instead of simply guessing what the next big trend
will be, we can determine it based on statistics, patterns, and logic. Another application of
automatic prediction is to use data mining to look at the past marketing strategies to
xxi
determine the best one so far and the reason for it being the best. We can avoid making any
mistakes that occurred in previous marketing campaigns.
Data mining is also a powerful tool for those who deal with finances. A financial institution
such as a bank can predict the number of defaults that will occur among their customers
within a given period of time, and they can also predict the amount of fraud that will occur as
well.
Another application of data mining is the automatic recognition of patterns that were not
previously known. While data mining is a very valuable tool, it is important to realize that it
is not a complete solution. Even if an automated technology should be invented, it will not
guarantee the success of the company. However, it will tip the odds in our favour.
5.1. Specific Application Areas of Data Mining
Data Mining for Financial Data Analysis few typical cases:
 Design and construction of data warehouses for multidimensional data analysis.
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.
 Data Mining for the Retail Industry.
A few examples of data mining in the retail industry:
 Design and construction of data warehouses based on the benefits of data mining
 Multidimensional analysis of sales, customers, products, time, and region
 Analysis of the effectiveness of sales campaigns
 Customer retention—analysis of customer loyalty
 Product recommendation and cross-referencing of items
Data Mining for the Telecommunication Industry
 Multidimensional analysis of telecommunication data.
 Fraudulent pattern analysis and the identification of unusual patterns.
 Multidimensional association and sequential pattern analysis.
xxii
 Mobile telecommunication services.
 Use of visualization tools in telecommunication data analysis.
Data Mining for Biological Data Analysis
 Semantic integration of heterogeneous, distributed genomic and proteomic databases.
 Alignment, indexing, similarity search, and comparative analysis of multiple
nucleotide/protein sequences.
 Discovery of structural patterns and analysis of genetic networks and protein
pathways.
 Association and path analysis: identifying co-occurring gene sequences and linking
genes to different stages of disease development.
 Visualization tools in genetic data analysis.
Data Mining in Other Scientific Applications
Data collection and storage technologies have recently improved, so that today, scientific data
can be amassed at much higher speeds and lower costs. This has resulted in the accumulation
of huge volumes of high-dimensional data, stream data, and heterogeneous data, containing
rich spatial and temporal information. Consequently, scientific applications are shifting from
the “hypothesize-and-test” paradigm toward a “collect and store data, mine for new
hypotheses, confirm with data or experimentation” process. This shift brings about new
challenges for data mining.
5.2. Spatial Data Mining
A spatial database stores a large amount of space-related data, such as maps, pre-processed
remote sensing or medical imaging data, and VLSI chip layout data. Spatial data mining
refers to the extraction of knowledge, spatial relationships, or other interesting patterns not
explicitly stored in spatial databases.
Spatial data mining is the application of data mining methods to spatial data. The end
objective of spatial data mining is to find patterns in data with respect to geography. So far,
data mining and Geographic Information Systems (GIS) have existed as two separate
technologies, but now Data mining offers great potential benefits for GIS-based applied
decision-making.
xxiii
Figure 5.1 Spatial Data Mining
Spatial Data Cube Construction and Spatial OLAP
As with relational data, we can integrate spatial data to construct a data warehouse that
facilitates spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time
variant and non-volatile collection of both spatial and noncapital data in support of spatial
data mining and spatial-data-related decision-making processes.
There are three types of dimensions in a spatial data cube:
 A non spatial dimension
 A spatial-to-non spatial dimension
 A spatial-to-spatial dimension
We can distinguish two types of measures in a spatial data cube:
 A numerical measure contains only numerical data
 A spatial measure contains a collection of pointers to spatial objects
xxiv
Spatial database systems usually handle vector data that consist of points, lines, polygons
(regions), and their compositions, such as networks or partitions. Typical examples of such
data include maps, design graphs, and 3-D representations of the arrangement of the chains of
protein molecules.
5.3. Multimedia Data Mining
A multimedia database system stores and manages a large collection of multimedia data, such
as audio, video, image, graphics, speech, text, document, and hypertext data, which contain
text, text mark-ups, and linkages Similarity Search in Multimedia Data When searching for
similarities in multimedia data, we can search on either the data description or the data
content approaches:
 Colour histogram–based signature
 Multi feature composed signature
 Wavelet-based signature
 Wavelet-based signature with region-based granularity
Multidimensional Analysis of Multimedia Data
To facilitate the multidimensional analysis of large multimedia databases, multimedia data
cubes can be designed and constructed in a manner similar to that for traditional data cubes
from relational data. A multimedia data cube can contain additional dimensions and measures
for multimedia information, such as colour, texture, and shape.
Classification and Prediction Analysis of Multimedia Data
Classification and predictive modelling can be used for mining multimedia data, especially in
scientific research, such as astronomy, seismology, and geo-scientific research
Mining Associations in Multimedia Data:
 Associations between image content and non-image content features:
 Associations among image contents that are not related to spatial relationships
 Associations among image contents related to spatial relationships:
xxv
Audio and Video Data Mining
An extraordinary amount of audiovisual information is becoming available in digital form, in
digital archives, on the World Wide Web, in broadcast data streams, and in personal and
professional databases, and hence there is a need to mine them.
Visual data mining discovers implicit and useful knowledge from large data sets using data
and/or knowledge visualization techniques.
In general, data visualization and data mining can be integrated in the following ways:
 Data visualization
 Data mining result visualization
 Data mining process visualization
 Interactive visual data mining
5.4. Web Mining
Figure 5.2 Process Chart for conducting Text Mining
Text Data Analysis and Information Retrieval Information retrieval (IR) is a field that has
been developing in parallel with database systems for many years. Basic Measures for Text
Retrieval: Precision and Recall
xxvi
Precision: This is the percentage of retrieved documents that are in fact relevant to the query
(i.e., “correct” responses).
Recall: This is the percentage of documents that are relevant to the query and were retrieved.
It is formally defined as Text Retrieval Methods
1) Document selection methods
2) Document ranking methods
Text Indexing Techniques
1) Inverted indices
2) Signature files.
Query Processing Techniques: Once an inverted index is created for a document collection, a
retrieval system can answer a keyword query quickly by looking up which documents contain
the query keywords.
Ways of dimensionality Reduction for Text:
 Latent Semantic Indexing
 Locality Preserving Indexing
 Probabilistic Latent Semantic Indexing
Probabilistic Latent Semantic indexing schemas:
 Keyword-Based Association Analysis
 Document Classification Analysis
 Document Clustering Analysis
Mining the World Wide Web
The World Wide Web serves as a huge, widely distributed, global information service centre
for news, advertisements, consumer information, financial management, education,
government, e-commerce, and many other information services. The Web also contains a rich
xxvii
and dynamic collection of hyperlink information and Web page access and usage
information, providing rich sources for data mining.
Challenges:
 The Web seems to be too huge for effective data warehousing and data mining
 The complexity of Web pages is far greater than that of any traditional text document
collection
 The Web is a highly dynamic information source
 The Web serves a broad diversity of user communities
 Only a small portion of the information on the Web is truly relevant or useful
Besides mining Web contents and Web linkage structures, another important task for Web
mining is Web usage mining.
Data Mining for Intrusion Detection
The security of our computer systems and data is at continual risk. The extensive growth of
the Internet and increasing availability of tools and tricks for intruding and attacking
networks have prompted intrusion detection to become a critical component of network
administration. Some areas in which data mining technology may be applied or further
developed for intrusion detection:
1. Development of data mining algorithms for intrusion detection
2. Association and correlation analysis, and aggregation to help select and build
discriminating attributes
3. Analysis of stream data
4. Distributed data mining
5. Visualization and querying tools
Data Mining System Products and Research Prototypes data mining systems should be
assessed based on the following multiple features: Data types, System issues, Data sources,
Data mining functions and methodologies, Coupling data mining with database and/or data
warehouse systems, Scalability, Visualization tools, Data mining query language and
graphical user interface.
xxviii
6 Conclusions
This seminar report provided an overview of Data Mining Process, its techniques and
applications. The following conclusions can be drawn:
I. Data Mining is a crucial step in the Knowledge Discovery in Databases Process but
can only be performed after pre-processing and transformation.
II. Although the basic steps in data mining include data cleaning, selection and
transformation; the functions and techniques are only applied in the vital step where
intelligent methods are used to detect patterns.
III. A model for Data Mining is useful for a company or a data mining practitioner as it
helps in adapting a result oriented approach.
IV. Cross Industry Standard Process for Data Mining Model is an effective approach to a
model which considers business requirements at every step.
V. Classification and Clustering techniques are popular and easily applicable in data
mining, however classification we require prior characteristic information.
VI. Artificial Neural Networks can be deployed to detect patterns and make predictions
which make them capable tools in data mining. A feed forward neural network uses a
back propagation algorithm to train itself.
VII. The application of data mining techniques along with GIS techniques makes for a
potential opportunity to explore various aspects of Spatial Data Mining.
VIII. The growth of data available for processing, as well as multimedia elements and the
world wide web leads to greater opportunities for data mining techniques. However
the pre-processing, selection and transformation needs to be handled first.
xxix
7 References
[1] M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database
perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996.
[2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in
Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[3] W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, Knowledge Discovery in
Databases: An Overview. In G. Piatetsky-Shapiro et al. (eds.), Knowledge Discovery
in Databases. AAAI/MIT Press, 1991.
[4] J. Han and M. Kamber. Data Mining: Concepts and Techniques.
[5] Morgan T. Imielinski and H. Mannila. A database perspective on knowledge.
[6] G. Piatetsky-Shapiro, U. M. Fayyad, and P. Smyth. From data mining to
knowledge discovery: An overview.
[7] In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-
35. AAAI/MIT Press, 1996.
[8] G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in
Databases.AAAI/MIT Press, 1991.
[9] Portia A. Cerny, Data mining and Neural Networks from a Commercial Perspective
[10] Bharati M. Ramageri, Data Mining Techniques and applications.
[11] Dr. Yashpal Singh, Alok Singh Chauhan, Neural Networks in Data Mining.

Mais conteúdo relacionado

Mais procurados

TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
Aditya Srinivasan
 

Mais procurados (17)

Hadoop
HadoopHadoop
Hadoop
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Big data
Big dataBig data
Big data
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

Destaque

Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
mayurik19
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 

Destaque (11)

Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Google Fiber
Google FiberGoogle Fiber
Google Fiber
 
Google fiber
Google fiberGoogle fiber
Google fiber
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Google fiber
Google fiberGoogle fiber
Google fiber
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Semelhante a Seminar Report Vaibhav

Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
Subrat Swain
 

Semelhante a Seminar Report Vaibhav (20)

Data Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope SurveyData Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope Survey
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining Technique
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
Association rule visualization technique
Association rule visualization techniqueAssociation rule visualization technique
Association rule visualization technique
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVEDATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
 
Data mining
Data miningData mining
Data mining
 
6 ijaems sept-2015-6-a review of data security primitives in data mining
6 ijaems sept-2015-6-a review of data security primitives in data mining6 ijaems sept-2015-6-a review of data security primitives in data mining
6 ijaems sept-2015-6-a review of data security primitives in data mining
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Data mining
Data miningData mining
Data mining
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
Data mining and business intelligence
Data mining and business intelligenceData mining and business intelligence
Data mining and business intelligence
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptx
 

Seminar Report Vaibhav

  • 1. ii A Seminar Report On ARTIFICIAL NEURAL NETWORKS BASED DATA MINING TECHNIQUES Submitted in partial fulfilment of the requirements For the award of degree Of INTEGRATED DUAL DEGREE In COMPUTER SCIENCE AND ENGINEERING (With Specialization in Information Technology) Submitted by Vaibhav Dhattarwal CSE-IDD Enrolment No: 08211018 Under the guidance of DR. DURGA TOSHINWAL Professor ELECTRONICS AND COMPUTER ENGINEERING DEPARTMENT INDIAN INSTITUTE OF TECHNOLOGY ROORKEE ROORKEE-247667 AUGUST 2012
  • 2. iii Abstract This report presents an overview of Data Mining Techniques and some of the applications of these techniques in various utility networks. Companies have been collecting data for decades, building massive data warehouses in which to store it. Even though this data is available, very few companies have been able to realize the actual value stored in it. The question these companies are asking is how to extract this value. The answer is Data mining. There are many technologies available to data mining practitioners, including Artificial Neural Networks, Regression, and Decision Trees. Many practitioners are wary of Neural Networks due to their black box nature, even though they have proven themselves in many situations. This report also provides a brief overview of artificial neural networks and questions their position as an applicable tool in data mining.
  • 3. iv Table of Contents Page Abstract i Table of Contents ii List of Figures iii Chapter 1 Introduction 1 1.1 Objective of Seminar 2 Chapter 2 Data Mining 3 2.1 Data Mining Process 4 2.2 CRISP-DM Model 5 Chapter 3 Data Mining Techniques 7 3.1 Classification 7 3.2 Clustering 15 3.3 Regression 18 3.4 Association Rule 18 3.5 Neural Networks 18 Chapter 4 Neural Networks in Data Mining 20 4.1 Feed Forward Neural Network 21 4.2 Back Propagation Algorithm 21 Chapter 5 Applications of Data Mining Techniques 22 5.1 Specific Application Areas 22 5.2 Spatial Data Mining 24 5.3 Multimedia Data Mining 24 5.4 Web Mining 24 Chapter 6 Conclusion 26 References 27
  • 4. v List of Figures Figure Title Page 2.1 Knowledge Discovery in Databases Process 3 2.2 Cross Industry Standard Process for Data Mining 6 3.1 Formation of Clusters 8 3.2 Linear Regression 9 4.1 An Artificial Neural Network 10 4.2 A Feed Forward Neural Network 10 5.1 Spatial Data Mining 11 5.2 Process Chart for conducting Text Mining 12
  • 5. vi 1 Introduction The development of Information Technology has generated large amount of databases and huge data in various areas. Nowadays corporate and organizations are accumulating data at an enormous rate and from a very broad variety of sources such as customer transactions, credit card transactions, bank cash withdrawal to hourly weather data. A lot of relational database servers have been built to store such massive quantities of data. As the matter of fact, the data itself is critical to a company’s growth. It contains knowledge that could lead to important business decisions that bring business to the next level. These data has never been examined in a superficial manner. It is becoming data rich but knowledge poor. In other words “We are drowning in data, but starving for knowledge!” We need information but what we have is a huge amount of data flooding around companies, organizations even individuals. Because of the amount of data is so enormous that humans cannot process it fast enough to get the information out of it at the right time, the machine learning technology has been established to solve this problem potentially. The research in databases and information technology has given rise to an approach to store and manipulate this precious data for further decision making. Data mining is the term used to describe the process of extracting value from a database. A Data-warehouse is a location where information is stored. The type of data stored depends largely on the type of industry and the company. Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), a relatively young and interdisciplinary field of computer science, is the process that attempts to discover patterns in large data sets. It utilizes methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
  • 6. vii Data mining is the business of answering questions that you’ve not asked yet. Data mining reaches deep into databases. Data mining tasks can be classified into two categories: Descriptive and predictive data mining. Descriptive data mining provides information to understand what is happening inside the data without a predetermined idea. Predictive data mining allows the user to submit records with unknown field values, and the previous patterns discovered form the database. Data mining models can be categorized according to the tasks they perform: Classification and Prediction, Clustering, Association Rules. Classification and prediction is a predictive model, but clustering and association rules are descriptive models. The most common action in data mining is classification. It recognizes patterns that describe the group to which an item belongs. It does this by examining existing items that already have been classified and inferring a set of rules. Similar to classification is clustering. The major difference being that no groups have been predefined. Prediction is the construction and use of a model to assess the class of an unlabeled object or to assess the value or value ranges of a given object is likely to have. The next application is forecasting. This is different from predictions because it estimates the future value of continuous variables based on patterns within the data. Four things are required to data-mine effectively: high-quality data, the “right” data, an adequate sample size and the right tool. There are many tools available to a data mining practitioner. These include decision trees, various types of regression and neural networks. 1.1 Objective of the Seminar The introduction of Data Mining and a description of the Data Mining Process are presented in this seminar report. The objective of this seminar is to present an overview of Data Mining techniques that are in use and are applicable in various scenarios. The application of these techniques has also been discussed after an explanation of the implementation of the technique.
  • 7. viii 2 Data Mining Data mining is a process of extraction of useful information and patterns from huge data. It is also called as knowledge discovery process, knowledge mining from data, knowledge extraction or data /pattern analysis. In other words, it can be referred to as Knowledge- Discovery in Databases (KDD). It involves searching large volumes of data for patterns. Figure 2.1 Knowledge Discovery in Databases Process The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: (1) Selection (2) Pre-processing (3) Transformation (4) Data Mining (5) Interpretation/Evaluation.
  • 8. ix 2.1 Data Mining Process Data Mining is performed on the following types of data  Relational databases  Data warehouses  Transactional databases  Advanced DB and information repositories o Object-oriented and object-relational databases o Spatial databases o Time-series data and temporal data o Text databases and multimedia databases o Heterogeneous and legacy databases Some of the steps involved in the Data Mining process are:  Data cleaning The task of this step is to remove noise and inconsistent data.  Data integration In this step, multiple data sources like the ones mentioned in the section above can be combined to an integrated collection of data.  Data selection All the data relevant to the analysis task is retrieved from the database in this step.  Data transformation The data is transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.  Data mining The critical step where intelligent methods are applied in order to extract data patterns.  Pattern evaluation This step is deployed to identify the truly interesting patterns representing knowledge based on certain measures.  Knowledge presentation In the final step, various visualization and knowledge representation techniques are used to present the mined knowledge to the user. Data mining has five main functions:  Classification: It infers the defining characteristics of a certain group.  Clustering: It identifies groups of items that share a particular characteristic. (Clus- tering differs from classification in that no predefining characteristic is given in clas- sification.)
  • 9. x  Association: It identifies relationships between events that occur at one time.  Sequencing: It is similar to association, except that the relationship exists over a period of time.  Forecasting: It estimates future values based on patterns within large sets of data. 2.2 Cross-Industry Standard Process for Data Mining (CRISP-DM) Model Figure 2.2 Cross-Industry Standard Process for Data Mining (CRISP-DM) 1. Business understanding - In the business understanding phase, it is a must to understand business objectives clearly and finding out what the client really want to achieve. Next, we have to assess the situation by finding about the resources, assumptions, constraints
  • 10. xi and other important factors. Then from the business objectives and current situations, we need to create goals to achieve the business objective within the current situation. Finally a good data mining plan has to be established to achieve both business and data mining goals. 2. Data understanding - This phase starts with initial data collection from available sources to get familiar with data. Data load and Data integration must be carried out to ensure successful data collection. Next, the “surface” properties of acquired data need to be examined carefully and reported. Then, the data need to be explored by tackling the data mining questions, which can be addressed using querying, reporting and visualization. Finally, we must check whether the acquired data is complete, and ensure that there are no missing values in the acquired data. 3. Data preparation - The data preparation normally consumes about 90% of the time. The outcome of the data preparation phase is the final data set. When the available data sources are identified, they need to be selected, cleaned, constructed and formatted into the desired form. 4. Modelling - Several modelling techniques are selected to be used for the prepared dataset. A test scenario must be generated to validate the model’s quality. One or more models are created by running the modelling tool on the prepared dataset. The created models need to be assessed carefully so that they meet business initiatives. 5. Evaluation - In the evaluation phase, the model results must be evaluated in the context of business objectives in the first phase. In this phase, new business requirements may be raised due to new patterns has been discovered in the model results or from other factors. Gaining business understanding is an iterative process in data mining. The final decision must be made in this step to move to the deployment phase. 6. Deployment - The knowledge or information gained through data mining process needs to be presented in such a way that it can be used, whenever it is desired. In this phase, the deployment, maintained and monitoring plans have to be created for deployment and future supports. From project point of view, the final evaluation of the project needs to summarize the project experiences and review the project to see what needs to be improved.
  • 11. xii 3 Data Mining Techniques 3.1. Classification Classification is the most commonly applied data mining technique, which employs a set of pre-classified examples to develop a model that can classify the population of records at large. . Classification is a classic data mining technique based on machine learning. Basically classification is used to classify each item in a set of data into one of predefined set of classes or groups. Fraud detection and credit risk applications are particularly well suited to this type of analysis. This approach frequently employs decision tree or neural network-based classification algorithms. The data classification process involves learning and classification. In Learning, the training data are analyzed by classification algorithm. In classification, test data are used to estimate the accuracy of the classification rules. If the accuracy is acceptable, the rules can be applied to the new data tuples Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics. In classification, we make the software that can learn how to classify the data items into groups. The classifier-training algorithm uses these pre-classified examples to determine the set of parameters required for proper discrimination. The algorithm then encodes these parameters into a model called a classifier. Types of classification models: Classification by decision tree induction Bayesian Classification Support Vector Machines (SVM) Classification Based on Associations 3.2. Clustering Clustering can be defined as identification of similar classes of objects. Clustering is a data mining technique that makes meaningful or useful cluster of objects that have similar characteristic using automatic technique. By using clustering techniques we can further identify dense and sparse regions in object space and can discover overall distribution pattern and correlations among data attributes. Due to the fact that classification approach can become costly, Clustering can be used as pre-processing approach for attribute subset
  • 12. xiii selection and classification. In clustering technique, the classes are defined and accordingly objects are put in them, whereas in classification objects are assigned into predefined classes. Figure 3.1 Formation of clusters Types of clustering methods: Partitioning Methods Hierarchical methods Density based methods Grid-based methods Model-based methods 3.3. Regression Regression analysis helps in understanding how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Regression analysis estimates the conditional expectation of the dependent variable given the independent variables. In other words, it estimates the average value of the dependent variable when the independent variables are fixed. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.
  • 13. xiv Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which independent variables are related to the dependent variable, and to explore the forms of these relationships. In data mining, independent variables are attributes already known and response variables are what we want to predict. Real-world problems are very difficult to predict because they may depend on complex interactions of multiple predictor variables. Therefore, more complex techniques (e.g., logistic regression, decision trees, or neural nets) may be necessary to forecast future values. The same model types can often be used for both regression and classification. For example, the CART (Classification and Regression Trees) decision tree algorithm can be used to build both classification trees (to classify categorical response variables) and regression trees (to forecast continuous response variables). Neural networks too can create both classification and regression models. Figure 3.2 Linear Regression Types of regression methods Linear Regression Multivariate Linear Regression Nonlinear Regression Multivariate Nonlinear Regression 3.4. Association Rule Association is one of the best known data mining technique. In association, a pattern is discovered based on a relationship of a particular item on other items in the same transaction.
  • 14. xv Association and correlation is usually to find frequent item set findings among large data sets. This type of finding helps businesses to make certain decisions, such as catalogue design, cross marketing and customer shopping behaviour analysis. Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps: First, minimum support is applied to find all frequent item sets in a database. Second, these frequent item sets and the minimum confidence constraint are used to form rules. Association Rule algorithms need to be able to generate rules with confidence values less than one. However the number of possible Association Rules for a given dataset is generally very large and a high proportion of the rules are usually of little (if any) value. Types of association rule Multilevel association rule Multidimensional association rule Quantitative association rule 3.5. Neural networks An Artificial Neural Network (ANN), usually called neural network (NN), is a mathematical model or computational model that is inspired by the structure and functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connection based approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. Modern neural networks are non-linear statistical data modelling tools. They are usually used to model complex relationships between inputs and outputs or to find patterns in data. Neural network is a set of connected input/output units and each connection has a weight present with it. During the learning phase, network learns by adjusting weights so as to be able to predict the correct class labels of the input tuples. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. These are well suited for continuous valued inputs and outputs. Neural networks are best at identifying patterns or trends in data and well suited for prediction or forecasting needs.
  • 15. xvi 4 Neural Networks in Data Mining Neural networks are non-linear statistical data modelling tools. They can be used to model complex relationships between inputs and outputs; or to find patterns in data and to infer rules from them. Neural networks are useful in providing information on associations, classifications, clusters, and forecasting. Using neural networks as a tool, data warehousing firms can harvest information from datasets in the data mining process. Neural networks are programmed to store, recognize, and associatively retrieve patterns or database entries; to solve combinatorial optimization problems; to filter noise from measurement data; to control ill-defined problems; in summary, to estimate sampled functions when we do not know the form of the functions. The two abilities: pattern recognition and function estimation make neural networks a very prevalent utility in data mining. With their model-free estimators and their dual nature, neural networks serve data mining in a variety of ways. Figure 4.1 an Artificial Neural Network Neural networks, depending on the architecture, provide associations, classifications, clusters, prediction and forecasting to the data mining industry. Neural networks essentially comprise
  • 16. xvii three pieces: the architecture or model; the learning algorithm; and the activation functions. Due to neural networks, we can mine valuable information from a mass of history information so that it can be efficiently used in financial areas. Hence, the applications of neural networks in financial forecasting have become very popular. 4.1. Feed forward Neural Network: Figure 4.2 a Feed Forward Neural Network One of the simplest feed forward neural networks (FFNN), in Figure 4.2, consists of three layers: an input layer, hidden layer and output layer. In each layer there are one or more processing elements (PEs). PEs is meant to simulate the neurons in the brain and this is why they are often referred to as neurons or nodes. A PE receives inputs from either the outside world or the previous layer. There are connections between the PEs in each layer that have a weight (parameter) associated with them. This weight is adjusted during training. Information only travels in the forward direction through the network - there are no feedback loops. The simplified process for training a FFNN is as follows:
  • 17. xviii 1. Input data is presented to the network and propagated through the network until it reaches the output layer. This forward process produces a predicted output. 2. The predicted output is subtracted from the actual output and an error value for the networks is calculated. 3. The neural network then uses supervised learning, which in most cases is back propagation, to train the network. Back propagation is a learning algorithm for adjusting the weights. It starts with the weights between the output layer PE’s and the last hidden layer PE’s and works backwards through the network. 4. Once back propagation has finished, the forward process starts again, and this cycle is continued until the error between predicted and actual outputs is minimized. 4.2. The Back Propagation Algorithm: Back propagation, or propagation of error, is a common method of teaching artificial neural networks how to perform a given task. Back propagation is the method of training artificial neural networks so as to minimize the objective function. The back propagation algorithm performs learning on a feed-forward neural network. The back propagation algorithm is used in layered feed forward ANNs. This means that the artificial neurons are organized in layers, and send their signals “forward”, and then the errors are propagated backwards. The back propagation algorithm uses supervised learning, which means that we provide the algorithm with examples of the inputs and outputs we want the network to compute, and then the error (difference between actual and expected results) is calculated. The idea of the back propagation algorithm is to reduce this error, until the ANN learns the training data. Algorithm for a 3-layer network: Initialize the weights in the network Do For each example E in the training set O = neural-net-output (network, e); forward pass T = teacher output for e Calculate error (T - O) at the output units
  • 18. xix Compute delta_wh for all weights from hidden layer to output layer ; backward pass Compute delta_wi for all weights from input layer to hidden layer ; backward pass continued Update the weights in the network Until all examples classified correctly or stopping criterion satisfied Return the network The Back Propagation learning algorithm can be divided into two phases: Phase 1: Propagation Every propagation involves the following steps: 1. Forward propagation of a training pattern's input through the neural network. 2. Backward propagation of the propagation's output activations through the neural network using the training pattern's target. Phase 2: Weight update For each weight-synapse the following steps are used: 1. Multiply its output delta and input activation to get the gradient of the weight. 2. Bring the weight in the opposite direction of the gradient by subtracting a ratio of it from the weight. Repeat phase 1 and 2 until the performance of the network is satisfactory.
  • 19. xx 5 Applications of Data Mining Data mining is a relatively new technology that has not fully matured. Despite this, there are a number of industries that are already using it on a regular basis. Some of these organizations include retail stores, hospitals, banks, and insurance companies. Many of these organizations are combining data mining with such things as statistics, pattern recognition, and other important tools. Data mining can be used to find patterns and connections that would otherwise be difficult to find. This technology is popular with many businesses because it allows them to learn more about their customers and make smart marketing decisions. There are a number of applications that data mining has. The first is called market segmentation. With market segmentation, we can find behaviours that are common among customers. We can look for patterns among customers that seem to purchase the same products at the same time. Another application of data mining is called customer churn. Customer churn allows us to estimate which customers are the most likely to stop purchasing our products or services and go to one of our competitors. In addition to this, a company can use data mining to find out which purchases are the most likely to be fraudulent. For example, by using data mining in retail stores, we may be able to determine which products are stolen the most. By finding out which products are stolen the most, steps can be taken to protect those products and detect those who are stealing them. We can also use data mining to determine the effectiveness of interactive marketing. Some of the customers will be more likely to purchase the products online than offline, and we must identify them. While many businesses use data mining to help increase their profits, it can also be used to create new businesses and industries. One industry that can be created by data mining is the automatic prediction of both behaviours and trends. Using this automated prediction we can have an advantage over the competition. Instead of simply guessing what the next big trend will be, we can determine it based on statistics, patterns, and logic. Another application of automatic prediction is to use data mining to look at the past marketing strategies to
  • 20. xxi determine the best one so far and the reason for it being the best. We can avoid making any mistakes that occurred in previous marketing campaigns. Data mining is also a powerful tool for those who deal with finances. A financial institution such as a bank can predict the number of defaults that will occur among their customers within a given period of time, and they can also predict the amount of fraud that will occur as well. Another application of data mining is the automatic recognition of patterns that were not previously known. While data mining is a very valuable tool, it is important to realize that it is not a complete solution. Even if an automated technology should be invented, it will not guarantee the success of the company. However, it will tip the odds in our favour. 5.1. Specific Application Areas of Data Mining Data Mining for Financial Data Analysis few typical cases:  Design and construction of data warehouses for multidimensional data analysis.  Loan payment prediction and customer credit policy analysis.  Classification and clustering of customers for targeted marketing.  Detection of money laundering and other financial crimes.  Data Mining for the Retail Industry. A few examples of data mining in the retail industry:  Design and construction of data warehouses based on the benefits of data mining  Multidimensional analysis of sales, customers, products, time, and region  Analysis of the effectiveness of sales campaigns  Customer retention—analysis of customer loyalty  Product recommendation and cross-referencing of items Data Mining for the Telecommunication Industry  Multidimensional analysis of telecommunication data.  Fraudulent pattern analysis and the identification of unusual patterns.  Multidimensional association and sequential pattern analysis.
  • 21. xxii  Mobile telecommunication services.  Use of visualization tools in telecommunication data analysis. Data Mining for Biological Data Analysis  Semantic integration of heterogeneous, distributed genomic and proteomic databases.  Alignment, indexing, similarity search, and comparative analysis of multiple nucleotide/protein sequences.  Discovery of structural patterns and analysis of genetic networks and protein pathways.  Association and path analysis: identifying co-occurring gene sequences and linking genes to different stages of disease development.  Visualization tools in genetic data analysis. Data Mining in Other Scientific Applications Data collection and storage technologies have recently improved, so that today, scientific data can be amassed at much higher speeds and lower costs. This has resulted in the accumulation of huge volumes of high-dimensional data, stream data, and heterogeneous data, containing rich spatial and temporal information. Consequently, scientific applications are shifting from the “hypothesize-and-test” paradigm toward a “collect and store data, mine for new hypotheses, confirm with data or experimentation” process. This shift brings about new challenges for data mining. 5.2. Spatial Data Mining A spatial database stores a large amount of space-related data, such as maps, pre-processed remote sensing or medical imaging data, and VLSI chip layout data. Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases. Spatial data mining is the application of data mining methods to spatial data. The end objective of spatial data mining is to find patterns in data with respect to geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate technologies, but now Data mining offers great potential benefits for GIS-based applied decision-making.
  • 22. xxiii Figure 5.1 Spatial Data Mining Spatial Data Cube Construction and Spatial OLAP As with relational data, we can integrate spatial data to construct a data warehouse that facilitates spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time variant and non-volatile collection of both spatial and noncapital data in support of spatial data mining and spatial-data-related decision-making processes. There are three types of dimensions in a spatial data cube:  A non spatial dimension  A spatial-to-non spatial dimension  A spatial-to-spatial dimension We can distinguish two types of measures in a spatial data cube:  A numerical measure contains only numerical data  A spatial measure contains a collection of pointers to spatial objects
  • 23. xxiv Spatial database systems usually handle vector data that consist of points, lines, polygons (regions), and their compositions, such as networks or partitions. Typical examples of such data include maps, design graphs, and 3-D representations of the arrangement of the chains of protein molecules. 5.3. Multimedia Data Mining A multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text mark-ups, and linkages Similarity Search in Multimedia Data When searching for similarities in multimedia data, we can search on either the data description or the data content approaches:  Colour histogram–based signature  Multi feature composed signature  Wavelet-based signature  Wavelet-based signature with region-based granularity Multidimensional Analysis of Multimedia Data To facilitate the multidimensional analysis of large multimedia databases, multimedia data cubes can be designed and constructed in a manner similar to that for traditional data cubes from relational data. A multimedia data cube can contain additional dimensions and measures for multimedia information, such as colour, texture, and shape. Classification and Prediction Analysis of Multimedia Data Classification and predictive modelling can be used for mining multimedia data, especially in scientific research, such as astronomy, seismology, and geo-scientific research Mining Associations in Multimedia Data:  Associations between image content and non-image content features:  Associations among image contents that are not related to spatial relationships  Associations among image contents related to spatial relationships:
  • 24. xxv Audio and Video Data Mining An extraordinary amount of audiovisual information is becoming available in digital form, in digital archives, on the World Wide Web, in broadcast data streams, and in personal and professional databases, and hence there is a need to mine them. Visual data mining discovers implicit and useful knowledge from large data sets using data and/or knowledge visualization techniques. In general, data visualization and data mining can be integrated in the following ways:  Data visualization  Data mining result visualization  Data mining process visualization  Interactive visual data mining 5.4. Web Mining Figure 5.2 Process Chart for conducting Text Mining Text Data Analysis and Information Retrieval Information retrieval (IR) is a field that has been developing in parallel with database systems for many years. Basic Measures for Text Retrieval: Precision and Recall
  • 25. xxvi Precision: This is the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses). Recall: This is the percentage of documents that are relevant to the query and were retrieved. It is formally defined as Text Retrieval Methods 1) Document selection methods 2) Document ranking methods Text Indexing Techniques 1) Inverted indices 2) Signature files. Query Processing Techniques: Once an inverted index is created for a document collection, a retrieval system can answer a keyword query quickly by looking up which documents contain the query keywords. Ways of dimensionality Reduction for Text:  Latent Semantic Indexing  Locality Preserving Indexing  Probabilistic Latent Semantic Indexing Probabilistic Latent Semantic indexing schemas:  Keyword-Based Association Analysis  Document Classification Analysis  Document Clustering Analysis Mining the World Wide Web The World Wide Web serves as a huge, widely distributed, global information service centre for news, advertisements, consumer information, financial management, education, government, e-commerce, and many other information services. The Web also contains a rich
  • 26. xxvii and dynamic collection of hyperlink information and Web page access and usage information, providing rich sources for data mining. Challenges:  The Web seems to be too huge for effective data warehousing and data mining  The complexity of Web pages is far greater than that of any traditional text document collection  The Web is a highly dynamic information source  The Web serves a broad diversity of user communities  Only a small portion of the information on the Web is truly relevant or useful Besides mining Web contents and Web linkage structures, another important task for Web mining is Web usage mining. Data Mining for Intrusion Detection The security of our computer systems and data is at continual risk. The extensive growth of the Internet and increasing availability of tools and tricks for intruding and attacking networks have prompted intrusion detection to become a critical component of network administration. Some areas in which data mining technology may be applied or further developed for intrusion detection: 1. Development of data mining algorithms for intrusion detection 2. Association and correlation analysis, and aggregation to help select and build discriminating attributes 3. Analysis of stream data 4. Distributed data mining 5. Visualization and querying tools Data Mining System Products and Research Prototypes data mining systems should be assessed based on the following multiple features: Data types, System issues, Data sources, Data mining functions and methodologies, Coupling data mining with database and/or data warehouse systems, Scalability, Visualization tools, Data mining query language and graphical user interface.
  • 27. xxviii 6 Conclusions This seminar report provided an overview of Data Mining Process, its techniques and applications. The following conclusions can be drawn: I. Data Mining is a crucial step in the Knowledge Discovery in Databases Process but can only be performed after pre-processing and transformation. II. Although the basic steps in data mining include data cleaning, selection and transformation; the functions and techniques are only applied in the vital step where intelligent methods are used to detect patterns. III. A model for Data Mining is useful for a company or a data mining practitioner as it helps in adapting a result oriented approach. IV. Cross Industry Standard Process for Data Mining Model is an effective approach to a model which considers business requirements at every step. V. Classification and Clustering techniques are popular and easily applicable in data mining, however classification we require prior characteristic information. VI. Artificial Neural Networks can be deployed to detect patterns and make predictions which make them capable tools in data mining. A feed forward neural network uses a back propagation algorithm to train itself. VII. The application of data mining techniques along with GIS techniques makes for a potential opportunity to explore various aspects of Spatial Data Mining. VIII. The growth of data available for processing, as well as multimedia elements and the world wide web leads to greater opportunities for data mining techniques. However the pre-processing, selection and transformation needs to be handled first.
  • 28. xxix 7 References [1] M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996. [2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. [3] W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, Knowledge Discovery in Databases: An Overview. In G. Piatetsky-Shapiro et al. (eds.), Knowledge Discovery in Databases. AAAI/MIT Press, 1991. [4] J. Han and M. Kamber. Data Mining: Concepts and Techniques. [5] Morgan T. Imielinski and H. Mannila. A database perspective on knowledge. [6] G. Piatetsky-Shapiro, U. M. Fayyad, and P. Smyth. From data mining to knowledge discovery: An overview. [7] In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1- 35. AAAI/MIT Press, 1996. [8] G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases.AAAI/MIT Press, 1991. [9] Portia A. Cerny, Data mining and Neural Networks from a Commercial Perspective [10] Bharati M. Ramageri, Data Mining Techniques and applications. [11] Dr. Yashpal Singh, Alok Singh Chauhan, Neural Networks in Data Mining.