SlideShare a Scribd company logo
1 of 23
Download to read offline
SUBMITETD BY: PROJECT GUIDE:
ANKITA AGARWAL MR. JARNAIL SINGH
DEEPIKA RAIPURIA FACULTY
C-11 FET
MODY INSTITUTE MODY INSTITUTE OF
TECH. AND SCIENCE TECH. AND SCIENCE
ACKNOWLEDGEMENT
We wish to express our deepest gratitude for Mr.Jarnail Singh, for his utmost
concern, guidance and encouragement during our major project. We would like
to thank him for his pleasant and insightful interaction with us. We are extremely
grateful for their constant guidance and encouragement and for their help in
gaining access to the resources essential for the successful completion of this
assignment. We would like to thank him for sharing their valuable knowledge and
resources with us and showed utmost co-operation and understanding.
ANKITA AGARWAL
DEEPIKA RAIPURIA
MITS,LAXMANGARH
EXECUTIVE SUMMARY
Data mining is the process of posing various queries and extracting useful information,
patterns and trends often previously unknown from large quantities of the data possibly
stored in databases. The goals of data mining include detecting abnormal patterns and
predicting the future based on past experiences and current trends. Some of the data
mining technique includes those based on rough sets, inductive losic programming,
machine learning and neutral networks.
The data mining problem includes
Classification: finding rules to partition data into groups.
Association: finding rules to make association between data.
Sequencing: finding rules to order data.
Data mining is also referred as Knowledge mining from databases, Knowledge
extraction, data/pattern analysis, data archaeology and data dredging.
Another popular term for it is Knowledge discovery in databases (KDD). it consist of
an iterative sequence of following steps:
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
Data can be stored in many types of databases. One database architecture is data
warehouse, a repository of multiple, heterogeneous data sources. It include data
cleansing, data integration, and on-line analytical processing(OLAP).
In addition user interfaces, optimized query processing, and transaction management.
Efficient method for this is on-line transaction processing that is(OLTP)
BRIEF CONTENTS
S. No. DESCRIPTION PAGE NO.
1. DATA MINING 1
Knowledge discovery in databases
2. DATA WAREHOUSE 8
OLAP technology for data mining
3. MINING ASSOCIATION RULES 15
In large databases
4. CLASSIFICATION AND PREDICTION 18
5. REFERENCES 20
1.0 DATA MINING-Knowledge discovery in databases
1.1 Motivation: Why data mining?
“Necessity Is the Mother of Invention”
1.1.1 Data explosion problem:
The major reason that data mining has attracted a great deal of attention in the
information industry in recent years is due to the wide availability of huge amounts of
data and the imminent need for turning such data into useful information and knowledge.
The information and knowledge gained can be used for applications ranging from
business management , production control, and market analysis, to engineering design
and science exploration. Automated data collection tools and mature database technology
lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data
warehouses, and other information repositories. We are drowning in data, but starving for
knowledge! Data mining can be viewed as a result of the natural evolution of information
technology.
1.2Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web databases
 2000s
 Stream data management and mining
 Data mining with a variety of applications
 Web technology and global information systems
1.2.1Solution: Data warehousing and data mining
 Data warehousing and on-line analytical processing(OLAP)
Data warehouse is a repository of multiple heterogeneous data sources, organized under a
unified schema at a single site in order to facilitate management decision making. It
comprises data cleansing,
data integration and OLAP, which is analysis techniques with
functionalities such as summarization, consolidation, and aggregation, as
well as the ability to view information from different angles.
 Mining interesting knowledge (rules, regularities, patterns, constraints)
from data in large databases
1.3 What is data mining?
Data mining refers to extracting or “mining” knowledge from large amounts of data. The
term is actually a misnomer. Mining of gold from rocks or sand is referred to as gold
mining rather than rock or sand mining. Thus data mining should have been appropriately
named “knowledge mining from data,” which is unfortunately somewhat long.
Nevertheless, mining is a vivid term characterizing the process that finds a small set of
precious nuggets from a great deal of raw material. Thus such a misnomer that carries
both “data” and “mining” became a popular choice. There are many other terms carrying
a similar or slightly different meaning to data mining such as knowledge mining from
databases, knowledge extraction, data or pattern analysis, data archeology and data
dredging. Many people treat data mining as a synonym for another popularly used term,
“Knowledge discovery in databases” or KDD. It consist of an iterative sequence of
the following steps:
Fig.1.1
 Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Main steps are:
1. Data cleaning: to remove noise and inconsistent data.
2. Data integration: here multiple data sources are combined.
3. Data selection :data relevant to the analysis task are retrieved from the
database.
4. Data reduction and transformation: here data are transformed or consolidate
into forms appropriate for mining by performing summary or aggregation
operations. we find useful features dimensionality /
variable reduction, invariant representation.
5. Data mining : an essential process where intelligent methods and several
mining algorithms are applied in order to extract data patterns.
6. Pattern evaluation: to identify the truly interesting patterns representing
knowledge based on some interestingness measures.
7. Knowledge presentation: here visualization and knowledge representation
techniques are used to present the mined knowledge to
the user.
1.4 Architecture: Typical Data Mining System
The architecture of a typical data mining system have the following major
components as shown in fig1.2.
fig. 1.2
Data
Warehouse
Data cleaning & data integration Filtering
Databases
Database or
data warehouse
server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
1.4.1 Database, Data warehouse, or other information repository:
This is one or a set of databases, data warehouses, spreadsheets or other kinds of
information repositories. Data cleaning and data integration are performed on the data.
1.4.2 Database or data warehouse server:
The database or data warehouse server is responsible for fetching the relevant data, based
on the users data mining request.
1.4.3 Knowledge base:
This is the domain knowledge that is used to guide the search, or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different level of abstraction.
1.4.4 Data mining engine:
This is essential to the data mining system and ideally consist of set of function modules
for task such as characterization, association, classification, cluster analysis, and
evolution and deviation analysis.
1.4.5 Pattern evaluation module:
This component typically employs interestingness measures and interacts with the data
mining modules so as to focus the search towards interesting patterns.
1.4.6 Graphical User Interface:
This module communicates between users and the data mining system, allowing the user
to interact with the system by specifying a data mining query or task, providing
information to help focus the search and performing exploratory data mining based on the
intermediate data mining results.
1.5 Data mining-on what kind of data?
1.5.1 Relational databases:
A relational database is a collection of tables, each of which is assigned a unique
name. each table consist of set of attributes and usually stores a large set of tuples.
Each tuple in a relational table represents an object identified by a unique key and
described by a set of attribute values.
1.5.2 Data warehouses:
A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and which usually resides at a single site. Data
warehouses are constructed via a process of data cleaning, data transformation,
data integration, data loading and periodic data refreshing.
1.5.3 Transactional databases:
A transactional database consist of a file where each record represents a
transaction. A transaction typically includes a unique transaction identity
no,(trans_id) , and a list of the items making up the transaction.
1.5.4 Advance database systems and advance database applications:
With the advances of database technology, various of advance database systems
have emerged and are undergoing development to address the requirements of
new database applications. The new database applications include:
 Object-oriented databases
 Object-relational databases
 Spatial databases
 Temporal databases and time-series databases
 Text databases and multimedia databases
 Heterogeneous databases and legacy databases
 World wide web
1.6 Data Mining Functionalities:
Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. They can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data into database.
Predictive mining tasks perform inference on the current data in order to make
predictions.
1.6.1 Concept /Class Description: Characterization and Discrimination
Data can be associated with classes or concepts. It can be useful to describe individual
classes or concepts in summarized, concise, and yet precise terms. Such descriptions of a
class or a concept are called class/concept descriptions. These descriptions can be
derived via:
Data Characterization: It is summarization of the general characteristics or features of a
target class of data. The data corresponding to the user-specified class are typically
collected by a database query.
Data Discrimination: It is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting classes. The
target and contrasting classes can be specified by the user, and the corresponding data
objects retrieved through database queries.
1.6.2 Association analysis:
Association analysis is the discovery of association rules showing attribute value
conditions that occur frequently together in a given set of data. Association analysis is
widely used for market basket or transaction data analysis.
More formally, association rules are of the form X => Y, that is, “A1
^………….^AmB1^…….^Bn”, where Ai(for i in {1,….,m}) and Bj(for j in {1,…,n})
are attribute-value pairs. The association rule X => Y is interpreted as “ database tuples
that satisfy the conditions in X are also likely to satisfy the conditions in Y.”
1.6.3 Classification and Prediction:
Classification is the process of finding a set of models that describe and distinguish data
classes or concepts, for the purpose of being able to use the module to predict the class of
objects whose class label is unknown, i.e. training data. The derived model is based on
the analysis of a set of training data. Classification can be used for predicting the class
label of data objects. Prediction referred to both data value prediction and class label
prediction. It also encompasses the identification of distribution trends based on the
available data.
1.6.4 Cluster Analysis:
Unlike classification and prediction, clustering analyses data objects without consulting a
known class label. It can be used to generate such labels. The objects are clustered or
grouped based on the principle of maximizing the intraclass similarity and minimizing
the interclass similarity. That is, cluster of objects are formed so that objects within a
cluster have high similarity in comparison to one another, but are very dissimilar to
objects in other clusters. Each cluster that is formed can be viewed as a class of objects,
from which rules can be derived.
1.6.5 Outlier Analysis:
A database may contain data objects that do not comply with the general behavior or the
model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. However, in some applications such as fraud detection,
the rare events can be more interesting than the more regularly occurring ones. The
analysis of outlier data is referred to as outlier mining.
1.7 Are all of the patterns interesting
A data mining system ha the potential to generate thousands or even millions of patterns,
or rules.
The pattern is interesting if
 It is easily understood by humans.
 Valid on new or test data with some degree of certainty.
 Potentially useful.
 Novel.
Several objective measures of pattern interestingness exist. These are based on the
structure of discovered patterns and the statistics underlying them. An objective
measure for association rules of the form X=>Y is rule support, representing the
percentage of transactions from a transaction database that the given rule satisfies.
This is taken to be the probability P(X U Y) where X U Y indicates that a transaction
contain both X and Y, that is, the union of item sets X and Y.
Another objective measure for association rules is confidence, which accesses the
degree of certainty of the detected association. This is taken to be the conditional
probability P(Y|X), that is, the probability that a transaction containing X also
contains Y.
Support and Confidence are defined as:
Support(X=>Y)=P(XUY).
Confidence(X=>Y)=P(Y|X).
2.0DATA WAREHOUSE- OLAP technology for data mining
Data warehousing provides architecture and tools for business executives to
systematically organize , understand ,and use their data to make strategic decisions . a
large number of organizations have found that data warehouse
System are valuable tools in today’s competitive, fast evolving world. in the last several
years, many firms have spent millions dollar in building enterprise-wide data warehouses.
A data warehouse is a subject oriented ,integrated ,time variant and non volatile
collection of data in support of management’s decision making process.
The four keywords subject oriented ,integrated ,time variant and non volatile ,and
distinguish data warehouses from other data repository systems, such as relational
database systems, transaction processing systems , and file systems.
 Subject-Oriented: A data warehouse is organized around major subjects, such as
customer, supplier, product and sales. Rather than concentrating on the day to day
operations and transaction processing of an organization, a data warehouse
focuses on the modeling and analysis of data for decision makers.
 Integrated: A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files, and on-line
transaction records. Data cleaning and data integration techniques are applied to
ensure consistency in naming conventions, encoding structures, attributes
measures, and so on.
 Time Variant: Data are stored to provide information from a historical
perspective. Every key structure in the data warehouse contains, either implicitly
or explicitly, an element of time.
 Nonvolatile: A data warehouse is always a physically separate store of data
transformed from the application data found in the operational environment. Due
to this separation, a data warehouse does not require transaction processing,
recovery, and concurrency control mechanisms. It usually requires only two
operations in data accesing : initial loading of data and access of data.
2.1 Differences between Operational Database Systems and Data
Warehouses:
The major task of on-line operational database systems is to perform on-line transaction
and query processing. These systems are called on-line transaction processing (OLTP)
systems. They cover most of the day-to-day operations of an organization, such as
purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.
Data warehouse systems, on the other hand, serve users or knowledge workers in the role
of data analysis and decision making. Such systems can organize and present data in
various formats in order to accommodate the diverse needs of the different users. These
systems are known as on-line analytical processing (OLAP) systems .Major
distinguishing features between OLTP and OLAP are summarized as follows:
Feature OLTP OLAP
Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, DBA, database knowledge worker
Professional
Function day-to-day operations long term informational
Requirements
DB-design ER-based, application star/snowflake, subject
oriented oriented
Data current; guaranteed historical; accuracy
Up to date maintained over time
Summarization primitive, highly detailed summarized, consolidated
View detailed, flat relational summarized, multi-
dimensional
Unit of work short, simple transaction complex query
Access read/write mostly read
Focus data in information out
No. of users thousands hundreds
2.2 OLAP Systems versus Statistical databases
Many of the characteristics of OLAP systems, such as the use of a multidimensional data
model and concept hierarchies, the association of measures with dimensions, and the
notions of roll up and drill down, also exist in earlier work on statistical databases(SDBs).
A statistical database is a database system that is designed to support statistical
applications. Similarities between the two types of systems are rarely discussed. Mainly
due to differences in terminology and application domains.
OLAP and SDB systems, however, have distinguishing differences. While SDBs tend to
focus on socio-economic applications, OLAP has been targeted for business applications.
Privacy issues regarding concept hierarchies are a major concern for SDBs. For example,
given summarized socio-economic data, it is controversial to allow users to view the
corresponding low level data. Finally, unlike SDBs, OLAP systems are designed for
handling huge amount of data efficiently.
2.3 Three Tier data warehouse architecture
Data warehouses often adopt a three-tier architecture:
1. The bottom tier is a warehouse database server that is almost always a relational
database system. Data from operational databases and external sources are
extracted using application program interfaces known as gateways. A gateway is
supported by underlying DBMS and allows client programs to generate SQL code
to be executed at a server.
2. The middle tier is an OLAP server that is typically implemented using either (1) a
relational OLAP(ROLAP) model, that is, an extended relational DBMS that
maps operations on multidimensional data to standard relational operations; or (2)
a multidimensional OLAP (MOLAP) model, that is, a special purpose server
that directly implements multidimensional data and operations.
3. The top tier is a client, which contains query and reporting tools, analysis tools,
and /or data mining tools.
Fig2.1
MMuullttii--TTiieerreedd AArrcchhiitteeccttuurree
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End
Serve
Data Marts
Operational
DBs
other
source
s
Data Storage
OLAP Server
From the architectural point of view, there are three data warehouse models:
 Enterprise warehouse: an enterprise warehouse collects all of the information
about subjects spanning the entire organization. It provides corporate-wide data
integration, usually from one or more operational systems or external information
providers, and is cross functional in scope.
 Data mart: A data mart contains a subset of corporate wide data that is of value
to a specific group of users. The scope is confined to specific selected subjects.
They are usually implemented on low cost departmental servers that are unix or
windows/NT based.
 Virtual warehouse: A virtual warehouse is a set of views over operational
databases. For efficient query processing, only some of the possible summary
views may be materialized. They are easy to built but requires excess capacity on
operational database servers.
Fig2.2
2.4 Metadata Repository
Metadata are data about data. When used in data warehouse, metadata are the data that
define warehouse objects. They are created for the data names and definitions of the
given warehouse. Additional metadata are created and captured for time stamping any
extracted data, the source of the extracted data, and missing fields that have been added
by data cleaning or integration processes. A metadata should contain the following:
 A description of the structure of the data warehouse, which includes the
warehouse schema, view, dimensions, hierarchies, and derived data definitions,
as well as data mart locations and contents.
 Operational metadata, which include data lineage, currency of data, and
monitoring information.
 The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports.
Data Warehouse Development
Define a high-level corporate data model
Data
Mart
Data
Mart
Distributed
Data Marts
Multi-Tier Data
Warehouse
Enterprise
Data
Warehouse
Model refinementModel refinement
 The mapping from operational environment to the data warehouse, which
includes, source databases and their content, gateway descriptions, data
partitions, data extraction, cleaning, transformation rules and defaults, data
refresh and purging rules, and security.
 Data related to system performance, which include indices and profiles that
improve data access and retrieval performance, in addition to rules for he timing
and scheduling of refresh, update, and replication cycles.
 Business metadata, which include business terms and definitions, data ownership
information, and charging policies.
2.5 From data warehousing to data mining
Data warehouse usage:
There are three kinds of data warehouse applications:
 Information processing: It supports querying, basic statistical analysis, and
reporting using cross tabs, tables, charts, or graphs. A current trend in data
warehouse information processing is to construct low cost web based accessing
tools that are integrated with web browsers.
 Analytical processing; It supports basic OLAP operations, including slice and
dice, drill down, roll up, and pivoting. It generally operates on historical data in
both and detailed forms. The major strength of on-line analytical processing over
information processing is the multi dimensional data analysis of data warehouse
data.
 Data mining: It supports knowledge discovery by finding hidden patterns and
associations., constructing analytical models, performing classification and
prediction, and presenting the mining results using visualization tools.
2.6 From On-line Analytical Processing to On-line Analytical Mining
Among many different paradigms and architectures of data mining systems, on-line
analytical mining(OLAM), which integrates on-line analytical processing(OLAP) with
data mining and mining knowledge in multidimensional databases, is particularly
important for the following reasons :
 High quality of data in data warehouses
 Available information processing infrastructure surrounding data warehouses.
 OLAP based exploratory data analysis
 On-line selection of data mining functions.
2.6.1 Architecture for On-line Analytical Mining
Fig2.3
An OLAM server performs analytical mining in data cubes in a similar manner as an
OLAP server performs on on-line analytical processing. An integrated OLAM and OLAP
architecture is shown in he fig. where the OLAM and OLAP servers both accept user on-
line queries via an graphical user interface API and work with the data cube in the data
analysis via a cube API. A metadata directory is used to guide the access off tha data
cube.
The data cube can be constructed by accessing and /or integrating multiple databases via
an MDDB API and/or by filtering a data warehouse via a database API that may support
OLEDB or ODBC connections.
An OLAM Architecture
Data
Warehouse
Meta Data
MDDB
OLAM
Engine
OLAP
Engine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data
Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
3.0 Mining Association Rules in Large Databases
Association rule mining searches for interesting relationships among items in a given data
set. The basic concepts of mining association are:
Let J=(i1,i2,…………,im} be a set of items. Let D, the task relevant data, be a set of
database transactions where each transaction T is a set of items such that T is a subset of
J. each transaction is associated with an identifier, called TID. Let A be a set of items. A
transaction T is said to contain A if and only if A is subset of T. An association rule is an
implication of the form A=>B, where A is subset of J, B is subset of J, and A
intersection B equals to null. The rule A=> B holds in the transaction set D with support
S, ewhere S is the percentage of transactions in D that contain AUB. This is taken to be
the probability, P(AUB). The rule A=>B has confidence C in the transaction set D if C is
the percentage of transactions in D containing A that also contains B. this is taken to be
the conditional probability, P(B|A). That is,
Support (A=>B)=P(AUB)
Confidence (A=>B)=P(B|A)
Rules that satisfy both a minimum support threshold(min_sup) and a minimum
confidence threshold(min_conf) are called strong. By convention, we write support and
confidence values as to occur between 0% and 100% rather than 0 to 1.
A set of item is referred to as an itemset. An itemset that contains k items is a k-itemset.
The occurrence frequency of an itemset is the no. of transactions that contain the itemset.
This is also known, simply, as the frequency, support-count, or count of the itemset. An
itemset satisfies min. support if theoccurence frequency of the itemset is greater than or
equal to the product of min_sup and the total no. of transactions in D. the no. Of
transactions required for the itemset to satisfy min. support is therefore referred to as the
min. support count. If an itemset satisfies min. support, then it is a frequent itemset. The
set of frequent k itemsets is commonly denoted by Lk.
Association rule mining is a two step process:
 Find all frequent itemsets
 Generate strong association rules from the frequent itemsets.
3.1 The Apriori Algorithm: Finding Frequent Itemsets Using
Candidate Generation
Apriori is an influential algorithm for mining frequent itemsets for boolean
association rules.The name of the algorithm is based on the fact that the algorithm
uses prior knowledge of frequent itemset properties. Apriori employs an iterative
approach known as level-wise search,where k-itemsets are used to explore (k+1)-
itemsets.First,the set of frequent 1-itemsets is found. This set is denoted L1. L1 is
used to find L2,the set of frequent 2-itemsets,which is used to find L3 and so
on,until no more frequent k-itemsets can be found. The finding pf each Lk
requires one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets,an
important property called the Apriori property,is used to reduce the search space.
In order to use the Apriori property,all nonempty subsets of a frequent itemset
must also be frequent. This property is based on the folllowing observation. By
definition, if an itemset I does not satisfy the minimum support threshold,
min_sup, then I is not frequent, that is,P(I)<min_sup. If an item A is added to the
itemset I,the the resulting itemset (i.e.,IUA)cannot occur more frequently than I.
therefore , IUA is not frequent either , that is , P(IUA) < min_sup.
This property belongs to a special category of properties called anti-monotone in the
sense that if a set cannot pass a test, all of it supersets will fail the same test as well. It is
called anti-monotone because the property is monotonic in the context of failing a test.
let us look at how Lk-1 is used to find Lk. A two step process i followed, consisting of
join and prune actions.
1. The Join step:To find Lk, a set of candidate K-itemsets is generated by joining Lk-1
with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1. The
notation Li[j] refers to the jth item in Li. By covention, apriori assumes that item
within a transaction or itemset are sorted in lexicographic order. The join, Lk-1 to Lk-
1, is performed, where members of Lk-1 are joinable if their first (k-2) items are in
common. That is, members l1 and l2 of Lk-1 are joined if (l1[1]=l2[2] ) and
(l1[2]=l2[2]) and ...and (l1[k-2]=l2[k-2]) and (l1[k-1]<l2[k-1]). The condition l1[k-
1]<l2[k-1] simply insures that no duplicates are generated. The resulting itemset
formed by joining l1 and l2 is l1[1]l1[2]...l1[k-1]l2[k-1].
2. The prune step: Ck is a superset of Lk, that is its member may or may not be frequent,
but all of the frequent k itemsets are included in Ck. A scan of the database to determine
the count of each candidate in Ck would result in the determination of Lk. Ck, however
can be huge, and so this could involve heavy computation. to reduce the size of Ck, the
apriori property is used as follows. Any (k-1) itemset that is not frequent cannot be a
subset of a frequent k itemset. hence, if any (k-1) subset of a candidate k-itemset is not in
Lk-1, then the candidate cannot be frequent either and so can be removd from Ck. This
subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.
4.0 Classification and Prediction
4.1 What Is Classification?
Data Classification is a two step process. In the first step, a model is built describing a
predetermined set of data classes or concepts. The model is constructed by analyzing
database tuples described by attributes. Each tuple is assumed to belong to a predefined
class, as determined by one of the attributes, called the class label attribute. In the context
of classification, data tuples are also referred to as samples, examples, or objects. The
data tuples analyzed to build the model collectively form the training data set. The
individual tuples making of the training set are referred to as training samples and are
randomly selected from the sample population. Since the class label of each training
sample is provided, this step is also known as supervised learning.
In the second step, the model is used for classification. First, the predictive accuracy of
the model is estimated. The hold out method is a simple technique that uses a test set of
class label samples. These samples are randomly selected and are independent of the
training samples. The accuracy of a model on a given test et is the percentage of test set
samples that are correctly classified by the model. For each test sample, the known class-
label is compared with the learned model’s class prediction for that sample. If the
accuracy of the model were estimated based on the training data set, this estimate could
be optimistic since the learned model tends to over fit the data. Therefore a test set is
used.
4.2 What is prediction?
Prediction can be viewed as the construction and use of a model to access the class of an
unlabeled sample, or to access the value or value range of an attribute that a given sample
is likely to have.
4.3 Classification by decision tree induction
A decision tree is a flow chart like tree structure, where each internal node denotes a test
on an attribute, each branch represents an outcome of the test, and leaf nodes represent
classes or class distributions. The topmost node in the tree is the root node.
In order to classify an unknown sample, the attribute-values of the sample are tested
against the decision tree. A path is traced from the root to a leaf node that holds the class
prediction for that sample. Decision trees can easily be converted to classification rules.
We describe a basic algorithm for learning decision trees.
Algorithm: Generate_decision_tree. Generate a decision tree from the given training
data.
Input: the training samples, samples, represented by discrete valued attributes; the set of
candidate attributes, attribute list.
Output: a decision tree.
Method:
(1) create a node N;
(2) if samples are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) If attribute list is emplty then
(5) Return N as a leaf node labekleed with the most common class in samples;
//majority voting.
(6) Select test-attribute, th attribute among the attribute-list with the highest
information gain;
(7) Label node N with test-attribute;
(8) For each known value ai of test-attribute //partition the samples
(9) Grow a branch from node N for the condition test attribute=ai ;
(10) Let si be the set of samples in samples for which test attribute = ai;
(11) If si is empty then
(12) Attach a leaf labeled with the most common class in samples;
(13) Else attach the node returned by Generate_decision_tree(si,attribute-list-test-
attribute);
REFERENCES
DATA MINING-CONCEPTS AND TECHNIQUES
- JIAWEI HAN
- MICHELINE KAMBER
PUBLISHERS: Morgan Kaufmann Publishers

More Related Content

What's hot

Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)Pratik Tambekar
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data MiningRanak Ghosh
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and TechniquesPratik Tambekar
 
Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
introduction to data warehousing and mining
 introduction to data warehousing and mining introduction to data warehousing and mining
introduction to data warehousing and miningRajesh Chandra
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 

What's hot (20)

03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining
Data miningData mining
Data mining
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data Mining
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and Techniques
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
introduction to data warehousing and mining
 introduction to data warehousing and mining introduction to data warehousing and mining
introduction to data warehousing and mining
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 

Similar to Data mining

Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
Advances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing TechnologyAdvances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing TechnologyKate Campbell
 
Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019Edwin S. Garcia
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dwANUSUYA T K
 
Unit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxUnit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxHarsha Patel
 
BVRM 402 IMS Database Concept.pptx
BVRM 402 IMS Database Concept.pptxBVRM 402 IMS Database Concept.pptx
BVRM 402 IMS Database Concept.pptxDrNilimaThakur
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docxAbshar Fatima
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxRupaRani28
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data MiningIOSR Journals
 

Similar to Data mining (20)

Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx
 
Data mining
Data miningData mining
Data mining
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
Advances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing TechnologyAdvances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing Technology
 
Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019Data Mining @ BSU Malolos 2019
Data Mining @ BSU Malolos 2019
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
Unit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxUnit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptx
 
BVRM 402 IMS UNIT V
BVRM 402 IMS UNIT VBVRM 402 IMS UNIT V
BVRM 402 IMS UNIT V
 
BVRM 402 IMS Database Concept.pptx
BVRM 402 IMS Database Concept.pptxBVRM 402 IMS Database Concept.pptx
BVRM 402 IMS Database Concept.pptx
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docx
 
Abstract
AbstractAbstract
Abstract
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
2 Data-mining process
2   Data-mining process2   Data-mining process
2 Data-mining process
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptx
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 

More from Sachin Sisodia (Entrepreneur, Architect) (8)

Beat recession
Beat recessionBeat recession
Beat recession
 
Beat recession
Beat recessionBeat recession
Beat recession
 
Saving account offerings of hdfc bank
Saving account offerings of hdfc bankSaving account offerings of hdfc bank
Saving account offerings of hdfc bank
 
Sachin resume
Sachin resumeSachin resume
Sachin resume
 
Sachin resume
Sachin resumeSachin resume
Sachin resume
 
Sachin resume
Sachin resumeSachin resume
Sachin resume
 
Sachin resume
Sachin resumeSachin resume
Sachin resume
 
Sachin resume
Sachin resumeSachin resume
Sachin resume
 

Recently uploaded

Summary IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)ICT Watch - Indonesia
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
How to login to Router net ORBI LOGIN...
How to login to Router net ORBI LOGIN...How to login to Router net ORBI LOGIN...
How to login to Router net ORBI LOGIN...rrouter90
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...
Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...ICT Watch - Indonesia
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesLumiverse Solutions Pvt Ltd
 
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...vmzoxnx5
 

Recently uploaded (9)

Summary IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
How to login to Router net ORBI LOGIN...
How to login to Router net ORBI LOGIN...How to login to Router net ORBI LOGIN...
How to login to Router net ORBI LOGIN...
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...
Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best Practices
 
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
 

Data mining

  • 1. SUBMITETD BY: PROJECT GUIDE: ANKITA AGARWAL MR. JARNAIL SINGH DEEPIKA RAIPURIA FACULTY C-11 FET MODY INSTITUTE MODY INSTITUTE OF TECH. AND SCIENCE TECH. AND SCIENCE
  • 2. ACKNOWLEDGEMENT We wish to express our deepest gratitude for Mr.Jarnail Singh, for his utmost concern, guidance and encouragement during our major project. We would like to thank him for his pleasant and insightful interaction with us. We are extremely grateful for their constant guidance and encouragement and for their help in gaining access to the resources essential for the successful completion of this assignment. We would like to thank him for sharing their valuable knowledge and resources with us and showed utmost co-operation and understanding. ANKITA AGARWAL DEEPIKA RAIPURIA MITS,LAXMANGARH
  • 3. EXECUTIVE SUMMARY Data mining is the process of posing various queries and extracting useful information, patterns and trends often previously unknown from large quantities of the data possibly stored in databases. The goals of data mining include detecting abnormal patterns and predicting the future based on past experiences and current trends. Some of the data mining technique includes those based on rough sets, inductive losic programming, machine learning and neutral networks. The data mining problem includes Classification: finding rules to partition data into groups. Association: finding rules to make association between data. Sequencing: finding rules to order data. Data mining is also referred as Knowledge mining from databases, Knowledge extraction, data/pattern analysis, data archaeology and data dredging. Another popular term for it is Knowledge discovery in databases (KDD). it consist of an iterative sequence of following steps: Data integration Data selection Data transformation Data mining Pattern evaluation Knowledge presentation Data can be stored in many types of databases. One database architecture is data warehouse, a repository of multiple, heterogeneous data sources. It include data cleansing, data integration, and on-line analytical processing(OLAP). In addition user interfaces, optimized query processing, and transaction management. Efficient method for this is on-line transaction processing that is(OLTP)
  • 4. BRIEF CONTENTS S. No. DESCRIPTION PAGE NO. 1. DATA MINING 1 Knowledge discovery in databases 2. DATA WAREHOUSE 8 OLAP technology for data mining 3. MINING ASSOCIATION RULES 15 In large databases 4. CLASSIFICATION AND PREDICTION 18 5. REFERENCES 20
  • 5. 1.0 DATA MINING-Knowledge discovery in databases 1.1 Motivation: Why data mining? “Necessity Is the Mother of Invention” 1.1.1 Data explosion problem: The major reason that data mining has attracted a great deal of attention in the information industry in recent years is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from business management , production control, and market analysis, to engineering design and science exploration. Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories. We are drowning in data, but starving for knowledge! Data mining can be viewed as a result of the natural evolution of information technology. 1.2Evolution of Database Technology  1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s:  Data mining, data warehousing, multimedia databases, and Web databases  2000s  Stream data management and mining  Data mining with a variety of applications  Web technology and global information systems 1.2.1Solution: Data warehousing and data mining  Data warehousing and on-line analytical processing(OLAP) Data warehouse is a repository of multiple heterogeneous data sources, organized under a unified schema at a single site in order to facilitate management decision making. It comprises data cleansing,
  • 6. data integration and OLAP, which is analysis techniques with functionalities such as summarization, consolidation, and aggregation, as well as the ability to view information from different angles.  Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases 1.3 What is data mining? Data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a misnomer. Mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus data mining should have been appropriately named “knowledge mining from data,” which is unfortunately somewhat long. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material. Thus such a misnomer that carries both “data” and “mining” became a popular choice. There are many other terms carrying a similar or slightly different meaning to data mining such as knowledge mining from databases, knowledge extraction, data or pattern analysis, data archeology and data dredging. Many people treat data mining as a synonym for another popularly used term, “Knowledge discovery in databases” or KDD. It consist of an iterative sequence of the following steps: Fig.1.1  Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 7. Main steps are: 1. Data cleaning: to remove noise and inconsistent data. 2. Data integration: here multiple data sources are combined. 3. Data selection :data relevant to the analysis task are retrieved from the database. 4. Data reduction and transformation: here data are transformed or consolidate into forms appropriate for mining by performing summary or aggregation operations. we find useful features dimensionality / variable reduction, invariant representation. 5. Data mining : an essential process where intelligent methods and several mining algorithms are applied in order to extract data patterns. 6. Pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interestingness measures. 7. Knowledge presentation: here visualization and knowledge representation techniques are used to present the mined knowledge to the user. 1.4 Architecture: Typical Data Mining System The architecture of a typical data mining system have the following major components as shown in fig1.2. fig. 1.2 Data Warehouse Data cleaning & data integration Filtering Databases Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base
  • 8. 1.4.1 Database, Data warehouse, or other information repository: This is one or a set of databases, data warehouses, spreadsheets or other kinds of information repositories. Data cleaning and data integration are performed on the data. 1.4.2 Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the users data mining request. 1.4.3 Knowledge base: This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different level of abstraction. 1.4.4 Data mining engine: This is essential to the data mining system and ideally consist of set of function modules for task such as characterization, association, classification, cluster analysis, and evolution and deviation analysis. 1.4.5 Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. 1.4.6 Graphical User Interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search and performing exploratory data mining based on the intermediate data mining results. 1.5 Data mining-on what kind of data? 1.5.1 Relational databases: A relational database is a collection of tables, each of which is assigned a unique name. each table consist of set of attributes and usually stores a large set of tuples. Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. 1.5.2 Data warehouses: A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and which usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data transformation, data integration, data loading and periodic data refreshing.
  • 9. 1.5.3 Transactional databases: A transactional database consist of a file where each record represents a transaction. A transaction typically includes a unique transaction identity no,(trans_id) , and a list of the items making up the transaction. 1.5.4 Advance database systems and advance database applications: With the advances of database technology, various of advance database systems have emerged and are undergoing development to address the requirements of new database applications. The new database applications include:  Object-oriented databases  Object-relational databases  Spatial databases  Temporal databases and time-series databases  Text databases and multimedia databases  Heterogeneous databases and legacy databases  World wide web 1.6 Data Mining Functionalities: Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. They can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize the general properties of the data into database. Predictive mining tasks perform inference on the current data in order to make predictions. 1.6.1 Concept /Class Description: Characterization and Discrimination Data can be associated with classes or concepts. It can be useful to describe individual classes or concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept descriptions. These descriptions can be derived via: Data Characterization: It is summarization of the general characteristics or features of a target class of data. The data corresponding to the user-specified class are typically collected by a database query. Data Discrimination: It is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. The target and contrasting classes can be specified by the user, and the corresponding data objects retrieved through database queries.
  • 10. 1.6.2 Association analysis: Association analysis is the discovery of association rules showing attribute value conditions that occur frequently together in a given set of data. Association analysis is widely used for market basket or transaction data analysis. More formally, association rules are of the form X => Y, that is, “A1 ^………….^AmB1^…….^Bn”, where Ai(for i in {1,….,m}) and Bj(for j in {1,…,n}) are attribute-value pairs. The association rule X => Y is interpreted as “ database tuples that satisfy the conditions in X are also likely to satisfy the conditions in Y.” 1.6.3 Classification and Prediction: Classification is the process of finding a set of models that describe and distinguish data classes or concepts, for the purpose of being able to use the module to predict the class of objects whose class label is unknown, i.e. training data. The derived model is based on the analysis of a set of training data. Classification can be used for predicting the class label of data objects. Prediction referred to both data value prediction and class label prediction. It also encompasses the identification of distribution trends based on the available data. 1.6.4 Cluster Analysis: Unlike classification and prediction, clustering analyses data objects without consulting a known class label. It can be used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. That is, cluster of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Each cluster that is formed can be viewed as a class of objects, from which rules can be derived. 1.6.5 Outlier Analysis: A database may contain data objects that do not comply with the general behavior or the model of the data. These data objects are outliers. Most data mining methods discard outliers as noise or exceptions. However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones. The analysis of outlier data is referred to as outlier mining.
  • 11. 1.7 Are all of the patterns interesting A data mining system ha the potential to generate thousands or even millions of patterns, or rules. The pattern is interesting if  It is easily understood by humans.  Valid on new or test data with some degree of certainty.  Potentially useful.  Novel. Several objective measures of pattern interestingness exist. These are based on the structure of discovered patterns and the statistics underlying them. An objective measure for association rules of the form X=>Y is rule support, representing the percentage of transactions from a transaction database that the given rule satisfies. This is taken to be the probability P(X U Y) where X U Y indicates that a transaction contain both X and Y, that is, the union of item sets X and Y. Another objective measure for association rules is confidence, which accesses the degree of certainty of the detected association. This is taken to be the conditional probability P(Y|X), that is, the probability that a transaction containing X also contains Y. Support and Confidence are defined as: Support(X=>Y)=P(XUY). Confidence(X=>Y)=P(Y|X).
  • 12. 2.0DATA WAREHOUSE- OLAP technology for data mining Data warehousing provides architecture and tools for business executives to systematically organize , understand ,and use their data to make strategic decisions . a large number of organizations have found that data warehouse System are valuable tools in today’s competitive, fast evolving world. in the last several years, many firms have spent millions dollar in building enterprise-wide data warehouses. A data warehouse is a subject oriented ,integrated ,time variant and non volatile collection of data in support of management’s decision making process. The four keywords subject oriented ,integrated ,time variant and non volatile ,and distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems , and file systems.  Subject-Oriented: A data warehouse is organized around major subjects, such as customer, supplier, product and sales. Rather than concentrating on the day to day operations and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers.  Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attributes measures, and so on.  Time Variant: Data are stored to provide information from a historical perspective. Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time.  Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accesing : initial loading of data and access of data. 2.1 Differences between Operational Database Systems and Data Warehouses: The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.
  • 13. Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems .Major distinguishing features between OLTP and OLAP are summarized as follows: Feature OLTP OLAP Characteristic operational processing informational processing Orientation transaction analysis User clerk, DBA, database knowledge worker Professional Function day-to-day operations long term informational Requirements DB-design ER-based, application star/snowflake, subject oriented oriented Data current; guaranteed historical; accuracy Up to date maintained over time Summarization primitive, highly detailed summarized, consolidated View detailed, flat relational summarized, multi- dimensional Unit of work short, simple transaction complex query Access read/write mostly read Focus data in information out No. of users thousands hundreds 2.2 OLAP Systems versus Statistical databases Many of the characteristics of OLAP systems, such as the use of a multidimensional data model and concept hierarchies, the association of measures with dimensions, and the notions of roll up and drill down, also exist in earlier work on statistical databases(SDBs). A statistical database is a database system that is designed to support statistical applications. Similarities between the two types of systems are rarely discussed. Mainly due to differences in terminology and application domains. OLAP and SDB systems, however, have distinguishing differences. While SDBs tend to focus on socio-economic applications, OLAP has been targeted for business applications. Privacy issues regarding concept hierarchies are a major concern for SDBs. For example, given summarized socio-economic data, it is controversial to allow users to view the corresponding low level data. Finally, unlike SDBs, OLAP systems are designed for handling huge amount of data efficiently.
  • 14. 2.3 Three Tier data warehouse architecture Data warehouses often adopt a three-tier architecture: 1. The bottom tier is a warehouse database server that is almost always a relational database system. Data from operational databases and external sources are extracted using application program interfaces known as gateways. A gateway is supported by underlying DBMS and allows client programs to generate SQL code to be executed at a server. 2. The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP(ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations; or (2) a multidimensional OLAP (MOLAP) model, that is, a special purpose server that directly implements multidimensional data and operations. 3. The top tier is a client, which contains query and reporting tools, analysis tools, and /or data mining tools. Fig2.1 MMuullttii--TTiieerreedd AArrcchhiitteeccttuurree Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Serve Data Marts Operational DBs other source s Data Storage OLAP Server
  • 15. From the architectural point of view, there are three data warehouse models:  Enterprise warehouse: an enterprise warehouse collects all of the information about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross functional in scope.  Data mart: A data mart contains a subset of corporate wide data that is of value to a specific group of users. The scope is confined to specific selected subjects. They are usually implemented on low cost departmental servers that are unix or windows/NT based.  Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized. They are easy to built but requires excess capacity on operational database servers.
  • 16. Fig2.2 2.4 Metadata Repository Metadata are data about data. When used in data warehouse, metadata are the data that define warehouse objects. They are created for the data names and definitions of the given warehouse. Additional metadata are created and captured for time stamping any extracted data, the source of the extracted data, and missing fields that have been added by data cleaning or integration processes. A metadata should contain the following:  A description of the structure of the data warehouse, which includes the warehouse schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.  Operational metadata, which include data lineage, currency of data, and monitoring information.  The algorithms used for summarization, which include measure and dimension definition algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and predefined queries and reports. Data Warehouse Development Define a high-level corporate data model Data Mart Data Mart Distributed Data Marts Multi-Tier Data Warehouse Enterprise Data Warehouse Model refinementModel refinement
  • 17.  The mapping from operational environment to the data warehouse, which includes, source databases and their content, gateway descriptions, data partitions, data extraction, cleaning, transformation rules and defaults, data refresh and purging rules, and security.  Data related to system performance, which include indices and profiles that improve data access and retrieval performance, in addition to rules for he timing and scheduling of refresh, update, and replication cycles.  Business metadata, which include business terms and definitions, data ownership information, and charging policies. 2.5 From data warehousing to data mining Data warehouse usage: There are three kinds of data warehouse applications:  Information processing: It supports querying, basic statistical analysis, and reporting using cross tabs, tables, charts, or graphs. A current trend in data warehouse information processing is to construct low cost web based accessing tools that are integrated with web browsers.  Analytical processing; It supports basic OLAP operations, including slice and dice, drill down, roll up, and pivoting. It generally operates on historical data in both and detailed forms. The major strength of on-line analytical processing over information processing is the multi dimensional data analysis of data warehouse data.  Data mining: It supports knowledge discovery by finding hidden patterns and associations., constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. 2.6 From On-line Analytical Processing to On-line Analytical Mining Among many different paradigms and architectures of data mining systems, on-line analytical mining(OLAM), which integrates on-line analytical processing(OLAP) with data mining and mining knowledge in multidimensional databases, is particularly important for the following reasons :  High quality of data in data warehouses  Available information processing infrastructure surrounding data warehouses.  OLAP based exploratory data analysis  On-line selection of data mining functions. 2.6.1 Architecture for On-line Analytical Mining
  • 18. Fig2.3 An OLAM server performs analytical mining in data cubes in a similar manner as an OLAP server performs on on-line analytical processing. An integrated OLAM and OLAP architecture is shown in he fig. where the OLAM and OLAP servers both accept user on- line queries via an graphical user interface API and work with the data cube in the data analysis via a cube API. A metadata directory is used to guide the access off tha data cube. The data cube can be constructed by accessing and /or integrating multiple databases via an MDDB API and/or by filtering a data warehouse via a database API that may support OLEDB or ODBC connections. An OLAM Architecture Data Warehouse Meta Data MDDB OLAM Engine OLAP Engine User GUI API Data Cube API Database API Data cleaning Data integration Layer3 OLAP/OLAM Layer2 MDDB Layer1 Data Repository Layer4 User Interface Filtering&Integration Filtering Databases Mining query Mining result
  • 19. 3.0 Mining Association Rules in Large Databases Association rule mining searches for interesting relationships among items in a given data set. The basic concepts of mining association are: Let J=(i1,i2,…………,im} be a set of items. Let D, the task relevant data, be a set of database transactions where each transaction T is a set of items such that T is a subset of J. each transaction is associated with an identifier, called TID. Let A be a set of items. A transaction T is said to contain A if and only if A is subset of T. An association rule is an implication of the form A=>B, where A is subset of J, B is subset of J, and A intersection B equals to null. The rule A=> B holds in the transaction set D with support S, ewhere S is the percentage of transactions in D that contain AUB. This is taken to be the probability, P(AUB). The rule A=>B has confidence C in the transaction set D if C is the percentage of transactions in D containing A that also contains B. this is taken to be the conditional probability, P(B|A). That is, Support (A=>B)=P(AUB) Confidence (A=>B)=P(B|A) Rules that satisfy both a minimum support threshold(min_sup) and a minimum confidence threshold(min_conf) are called strong. By convention, we write support and confidence values as to occur between 0% and 100% rather than 0 to 1. A set of item is referred to as an itemset. An itemset that contains k items is a k-itemset. The occurrence frequency of an itemset is the no. of transactions that contain the itemset. This is also known, simply, as the frequency, support-count, or count of the itemset. An itemset satisfies min. support if theoccurence frequency of the itemset is greater than or equal to the product of min_sup and the total no. of transactions in D. the no. Of transactions required for the itemset to satisfy min. support is therefore referred to as the min. support count. If an itemset satisfies min. support, then it is a frequent itemset. The set of frequent k itemsets is commonly denoted by Lk. Association rule mining is a two step process:  Find all frequent itemsets  Generate strong association rules from the frequent itemsets.
  • 20. 3.1 The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation Apriori is an influential algorithm for mining frequent itemsets for boolean association rules.The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an iterative approach known as level-wise search,where k-itemsets are used to explore (k+1)- itemsets.First,the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2,the set of frequent 2-itemsets,which is used to find L3 and so on,until no more frequent k-itemsets can be found. The finding pf each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets,an important property called the Apriori property,is used to reduce the search space. In order to use the Apriori property,all nonempty subsets of a frequent itemset must also be frequent. This property is based on the folllowing observation. By definition, if an itemset I does not satisfy the minimum support threshold, min_sup, then I is not frequent, that is,P(I)<min_sup. If an item A is added to the itemset I,the the resulting itemset (i.e.,IUA)cannot occur more frequently than I. therefore , IUA is not frequent either , that is , P(IUA) < min_sup. This property belongs to a special category of properties called anti-monotone in the sense that if a set cannot pass a test, all of it supersets will fail the same test as well. It is called anti-monotone because the property is monotonic in the context of failing a test. let us look at how Lk-1 is used to find Lk. A two step process i followed, consisting of join and prune actions. 1. The Join step:To find Lk, a set of candidate K-itemsets is generated by joining Lk-1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1. The notation Li[j] refers to the jth item in Li. By covention, apriori assumes that item within a transaction or itemset are sorted in lexicographic order. The join, Lk-1 to Lk- 1, is performed, where members of Lk-1 are joinable if their first (k-2) items are in common. That is, members l1 and l2 of Lk-1 are joined if (l1[1]=l2[2] ) and (l1[2]=l2[2]) and ...and (l1[k-2]=l2[k-2]) and (l1[k-1]<l2[k-1]). The condition l1[k- 1]<l2[k-1] simply insures that no duplicates are generated. The resulting itemset formed by joining l1 and l2 is l1[1]l1[2]...l1[k-1]l2[k-1]. 2. The prune step: Ck is a superset of Lk, that is its member may or may not be frequent, but all of the frequent k itemsets are included in Ck. A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk. Ck, however
  • 21. can be huge, and so this could involve heavy computation. to reduce the size of Ck, the apriori property is used as follows. Any (k-1) itemset that is not frequent cannot be a subset of a frequent k itemset. hence, if any (k-1) subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removd from Ck. This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets. 4.0 Classification and Prediction 4.1 What Is Classification? Data Classification is a two step process. In the first step, a model is built describing a predetermined set of data classes or concepts. The model is constructed by analyzing database tuples described by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label attribute. In the context of classification, data tuples are also referred to as samples, examples, or objects. The data tuples analyzed to build the model collectively form the training data set. The individual tuples making of the training set are referred to as training samples and are randomly selected from the sample population. Since the class label of each training sample is provided, this step is also known as supervised learning. In the second step, the model is used for classification. First, the predictive accuracy of the model is estimated. The hold out method is a simple technique that uses a test set of class label samples. These samples are randomly selected and are independent of the training samples. The accuracy of a model on a given test et is the percentage of test set samples that are correctly classified by the model. For each test sample, the known class- label is compared with the learned model’s class prediction for that sample. If the accuracy of the model were estimated based on the training data set, this estimate could be optimistic since the learned model tends to over fit the data. Therefore a test set is used.
  • 22. 4.2 What is prediction? Prediction can be viewed as the construction and use of a model to access the class of an unlabeled sample, or to access the value or value range of an attribute that a given sample is likely to have. 4.3 Classification by decision tree induction A decision tree is a flow chart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distributions. The topmost node in the tree is the root node. In order to classify an unknown sample, the attribute-values of the sample are tested against the decision tree. A path is traced from the root to a leaf node that holds the class prediction for that sample. Decision trees can easily be converted to classification rules. We describe a basic algorithm for learning decision trees. Algorithm: Generate_decision_tree. Generate a decision tree from the given training data. Input: the training samples, samples, represented by discrete valued attributes; the set of candidate attributes, attribute list. Output: a decision tree. Method: (1) create a node N; (2) if samples are all of the same class, C then (3) return N as a leaf node labeled with the class C; (4) If attribute list is emplty then (5) Return N as a leaf node labekleed with the most common class in samples; //majority voting. (6) Select test-attribute, th attribute among the attribute-list with the highest information gain; (7) Label node N with test-attribute; (8) For each known value ai of test-attribute //partition the samples (9) Grow a branch from node N for the condition test attribute=ai ; (10) Let si be the set of samples in samples for which test attribute = ai; (11) If si is empty then
  • 23. (12) Attach a leaf labeled with the most common class in samples; (13) Else attach the node returned by Generate_decision_tree(si,attribute-list-test- attribute); REFERENCES DATA MINING-CONCEPTS AND TECHNIQUES - JIAWEI HAN - MICHELINE KAMBER PUBLISHERS: Morgan Kaufmann Publishers