SlideShare a Scribd company logo
1 of 18
Download to read offline
Introduction
UNIT 1 - Chapter 1
Ranjit Reddy M M. Tech., (Ph. D)
Associate Professor
Department of Computer Science & Engineering
2
Contents/Topics
 What Is Data Mining?
 Motivating Challenges
 The Origins of Data Mining
 Data Mining Tasks
 Summary
January 31, 2016 Data Mining: Concepts and Techniques 3
What Is Data Mining?
 Data Mining: (knowledge discovery from data)
 Extracting or “Mining” knowledge from large amounts of data.
 Searching for knowledge in your data
 Alternative names:
 Knowledge discovery (mining) in databases (KDD)
 knowledge extraction
 data/pattern analysis
 data archeology
 data dredging
 information harvesting
 business intelligence, etc.
Knowledge Discovery (KDD) Process
January 31, 2016 Data Mining: Concepts and Techniques 5
Knowledge Discovery (KDD) Process steps
 1. Data cleaning (to remove noise and inconsistent data)
 2. Data integration (where multiple data sources may be combined-Flat files,
spread sheets and relational tables)
 3. Data selection (where data relevant to the analysis task are retrieved from the
database)
 4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance)
 5. Data mining (an essential process where intelligent methods are applied in
order to extract data patterns)
 6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures)
 7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)
Architecture of typical data mining system
January 31, 2016 Data Mining: Concepts and Techniques 7
Architecture of typical data mining system
 Database, data warehouse, World Wide Web, or other information
repository: This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories. Data cleaning and data integration
techniques may be performed on the data.
 Database or data warehouse server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include
concept hierarchies, used to organize attributes or attribute values into different
levels of abstraction. Knowledge such as user beliefs, which can be used to assess a
pattern’s interestingness based on its unexpectedness, may also be included. Other
examples of domain knowledge are additional interestingness constraints or
thresholds, and metadata (e.g., describing data from multiple heterogeneous
sources).
January 31, 2016 Data Mining: Concepts and Techniques 8
Architecture of typical data mining system
 Data mining engine: Consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution analysis.
 Pattern evaluation module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the search
toward interesting patterns. It may use interestingness thresholds to filter out
discovered patterns. Alternatively, the pattern evaluation module may be integrated
with the mining module.
 User interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining
query or task, providing information to help focus the search, and performing
exploratory data mining based on the intermediate data mining results. This
component allows the user to browse database and data warehouse schemas or
data structures, evaluate mined patterns, and visualize the patterns in different
forms.
Motivating Challenges
 Scalability:
 Datasets with sizes of gigabytes, terabytes or even petabytes
 Massive datasets cannot fit into main memory
 Need to develop scalable data mining algorithms to mine massive datasets
 Scalability can also be improved by using sampling or developing parallel and
distributed algorithms.
 High Dimensionality:
 Data sets with hundreds or thousands of attributes.
 Example: Dataset that contains measurements of temperature at various
location
 Traditional data analysis techniques that were developed for low dimensional
data .
 Need to develop data mining algorithms to handle high dimensionality.
Motivating Challenges
 Heterogeneous and Complex Data:
 Traditional data analysis methods deal with datasets containing attributes of
same type(Continuous or Categorical).
 Complex data sets contains image, video, text etc.
 Need to develop mining methods to handle complex datasets
 Data Ownership and Distribution:
 Data is not stored in one location or owned by one organization.
 Data is geographically distributed among resources belonging to multiple
entities.
 Need to develop distributed data mining algorithms to handle distributed
datasets.
 Key challenges:
 How to reduce the amount of communication needed for distributed data.
 How to effectively consolidate the data mining results from multiple sources
 How to address data security issues.
Motivating Challenges
 Non Traditional Analysis:
 Traditional statistical approach is based on a hypothesize-and-test paradigm.
 A hypothesis is proposed, an experiment is designed to gather the data, and then
data is analyzed with respect to the hypothesis.
 This process is extremely labor-intensive.
 Need to develop mining methods to automate the process of hypothesis
generation and evaluation.
The Origins of Data Mining
 Data Mining Draws ideas, such as:
 Sampling, estimation and hypothesis testing from statistics.
 Search algorithms, modeling techniques and learning theories from Artificial
Intelligence or Machine Learning, Pattern Recognition.
 Database systems are
needed to provide support
for efficient storage,
Indexing and query
processing.
 The Techniques from
parallel computing are
addressing the massive size of some datasets.
 Distributed Computing techniques are used to gather information from different
locations.
Data Mining Tasks
 Data Mining tasks divided into two major categories:
 Predictive Tasks: Predict the value of particular attribute based on the values
of other attributes. The predicted attribute is known as target or dependent
variable and other attribute is known as explanatory or independent
variables.
 Descriptive Tasks: Characterize the general properties of the data in the
database(Correlations, Trends, Clusters, Trajectories and anomalies).
 Four of the core data mining tasks:
 Classification & Regression
 Association Analysis
 Cluster Analysis
 Anomaly Detection
Data Mining Functionalities
Data Mining Functionalities
 Predictive Modeling: Building a model for the target variable as a function of the
explanatory variable.
 Classification: Which is used for Discrete Target Variables.
Ex: Predicting whether a web user will make a purchase at an online book
store(Target variable is binary valued).
 Regression: Which is used for Continuous Target Variables.
 Ex: Forecasting the future price of a stock(Price is a continuous-valued attribute)
.
Data Mining Functionalities
 Association Analysis:
 Used to discover patterns that describe strongly associated features in the data.
 The discovered patterns are typically represented in the form of implication rules or
feature subsets
 The above table illustrate the data collected at supermarkets.
 Association analysis can be applied to find items that are frequently bought together
by customers.
 Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers
also tend to buy milk)
Transaction ID Items
1
2
3
4
5
6
7
8
9
10
{Bread, Butter, Diapers, Milk}
{Coffee, Sugar, Cookies, Salmon}
{Bread, Butter, Coffee, Diapers, Milk, Eggs}
{Bread, Butter, Salmon, Chicken}
{Eggs, Bread, Butter}
{Salmon, Diapers, Milk}
{Bread, Tea, Sugar, Eggs}
{Coffee, Sugar, Chicken, Eggs}
{Bread, Diapers, Milk, Salt}
{Tea, Eggs, Cookies, Diapers, Milk}
Market
Basket
Analysis
Data Mining Functionalities
 Cluster Analysis:
 Grouping of similar things is called cluster.
 The objects are clustered or grouped based on the principle of maximizing the
intra class similarity(Within a Cluster) and minimizing the interclass
similarity(Cluster to Cluster).
Document Clustering
 Each Article is represented as a set of word frequency pairs (w, c), Where w is a
word and c is the number of times the word appears in the article.
 There are 2 natural clusters in the above dataset
 First Cluster consists of the first 3 articles (News about the Economy)
 Second cluster contain last 3 articles (News about the Heath Care)
Article Word
1
2
3
4
5
6
Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2
Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1
Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3
Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2
Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2
Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
Data Mining Functionalities
 Anomaly detection:
 The task of identifying observations whose characteristics are significantly different
from the rest of the data.
 Such observations are known as anomalies or Outliers.
 A good anomaly detector must have a high detection rate and a low false rate.
 Applications: Detection of fraud, Network Intrusions etc…
 Ex: Credit Card Fraud Detection:
 A Credit Card Company records the transactions made by every credit card holder,
along with the personal information such as credit limit, age, annual income and
address.
 When a new transaction arrives, it is compared against the profile of the user.
 If the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.

More Related Content

What's hot

What's hot (20)

01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Temporal databases
Temporal databasesTemporal databases
Temporal databases
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
 
web mining
web miningweb mining
web mining
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Concurrency control
Concurrency controlConcurrency control
Concurrency control
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia datamining
 

Similar to data mining

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 

Similar to data mining (20)

Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Talk
TalkTalk
Talk
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Unit 4 Advanced Data Analytics
Unit 4 Advanced Data AnalyticsUnit 4 Advanced Data Analytics
Unit 4 Advanced Data Analytics
 
Data mining
Data miningData mining
Data mining
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Unit i
Unit iUnit i
Unit i
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Data mining
Data miningData mining
Data mining
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
G045033841
G045033841G045033841
G045033841
 
Data Mining
Data MiningData Mining
Data Mining
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 

Recently uploaded

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

data mining

  • 1. Introduction UNIT 1 - Chapter 1 Ranjit Reddy M M. Tech., (Ph. D) Associate Professor Department of Computer Science & Engineering
  • 2. 2 Contents/Topics  What Is Data Mining?  Motivating Challenges  The Origins of Data Mining  Data Mining Tasks  Summary
  • 3. January 31, 2016 Data Mining: Concepts and Techniques 3 What Is Data Mining?  Data Mining: (knowledge discovery from data)  Extracting or “Mining” knowledge from large amounts of data.  Searching for knowledge in your data  Alternative names:  Knowledge discovery (mining) in databases (KDD)  knowledge extraction  data/pattern analysis  data archeology  data dredging  information harvesting  business intelligence, etc.
  • 5. January 31, 2016 Data Mining: Concepts and Techniques 5 Knowledge Discovery (KDD) Process steps  1. Data cleaning (to remove noise and inconsistent data)  2. Data integration (where multiple data sources may be combined-Flat files, spread sheets and relational tables)  3. Data selection (where data relevant to the analysis task are retrieved from the database)  4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)  5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)  6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)  7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
  • 6. Architecture of typical data mining system
  • 7. January 31, 2016 Data Mining: Concepts and Techniques 7 Architecture of typical data mining system  Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.  Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.  Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
  • 8. January 31, 2016 Data Mining: Concepts and Techniques 8 Architecture of typical data mining system  Data mining engine: Consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.  Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module.  User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. This component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.
  • 9. Motivating Challenges  Scalability:  Datasets with sizes of gigabytes, terabytes or even petabytes  Massive datasets cannot fit into main memory  Need to develop scalable data mining algorithms to mine massive datasets  Scalability can also be improved by using sampling or developing parallel and distributed algorithms.  High Dimensionality:  Data sets with hundreds or thousands of attributes.  Example: Dataset that contains measurements of temperature at various location  Traditional data analysis techniques that were developed for low dimensional data .  Need to develop data mining algorithms to handle high dimensionality.
  • 10. Motivating Challenges  Heterogeneous and Complex Data:  Traditional data analysis methods deal with datasets containing attributes of same type(Continuous or Categorical).  Complex data sets contains image, video, text etc.  Need to develop mining methods to handle complex datasets  Data Ownership and Distribution:  Data is not stored in one location or owned by one organization.  Data is geographically distributed among resources belonging to multiple entities.  Need to develop distributed data mining algorithms to handle distributed datasets.  Key challenges:  How to reduce the amount of communication needed for distributed data.  How to effectively consolidate the data mining results from multiple sources  How to address data security issues.
  • 11. Motivating Challenges  Non Traditional Analysis:  Traditional statistical approach is based on a hypothesize-and-test paradigm.  A hypothesis is proposed, an experiment is designed to gather the data, and then data is analyzed with respect to the hypothesis.  This process is extremely labor-intensive.  Need to develop mining methods to automate the process of hypothesis generation and evaluation.
  • 12. The Origins of Data Mining  Data Mining Draws ideas, such as:  Sampling, estimation and hypothesis testing from statistics.  Search algorithms, modeling techniques and learning theories from Artificial Intelligence or Machine Learning, Pattern Recognition.  Database systems are needed to provide support for efficient storage, Indexing and query processing.  The Techniques from parallel computing are addressing the massive size of some datasets.  Distributed Computing techniques are used to gather information from different locations.
  • 13. Data Mining Tasks  Data Mining tasks divided into two major categories:  Predictive Tasks: Predict the value of particular attribute based on the values of other attributes. The predicted attribute is known as target or dependent variable and other attribute is known as explanatory or independent variables.  Descriptive Tasks: Characterize the general properties of the data in the database(Correlations, Trends, Clusters, Trajectories and anomalies).  Four of the core data mining tasks:  Classification & Regression  Association Analysis  Cluster Analysis  Anomaly Detection
  • 15. Data Mining Functionalities  Predictive Modeling: Building a model for the target variable as a function of the explanatory variable.  Classification: Which is used for Discrete Target Variables. Ex: Predicting whether a web user will make a purchase at an online book store(Target variable is binary valued).  Regression: Which is used for Continuous Target Variables.  Ex: Forecasting the future price of a stock(Price is a continuous-valued attribute) .
  • 16. Data Mining Functionalities  Association Analysis:  Used to discover patterns that describe strongly associated features in the data.  The discovered patterns are typically represented in the form of implication rules or feature subsets  The above table illustrate the data collected at supermarkets.  Association analysis can be applied to find items that are frequently bought together by customers.  Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers also tend to buy milk) Transaction ID Items 1 2 3 4 5 6 7 8 9 10 {Bread, Butter, Diapers, Milk} {Coffee, Sugar, Cookies, Salmon} {Bread, Butter, Coffee, Diapers, Milk, Eggs} {Bread, Butter, Salmon, Chicken} {Eggs, Bread, Butter} {Salmon, Diapers, Milk} {Bread, Tea, Sugar, Eggs} {Coffee, Sugar, Chicken, Eggs} {Bread, Diapers, Milk, Salt} {Tea, Eggs, Cookies, Diapers, Milk} Market Basket Analysis
  • 17. Data Mining Functionalities  Cluster Analysis:  Grouping of similar things is called cluster.  The objects are clustered or grouped based on the principle of maximizing the intra class similarity(Within a Cluster) and minimizing the interclass similarity(Cluster to Cluster). Document Clustering  Each Article is represented as a set of word frequency pairs (w, c), Where w is a word and c is the number of times the word appears in the article.  There are 2 natural clusters in the above dataset  First Cluster consists of the first 3 articles (News about the Economy)  Second cluster contain last 3 articles (News about the Heath Care) Article Word 1 2 3 4 5 6 Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2 Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1 Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3 Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2 Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2 Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
  • 18. Data Mining Functionalities  Anomaly detection:  The task of identifying observations whose characteristics are significantly different from the rest of the data.  Such observations are known as anomalies or Outliers.  A good anomaly detector must have a high detection rate and a low false rate.  Applications: Detection of fraud, Network Intrusions etc…  Ex: Credit Card Fraud Detection:  A Credit Card Company records the transactions made by every credit card holder, along with the personal information such as credit limit, age, annual income and address.  When a new transaction arrives, it is compared against the profile of the user.  If the characteristics of the transaction are very different from the previously created profile, then the transaction is flagged as potentially fraudulent.