SlideShare uma empresa Scribd logo
1 de 46
Comparison Between Data
Mining, Web Mining and Text
Mining
Outline
•Data Mining
•Text Mining
•Web Mining
•Differences
•Similarities
•Shared Techniques
•Conclusion
DATA MINING
OVERVIEW
Data Mining: The
task of discovering
interesting patterns
from large amounts
of data.
Why Data Mining
Data, Data everywhere yet…….
I can’t find the data I need
I can’t get the data I need
I can’t remember the data I
found
I can’t use the data I found
Data format
• Text file –
• Dataset is a text file containing a transaction per line.
• Table format –
• Dataset is stored in a two columns table.
• Custom Format –
• Many data mining packages use a custom format for the input data and example of this is
ARFF(Attribute relation File Format).
Data storage
• Most of our data’s used in Data Mining solutions are stored in a Data
warehouse
• A Data Warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision making
process.
Potential Application
Database analysis and decision support
– Market analysis and management
• target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and management
Process of Data Mining
Process of Data Mining
• Data Integration –
• where multiple data source may be combined.
• Data Selection –
• where data relevant to the analysis task are retrieved from the database.
• Data Cleaning-
• to remove noise and inconsistent data
Process of Data Mining
• Data Transformation
• where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation
operations, for instance.
• Data Mining
• an essential process where intelligent methods are applied in order to extract data patterns.
• Pattern Evaluation
• (to identify the truly interesting patterns representing knowledge based
on some interestingness measures
• Knowledge Presentation
• where visualization and knowledge representation techniques are used to present the mined knowledge to the user
Techniques Involved in Data Mining
• Classification and prediction
• Finding a model (or function) that describes and distinguishes data classes or concepts, for the
purpose of being able to use the model to predict the class of objects whose class label is
unknown and it use SVM, trees, decision tree etc for algorithm
• Cluster Analysis
• Analyzes data objects without consulting a known class label and uses K mean algorithm.
• An example is the Market segmentation.
• Outlier Analysis
• Detect data objects that do not comply with the general behavior or model of the data.
• It has its usage in fraud detection.
Techniques Involved in Data Mining
• Association Analysis
• Discovery of interesting associations and correlations within data
• Uses Apriori and FP- growth algorithm
• Example is the Market Basket Analysis
• Evolution Analysis
• Describes and models regularities or trends for objects whose behavior changes over time.
• Forecasting stock market index using Time series analysis
Data Mining Tools
• There are variety of tools to all techniques or processes related to Data mining and
some of which are;
• Oracle Data Mining
• Microsoft SQL SERVER
• DBMiner
Conclusion
TEXT MINING
Overview
• Text miming is the extraction of useful information from a collection of documents.
• Textual data makes up huge amounts of data found on World Wide Web WWW,
aside from multimedia.
• The phrase “text mining” is used to denote any system that analyzes large quantities
of natural language text by parsing it and then detects lexical or linguistic usage
patterns in an attempt to extract correct information (Sebastian, 2002).
Data format
• Textual data can be unstructured i.e. they are in a free-style text with no
format just texts all through
• It could be weakly/ semi structured i.e. adheres to some pre-specified
format. Example are scientific papers, business reports.
Data storage
• In text mining, data are stored in Document warehouse or document collection.
• Document is a unit of discrete textual data within a collection, representing some
real word documents like emails, research paper, newspaper, business reports.
• Document collection is a grouping or a repository for Business Intelligence
keeping variety of document types from different sources to automatically extract
and store the salient features of documents.
• It could be static or dynamic (grows over time)
Techniques involved in Text mining
• Information Retrieval
• Users find documents that satisfies their information needs.
• Utilizes techniques of vector space model by calculating Euclidian distance between vectors
representing documents and query.
• Categorization
• It is a kind of supervised learning that categorizes volumes of documents into “topics” or “themes”
• Techniques like Naïve Bayesian classifier, Nearest Neighbor classifier, Decision Tree, and Support
Vector Machines can be used to categorize text
Techniques involved in Text mining
• Clustering
• The technique used in order to find groups of documents with similar content.
• The K-means algorithm is used here
• NLP/ computational Linguistics.
• Technique of text mining in which computer gets the ability to process, analyze and deduce patterns
from language enabling algorithms for problems like parts of speech tagging e.t.c.
• The Hidden Markov Model algorithm is used here.
Techniques involved in Text mining
• Summarization.
• Text summarization is to reduce the length and detail of a document while retaining most
important points and general meaning
• Visualization
• To represent individual documents or groups of documents text flags are used to show
document category and to show density colors are used. Like scikit-learn, scipy, numpy,
pandas, matplotlib.
Text Mining Tools
• There are variety of tools to all techniques or processes related to Text mining and they
range from Commercial tools to Open source tools
• Commercial Tools
• SAS text miner, Poly vista.
• Open Source Tools
• Rapid Miner text mining, GATE, NLTK, OpenNLP
• Comprehensive list of tools can be found here
• https://en.wikipedia.org/wiki/List_of_text_mining_software
• http://www.kdnuggets.com/software/text.html
Conclusion.
Text Mining is an important aspect of Business Intelligence
that helps users and enterprises in analyzing stored text in a
better way so as to make better decisions, improve
customer satisfaction and gain competitive advantage. It is
better than data mining as it provides deeper insight into
the expanding business domain and extracts more fruitful
data for business intelligence.
WEB MINING
Web Data
• Web pages
• Usage Data
• Intra-page structures
• Inter-page structures.
• Supplemental Data
• Profiles
• Registration Information
• Cookies
Types of Web Mining
Web Mining
Web Content
Mining
Web Usage
Mining
Web
Structure
Mining
Web Content Mining
• Here we have web page content mining and Search result mining.
• It extends the work of basic search engines.
• It helps in discovering useful information from web data
• Text, multimedia, metadata and hyperlinks
• Its techniques involves
• Classification,
• Clustering
• Association.
Web Structure Mining
• This part mine structure like link and graphs of the web.
• Its techniques used are PageRank and Clever.
• It could be combined with content mining to retrieve important information
and pattern.
Web Usage Mining
• Web Usage Mining is used for mining social networks, namely online blogs
or web data in order to understand and better serve the needs of Web-based
applications.
• The web usage mining can also be classified under the different kind of data
involved.
• Web server data
• Application server data
• Application level data
• The Technique used in WUM is clustering having user clusters and page clusters.
DIFFERENCES
Data and Data Format
Data Mining Text Mining Web Mining
Mostly the data are in a categorical
and numerical order but
multimedia data are also included.
The data is formatted in a text
file/tabular format or custom
format.
They are mostly structured data
The data are in text formats
It could be in a MS word or a an
xml files.
Textual data are mostly
unstructured or weakly structured
Web logs, cookies and web pages
are the data used in this solution.
The datas are gotten from the
WWW and could be structured
unstructured or semi-structured.
Depending on which type of Web
mining is being done.
Data Storage
Data Mining Text Mining Web Mining
In data mining, after integration
cleaning and transformation, they
are all being stored in a Data
Warehouse.
Then our mining process continues
from there.
In text mining, our textual data
which are in documents are all
collected and stored in a document
warehouse could also be called
corpus.
In web mining, our datas are stored
on the web in web log files on a
server.
Approaches in Mining
Data Mining Text Mining Web Mining
1. Classification according to
the kinds of databases mined,
2. Classification according to
the kinds of knowledge
mined,
3. Classification according to
the kinds of techniques
utilized,
4. Classification according to
the application adapted.
1. The keyword-based approach:
where the input is a set of
keywords or terms in the
documents.
2. The tagging approach: where
the input is a set of tags
3. The information-extraction
approach: which inputs
semantic information, such as
events, facts, or entities
uncovered by information
extraction.
1. It is based on Internet and
agent technologies that utilize
soft computing and fuzzy logic
techniques.
2. After the agent finds their
targets, they are programmed
to apply the mining technique.
3. These agents then analyse the
gathered data, using DM
techniques, to understand the
habits, patterns found in the
WWW
Peculiar Characteristics
Data Mining Text Mining Web Mining
Data Mining has a peculiar
characteristics of incorporating
many algorithms for both
prediction and description
Text mining peculiar
characteristics is the
implementation of the NLP
processes which is only peculiar
to this mining technique
The data in web mining are its
peculiar characteristics in that
sometimes the data grow
dynamically (changes over time)
SIMILARITIES
Techniques
Data, text and web Mining both shared the same mining technique like
Classification, Clustering and Association.
Clustering - in Text Mining use clustering in order to find groups of
documents with similar content. Also in Data Mining technique used to
place data elements into related groups without advance knowledge of the
group definitions and lastly the Web Usage Mining use this techniques for
Users cluster or Page Cluster.
Techniques
CLASSIFICATION(Supervised Techniques) - FINDING A MODEL (OR
FUNCTION) THAT DESCRIBES AND DISTINGUISHES DATA CLASSES OR
CONCEPTS, FOR THE PURPOSE OF BEING ABLE TO USE THE MODEL TO
PREDICT THE CLASS OF OBJECTS WHOSE CLASS LABEL IS UNKNOWN.
ASSOCIATION RULE- DISCOVERY OF INTERESTING ASSOCIATIONS AND
CORRELATIONS WITHIN DATA AND USES APRIORI AND FP GROWTH
ALGORITHM.
Data Processing
DATA MINING, TEXT MINING AND WEB MINING ALL SHARED THE SAME
DATA PROCESSING STAGE SUCH HAS :
DATA SELECTION
DATA CLEANING
DATA INTEGRATION
DATA TRANSFORMATION
Data Processing
DATA SELECTION - WHERE DATA RELEVANT TO THE ANALYSIS TASK ARE
RETRIEVED FROM THE DATABASE.
DATA CLEANING - TO REMOVE NOISE AND INCONSISTENT DATA.
DATA INTEGRATION -WHERE MULTIPLE DATA SOURCE MAY BE COMBINED.
DATA TRANSFORMATION - WHERE DATA ARE TRANSFORMED OR
CONSOLIDATED INTO FORMS APPROPRIATE FOR MINING BY PERFORMING
SUMMARY OR AGGREGATION OPERATIONS.
DATA
Data mining, Text Mining and Web Mining all accept large volume of data and
involve integration of techniques unlike other machine learning system that does
not handle large amount of data.
Data mining, Text Mining and Web Mining have a major relationship in finding
new data or knowledge previously unknown to the system.
Data mining, Text Mining and Web Mining also known to be viable in retrieving
known data, document, web content effectively and efficiently.
SHARED TECHNIQUE
Technique Data Mining Text Mining Web Mining
Clustering Here, clustering partitions a set
of data (or objects) into a set of
meaningful sub-classes,
called clusters. Help users
understand the natural grouping
or structure in a data set. Used
either as a stand-alone tool to
get insight into data distribution
or as a preprocessing step for
other algorithms
Clustering in text mining is
done by first converting text
data to numeric data using a
Document-Term Matrix. K
Means clustering is used
here.
Clustering is used in Web
Mining ( Web Usage Mining)
dividing it into user clusters
and Page clusters.
SHARED TECHNIQUE
Technique Data Mining Text Mining Web Mining
Classification Classification is a data mining
function that assigns items in a
collection to target categories or
classes. The goal of
classification is to accurately
predict the target class for each
case in the data.
Text classification makes us
of Supervised Latent
Dirichlet Allocation
(SLDA) to classify
texts into a specified
set/ class. They could
be classified according
to subjects or other
needed attributes.
Web Usage mining
make use of this
technique to classify
usage pattern on the
web. We could be
looking at the Entry
and Exit point of the
user.
Data, Text and Web Mining

Mais conteúdo relacionado

Mais procurados

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 

Mais procurados (20)

Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
WEB MINING.
WEB MINING.WEB MINING.
WEB MINING.
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Web mining
Web miningWeb mining
Web mining
 
Web mining
Web mining Web mining
Web mining
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
4.5 mining the worldwideweb
4.5 mining the worldwideweb4.5 mining the worldwideweb
4.5 mining the worldwideweb
 
Data Mining
Data MiningData Mining
Data Mining
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia datamining
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Web mining
Web miningWeb mining
Web mining
 
Multi media Data mining
Multi media Data miningMulti media Data mining
Multi media Data mining
 

Semelhante a Data, Text and Web Mining

Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
bhagathk
 

Semelhante a Data, Text and Web Mining (20)

Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data mining basic concept and Data warehousing
Data mining basic concept and Data warehousingData mining basic concept and Data warehousing
Data mining basic concept and Data warehousing
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Compilerpt
CompilerptCompilerpt
Compilerpt
 
Data Mining-2023 (2).ppt
Data Mining-2023 (2).pptData Mining-2023 (2).ppt
Data Mining-2023 (2).ppt
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
UNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data MiningUNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data Mining
 
Chapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptx
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptx
 
All types of mining and trends indata mining
All types of mining and trends indata miningAll types of mining and trends indata mining
All types of mining and trends indata mining
 
1 _text_mining_v0a
1  _text_mining_v0a1  _text_mining_v0a
1 _text_mining_v0a
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Dm1.1
Dm1.1Dm1.1
Dm1.1
 
2. Business Data Analytics and Technology.pptx
2. Business Data Analytics and Technology.pptx2. Business Data Analytics and Technology.pptx
2. Business Data Analytics and Technology.pptx
 
RowanDay4.pptx
RowanDay4.pptxRowanDay4.pptx
RowanDay4.pptx
 
DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT
 

Último

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Data, Text and Web Mining

  • 1. Comparison Between Data Mining, Web Mining and Text Mining
  • 2. Outline •Data Mining •Text Mining •Web Mining •Differences •Similarities •Shared Techniques •Conclusion
  • 4. OVERVIEW Data Mining: The task of discovering interesting patterns from large amounts of data.
  • 5.
  • 6. Why Data Mining Data, Data everywhere yet……. I can’t find the data I need I can’t get the data I need I can’t remember the data I found I can’t use the data I found
  • 7. Data format • Text file – • Dataset is a text file containing a transaction per line. • Table format – • Dataset is stored in a two columns table. • Custom Format – • Many data mining packages use a custom format for the input data and example of this is ARFF(Attribute relation File Format).
  • 8. Data storage • Most of our data’s used in Data Mining solutions are stored in a Data warehouse • A Data Warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process.
  • 9. Potential Application Database analysis and decision support – Market analysis and management • target marketing, customer relation management, market basket analysis, cross selling, market segmentation – Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Fraud detection and management
  • 10. Process of Data Mining
  • 11. Process of Data Mining • Data Integration – • where multiple data source may be combined. • Data Selection – • where data relevant to the analysis task are retrieved from the database. • Data Cleaning- • to remove noise and inconsistent data
  • 12. Process of Data Mining • Data Transformation • where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance. • Data Mining • an essential process where intelligent methods are applied in order to extract data patterns. • Pattern Evaluation • (to identify the truly interesting patterns representing knowledge based on some interestingness measures • Knowledge Presentation • where visualization and knowledge representation techniques are used to present the mined knowledge to the user
  • 13. Techniques Involved in Data Mining • Classification and prediction • Finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown and it use SVM, trees, decision tree etc for algorithm • Cluster Analysis • Analyzes data objects without consulting a known class label and uses K mean algorithm. • An example is the Market segmentation. • Outlier Analysis • Detect data objects that do not comply with the general behavior or model of the data. • It has its usage in fraud detection.
  • 14. Techniques Involved in Data Mining • Association Analysis • Discovery of interesting associations and correlations within data • Uses Apriori and FP- growth algorithm • Example is the Market Basket Analysis • Evolution Analysis • Describes and models regularities or trends for objects whose behavior changes over time. • Forecasting stock market index using Time series analysis
  • 15. Data Mining Tools • There are variety of tools to all techniques or processes related to Data mining and some of which are; • Oracle Data Mining • Microsoft SQL SERVER • DBMiner
  • 18. Overview • Text miming is the extraction of useful information from a collection of documents. • Textual data makes up huge amounts of data found on World Wide Web WWW, aside from multimedia. • The phrase “text mining” is used to denote any system that analyzes large quantities of natural language text by parsing it and then detects lexical or linguistic usage patterns in an attempt to extract correct information (Sebastian, 2002).
  • 19. Data format • Textual data can be unstructured i.e. they are in a free-style text with no format just texts all through • It could be weakly/ semi structured i.e. adheres to some pre-specified format. Example are scientific papers, business reports.
  • 20. Data storage • In text mining, data are stored in Document warehouse or document collection. • Document is a unit of discrete textual data within a collection, representing some real word documents like emails, research paper, newspaper, business reports. • Document collection is a grouping or a repository for Business Intelligence keeping variety of document types from different sources to automatically extract and store the salient features of documents. • It could be static or dynamic (grows over time)
  • 21.
  • 22. Techniques involved in Text mining • Information Retrieval • Users find documents that satisfies their information needs. • Utilizes techniques of vector space model by calculating Euclidian distance between vectors representing documents and query. • Categorization • It is a kind of supervised learning that categorizes volumes of documents into “topics” or “themes” • Techniques like Naïve Bayesian classifier, Nearest Neighbor classifier, Decision Tree, and Support Vector Machines can be used to categorize text
  • 23. Techniques involved in Text mining • Clustering • The technique used in order to find groups of documents with similar content. • The K-means algorithm is used here • NLP/ computational Linguistics. • Technique of text mining in which computer gets the ability to process, analyze and deduce patterns from language enabling algorithms for problems like parts of speech tagging e.t.c. • The Hidden Markov Model algorithm is used here.
  • 24. Techniques involved in Text mining • Summarization. • Text summarization is to reduce the length and detail of a document while retaining most important points and general meaning • Visualization • To represent individual documents or groups of documents text flags are used to show document category and to show density colors are used. Like scikit-learn, scipy, numpy, pandas, matplotlib.
  • 25. Text Mining Tools • There are variety of tools to all techniques or processes related to Text mining and they range from Commercial tools to Open source tools • Commercial Tools • SAS text miner, Poly vista. • Open Source Tools • Rapid Miner text mining, GATE, NLTK, OpenNLP • Comprehensive list of tools can be found here • https://en.wikipedia.org/wiki/List_of_text_mining_software • http://www.kdnuggets.com/software/text.html
  • 26. Conclusion. Text Mining is an important aspect of Business Intelligence that helps users and enterprises in analyzing stored text in a better way so as to make better decisions, improve customer satisfaction and gain competitive advantage. It is better than data mining as it provides deeper insight into the expanding business domain and extracts more fruitful data for business intelligence.
  • 28. Web Data • Web pages • Usage Data • Intra-page structures • Inter-page structures. • Supplemental Data • Profiles • Registration Information • Cookies
  • 29. Types of Web Mining Web Mining Web Content Mining Web Usage Mining Web Structure Mining
  • 30. Web Content Mining • Here we have web page content mining and Search result mining. • It extends the work of basic search engines. • It helps in discovering useful information from web data • Text, multimedia, metadata and hyperlinks • Its techniques involves • Classification, • Clustering • Association.
  • 31. Web Structure Mining • This part mine structure like link and graphs of the web. • Its techniques used are PageRank and Clever. • It could be combined with content mining to retrieve important information and pattern.
  • 32. Web Usage Mining • Web Usage Mining is used for mining social networks, namely online blogs or web data in order to understand and better serve the needs of Web-based applications. • The web usage mining can also be classified under the different kind of data involved. • Web server data • Application server data • Application level data • The Technique used in WUM is clustering having user clusters and page clusters.
  • 34. Data and Data Format Data Mining Text Mining Web Mining Mostly the data are in a categorical and numerical order but multimedia data are also included. The data is formatted in a text file/tabular format or custom format. They are mostly structured data The data are in text formats It could be in a MS word or a an xml files. Textual data are mostly unstructured or weakly structured Web logs, cookies and web pages are the data used in this solution. The datas are gotten from the WWW and could be structured unstructured or semi-structured. Depending on which type of Web mining is being done.
  • 35. Data Storage Data Mining Text Mining Web Mining In data mining, after integration cleaning and transformation, they are all being stored in a Data Warehouse. Then our mining process continues from there. In text mining, our textual data which are in documents are all collected and stored in a document warehouse could also be called corpus. In web mining, our datas are stored on the web in web log files on a server.
  • 36. Approaches in Mining Data Mining Text Mining Web Mining 1. Classification according to the kinds of databases mined, 2. Classification according to the kinds of knowledge mined, 3. Classification according to the kinds of techniques utilized, 4. Classification according to the application adapted. 1. The keyword-based approach: where the input is a set of keywords or terms in the documents. 2. The tagging approach: where the input is a set of tags 3. The information-extraction approach: which inputs semantic information, such as events, facts, or entities uncovered by information extraction. 1. It is based on Internet and agent technologies that utilize soft computing and fuzzy logic techniques. 2. After the agent finds their targets, they are programmed to apply the mining technique. 3. These agents then analyse the gathered data, using DM techniques, to understand the habits, patterns found in the WWW
  • 37. Peculiar Characteristics Data Mining Text Mining Web Mining Data Mining has a peculiar characteristics of incorporating many algorithms for both prediction and description Text mining peculiar characteristics is the implementation of the NLP processes which is only peculiar to this mining technique The data in web mining are its peculiar characteristics in that sometimes the data grow dynamically (changes over time)
  • 39. Techniques Data, text and web Mining both shared the same mining technique like Classification, Clustering and Association. Clustering - in Text Mining use clustering in order to find groups of documents with similar content. Also in Data Mining technique used to place data elements into related groups without advance knowledge of the group definitions and lastly the Web Usage Mining use this techniques for Users cluster or Page Cluster.
  • 40. Techniques CLASSIFICATION(Supervised Techniques) - FINDING A MODEL (OR FUNCTION) THAT DESCRIBES AND DISTINGUISHES DATA CLASSES OR CONCEPTS, FOR THE PURPOSE OF BEING ABLE TO USE THE MODEL TO PREDICT THE CLASS OF OBJECTS WHOSE CLASS LABEL IS UNKNOWN. ASSOCIATION RULE- DISCOVERY OF INTERESTING ASSOCIATIONS AND CORRELATIONS WITHIN DATA AND USES APRIORI AND FP GROWTH ALGORITHM.
  • 41. Data Processing DATA MINING, TEXT MINING AND WEB MINING ALL SHARED THE SAME DATA PROCESSING STAGE SUCH HAS : DATA SELECTION DATA CLEANING DATA INTEGRATION DATA TRANSFORMATION
  • 42. Data Processing DATA SELECTION - WHERE DATA RELEVANT TO THE ANALYSIS TASK ARE RETRIEVED FROM THE DATABASE. DATA CLEANING - TO REMOVE NOISE AND INCONSISTENT DATA. DATA INTEGRATION -WHERE MULTIPLE DATA SOURCE MAY BE COMBINED. DATA TRANSFORMATION - WHERE DATA ARE TRANSFORMED OR CONSOLIDATED INTO FORMS APPROPRIATE FOR MINING BY PERFORMING SUMMARY OR AGGREGATION OPERATIONS.
  • 43. DATA Data mining, Text Mining and Web Mining all accept large volume of data and involve integration of techniques unlike other machine learning system that does not handle large amount of data. Data mining, Text Mining and Web Mining have a major relationship in finding new data or knowledge previously unknown to the system. Data mining, Text Mining and Web Mining also known to be viable in retrieving known data, document, web content effectively and efficiently.
  • 44. SHARED TECHNIQUE Technique Data Mining Text Mining Web Mining Clustering Here, clustering partitions a set of data (or objects) into a set of meaningful sub-classes, called clusters. Help users understand the natural grouping or structure in a data set. Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms Clustering in text mining is done by first converting text data to numeric data using a Document-Term Matrix. K Means clustering is used here. Clustering is used in Web Mining ( Web Usage Mining) dividing it into user clusters and Page clusters.
  • 45. SHARED TECHNIQUE Technique Data Mining Text Mining Web Mining Classification Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. Text classification makes us of Supervised Latent Dirichlet Allocation (SLDA) to classify texts into a specified set/ class. They could be classified according to subjects or other needed attributes. Web Usage mining make use of this technique to classify usage pattern on the web. We could be looking at the Entry and Exit point of the user.