SlideShare a Scribd company logo
1 of 36
UNIT 2
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING
DEFINITION
A data warehouse is a system that aggregates data
from different sources into a single, central,
consistent data store to support data analysis, data
mining, artificial intelligence (AI), and machine
learning. A data warehouse system enables an
organization to run powerful analytics on huge
volumes (petabytes and petabytes) of historical
data in ways that a standard database cannot.
TYPES OF DATA WAREHOUSES
• Enterprise Warehouse: covers all areas of interest for an
organization
• Data Mart: covers a subset of corporate-wide data that is
of interest for a specific user group (e.g., marketing).
• Virtual Warehouse: offers a set of views constructed on
demand on operational databases. Some of the views could
be materialized (precomputed)
CHARACTERISTICS OF DATA WAREHOUSING
1.Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
2.Integrated: A data warehouse integrates data from multiple data sources. For example, source
A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.
3.Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This
contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse
can
hold all addresses associated with a customer.
DATA WAREHOUSE ARCHITECTURE
• A data warehouse architecture is a method of defining the overall architecture of data communication
processing and presentation that exist for end‐clients computing within the enterprise. Each data
warehouse is different, but all are characterized by standard vital components.
• Production applications such as payroll accounts payable product purchasing and inventory control are
designed for online transaction processing ď´žOLTPď´ż. Such applications gather detailed data from day to day
operations.
• Data Warehouse applications are designed to support the user ad‐hoc data requirements, an activity
recently dubbed online analytical processing ď´žOLAPď´ż. These include applications such as forecasting,
profiling, summary reporting, and trend analysis.
• Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a
warehouse database is updated from operational systems periodically, usually during off‐hours. As OLTP
data accumulates in production databases, it is regularly extracted, filtered, and then loaded into a
dedicated warehouse server that is accessible to users. As the warehouse is populated, it must be
restructured tables de‐normalized, data cleansed of errors and redundancies and new fields and keys added
to reflect the needs to the user for sorting, combining, and summarizing data.
• Data warehouses and their architectures very depending upon the elements of an organization's situation.
FRAMEWORK OF DATA WAREHOUSING
SINGLE TIER ARCHITECTURE
Single‐Tier architecture is not periodically used in practice. Its
purpose is to minimize the amount of data stored to reach this
goal; it removes data redundancies.
The figure shows the only layer physically available is the source
layer. In this method, data warehouses are virtual. This means that
the data warehouse is implemented as a multidimensional view of
operational data created by specific middleware, or an
intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional
processing. Analysis queries are agreed to operational data after
the middleware interprets them. In this way, queries affect
transactional workloads.
TWO TIER ARCHITECTURE
The requirement for separation plays an essential role in defining the two‐tier architecture for a
data warehouse system, it consists of four subsequent data flow stages:
Source layer: A data warehouse system uses a heterogeneous source of data. That data is
stored initially to corporate relational databases or legacy databases, or it may come from an
information system outside the corporate walls.
Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one standard
schema. The so‐named Extraction, Transformation, and Loading Tools ﴾ETL﴿ can combine
heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a
data warehouse.
Data Warehouse layer: Information is saved to one logically centralized individual repository: a
data warehouse. The data warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data warehouse contents and are
designed for specific enterprise departments. Meta‐data repositories store information on
sources, access procedures, data staging, users, data mart schema, and so on.
Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should feature
aggregate information navigators, complex query optimizers, and customer‐friendly GUIs
THREE TIER ARCHITECTURE
The three‐tier architecture consists of the source layer ﴾containing multiple
source systemď´ż, the reconciled layer and the data warehouse layer
ď´žcontaining both data warehouses and data martsď´ż. The reconciled layer sits
between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard
reference data model for a whole enterprise. At the same time, it separates
the problems of source data extraction and integration from those of data
warehouse population. In some cases, the reconciled layer is also directly
used to accomplish better some operational tasks, such as producing daily
reports that cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external processes periodically
to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise‐wide
systems. A disadvantage of this structure is the extra file storage space used
through the extra redundant reconciled layer. It also makes the analytical
tools a little further away from being real‐time.
THREE COMMON ARCHITECTURES
Three common architectures are:
• Data Warehouse Architecture: Basic
• Data Warehouse Architecture: With Staging Area
• Data Warehouse Architecture: With Staging Area and Data Marts
DATA WAREHOUSE ARCHITECTURE:
BASIC
Operational System
• An operational system is a method used in data warehousing to refer to a
system that is used to process the day‐to‐day transactions of an
organization.
Flat Files
• A Flat file system is a system of files in which transactional data is stored,
and every file in the system must have a different name.
Meta Data
• A set of data that defines and gives information about other data. Meta
Data used in Data Warehouse for a variety of purpose, including: Meta
Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples
of very basic document metadata. Metadata is used to direct a query to
the most appropriate data source.
DATA WAREHOUSE ARCHITECTURE:
BASIC
Lightly and highly summarized data
• The area of the data warehouse saves all the predefined lightly and highly summarized ﴾aggregated﴿
data generated by the warehouse manager. The goals of the summarized information are to speed up
query performance. The summarized record is updated continuously as new information is loaded
into the warehouse.
End‐User access Tools
• The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision‐making. These customers interact with the warehouse using end‐client access tools.
The examples of some of the end‐user access tools can be:
• Reporting and Query Tools
• Application Development Tools
• Executive Information Systems Tools
• Online Analytical Processing Tools
• Data Mining Tools
DATA WAREHOUSE ARCHITECTURE:
WITH STAGING AREA
We must clean and process your operational information
before put it into the warehouse.
We can do this programmatically, although data
warehouses uses a staging area ď´žA place where data is
processed before entering the warehouseď´ż.
A staging area simplifies data cleansing and consolidation
for operational method coming from multiple source
systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location
where a record from source systems is copied.
DATA WAREHOUSE ARCHITECTURE:
WITH STAGING AREA AND DATA MARTS
We may want to customize our warehouse's architecture for multiple
groups within our organization. We can do this by adding data marts.
A data mart is a segment of a data warehouses that can provided
information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.
company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks
are separated. In this example, a financial analyst wants to analyze
historical data for purchases and sales or mine historical information
to make predictions about customer behavior.
PROPERTIES OF DATA WAREHOUSE
ARCHITECTURES
The following architecture properties are necessary for a data warehouse
system:
Separation: Analytical and transactional processing should be keep apart as
much as possible.
Scalability: Hardware and software architectures should be simple to
upgrade the data volume, which has to be managed and processed, and the
number of user's requirements, which have to be met, progressively
increase.
Extensibility: The architecture should be able to perform new operations
and technologies without redesigning the whole system.
Security: Monitoring accesses are necessary because of the strategic data
stored in the data warehouses.
Administerability: Data Warehouse management should not be
ETL PROCESS IN DATA WAREHOUSE
INTRODUCTION:
ETL stands for Extract, Transform, Load and it is a process used in data warehousing
to extract data from various sources, transform it into a format suitable for loading
into a data warehouse, and then load it into the warehouse. The process of ETL can
be broken down into the following three stages:
1.Extract: The first stage in the ETL process is to extract data from various sources
such as transactional systems, spreadsheets, and flat files. This step involves reading
data from the source systems and storing it in a staging area.
2.Transform: In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. This may involve cleaning and
validating the data, converting data types, combining data from multiple sources,
and creating new data fields.
3.Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the
warehouse.
ADVANTAGES OF ETL PROCESS
The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data
warehouse is accurate, complete, and up-to-date. It also helps to ensure that the
data is in the format required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available, such as
Informatica, Talend, DataStage, and others, that can automate and simplify the ETL
process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It
is a process in which an ETL tool extracts the data from various data source systems,
transforms it in the staging area, and then finally, loads it into the Data
Warehouse system.
ADVANTAGES OF ETL PROCESS IN DATA
WAREHOUSING
Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
Better data integration: ETL process helps to integrate data from multiple
sources and systems, making it more accessible and usable.
Increased data security: ETL process can help to improve data security by
controlling access to the data warehouse and ensuring that only authorized
users can access the data.
Improved scalability: ETL process can help to improve scalability by providing
a way to manage and analyze large amounts of data.
Increased automation: ETL tools and technologies can automate and simplify
the ETL process, reducing the time and effort required to load and update data
in the warehouse.
DISADVANTAGES OF ETL PROCESS IN DATA
WAREHOUSING:
High cost: ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or resources.
Limited flexibility: ETL process can be limited in terms of flexibility, as it
may not be able to handle unstructured data or real-time data streams.
Limited scalability: ETL process can be limited in terms of scalability, as it
may not be able to handle very large amounts of data.
Data privacy concerns: ETL process can raise concerns about data privacy,
as large amounts of data are collected, stored, and analyzed
DATA MARTS
As corporate-wide data warehouses came into use, it was discovered
that in many situations a full-blown data warehouse was overkill for
applications. Data marts evolved to solve this problem. A data mart
is a special type of a data warehouse. It is focused on a single subject
(or functional area), such as Sales, Finance, or Marketing. Whereas
data warehouses have an enterprisewide depth, the information in
data marts pertains to a single department. The primary use for a
data mart is Business Intelligence (BI) applications. Implementing a
data mart can be less expensive than implementing a data
warehouse, thus making it more practical for the small business. A
data mart can also be set up in much less time than
a data warehouse.
DATA MINING ENGINES
The ETL utilities make data collection from numerous diverse systems practical. Then, the
data needs to be converted into useful information. Some key points to remember:
• Data are easily facts, figures, and text that can be processed by a computer. Example: A
transaction at retail point-of-sale is data.
• Information is processed data. For example, analysis of point-of-sale transactions yields
information of consumer buying behaviour.
• Knowledge represents a pattern that connects information and usually presents a high
grade of predictability as to what is recounted or what will happen next.
Useful data-mining engines were evolved to support complex analysis and ad hoc queries
on a data warehouse’s database. Data mining looks for patterns among hundreds of
seemingly unrelated fields in a large database, patterns that recognize earlier unknown
trends. These trends play a key role in strategic decision making because they disclose
localities for process enhancement.
REPORTING TOOLS
The knowledge created by a data-mining engine is not very useful unless it
is presented easily and clearly to those who need it. There are many
formats for reporting information and knowledge results. One of the
common techniques for displaying information is the digital dashboard .
It provides a business manager with the input necessary to push the
business towards success. It presents the client a graphical view of
business processes. The client then drills down the data at will to get more
details on a specific process. Today, many versions of digital dashboards
are accessible from a kind of software vendors.
DATA MINING
DATA MINING
DEFINITION:
Data Mining is the computer-assisted process of extracting knowledge
from large amount of data.
In other words, data mining derives its name as Data + Mining the same
way in which mining is done in the ground to find a valuable ore, data
mining is done to find valuable information in the dataset.
Data Mining tools predict customer habits, predict patterns and future
trends, allowing business to increase company revenues and make
proactive decisions.
CHARACTERISTICS OF DATA MINING
1.Prediction of likely outcomes.
2.Focus on large datasets and database.
3.Automatic pattern predictions based on behavior analysis.
4.Calculation – To calculate a feature from other features, any SQL
expression can be calculated.
BENEFITS OF DATA MINING:
1.It helps companies gather reliable information.
2.It’s an efficient, cost-effective solution compared to other data applications.
3.It helps businesses make profitable production and operational adjustments.
4.Data mining uses both new and legacy systems.
5.It helps businesses make informed decisions.
6.It helps detect credit risks and fraud.
7.It helps data scientists easily analyze enormous amounts of data quickly.
8.Data scientists can use the information to detect fraud, build risk models, and
improve product safety.
9.It helps data scientists quickly initiate automated predictions of behaviors and
trends and discover hidden patterns.
DATA MINING APPLICATIONS:
• Banks: Data mining helps banks work with credit ratings and anti-fraud
systems, analyzing customer financial data, purchasing transactions, and
card transactions.
• Healthcare: Data mining helps doctors create more accurate diagnoses by
bringing together every patient’s medical history, physical examination
results, medications, and treatment patterns.
• Marketing: If there was ever an application that benefitted from data
mining, it’s marketing! After all, marketing’s heart and soul is all about
targeting customers effectively for maximum results. Of course, the best way
to target your audience is to know as much about them as possible.
• Retail: The world of retail and marketing go hand-in-hand, but the former
still warrants its separate listing.Data mining also pinpoints which campaigns
get the most response.
DATA MINING APPROACHES
• CLUSTER ANALYSIS
• CLASSIFICATION
• REGRESSION
CLUSTER ANALYSIS
Cluster analysis, also known as clustering, is a method
of data mining that groups similar data points
together. The goal of cluster analysis is to divide a
dataset into groups (or clusters) such that the data
points within each group are more similar to each
other than to data points in other groups. This process
is often used for exploratory data analysis and can help
identify patterns or relationships within the data that
may not be immediately obvious.
CLASSIFICATION
Classification is a different method than clustering.
Unlike clustering, a classification analysis requires
that the end-user/analyst understand ahead of time
how classes are characterised.
Example: Classes can be defined to represent the
likelihood that a customer defaults on a loan
(Yes/No).
A common approach for classifiers is to use
decisions trees to partition and segment records.
New records can be classified by traversing the tree
from the origin through branches and nodes, to a
leaf representing a class.
REGRESSION ANALYSIS
Regression refers to a data mining
technique that is used to predict the
numeric values in a given data set. For
example, regression might be used to
predict the product or service cost or
other variables. It is also used in various
industries for business and marketing
behavior, trend analysis, and financial
forecast.
TEXT MINING
DEFINITION:
Text mining (also known as text analysis), is the process of transforming
unstructured text into structured data for easy analysis. Text mining uses
natural language processing (NLP), allowing machines to understand the
human language and process it automatically.For businesses, the large
amount of data generated every day represents both an opportunity and a
challenge. On the one side, data helps companies get smart insights on
people’s opinions about a product or service. Think about all the potential
ideas that you could get from analyzing emails, product reviews, social
media posts, customer feedback, support tickets, etc. On the other side,
there’s the dilemma of how to process all this data. And that’s where text
mining plays a major role.
TEXT MINING
• Like most things related to Natural Language Processing (NLP), text mining may sound like a
hard-to-grasp concept. But the truth is, it doesn’t need to be. This guide will go through the
basics of text mining, explain its different methods and techniques, and make it simple to
understand how it works. Text mining is an automatic process that uses natural language
processing to extract valuable insights from unstructured text. By transforming data into
information that machines can understand, text mining automates the process of classifying
texts by sentiment, topic, and intent.
• Thanks to text mining, businesses are being able to analyze complex and large sets of data in
a simple, fast and effective way. At the same time, companies are taking advantage of this
powerful tool to reduce some of their manual and repetitive tasks, saving their teams
precious time and allowing customer support agents to focus on what they do best.
• Let’s say you need to examine tons of reviews in G2 Crowd to understand what customers
are praising or criticizing about your SaaS. A text mining algorithm could help you identify
the most popular topics that arise in customer comments, and the way that people feel
about them: are the comments positive, negative or neutral? You could also find out the
main keywords mentioned by customers regarding a given topic.
• In a nutshell, text mining helps companies make the most of their data, which leads to better
data-driven business decisions.
WEB MINING
DEFINITION
Web mining is the process of using data mining techniques and algorithms to
extract information directly from the Web by extracting it from Web
documents and services, Web content, hyperlinks and server logs. The goal of
Web mining is to look for patterns in Web data by collecting and analyzing
information in order to gain insight into trends, the industry and users in
general.
Web mining is a branch of data mining concentrating on the World Wide Web
as the primary data source, including all of its components from Web content,
server logs to everything in between. The contents of data mined from the
Web may be a collection of facts that Web pages are meant to contain, and
these may consist of text, structured data such as lists and tables, and even
images, video and audio.
CATEGORIES OF WEB MINING
Web content mining — This is the process of mining useful information from
the contents of Web pages and Web documents, which are mostly text, images
and audio/video files. Techniques used in this discipline have been heavily drawn
from natural language processing (NLP) and information retrieval.
Web structure mining — This is the process of analyzing the nodes and
connection structure of a website through the use of graph theory. There are two
things that can be obtained from this: the structure of a website in terms of how
it is connected to other sites and the document structure of the website itself, as
to how each page is connected.
Web usage mining — This is the process of extracting patterns and information
from server logs to gain insight on user activity including where the users are
from, how many clicked what item on the site and the types of activities being
done on the site.

More Related Content

Similar to UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx

Data warehousing
Data warehousingData warehousing
Data warehousingShruti Dalela
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse conceptsobieefans
 
Lesson 2.docx
Lesson 2.docxLesson 2.docx
Lesson 2.docxcalf_ville86
 
Informatica and datawarehouse Material
Informatica and datawarehouse MaterialInformatica and datawarehouse Material
Informatica and datawarehouse Materialobieefans
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxnikshaikh786
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDhilsath Fathima
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseSOMASUNDARAM T
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4ambujm
 
Data warehousing.pptx
Data warehousing.pptxData warehousing.pptx
Data warehousing.pptxAnusuya123
 
Data warehousing
Data warehousingData warehousing
Data warehousingJuhi Mahajan
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Data Warehouse By Piyush
Data Warehouse By PiyushData Warehouse By Piyush
Data Warehouse By Piyushastronish
 
Unit 1
Unit 1Unit 1
Unit 1DrPrabu M
 
Cognos datawarehouse
Cognos datawarehouseCognos datawarehouse
Cognos datawarehousessuser7fc7eb
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingsumit621
 
DataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptDataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptPurnenduMaity2
 
Data Mining
Data MiningData Mining
Data Miningksanthosh
 

Similar to UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx (20)

Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse concepts
 
Lesson 2.docx
Lesson 2.docxLesson 2.docx
Lesson 2.docx
 
Informatica and datawarehouse Material
Informatica and datawarehouse MaterialInformatica and datawarehouse Material
Informatica and datawarehouse Material
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Data Management
Data ManagementData Management
Data Management
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data warehousing.pptx
Data warehousing.pptxData warehousing.pptx
Data warehousing.pptx
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Data Warehouse By Piyush
Data Warehouse By PiyushData Warehouse By Piyush
Data Warehouse By Piyush
 
Unit 1
Unit 1Unit 1
Unit 1
 
Oracle sql plsql & dw
Oracle sql plsql & dwOracle sql plsql & dw
Oracle sql plsql & dw
 
Cognos datawarehouse
Cognos datawarehouseCognos datawarehouse
Cognos datawarehouse
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
DataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptDataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.ppt
 
Data Mining
Data MiningData Mining
Data Mining
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx

  • 1. UNIT 2 DATA WAREHOUSING AND DATA MINING
  • 3. DEFINITION A data warehouse is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI), and machine learning. A data warehouse system enables an organization to run powerful analytics on huge volumes (petabytes and petabytes) of historical data in ways that a standard database cannot.
  • 4. TYPES OF DATA WAREHOUSES • Enterprise Warehouse: covers all areas of interest for an organization • Data Mart: covers a subset of corporate-wide data that is of interest for a specific user group (e.g., marketing). • Virtual Warehouse: offers a set of views constructed on demand on operational databases. Some of the views could be materialized (precomputed)
  • 5. CHARACTERISTICS OF DATA WAREHOUSING 1.Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. 2.Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. 3.Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
  • 6. DATA WAREHOUSE ARCHITECTURE • A data warehouse architecture is a method of defining the overall architecture of data communication processing and presentation that exist for end‐clients computing within the enterprise. Each data warehouse is different, but all are characterized by standard vital components. • Production applications such as payroll accounts payable product purchasing and inventory control are designed for online transaction processing ď´žOLTPď´ż. Such applications gather detailed data from day to day operations. • Data Warehouse applications are designed to support the user ad‐hoc data requirements, an activity recently dubbed online analytical processing ď´žOLAPď´ż. These include applications such as forecasting, profiling, summary reporting, and trend analysis. • Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a warehouse database is updated from operational systems periodically, usually during off‐hours. As OLTP data accumulates in production databases, it is regularly extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to users. As the warehouse is populated, it must be restructured tables de‐normalized, data cleansed of errors and redundancies and new fields and keys added to reflect the needs to the user for sorting, combining, and summarizing data. • Data warehouses and their architectures very depending upon the elements of an organization's situation.
  • 7. FRAMEWORK OF DATA WAREHOUSING
  • 8. SINGLE TIER ARCHITECTURE Single‐Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of data stored to reach this goal; it removes data redundancies. The figure shows the only layer physically available is the source layer. In this method, data warehouses are virtual. This means that the data warehouse is implemented as a multidimensional view of operational data created by specific middleware, or an intermediate processing layer. The vulnerability of this architecture lies in its failure to meet the requirement for separation between analytical and transactional processing. Analysis queries are agreed to operational data after the middleware interprets them. In this way, queries affect transactional workloads.
  • 9. TWO TIER ARCHITECTURE The requirement for separation plays an essential role in defining the two‐tier architecture for a data warehouse system, it consists of four subsequent data flow stages: Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored initially to corporate relational databases or legacy databases, or it may come from an information system outside the corporate walls. Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one standard schema. The so‐named Extraction, Transformation, and Loading Tools ď´žETLď´ż can combine heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a data warehouse. Data Warehouse layer: Information is saved to one logically centralized individual repository: a data warehouse. The data warehouses can be directly accessed, but it can also be used as a source for creating data marts, which partially replicate data warehouse contents and are designed for specific enterprise departments. Meta‐data repositories store information on sources, access procedures, data staging, users, data mart schema, and so on. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports, dynamically analyze information, and simulate hypothetical business scenarios. It should feature aggregate information navigators, complex query optimizers, and customer‐friendly GUIs
  • 10. THREE TIER ARCHITECTURE The three‐tier architecture consists of the source layer ď´žcontaining multiple source systemď´ż, the reconciled layer and the data warehouse layer ď´žcontaining both data warehouses and data martsď´ż. The reconciled layer sits between the source data and data warehouse. The main advantage of the reconciled layer is that it creates a standard reference data model for a whole enterprise. At the same time, it separates the problems of source data extraction and integration from those of data warehouse population. In some cases, the reconciled layer is also directly used to accomplish better some operational tasks, such as producing daily reports that cannot be satisfactorily prepared using the corporate applications or generating data flows to feed external processes periodically to benefit from cleaning and integration. This architecture is especially useful for the extensive, enterprise‐wide systems. A disadvantage of this structure is the extra file storage space used through the extra redundant reconciled layer. It also makes the analytical tools a little further away from being real‐time.
  • 11. THREE COMMON ARCHITECTURES Three common architectures are: • Data Warehouse Architecture: Basic • Data Warehouse Architecture: With Staging Area • Data Warehouse Architecture: With Staging Area and Data Marts
  • 12. DATA WAREHOUSE ARCHITECTURE: BASIC Operational System • An operational system is a method used in data warehousing to refer to a system that is used to process the day‐to‐day transactions of an organization. Flat Files • A Flat file system is a system of files in which transactional data is stored, and every file in the system must have a different name. Meta Data • A set of data that defines and gives information about other data. Meta Data used in Data Warehouse for a variety of purpose, including: Meta Data summarizes necessary information about data, which can make finding and work with particular instances of data more accessible. For example, author, data build, and data changed, and file size are examples of very basic document metadata. Metadata is used to direct a query to the most appropriate data source.
  • 13. DATA WAREHOUSE ARCHITECTURE: BASIC Lightly and highly summarized data • The area of the data warehouse saves all the predefined lightly and highly summarized ď´žaggregatedď´ż data generated by the warehouse manager. The goals of the summarized information are to speed up query performance. The summarized record is updated continuously as new information is loaded into the warehouse. End‐User access Tools • The principal purpose of a data warehouse is to provide information to the business managers for strategic decision‐making. These customers interact with the warehouse using end‐client access tools. The examples of some of the end‐user access tools can be: • Reporting and Query Tools • Application Development Tools • Executive Information Systems Tools • Online Analytical Processing Tools • Data Mining Tools
  • 14. DATA WAREHOUSE ARCHITECTURE: WITH STAGING AREA We must clean and process your operational information before put it into the warehouse. We can do this programmatically, although data warehouses uses a staging area ď´žA place where data is processed before entering the warehouseď´ż. A staging area simplifies data cleansing and consolidation for operational method coming from multiple source systems, especially for enterprise data warehouses where all relevant data of an enterprise is consolidated. Data Warehouse Staging Area is a temporary location where a record from source systems is copied.
  • 15. DATA WAREHOUSE ARCHITECTURE: WITH STAGING AREA AND DATA MARTS We may want to customize our warehouse's architecture for multiple groups within our organization. We can do this by adding data marts. A data mart is a segment of a data warehouses that can provided information for reporting and analysis on a section, unit, department or operation in the company, e.g., sales, payroll, production, etc. company, e.g., sales, payroll, production, etc. The figure illustrates an example where purchasing, sales, and stocks are separated. In this example, a financial analyst wants to analyze historical data for purchases and sales or mine historical information to make predictions about customer behavior.
  • 16. PROPERTIES OF DATA WAREHOUSE ARCHITECTURES The following architecture properties are necessary for a data warehouse system: Separation: Analytical and transactional processing should be keep apart as much as possible. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which has to be managed and processed, and the number of user's requirements, which have to be met, progressively increase. Extensibility: The architecture should be able to perform new operations and technologies without redesigning the whole system. Security: Monitoring accesses are necessary because of the strategic data stored in the data warehouses. Administerability: Data Warehouse management should not be
  • 17. ETL PROCESS IN DATA WAREHOUSE INTRODUCTION: ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract data from various sources, transform it into a format suitable for loading into a data warehouse, and then load it into the warehouse. The process of ETL can be broken down into the following three stages: 1.Extract: The first stage in the ETL process is to extract data from various sources such as transactional systems, spreadsheets, and flat files. This step involves reading data from the source systems and storing it in a staging area. 2.Transform: In this stage, the extracted data is transformed into a format that is suitable for loading into the data warehouse. This may involve cleaning and validating the data, converting data types, combining data from multiple sources, and creating new data fields. 3.Load: After the data is transformed, it is loaded into the data warehouse. This step involves creating the physical data structures and loading the data into the warehouse.
  • 18. ADVANTAGES OF ETL PROCESS The ETL process is an iterative process that is repeated as new data is added to the warehouse. The process is important because it ensures that the data in the data warehouse is accurate, complete, and up-to-date. It also helps to ensure that the data is in the format required for data mining and reporting. Additionally, there are many different ETL tools and technologies available, such as Informatica, Talend, DataStage, and others, that can automate and simplify the ETL process. ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL tool extracts the data from various data source systems, transforms it in the staging area, and then finally, loads it into the Data Warehouse system.
  • 19. ADVANTAGES OF ETL PROCESS IN DATA WAREHOUSING Improved data quality: ETL process ensures that the data in the data warehouse is accurate, complete, and up-to-date. Better data integration: ETL process helps to integrate data from multiple sources and systems, making it more accessible and usable. Increased data security: ETL process can help to improve data security by controlling access to the data warehouse and ensuring that only authorized users can access the data. Improved scalability: ETL process can help to improve scalability by providing a way to manage and analyze large amounts of data. Increased automation: ETL tools and technologies can automate and simplify the ETL process, reducing the time and effort required to load and update data in the warehouse.
  • 20. DISADVANTAGES OF ETL PROCESS IN DATA WAREHOUSING: High cost: ETL process can be expensive to implement and maintain, especially for organizations with limited resources. Complexity: ETL process can be complex and difficult to implement, especially for organizations that lack the necessary expertise or resources. Limited flexibility: ETL process can be limited in terms of flexibility, as it may not be able to handle unstructured data or real-time data streams. Limited scalability: ETL process can be limited in terms of scalability, as it may not be able to handle very large amounts of data. Data privacy concerns: ETL process can raise concerns about data privacy, as large amounts of data are collected, stored, and analyzed
  • 21. DATA MARTS As corporate-wide data warehouses came into use, it was discovered that in many situations a full-blown data warehouse was overkill for applications. Data marts evolved to solve this problem. A data mart is a special type of a data warehouse. It is focused on a single subject (or functional area), such as Sales, Finance, or Marketing. Whereas data warehouses have an enterprisewide depth, the information in data marts pertains to a single department. The primary use for a data mart is Business Intelligence (BI) applications. Implementing a data mart can be less expensive than implementing a data warehouse, thus making it more practical for the small business. A data mart can also be set up in much less time than a data warehouse.
  • 22. DATA MINING ENGINES The ETL utilities make data collection from numerous diverse systems practical. Then, the data needs to be converted into useful information. Some key points to remember: • Data are easily facts, figures, and text that can be processed by a computer. Example: A transaction at retail point-of-sale is data. • Information is processed data. For example, analysis of point-of-sale transactions yields information of consumer buying behaviour. • Knowledge represents a pattern that connects information and usually presents a high grade of predictability as to what is recounted or what will happen next. Useful data-mining engines were evolved to support complex analysis and ad hoc queries on a data warehouse’s database. Data mining looks for patterns among hundreds of seemingly unrelated fields in a large database, patterns that recognize earlier unknown trends. These trends play a key role in strategic decision making because they disclose localities for process enhancement.
  • 23. REPORTING TOOLS The knowledge created by a data-mining engine is not very useful unless it is presented easily and clearly to those who need it. There are many formats for reporting information and knowledge results. One of the common techniques for displaying information is the digital dashboard . It provides a business manager with the input necessary to push the business towards success. It presents the client a graphical view of business processes. The client then drills down the data at will to get more details on a specific process. Today, many versions of digital dashboards are accessible from a kind of software vendors.
  • 25. DATA MINING DEFINITION: Data Mining is the computer-assisted process of extracting knowledge from large amount of data. In other words, data mining derives its name as Data + Mining the same way in which mining is done in the ground to find a valuable ore, data mining is done to find valuable information in the dataset. Data Mining tools predict customer habits, predict patterns and future trends, allowing business to increase company revenues and make proactive decisions.
  • 26. CHARACTERISTICS OF DATA MINING 1.Prediction of likely outcomes. 2.Focus on large datasets and database. 3.Automatic pattern predictions based on behavior analysis. 4.Calculation – To calculate a feature from other features, any SQL expression can be calculated.
  • 27. BENEFITS OF DATA MINING: 1.It helps companies gather reliable information. 2.It’s an efficient, cost-effective solution compared to other data applications. 3.It helps businesses make profitable production and operational adjustments. 4.Data mining uses both new and legacy systems. 5.It helps businesses make informed decisions. 6.It helps detect credit risks and fraud. 7.It helps data scientists easily analyze enormous amounts of data quickly. 8.Data scientists can use the information to detect fraud, build risk models, and improve product safety. 9.It helps data scientists quickly initiate automated predictions of behaviors and trends and discover hidden patterns.
  • 28. DATA MINING APPLICATIONS: • Banks: Data mining helps banks work with credit ratings and anti-fraud systems, analyzing customer financial data, purchasing transactions, and card transactions. • Healthcare: Data mining helps doctors create more accurate diagnoses by bringing together every patient’s medical history, physical examination results, medications, and treatment patterns. • Marketing: If there was ever an application that benefitted from data mining, it’s marketing! After all, marketing’s heart and soul is all about targeting customers effectively for maximum results. Of course, the best way to target your audience is to know as much about them as possible. • Retail: The world of retail and marketing go hand-in-hand, but the former still warrants its separate listing.Data mining also pinpoints which campaigns get the most response.
  • 29. DATA MINING APPROACHES • CLUSTER ANALYSIS • CLASSIFICATION • REGRESSION
  • 30. CLUSTER ANALYSIS Cluster analysis, also known as clustering, is a method of data mining that groups similar data points together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points within each group are more similar to each other than to data points in other groups. This process is often used for exploratory data analysis and can help identify patterns or relationships within the data that may not be immediately obvious.
  • 31. CLASSIFICATION Classification is a different method than clustering. Unlike clustering, a classification analysis requires that the end-user/analyst understand ahead of time how classes are characterised. Example: Classes can be defined to represent the likelihood that a customer defaults on a loan (Yes/No). A common approach for classifiers is to use decisions trees to partition and segment records. New records can be classified by traversing the tree from the origin through branches and nodes, to a leaf representing a class.
  • 32. REGRESSION ANALYSIS Regression refers to a data mining technique that is used to predict the numeric values in a given data set. For example, regression might be used to predict the product or service cost or other variables. It is also used in various industries for business and marketing behavior, trend analysis, and financial forecast.
  • 33. TEXT MINING DEFINITION: Text mining (also known as text analysis), is the process of transforming unstructured text into structured data for easy analysis. Text mining uses natural language processing (NLP), allowing machines to understand the human language and process it automatically.For businesses, the large amount of data generated every day represents both an opportunity and a challenge. On the one side, data helps companies get smart insights on people’s opinions about a product or service. Think about all the potential ideas that you could get from analyzing emails, product reviews, social media posts, customer feedback, support tickets, etc. On the other side, there’s the dilemma of how to process all this data. And that’s where text mining plays a major role.
  • 34. TEXT MINING • Like most things related to Natural Language Processing (NLP), text mining may sound like a hard-to-grasp concept. But the truth is, it doesn’t need to be. This guide will go through the basics of text mining, explain its different methods and techniques, and make it simple to understand how it works. Text mining is an automatic process that uses natural language processing to extract valuable insights from unstructured text. By transforming data into information that machines can understand, text mining automates the process of classifying texts by sentiment, topic, and intent. • Thanks to text mining, businesses are being able to analyze complex and large sets of data in a simple, fast and effective way. At the same time, companies are taking advantage of this powerful tool to reduce some of their manual and repetitive tasks, saving their teams precious time and allowing customer support agents to focus on what they do best. • Let’s say you need to examine tons of reviews in G2 Crowd to understand what customers are praising or criticizing about your SaaS. A text mining algorithm could help you identify the most popular topics that arise in customer comments, and the way that people feel about them: are the comments positive, negative or neutral? You could also find out the main keywords mentioned by customers regarding a given topic. • In a nutshell, text mining helps companies make the most of their data, which leads to better data-driven business decisions.
  • 35. WEB MINING DEFINITION Web mining is the process of using data mining techniques and algorithms to extract information directly from the Web by extracting it from Web documents and services, Web content, hyperlinks and server logs. The goal of Web mining is to look for patterns in Web data by collecting and analyzing information in order to gain insight into trends, the industry and users in general. Web mining is a branch of data mining concentrating on the World Wide Web as the primary data source, including all of its components from Web content, server logs to everything in between. The contents of data mined from the Web may be a collection of facts that Web pages are meant to contain, and these may consist of text, structured data such as lists and tables, and even images, video and audio.
  • 36. CATEGORIES OF WEB MINING Web content mining — This is the process of mining useful information from the contents of Web pages and Web documents, which are mostly text, images and audio/video files. Techniques used in this discipline have been heavily drawn from natural language processing (NLP) and information retrieval. Web structure mining — This is the process of analyzing the nodes and connection structure of a website through the use of graph theory. There are two things that can be obtained from this: the structure of a website in terms of how it is connected to other sites and the document structure of the website itself, as to how each page is connected. Web usage mining — This is the process of extracting patterns and information from server logs to gain insight on user activity including where the users are from, how many clicked what item on the site and the types of activities being done on the site.