UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx

UNIT 2
DATA WAREHOUSING AND DATA MINING

DEFINITION
A data warehouse is a system that aggregates data
from different sources into a single, central,
consistent data store to support data analysis, data
mining, artificial intelligence (AI), and machine
learning. A data warehouse system enables an
organization to run powerful analytics on huge
volumes (petabytes and petabytes) of historical
data in ways that a standard database cannot.

TYPES OF DATA WAREHOUSES
• Enterprise Warehouse: covers all areas of interest for an
organization
• Data Mart: covers a subset of corporate-wide data that is
of interest for a specific user group (e.g., marketing).
• Virtual Warehouse: offers a set of views constructed on
demand on operational databases. Some of the views could
be materialized (precomputed)

CHARACTERISTICS OF DATA WAREHOUSING
1.Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
2.Integrated: A data warehouse integrates data from multiple data sources. For example, source
A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.
3.Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This
contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse
can
hold all addresses associated with a customer.

DATA WAREHOUSE ARCHITECTURE
• A data warehouse architecture is a method of defining the overall architecture of data communication
processing and presentation that exist for end‐clients computing within the enterprise. Each data
warehouse is different, but all are characterized by standard vital components.
• Production applications such as payroll accounts payable product purchasing and inventory control are
designed for online transaction processing ﴾OLTP﴿. Such applications gather detailed data from day to day
operations.
• Data Warehouse applications are designed to support the user ad‐hoc data requirements, an activity
recently dubbed online analytical processing ﴾OLAP﴿. These include applications such as forecasting,
profiling, summary reporting, and trend analysis.
• Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a
warehouse database is updated from operational systems periodically, usually during off‐hours. As OLTP
data accumulates in production databases, it is regularly extracted, filtered, and then loaded into a
dedicated warehouse server that is accessible to users. As the warehouse is populated, it must be
restructured tables de‐normalized, data cleansed of errors and redundancies and new fields and keys added
to reflect the needs to the user for sorting, combining, and summarizing data.
• Data warehouses and their architectures very depending upon the elements of an organization's situation.

SINGLE TIER ARCHITECTURE
Single‐Tier architecture is not periodically used in practice. Its
purpose is to minimize the amount of data stored to reach this
goal; it removes data redundancies.
The figure shows the only layer physically available is the source
layer. In this method, data warehouses are virtual. This means that
the data warehouse is implemented as a multidimensional view of
operational data created by specific middleware, or an
intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional
processing. Analysis queries are agreed to operational data after
the middleware interprets them. In this way, queries affect
transactional workloads.

TWO TIER ARCHITECTURE
The requirement for separation plays an essential role in defining the two‐tier architecture for a
data warehouse system, it consists of four subsequent data flow stages:
Source layer: A data warehouse system uses a heterogeneous source of data. That data is
stored initially to corporate relational databases or legacy databases, or it may come from an
information system outside the corporate walls.
Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one standard
schema. The so‐named Extraction, Transformation, and Loading Tools ﴾ETL﴿ can combine
heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a
data warehouse.
Data Warehouse layer: Information is saved to one logically centralized individual repository: a
data warehouse. The data warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data warehouse contents and are
designed for specific enterprise departments. Meta‐data repositories store information on
sources, access procedures, data staging, users, data mart schema, and so on.
Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should feature
aggregate information navigators, complex query optimizers, and customer‐friendly GUIs

THREE TIER ARCHITECTURE
The three‐tier architecture consists of the source layer ﴾containing multiple
source system﴿, the reconciled layer and the data warehouse layer
﴾containing both data warehouses and data marts﴿. The reconciled layer sits
between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard
reference data model for a whole enterprise. At the same time, it separates
the problems of source data extraction and integration from those of data
warehouse population. In some cases, the reconciled layer is also directly
used to accomplish better some operational tasks, such as producing daily
reports that cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external processes periodically
to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise‐wide
systems. A disadvantage of this structure is the extra file storage space used
through the extra redundant reconciled layer. It also makes the analytical
tools a little further away from being real‐time.

THREE COMMON ARCHITECTURES
Three common architectures are:
• Data Warehouse Architecture: Basic
• Data Warehouse Architecture: With Staging Area
• Data Warehouse Architecture: With Staging Area and Data Marts

DATA WAREHOUSE ARCHITECTURE:
BASIC
Operational System
• An operational system is a method used in data warehousing to refer to a
system that is used to process the day‐to‐day transactions of an
organization.
Flat Files
• A Flat file system is a system of files in which transactional data is stored,
and every file in the system must have a different name.
Meta Data
• A set of data that defines and gives information about other data. Meta
Data used in Data Warehouse for a variety of purpose, including: Meta
Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples
of very basic document metadata. Metadata is used to direct a query to
the most appropriate data source.

BASIC
Lightly and highly summarized data
• The area of the data warehouse saves all the predefined lightly and highly summarized ﴾aggregated﴿
data generated by the warehouse manager. The goals of the summarized information are to speed up
query performance. The summarized record is updated continuously as new information is loaded
into the warehouse.
End‐User access Tools
• The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision‐making. These customers interact with the warehouse using end‐client access tools.
The examples of some of the end‐user access tools can be:
• Reporting and Query Tools
• Application Development Tools
• Executive Information Systems Tools
• Online Analytical Processing Tools
• Data Mining Tools

WITH STAGING AREA
We must clean and process your operational information
before put it into the warehouse.
We can do this programmatically, although data
warehouses uses a staging area ﴾A place where data is
processed before entering the warehouse﴿.
A staging area simplifies data cleansing and consolidation
for operational method coming from multiple source
systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location
where a record from source systems is copied.

WITH STAGING AREA AND DATA MARTS
We may want to customize our warehouse's architecture for multiple
groups within our organization. We can do this by adding data marts.
A data mart is a segment of a data warehouses that can provided
information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.
company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks
are separated. In this example, a financial analyst wants to analyze
historical data for purchases and sales or mine historical information
to make predictions about customer behavior.

PROPERTIES OF DATA WAREHOUSE
ARCHITECTURES
The following architecture properties are necessary for a data warehouse
system:
Separation: Analytical and transactional processing should be keep apart as
much as possible.
Scalability: Hardware and software architectures should be simple to
upgrade the data volume, which has to be managed and processed, and the
number of user's requirements, which have to be met, progressively
increase.
Extensibility: The architecture should be able to perform new operations
and technologies without redesigning the whole system.
Security: Monitoring accesses are necessary because of the strategic data
stored in the data warehouses.
Administerability: Data Warehouse management should not be

ETL PROCESS IN DATA WAREHOUSE
INTRODUCTION:
ETL stands for Extract, Transform, Load and it is a process used in data warehousing
to extract data from various sources, transform it into a format suitable for loading
into a data warehouse, and then load it into the warehouse. The process of ETL can
be broken down into the following three stages:
1.Extract: The first stage in the ETL process is to extract data from various sources
such as transactional systems, spreadsheets, and flat files. This step involves reading
data from the source systems and storing it in a staging area.
2.Transform: In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. This may involve cleaning and
validating the data, converting data types, combining data from multiple sources,
and creating new data fields.
3.Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the
warehouse.

ADVANTAGES OF ETL PROCESS
The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data
warehouse is accurate, complete, and up-to-date. It also helps to ensure that the
data is in the format required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available, such as
Informatica, Talend, DataStage, and others, that can automate and simplify the ETL
process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It
is a process in which an ETL tool extracts the data from various data source systems,
transforms it in the staging area, and then finally, loads it into the Data
Warehouse system.

ADVANTAGES OF ETL PROCESS IN DATA
WAREHOUSING
Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
Better data integration: ETL process helps to integrate data from multiple
sources and systems, making it more accessible and usable.
Increased data security: ETL process can help to improve data security by
controlling access to the data warehouse and ensuring that only authorized
users can access the data.
Improved scalability: ETL process can help to improve scalability by providing
a way to manage and analyze large amounts of data.
Increased automation: ETL tools and technologies can automate and simplify
the ETL process, reducing the time and effort required to load and update data
in the warehouse.

DISADVANTAGES OF ETL PROCESS IN DATA
WAREHOUSING:
High cost: ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or resources.
Limited flexibility: ETL process can be limited in terms of flexibility, as it
may not be able to handle unstructured data or real-time data streams.
Limited scalability: ETL process can be limited in terms of scalability, as it
may not be able to handle very large amounts of data.
Data privacy concerns: ETL process can raise concerns about data privacy,
as large amounts of data are collected, stored, and analyzed

DATA MARTS
As corporate-wide data warehouses came into use, it was discovered
that in many situations a full-blown data warehouse was overkill for
applications. Data marts evolved to solve this problem. A data mart
is a special type of a data warehouse. It is focused on a single subject
(or functional area), such as Sales, Finance, or Marketing. Whereas
data warehouses have an enterprisewide depth, the information in
data marts pertains to a single department. The primary use for a
data mart is Business Intelligence (BI) applications. Implementing a
data mart can be less expensive than implementing a data
warehouse, thus making it more practical for the small business. A
data mart can also be set up in much less time than
a data warehouse.

DATA MINING ENGINES
The ETL utilities make data collection from numerous diverse systems practical. Then, the
data needs to be converted into useful information. Some key points to remember:
• Data are easily facts, figures, and text that can be processed by a computer. Example: A
transaction at retail point-of-sale is data.
• Information is processed data. For example, analysis of point-of-sale transactions yields
information of consumer buying behaviour.
• Knowledge represents a pattern that connects information and usually presents a high
grade of predictability as to what is recounted or what will happen next.
Useful data-mining engines were evolved to support complex analysis and ad hoc queries
on a data warehouse’s database. Data mining looks for patterns among hundreds of
seemingly unrelated fields in a large database, patterns that recognize earlier unknown
trends. These trends play a key role in strategic decision making because they disclose
localities for process enhancement.

REPORTING TOOLS
The knowledge created by a data-mining engine is not very useful unless it
is presented easily and clearly to those who need it. There are many
formats for reporting information and knowledge results. One of the
common techniques for displaying information is the digital dashboard .
It provides a business manager with the input necessary to push the
business towards success. It presents the client a graphical view of
business processes. The client then drills down the data at will to get more
details on a specific process. Today, many versions of digital dashboards
are accessible from a kind of software vendors.

DATA MINING
DEFINITION:
Data Mining is the computer-assisted process of extracting knowledge
from large amount of data.
In other words, data mining derives its name as Data + Mining the same
way in which mining is done in the ground to find a valuable ore, data
mining is done to find valuable information in the dataset.
Data Mining tools predict customer habits, predict patterns and future
trends, allowing business to increase company revenues and make
proactive decisions.

CHARACTERISTICS OF DATA MINING
1.Prediction of likely outcomes.
2.Focus on large datasets and database.
3.Automatic pattern predictions based on behavior analysis.
4.Calculation – To calculate a feature from other features, any SQL
expression can be calculated.

BENEFITS OF DATA MINING:
1.It helps companies gather reliable information.
2.It’s an efficient, cost-effective solution compared to other data applications.
3.It helps businesses make profitable production and operational adjustments.
4.Data mining uses both new and legacy systems.
5.It helps businesses make informed decisions.
6.It helps detect credit risks and fraud.
7.It helps data scientists easily analyze enormous amounts of data quickly.
8.Data scientists can use the information to detect fraud, build risk models, and
improve product safety.
9.It helps data scientists quickly initiate automated predictions of behaviors and
trends and discover hidden patterns.

DATA MINING APPLICATIONS:
• Banks: Data mining helps banks work with credit ratings and anti-fraud
systems, analyzing customer financial data, purchasing transactions, and
card transactions.
• Healthcare: Data mining helps doctors create more accurate diagnoses by
bringing together every patient’s medical history, physical examination
results, medications, and treatment patterns.
• Marketing: If there was ever an application that benefitted from data
mining, it’s marketing! After all, marketing’s heart and soul is all about
targeting customers effectively for maximum results. Of course, the best way
to target your audience is to know as much about them as possible.
• Retail: The world of retail and marketing go hand-in-hand, but the former
still warrants its separate listing.Data mining also pinpoints which campaigns
get the most response.

DATA MINING APPROACHES
• CLUSTER ANALYSIS
• CLASSIFICATION
• REGRESSION

CLUSTER ANALYSIS
Cluster analysis, also known as clustering, is a method
of data mining that groups similar data points
together. The goal of cluster analysis is to divide a
dataset into groups (or clusters) such that the data
points within each group are more similar to each
other than to data points in other groups. This process
is often used for exploratory data analysis and can help
identify patterns or relationships within the data that
may not be immediately obvious.

CLASSIFICATION
Classification is a different method than clustering.
Unlike clustering, a classification analysis requires
that the end-user/analyst understand ahead of time
how classes are characterised.
Example: Classes can be defined to represent the
likelihood that a customer defaults on a loan
(Yes/No).
A common approach for classifiers is to use
decisions trees to partition and segment records.
New records can be classified by traversing the tree
from the origin through branches and nodes, to a
leaf representing a class.

REGRESSION ANALYSIS
Regression refers to a data mining
technique that is used to predict the
numeric values in a given data set. For
example, regression might be used to
predict the product or service cost or
other variables. It is also used in various
industries for business and marketing
behavior, trend analysis, and financial
forecast.

TEXT MINING
DEFINITION:
Text mining (also known as text analysis), is the process of transforming
unstructured text into structured data for easy analysis. Text mining uses
natural language processing (NLP), allowing machines to understand the
human language and process it automatically.For businesses, the large
amount of data generated every day represents both an opportunity and a
challenge. On the one side, data helps companies get smart insights on
people’s opinions about a product or service. Think about all the potential
ideas that you could get from analyzing emails, product reviews, social
media posts, customer feedback, support tickets, etc. On the other side,
there’s the dilemma of how to process all this data. And that’s where text
mining plays a major role.

TEXT MINING
• Like most things related to Natural Language Processing (NLP), text mining may sound like a
hard-to-grasp concept. But the truth is, it doesn’t need to be. This guide will go through the
basics of text mining, explain its different methods and techniques, and make it simple to
understand how it works. Text mining is an automatic process that uses natural language
processing to extract valuable insights from unstructured text. By transforming data into
information that machines can understand, text mining automates the process of classifying
texts by sentiment, topic, and intent.
• Thanks to text mining, businesses are being able to analyze complex and large sets of data in
a simple, fast and effective way. At the same time, companies are taking advantage of this
powerful tool to reduce some of their manual and repetitive tasks, saving their teams
precious time and allowing customer support agents to focus on what they do best.
• Let’s say you need to examine tons of reviews in G2 Crowd to understand what customers
are praising or criticizing about your SaaS. A text mining algorithm could help you identify
the most popular topics that arise in customer comments, and the way that people feel
about them: are the comments positive, negative or neutral? You could also find out the
main keywords mentioned by customers regarding a given topic.
• In a nutshell, text mining helps companies make the most of their data, which leads to better
data-driven business decisions.

WEB MINING
DEFINITION
Web mining is the process of using data mining techniques and algorithms to
extract information directly from the Web by extracting it from Web
documents and services, Web content, hyperlinks and server logs. The goal of
Web mining is to look for patterns in Web data by collecting and analyzing
information in order to gain insight into trends, the industry and users in
general.
Web mining is a branch of data mining concentrating on the World Wide Web
as the primary data source, including all of its components from Web content,
server logs to everything in between. The contents of data mined from the
Web may be a collection of facts that Web pages are meant to contain, and
these may consist of text, structured data such as lists and tables, and even
images, video and audio.

CATEGORIES OF WEB MINING
Web content mining — This is the process of mining useful information from
the contents of Web pages and Web documents, which are mostly text, images
and audio/video files. Techniques used in this discipline have been heavily drawn
from natural language processing (NLP) and information retrieval.
Web structure mining — This is the process of analyzing the nodes and
connection structure of a website through the use of graph theory. There are two
things that can be obtained from this: the structure of a website in terms of how
it is connected to other sites and the document structure of the website itself, as
to how each page is connected.
Web usage mining — This is the process of extracting patterns and information
from server logs to gain insight on user activity including where the users are
from, how many clicked what item on the site and the types of activities being
done on the site.

UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx

Recommended

Recommended

More Related Content

Similar to UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx

Similar to UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx (20)

Recently uploaded

Recently uploaded (20)

UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx