1. Building Data WareHouse by
Inmon
Chapter 12: The Really Large Data Warehouse
http://it-slideshares.blogspot.com/
2. Why the Rapid Growth?
The Impact of Large Volumes of Data
Disk Storage in the Face of Data Separation
Moving Data from One Environment to
Another
Inverting the Data Warehouse
Total Cost
Maximum Capacity
Summary
3.
4. Why the Rapid Growth?
The data warehouse contains history.
Data warehouses collect data at the most
granular level
The need to bring lots of different kinds
of data together
5. The Impact of Large Volumes of Data
Basic Data-Management Activities
◦ As data volumes grow large, normal database
functions require increasingly larger amounts
of resources.
The Cost of Storage
◦ The volume of data grows, the cost of the
data increases dramatically
6. The Impact of Large Volumes of Data
The Real Costs of Storage
◦ There are lots of components to disk storage
aside from the storage device itself
Disk controller
Communications lines
Processor
Software
7. The Impact of Large Volumes of Data
The Usage Pattern of Data in the Face of
Large Volumes
◦ Over time, as the volume of data grows, the
percentage of data actually used drops
8. The Impact of Large Volumes of Data
A Simple Calculation
Usage ratio = Actual bytes used / Total data warehouse bytes
◦ the volume of data found in your data
warehouse goes up, the actual percentage
used goes down
Two Classes of Data
◦ Infrequently used data is often called dormant
data or inactive data.
◦ Frequently used data is often called actively used
data.
9. The Impact of Large Volumes of Data
Implications of Separating Data into Two
Classes
10. Disk Storage
in the Face of Data Separation
Near-Line Storage
◦ near-line storage, (depending on the vendor) is
sequential storage
◦ Characteristics:
Robotically controlled
Inexpensive
Bulk amounts of data
Reliable over a long period of time
Seconds to access first record
11. Disk Storage
in the Face of Data Separation
Access Speed and Disk Storage
◦ The difference between freely flowing blood
and blood with many restricting components
12. Disk Storage
in the Face of Data Separation
Archival Storage
◦ Needs for split storage to manage large
amount of data
◦ Besides disk storage and near-line or bulk
storage
◦ Different with near-line storage
13. Disk Storage
in the Face of Data Separation
Implications of Transparency
◦ A record or row in the data warehouse is
identical to a record or row in near-line
storage.
14. Moving Data from
One Environment to Another
Many ways:
◦ have a database administrator manually move data
◦ hierarchical storage management (HSM)
◦ the cross-media storage management (CMSM) option
15. Moving Data from
One Environment to Another
The CMSM Approach
◦ The CMSM technology is fully
automated.
◦ The CMSM is software that makes
the physical location of the data
transparent
◦ The end user does not need to
know where data is—in the data
warehouse or on near-line
storage.
16. Moving Data from
One Environment to Another
A Data Warehouse Usage Monitor
◦ Streamline the operations of the CMSM
environment
◦ Two types:
those that are supplied by the DBMS vendor
those supplied by third-party monitors
17. Inverting the Data Warehouse
inverteddata warehouse: Consider a
normal data warehouse.
To build a data warehouse:
◦ Normal way: put data first into disk storage
(after the data ages) near-line or archival
storage
◦ Alternative way: first enter data into near-line
storage (not disk storage) data is “staged”
from the near-line environment to the disk
environment (to accessed and analyzed)
(after over) returned to near-line storage
18. Total Cost
With the introduction of near-line and
archival storage, the growing costs of a
data warehouse can be mitigated
19. Maximum Capacity
“XYZ machine can handle up to nnn terabytes
of data.”
Parameters measures the machines capacity:
Volumes of data
Number of users
Workload complexity
The balanced case is where there is a fair
amount of data, a fair number of users, and a
reasonably complex workload
20. Summary
Data warehouses grow large explosively
The data inside the warehouse separates
into one of two classes—frequently used
data or infrequently used data
Without near-line and/or archival
storage, the costs of the data
warehouseskyrocket as the data
warehouse grows large
http://it-slideshares.blogspot.com/
Notas do Editor
Historical data _ Detailed data _ Diverse data = Lots of data
Splitting data over multiple storage media based on frequency of usage
Archival storage is very similar to near-line storage , except that in archival storage, the probability of access drops very low. To put the probability of access in perspective, consider the following simple chart: High performance disk storage Access a unit of data once a month Near-line storage Access 0.5 units of data every year Archival storage Access 0.1 units of data every decade. Near-line storage can be thought of as a logical extension of the data warehouse. Archival storage cannot be thought of as a logical extension.
Options for Moving Data: ADVANTAGES Manual Very simple; available immediately; operates at the row level HSM Relatively simple; not too expensive; fully automated CMSM Fully automated; operates at the row level DISADVANTAGES Manual Prone to error; requires human interaction HSM Operates at the data set level CMSM Expensive; complex to implement and operate
third-party monitors are much better because the monitors supplied by the DBMS vendors require far more resources than those supplied The Extension of the Data Warehouse across Different Storage Media: The data warehouse can grow to petabytes (equivalent to a quadrillion bytes) of data and can still be effective and still be managed.
third-party monitors are much better because the monitors supplied by the DBMS vendors require far more resources than those supplied