Lecture 11 Unstructured Data and the Data Warehouse

Building Data WareHouse by
Inmon

Chapter 11: Unstructured Data and the Data Warehouse

http://it-slideshares.blogspot.com/

Contents
Overview
Integrating the Two Worlds
A Themed Match
A Two-Tiered Data Warehouse
A Self-Organizing Map (SOM)
Fitting the Two Environments Together
Summary

Overview
Unstructured data
◦ Casual, informal activities such as those found
on the personal computer and the Internet
◦ Ex: Emails, Spreadsheets, Text files,
Documents, Portable Document Format
(.PDF) files, Microsoft PowerPoint (.PPT) files
Structured data
◦ Standard DBMSs, reports, indexes, databases,
fields, records, and the like

Overview (cont’)
The primary differences between
structured data and unstructured data

Integrating the Two Worlds
Text — The Common Link

Plenty of problems arise:
• Misspelling
• Context
• Same name
• Nicknames
• Diminutives
• Incomplete names
• Word stems

Integrating the Two Worlds (con’t)
A Fundamental Mismatch
◦ The unstructured environment represents
documents and communications.
◦ The structured environment represents
transactions.
Matching Text across the Environments
◦ Remove extraneous stop words
◦ Reduction of words back to their stem

A Probabilistic Match

Matching All the Information

A Themed Match
Industrially Recognized Themes
◦ The unstructured data is analyzed according
to the existence of words that relate to
industrialized themes.

A Themed Match
Naturally Occurring Themes
• fire—296 occurrences
• fireman—285 occurrences
• hose—277 occurrences
• firetruck—201 occurrences
• alarm—199 occurrences
• smoke—175 occurrences
• heat—128 occurrences

• fire—296 occurrences
• Rock Springs, WY—2
• alabaster—1
• angel—2
• Rio Grande river – 1
• beaver dam—1

A Themed Match
Linkage through Themes and Themed
Words

A Themed Match
Linkagethrough Abstraction and
Metadata
◦ Is another way to link the two environments.

A Two-Tiered Data Warehouse
Two-Tiered Data Warehouse
◦ One tier of the data warehouse is for
unstructured data and another tier of the data
warehouse is for structured data.

Dividing
the Unstructured Data
Warehouse
◦ Unstructured communications
◦ Documents and libraries

Documents in the Unstructured Data
Warehouse
Factors determine whether or not the actual
document is stored in the data warehouse:
 How many documents are there?
 What is the size of the documents?
 How critical is the information in the document?
 Can the document be easily reached if it is not
stored in the warehouse?
 Can subsections of the document be captured?

Visualizing Unstructured Data
◦ Unstructured visualization is the counterpart
to structured visualization.
◦ Structured visualization is known as Business
Intelligence
◦ The essence of structured visualization is the
display of numbers

A Self-Organizing Map (SOM)
◦ Produces a display that appears to be a
topographical map
◦ Shows how different words and the
documents are clustered, and displayed
according to themes

A Themed Match

The Unstructured Data Warehouse
◦ Is divided into two basic organizations—one part
for documents and another part for
communications

A Themed Match

Volumesof Data and the Unstructured Data
Warehouse
◦ Volumes of data are an issue
◦ Mitigate the volumes of data that can collect in the
unstructured data warehouse

Fitting the Two Environments
Together the unstructured environment contains
Maybe
data that is incompatible with data from the
structured environment
However there are ways that the two
environments can be related

Fitting the Two Environments
Together

http://it-slideshares.blogspot.com/
Summary
World of information technology is really
divided into two worlds—structured data and
unstructured data
The common bond between the two worlds is
text.
The structured environment and the
unstructured environment can be matched at:
◦ the identifier level
◦ the close identifier level using a probabilistic
match
◦ the keyword to metadata or repository level

Lecture 11 Unstructured Data and the Data Warehouse

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Lecture 11 Unstructured Data and the Data Warehouse

Semelhante a Lecture 11 Unstructured Data and the Data Warehouse (20)

Mais de phanleson

Mais de phanleson (20)

Último

Último (20)

Lecture 11 Unstructured Data and the Data Warehouse

Notas do Editor