This chapter discusses integrating structured and unstructured data in a data warehouse. It presents methods like using common text to link the two environments, employing a two-tiered structure with separate warehouses for structured and unstructured data, and using techniques like self-organizing maps to visualize unstructured data. The goal is to find ways to relate the different data types while addressing issues like incompatible formats and large unstructured data volumes.
Lecture 11 Unstructured Data and the Data Warehouse
1. Building Data WareHouse by
Inmon
Chapter 11: Unstructured Data and the Data Warehouse
http://it-slideshares.blogspot.com/
2. Contents
Overview
Integrating the Two Worlds
A Themed Match
A Two-Tiered Data Warehouse
A Self-Organizing Map (SOM)
Fitting the Two Environments Together
Summary
3. Overview
Unstructured data
◦ Casual, informal activities such as those found
on the personal computer and the Internet
◦ Ex: Emails, Spreadsheets, Text files,
Documents, Portable Document Format
(.PDF) files, Microsoft PowerPoint (.PPT) files
Structured data
◦ Standard DBMSs, reports, indexes, databases,
fields, records, and the like
4. Overview (cont’)
The primary differences between
structured data and unstructured data
5. Integrating the Two Worlds
Text — The Common Link
Plenty of problems arise:
• Misspelling
• Context
• Same name
• Nicknames
• Diminutives
• Incomplete names
• Word stems
6. Integrating the Two Worlds (con’t)
A Fundamental Mismatch
◦ The unstructured environment represents
documents and communications.
◦ The structured environment represents
transactions.
Matching Text across the Environments
◦ Remove extraneous stop words
◦ Reduction of words back to their stem
9. A Themed Match
Industrially Recognized Themes
◦ The unstructured data is analyzed according
to the existence of words that relate to
industrialized themes.
10. A Themed Match
Naturally Occurring Themes
• fire—296 occurrences
• fireman—285 occurrences
• hose—277 occurrences
• firetruck—201 occurrences
• alarm—199 occurrences
• smoke—175 occurrences
• heat—128 occurrences
• fire—296 occurrences
• Rock Springs, WY—2
• alabaster—1
• angel—2
• Rio Grande river – 1
• beaver dam—1
13. A Two-Tiered Data Warehouse
Two-Tiered Data Warehouse
◦ One tier of the data warehouse is for
unstructured data and another tier of the data
warehouse is for structured data.
14. A Two-Tiered Data Warehouse
Dividing
the Unstructured Data
Warehouse
◦ Unstructured communications
◦ Documents and libraries
15. A Two-Tiered Data Warehouse
Documents in the Unstructured Data
Warehouse
Factors determine whether or not the actual
document is stored in the data warehouse:
How many documents are there?
What is the size of the documents?
How critical is the information in the document?
Can the document be easily reached if it is not
stored in the warehouse?
Can subsections of the document be captured?
16. A Two-Tiered Data Warehouse
Visualizing Unstructured Data
◦ Unstructured visualization is the counterpart
to structured visualization.
◦ Structured visualization is known as Business
Intelligence
◦ The essence of structured visualization is the
display of numbers
17. A Two-Tiered Data Warehouse
A Self-Organizing Map (SOM)
◦ Produces a display that appears to be a
topographical map
◦ Shows how different words and the
documents are clustered, and displayed
according to themes
18. A Themed Match
The Unstructured Data Warehouse
◦ Is divided into two basic organizations—one part
for documents and another part for
communications
19. A Themed Match
Volumesof Data and the Unstructured Data
Warehouse
◦ Volumes of data are an issue
◦ Mitigate the volumes of data that can collect in the
unstructured data warehouse
20. Fitting the Two Environments
Together the unstructured environment contains
Maybe
data that is incompatible with data from the
structured environment
However there are ways that the two
environments can be related
22. http://it-slideshares.blogspot.com/
Summary
World of information technology is really
divided into two worlds—structured data and
unstructured data
The common bond between the two worlds is
text.
The structured environment and the
unstructured environment can be matched at:
◦ the identifier level
◦ the close identifier level using a probabilistic
match
◦ the keyword to metadata or repository level
Notas do Editor
Matching different formats of electricity—alternating current (AC) and direct current (DC). The unstructured world operates on AC and the structured world operates on DC. Problem in integrating by text: Misspelling—What if two words are found in the two environments— Chernobyl and Chernobile? Should there be a match made between these two worlds? Do they refer to the same thing or something different? Context—The term “bill” is found in the two worlds. Should they be matched? In one case, the reference is to a bird’s beak and in the other case, the reference is to how much money a person is owed. Same name —The same name, “Bob Smith,” appears in both worlds. Are they the same thing? Do they refer to the same person? Or, do they refer to entirely different people who happen to have matching names? Nicknames—In one world, there appears the name “Bill Inmon.” In another world there appears the name “William Inmon.” Should a match be made? Do they refer to the same person? Diminutives —Is 1245 Sharps Ct the same as 1245 Sharps Court? Is NY, NY, the same as New York, New York? Incomplete names —Is Mrs. Inmon the same as Lynn Inmon? Word stems —Should the word “moving” be connected and matched with the word “moved”?
A stop word is a word that occurs so frequently as to be meaningless to the document. Typical stop words include the following: a, an, the, for, to, by from, when, which… The second basic edit that must be done is the reduction of words back to their stem. For example, the following words all have the same grammatical Stem: moving, moved, moves, mover, removing “move”
In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
The accounting theme would contain words and phrases such as the following: receivable, payable, cash on hand, asset, debit, due date, account… The finance theme would contain such information as the following: price, margin, discount, gross sale, net sale, interest rate, carrying loan, balance due There can be many industrially recognized themes for word collections. Some of the word themes might be the following: sales, marketing, finance, human resources, engineering, accounting, distribution…
In an organization by “natural” themes, the unstructured data is collected on a document-by-document basis. Once the data is collected, the words and phrases are ranked by number of occurrences. Then, a theme to the document is formed by ranking the words and phrases inside the document based on the number of occurrences.
Raw match of data: if a word is found anywhere in the structured environment and the word is part of the theme of a document, the unstructured document is linked to the structured record. But such a matching is not very meaningful and may actually be misleading.
In Figure 11-11, data in the unstructured environment includes such people as Bill Jones, Mary Adams, Wayne Folmer, and Susan Young. All of these people exist in records of data that have a data element called “Name.” Put another way, data exists at two levels in the structured environment—the abstract level and the actual occurrence level. Figure 11-12 shows this relationship of data. In Figure 11-12, data exists at an abstract level—the metadata level. In addition, data exists at the occurrence level—where the actual occurrences of data reside.
The data found in the unstructured data warehouse is in many ways similar to the data found in the structured data warehouse. Consider the following when looking at data in the unstructured environment: It exists at a low level of granularity. It has an element of time attached to the data. It is typically organized by subject area or “theme.”
The data that can be stored in each section includes the following: ■■ The first n bytes of the document ■■ The document itself (optional) ■■ The communication itself (optional) ■■ Context information ■■ Keyword information
An identifier is an occurrence of data that serves to specifically identify a record. Close identifiers are i dentifiers where there is a good probability that a solid identification has been made.