SlideShare uma empresa Scribd logo
1 de 22
Building Data WareHouse by
 Inmon

Chapter 11: Unstructured Data and the Data Warehouse

http://it-slideshares.blogspot.com/
Contents
Overview
Integrating the Two Worlds
A Themed Match
A Two-Tiered Data Warehouse
A Self-Organizing Map (SOM)
Fitting the Two Environments Together
Summary
Overview
Unstructured   data
 ◦ Casual, informal activities such as those found
   on the personal computer and the Internet
 ◦ Ex: Emails, Spreadsheets, Text files,
   Documents, Portable Document Format
   (.PDF) files, Microsoft PowerPoint (.PPT) files
Structured   data
 ◦ Standard DBMSs, reports, indexes, databases,
   fields, records, and the like
Overview (cont’)
The  primary differences between
 structured data and unstructured data
Integrating the Two Worlds
Text   — The Common Link

                 Plenty of problems arise:
                 • Misspelling
                 • Context
                 • Same name
                 • Nicknames
                 • Diminutives
                 • Incomplete names
                 • Word stems
Integrating the Two Worlds (con’t)
A   Fundamental Mismatch
 ◦ The unstructured environment represents
   documents and communications.
 ◦ The structured environment represents
   transactions.
Matching   Text across the Environments
 ◦ Remove extraneous stop words
 ◦ Reduction of words back to their stem
Integrating the Two Worlds (con’t)
A   Probabilistic Match
Integrating the Two Worlds (con’t)
Matching   All the Information
A Themed Match
Industrially   Recognized Themes
 ◦ The unstructured data is analyzed according
   to the existence of words that relate to
   industrialized themes.
A Themed Match
Naturally   Occurring Themes
                    •   fire—296 occurrences
                    •   fireman—285 occurrences
                    •   hose—277 occurrences
                    •   firetruck—201 occurrences
                    •   alarm—199 occurrences
                    •   smoke—175 occurrences
                    •   heat—128 occurrences


                    •   fire—296 occurrences
                    •   Rock Springs, WY—2
                    •   alabaster—1
                    •   angel—2
                    •   Rio Grande river – 1
                    •   beaver dam—1
A Themed Match
Linkage   through Themes and Themed
 Words
A Themed Match
Linkagethrough Abstraction and
 Metadata
 ◦ Is another way to link the two environments.
A Two-Tiered Data Warehouse
Two-Tiered    Data Warehouse
 ◦ One tier of the data warehouse is for
   unstructured data and another tier of the data
   warehouse is for structured data.
A Two-Tiered Data Warehouse
Dividing
        the Unstructured Data
 Warehouse
 ◦ Unstructured communications
 ◦ Documents and libraries
A Two-Tiered Data Warehouse
Documents      in the Unstructured Data
 Warehouse
 Factors determine whether or not the actual
  document is stored in the data warehouse:
   How many documents are there?
   What is the size of the documents?
   How critical is the information in the document?
   Can the document be easily reached if it is not
    stored in the warehouse?
   Can subsections of the document be captured?
A Two-Tiered Data Warehouse
Visualizing   Unstructured Data
 ◦ Unstructured visualization is the counterpart
   to structured visualization.
 ◦ Structured visualization is known as Business
   Intelligence
 ◦ The essence of structured visualization is the
   display of numbers
A Two-Tiered Data Warehouse
A   Self-Organizing Map (SOM)
 ◦ Produces a display that appears to be a
   topographical map
 ◦ Shows how different words and the
   documents are clustered, and displayed
   according to themes
A Themed Match

The   Unstructured Data Warehouse
 ◦ Is divided into two basic organizations—one part
   for documents and another part for
   communications
A Themed Match

Volumesof Data and the Unstructured Data
 Warehouse
 ◦ Volumes of data are an issue
 ◦ Mitigate the volumes of data that can collect in the
   unstructured data warehouse
Fitting the Two Environments
Together the unstructured environment contains
      Maybe
       data that is incompatible with data from the
       structured environment
      However there are ways that the two
       environments can be related
Fitting the Two Environments
Together
http://it-slideshares.blogspot.com/
Summary
World   of information technology is really
 divided into two worlds—structured data and
 unstructured data
The common bond between the two worlds is
 text.
The structured environment and the
 unstructured environment can be matched at:
 ◦ the identifier level
 ◦ the close identifier level using a probabilistic
   match
 ◦ the keyword to metadata or repository level

Mais conteúdo relacionado

Semelhante a Lecture 11 Unstructured Data and the Data Warehouse

Schema Design
Schema DesignSchema Design
Schema DesignMongoDB
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxAnusuya123
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAsDorothea Salo
 
Concepts of Data Bases
Concepts of Data BasesConcepts of Data Bases
Concepts of Data BasesNetworking
 
Trends in the Database
Trends in the DatabaseTrends in the Database
Trends in the DatabaseMarlon Jamera
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Hugo Besemer
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bitsDipesh Lall
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SEmily Nimsakont
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Recordspbajcsy
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiersLars Marius Garshol
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 

Semelhante a Lecture 11 Unstructured Data and the Data Warehouse (20)

Schema Design
Schema DesignSchema Design
Schema Design
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Digital data
Digital dataDigital data
Digital data
 
Digital Types
Digital TypesDigital Types
Digital Types
 
NCompass Live: RDA: Are We There Yet?
NCompass Live: RDA: Are We There Yet?NCompass Live: RDA: Are We There Yet?
NCompass Live: RDA: Are We There Yet?
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAs
 
Concepts of Data Bases
Concepts of Data BasesConcepts of Data Bases
Concepts of Data Bases
 
Trends in the Database
Trends in the DatabaseTrends in the Database
Trends in the Database
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bits
 
RDMS AND SQL
RDMS AND SQLRDMS AND SQL
RDMS AND SQL
 
Data engineering
Data engineeringData engineering
Data engineering
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar S
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 
lec6
lec6lec6
lec6
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 

Mais de phanleson

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewallsphanleson
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hackingphanleson
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocolsphanleson
 
E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacksphanleson
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applicationsphanleson
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designphanleson
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operationsphanleson
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBasephanleson
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibphanleson
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLphanleson
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Clusterphanleson
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programmingphanleson
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Dataphanleson
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairsphanleson
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagiaphanleson
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLphanleson
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Webphanleson
 

Mais de phanleson (20)

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewalls
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hacking
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocols
 
E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacks
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XML
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Web
 

Último

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Último (20)

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

Lecture 11 Unstructured Data and the Data Warehouse

  • 1. Building Data WareHouse by Inmon Chapter 11: Unstructured Data and the Data Warehouse http://it-slideshares.blogspot.com/
  • 2. Contents Overview Integrating the Two Worlds A Themed Match A Two-Tiered Data Warehouse A Self-Organizing Map (SOM) Fitting the Two Environments Together Summary
  • 3. Overview Unstructured data ◦ Casual, informal activities such as those found on the personal computer and the Internet ◦ Ex: Emails, Spreadsheets, Text files, Documents, Portable Document Format (.PDF) files, Microsoft PowerPoint (.PPT) files Structured data ◦ Standard DBMSs, reports, indexes, databases, fields, records, and the like
  • 4. Overview (cont’) The primary differences between structured data and unstructured data
  • 5. Integrating the Two Worlds Text — The Common Link Plenty of problems arise: • Misspelling • Context • Same name • Nicknames • Diminutives • Incomplete names • Word stems
  • 6. Integrating the Two Worlds (con’t) A Fundamental Mismatch ◦ The unstructured environment represents documents and communications. ◦ The structured environment represents transactions. Matching Text across the Environments ◦ Remove extraneous stop words ◦ Reduction of words back to their stem
  • 7. Integrating the Two Worlds (con’t) A Probabilistic Match
  • 8. Integrating the Two Worlds (con’t) Matching All the Information
  • 9. A Themed Match Industrially Recognized Themes ◦ The unstructured data is analyzed according to the existence of words that relate to industrialized themes.
  • 10. A Themed Match Naturally Occurring Themes • fire—296 occurrences • fireman—285 occurrences • hose—277 occurrences • firetruck—201 occurrences • alarm—199 occurrences • smoke—175 occurrences • heat—128 occurrences • fire—296 occurrences • Rock Springs, WY—2 • alabaster—1 • angel—2 • Rio Grande river – 1 • beaver dam—1
  • 11. A Themed Match Linkage through Themes and Themed Words
  • 12. A Themed Match Linkagethrough Abstraction and Metadata ◦ Is another way to link the two environments.
  • 13. A Two-Tiered Data Warehouse Two-Tiered Data Warehouse ◦ One tier of the data warehouse is for unstructured data and another tier of the data warehouse is for structured data.
  • 14. A Two-Tiered Data Warehouse Dividing the Unstructured Data Warehouse ◦ Unstructured communications ◦ Documents and libraries
  • 15. A Two-Tiered Data Warehouse Documents in the Unstructured Data Warehouse Factors determine whether or not the actual document is stored in the data warehouse:  How many documents are there?  What is the size of the documents?  How critical is the information in the document?  Can the document be easily reached if it is not stored in the warehouse?  Can subsections of the document be captured?
  • 16. A Two-Tiered Data Warehouse Visualizing Unstructured Data ◦ Unstructured visualization is the counterpart to structured visualization. ◦ Structured visualization is known as Business Intelligence ◦ The essence of structured visualization is the display of numbers
  • 17. A Two-Tiered Data Warehouse A Self-Organizing Map (SOM) ◦ Produces a display that appears to be a topographical map ◦ Shows how different words and the documents are clustered, and displayed according to themes
  • 18. A Themed Match The Unstructured Data Warehouse ◦ Is divided into two basic organizations—one part for documents and another part for communications
  • 19. A Themed Match Volumesof Data and the Unstructured Data Warehouse ◦ Volumes of data are an issue ◦ Mitigate the volumes of data that can collect in the unstructured data warehouse
  • 20. Fitting the Two Environments Together the unstructured environment contains Maybe data that is incompatible with data from the structured environment However there are ways that the two environments can be related
  • 21. Fitting the Two Environments Together
  • 22. http://it-slideshares.blogspot.com/ Summary World of information technology is really divided into two worlds—structured data and unstructured data The common bond between the two worlds is text. The structured environment and the unstructured environment can be matched at: ◦ the identifier level ◦ the close identifier level using a probabilistic match ◦ the keyword to metadata or repository level

Notas do Editor

  1. Matching different formats of electricity—alternating current (AC) and direct current (DC). The unstructured world operates on AC and the structured world operates on DC. Problem in integrating by text: Misspelling—What if two words are found in the two environments— Chernobyl and Chernobile? Should there be a match made between these two worlds? Do they refer to the same thing or something different? Context—The term “bill” is found in the two worlds. Should they be matched? In one case, the reference is to a bird’s beak and in the other case, the reference is to how much money a person is owed. Same name —The same name, “Bob Smith,” appears in both worlds. Are they the same thing? Do they refer to the same person? Or, do they refer to entirely different people who happen to have matching names? Nicknames—In one world, there appears the name “Bill Inmon.” In another world there appears the name “William Inmon.” Should a match be made? Do they refer to the same person? Diminutives —Is 1245 Sharps Ct the same as 1245 Sharps Court? Is NY, NY, the same as New York, New York? Incomplete names —Is Mrs. Inmon the same as Lynn Inmon? Word stems —Should the word “moving” be connected and matched with the word “moved”?
  2. A stop word is a word that occurs so frequently as to be meaningless to the document. Typical stop words include the following: a, an, the, for, to, by from, when, which… The second basic edit that must be done is the reduction of words back to their stem. For example, the following words all have the same grammatical Stem: moving, moved, moves, mover, removing  “move”
  3. In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
  4. In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
  5. The accounting theme would contain words and phrases such as the following: receivable, payable, cash on hand, asset, debit, due date, account… The finance theme would contain such information as the following: price, margin, discount, gross sale, net sale, interest rate, carrying loan, balance due There can be many industrially recognized themes for word collections. Some of the word themes might be the following: sales, marketing, finance, human resources, engineering, accounting, distribution…
  6. In an organization by “natural” themes, the unstructured data is collected on a document-by-document basis. Once the data is collected, the words and phrases are ranked by number of occurrences. Then, a theme to the document is formed by ranking the words and phrases inside the document based on the number of occurrences.
  7. Raw match of data: if a word is found anywhere in the structured environment and the word is part of the theme of a document, the unstructured document is linked to the structured record. But such a matching is not very meaningful and may actually be misleading.
  8. In Figure 11-11, data in the unstructured environment includes such people as Bill Jones, Mary Adams, Wayne Folmer, and Susan Young. All of these people exist in records of data that have a data element called “Name.” Put another way, data exists at two levels in the structured environment—the abstract level and the actual occurrence level. Figure 11-12 shows this relationship of data. In Figure 11-12, data exists at an abstract level—the metadata level. In addition, data exists at the occurrence level—where the actual occurrences of data reside.
  9. The data found in the unstructured data warehouse is in many ways similar to the data found in the structured data warehouse. Consider the following when looking at data in the unstructured environment: It exists at a low level of granularity. It has an element of time attached to the data. It is typically organized by subject area or “theme.”
  10. The data that can be stored in each section includes the following: ■■ The first n bytes of the document ■■ The document itself (optional) ■■ The communication itself (optional) ■■ Context information ■■ Keyword information
  11. An identifier is an occurrence of data that serves to specifically identify a record. Close identifiers are i dentifiers where there is a good probability that a solid identification has been made.