O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Date warehousing concepts

15.566 visualizações

Publicada em

Publicada em: Tecnologia
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • The #1 Woodworking Resource With Over 16,000 Plans, Download 50 FREE Plans... ▲▲▲ http://ishbv.com/tedsplans/pdf
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • The #1 Woodworking Resource With Over 16,000 Plans, Download 50 FREE Plans... ♥♥♥ http://tinyurl.com/yy9yh8fu
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Want to preview some of our plans? You can get 50 Woodworking Plans and a 440-Page "The Art of Woodworking" Book... Absolutely FREE ★★★ http://ishbv.com/tedsplans/pdf
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Want to preview some of our plans? You can get 50 Woodworking Plans and a 440-Page "The Art of Woodworking" Book... Absolutely FREE  http://ishbv.com/tedsplans/pdf
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Date warehousing concepts

  1. 1. Data Warehousing Concepts and Design
  2. 2. Introduction & Ground Rules
  3. 3. Objectives <ul><li>Data Warehousing Concepts </li></ul><ul><li>What is Business Intelligence (BI)? </li></ul><ul><li>Evolution of BI </li></ul><ul><li>Characteristics of an OLTP system </li></ul><ul><li>Why OLTP is not suitable for complex analysis? </li></ul><ul><li>Characteristics of a Data Warehouse </li></ul><ul><li>Define DWH and its properties – </li></ul><ul><li>Subject Oriented, Integrated, Time variant, Non-Volatile </li></ul><ul><li>Define Grain/Granularity </li></ul><ul><li>Differentiate between OLTP and Data Warehouse </li></ul><ul><li>User expectations and User community </li></ul><ul><li>Enterprise Data Warehouse </li></ul><ul><li>Data Warehouse versus Data marts </li></ul><ul><li>Dependent Data marts </li></ul><ul><li>Independent Data marts </li></ul><ul><li>Data Warehouse components – </li></ul><ul><li>Source systems, Staging area, Presentation area, Access tools </li></ul>
  4. 4. Objectives <ul><li>Data Warehousing Concepts </li></ul><ul><li>Goals of a Data Warehouse </li></ul><ul><li>Data Warehouse development approaches - </li></ul><ul><li>Top-down, Bottom-up, Hybrid, Federated </li></ul><ul><li>Incremental approach to warehouse development </li></ul><ul><li>Dimensional Modeling </li></ul><ul><li>Star Schema – Fact and Dimension tables </li></ul><ul><li>Dimensions and Measure objects </li></ul><ul><li>Snowflake Schema </li></ul><ul><li>Types of Fact tables </li></ul><ul><li>Factless Fact table </li></ul><ul><li>OLAP storage modes – MOLAP, ROLAP, HOLAP, DOLAP </li></ul><ul><li>Slowly and Rapidly changing Dimensions- Type I, II, III </li></ul><ul><li>Degenarated Dimension </li></ul><ul><li>Junk Dimension </li></ul><ul><li>CASE-STUDIES </li></ul>
  5. 5. What is Business Intelligence (BI)? <ul><ul><li>“ Business Intelligence (BI) is the process of transforming data into </li></ul></ul><ul><ul><li>information , information into knowledge and through iterative discoveries </li></ul></ul><ul><ul><li>turning knowledge into Intelligence .” </li></ul></ul><ul><ul><ul><ul><li>— Gartner group </li></ul></ul></ul></ul>
  6. 6. Objective of Business Intelligence Value Volume BI can be defined as taking ‘Decisions based on Data’. The objective of BI is to transform large volumes of data into useful information. Intelligence Knowledge Information Data
  7. 7. Evolution of BI <ul><ul><li>Executive information systems (EIS) </li></ul></ul><ul><ul><li>Management Information System (MIS) </li></ul></ul><ul><ul><li>Decision Support Systems (DSS) </li></ul></ul><ul><ul><li>Business Intelligence (BI) </li></ul></ul>EIS MIS DSS BI
  8. 8. Information <ul><li>Information in an organization could exists in two different types of systems: </li></ul><ul><ul><li>Online Transaction Processing (OLTP) systems </li></ul></ul><ul><ul><li>(Operational Systems) </li></ul></ul><ul><ul><li>Data Warehouse (DWH) systems </li></ul></ul><ul><li>Both OLTP and DWH systems have different purpose, business needs and users . </li></ul>
  9. 9. Features of OLTP Systems <ul><li>OLTP systems handle day-to-day transactions and operations of the business. They are </li></ul><ul><li>high performance, high throughput systems. They run mission critical applications. </li></ul><ul><li>OLTP systems store, update and retrieve Operational Data. </li></ul><ul><li>Operational Data is the data that runs the business. </li></ul><ul><li>Some of the Operational systems that we interact with are Net Banking system, Tax Accounting </li></ul><ul><li>system, Payroll package, Order-processing system, SAP, Airline reservation system etc. </li></ul>
  10. 10. Why OLTP systems are not suitable for analysis? Database design: Dimensional Database design: Normalized Data needs to be integrated Islands of operational systems Data required at summary level Data stored at transaction level Historical information to analyze Supports day-to-day operations Analytical Reporting OLTP
  11. 11. OLTP Versus Data Warehouse Large to very large, Few GB to TB Small to large Few MB to GB Size Subject, time Application Data Organization Snapshots over time (Quarter, Month, etc). Historical 30 – 60 days or 1 year - 2 years. Current Age of Data Primarily Read only Data goes out DML Data goes in Operations Seconds to hours Sub seconds to seconds Response Time Data Warehouse OLTP Property
  12. 12. OLTP Versus Data Warehouse De-Normalized, Star schema Normalized Database Design Thousands to millions of records One record at a time No. of records Atomic and/or Summarized (aggregate), less granularity Atomic (Detail), transactional level, Highest granularity Grain Analysis Processes Activities Operational, Internal, External Operational, Internal Data Sources Data Warehouse OLTP Property
  13. 13. OLTP Versus Data Warehouse
  14. 14. Data Extract Processing <ul><ul><li>A logical progression towards a data warehouse – Data Extracts </li></ul></ul><ul><ul><li>End user computing offloaded from the operational environment </li></ul></ul><ul><ul><li>User’s own data </li></ul></ul>Decision makers Operational systems Extracts
  15. 15. Issues with Data Extract Programs Extracts Operational systems Decision makers Extract Explosion
  16. 16. Data Quality Issues with Extract Processing <ul><ul><li>No common time basis </li></ul></ul><ul><ul><li>Different calculation algorithms </li></ul></ul><ul><ul><li>Different levels of extraction </li></ul></ul><ul><ul><li>Different levels of granularity </li></ul></ul><ul><ul><li>Different data field names </li></ul></ul><ul><ul><li>Different data field meanings </li></ul></ul><ul><ul><li>Missing information </li></ul></ul><ul><ul><li>No data correction rules </li></ul></ul><ul><ul><li>No Metadata </li></ul></ul><ul><ul><li>No drill-down capability </li></ul></ul>
  17. 17. Data Warehousing and Business Intelligence
  18. 18. Advances Enabling Data Warehousing <ul><ul><li>Technology </li></ul></ul><ul><ul><li>Hardware </li></ul></ul><ul><ul><li>Operating system </li></ul></ul><ul><ul><li>Database </li></ul></ul><ul><ul><li>BI Tools & Applications </li></ul></ul><ul><ul><li>Business </li></ul></ul><ul><ul><li>Competition </li></ul></ul>
  19. 19. Definition of a Data Warehouse <ul><li>“ A data warehouse is a subject oriented , integrated , non-volatile , </li></ul><ul><li>and time-variant collection of data to support management decisions.” </li></ul><ul><li> — Bill Inmon </li></ul>
  20. 20. Data Warehouse Properties Integrated Time-variant Nonvolatile Subject- oriented Data Warehouse
  21. 21. Subject-Oriented <ul><li>Data is categorized and stored by business subject rather than by application. </li></ul>OLTP Applications Equity Plans Shares Insurance Loans Savings Data Warehouse Subject Customer financial information
  22. 22. Integrated <ul><li>Data on a given subject is collected from various sources and stored once. </li></ul>Data Warehouse OLTP Applications Customer Savings Current Accounts Loans
  23. 23. Time-Variant <ul><li>Data is stored as a series of snapshots, each representing a period of time. </li></ul>Data Warehouse
  24. 24. Non-volatile <ul><li>Typically data in the data warehouse is not updated or deleted. </li></ul>Warehouse Read Load Operational Insert, Update, Delete, or Read
  25. 25. Changing Warehouse Data Operational Databases Warehouse Database First time load Refresh Refresh Refresh Purge or Archive
  26. 26. Goals of a Data Warehouse <ul><li>The Data Warehouse must assist in decision making process </li></ul><ul><li>The Data Warehouse must meet the requirements of the business community </li></ul><ul><li>The Data Warehouse must provide easy access to information </li></ul><ul><li>The Data Warehouse must present information consistently and accurately </li></ul><ul><li>The Data Warehouse must be adaptive and resilient to change </li></ul><ul><li>The Data Warehouse must provide a secured access to information </li></ul>
  27. 27. Usage Curves <ul><ul><li>Operational system is predictable </li></ul></ul><ul><ul><li>Data warehouse: </li></ul></ul><ul><ul><ul><li>Variable </li></ul></ul></ul><ul><ul><ul><li>Random </li></ul></ul></ul>
  28. 28. User Expectations <ul><ul><li>Control expectations </li></ul></ul><ul><ul><li>Set achievable targets for query response </li></ul></ul><ul><ul><li>Set SLAs </li></ul></ul><ul><ul><li>Educate business and end users </li></ul></ul><ul><ul><li>Growth and use is exponential </li></ul></ul>
  29. 29. Enterprisewide Data Warehouse <ul><li>Large scale implementation </li></ul><ul><li>Scopes the entire business </li></ul><ul><li>Data from all subject areas </li></ul><ul><li>Developed incrementally </li></ul><ul><li>Single source of enterprisewide data </li></ul><ul><li>Synchronized enterprisewide data </li></ul><ul><li>Single distribution point to dependent data marts </li></ul>
  30. 30. Data Warehouse Vocabulary <ul><li>Grain of Data - Granularity </li></ul><ul><li>Grain is defined as the level of detail of data captured in the data </li></ul><ul><li>warehouse. More the detail, higher the granularity and vice-versa </li></ul><ul><li>Fact table </li></ul><ul><li>It is similar to the transaction table in an OLTP system. </li></ul><ul><li>It stores the facts or measures of the business. </li></ul><ul><li>Eg: SALES, ORDERS </li></ul><ul><li>Dimension table </li></ul><ul><li>It is similar to the master table in an OLTP system. </li></ul><ul><li>It stores the textual descriptors of the business. </li></ul><ul><li>Eg: CUSTOMER, PRODUCT </li></ul>
  31. 31. Data Marts <ul><li>A Data mart is a subset of data warehouse. </li></ul><ul><li>A data mart is designed for a single line of business (LOB) or functional area </li></ul><ul><li>such as sales, finance, or marketing. </li></ul>
  32. 32. Data Warehouses Versus Data Marts Bottom-up Top-Down Approach Data Warehouse Data Mart Next level of migration Lower Higher Initial effort, cost, Risk < 100 GB 100 GB to > 1 TB Size Months Months to years Implementation time Few Many Data Source Single-subject, LOB Multiple Subjects Department Enterprise Scope Data Mart Data Warehouse Property
  33. 33. Data Warehouses Versus Data Marts
  34. 34. Dependent Data Mart Data Warehouse Data Marts Marketing Sales Finance HR Flat Files Marketing Sales Finance Operational Systems External Data Operations Data Legacy Data External Data
  35. 35. Independent Data Mart Sales or Marketing Flat Files Operational Systems External Data Operations Data Legacy Data External Data
  36. 36. Warehouse Development Approaches <ul><li>Top-down approach </li></ul><ul><li>(Big-Bang) </li></ul><ul><li>Bottom-up approach </li></ul><ul><li>Hybrid approach </li></ul><ul><li>(Combination) </li></ul><ul><li>Federated approach </li></ul>
  37. 37. Top-Down Approach Build the Data Warehouse Build the Data Marts
  38. 38. Top-Down Approach Data Warehouse Data Marts Marketing Sales Finance HR Flat Files Marketing Sales Finance Operational Systems External Data Operations Data Legacy Data External Data
  39. 39. Bottom-Up Approach Build Data Marts Build the Data Warehouse
  40. 40. Bottom-Up Approach Data Warehouse Data Marts Marketing Sales Finance Operational Systems External Data Operations Data Legacy Data
  41. 41. Hybrid Approach <ul><li>The hybrid approach tries to blend the best of both </li></ul><ul><ul><ul><ul><ul><li>“ top-down and “bottom-up” approaches </li></ul></ul></ul></ul></ul>Starts by designing DW and DM models synchronously, Build out first 2-3 DMs that are mutually exclusive and critical Backfill a DW behind the DMs Build the enterprise model and move atomic data to the DW
  42. 42. Federated Approach This approach is referred to as “an architecture of architectures”. Emphasizes the need to integrate new and existing heterogeneous BI environments.
  43. 43. Data Warehouse Components Source Systems Staging Area Presentation Area Access Tools Operational External Legacy Metadata Repository Data Marts Data Warehouse ODS
  44. 44. Source Systems Staging Area Presentation Area Access Tools Operational External Legacy Metadata Repository Data Marts Data Warehouse Data Warehouse Components ODS
  45. 45. Examining Data Sources <ul><ul><li>Production </li></ul></ul><ul><ul><li>Archive </li></ul></ul><ul><ul><li>Internal </li></ul></ul><ul><ul><li>External </li></ul></ul>
  46. 46. Production Data <ul><li>Operating system platforms </li></ul><ul><li>File systems </li></ul><ul><li>Database systems </li></ul><ul><li>Vertical applications </li></ul>IMS DB2 Oracle Sybase Informix VSAM SAP Dun and Bradstreet Financials Oracle Financials Baan PeopleSoft
  47. 47. Archive Data <ul><ul><li>Historical data </li></ul></ul><ul><ul><li>Useful for analysis over long periods of time </li></ul></ul><ul><ul><li>Useful for first-time load </li></ul></ul>Operation databases Warehouse database
  48. 48. Internal Data <ul><ul><li>Planning, sales, and marketing organization data </li></ul></ul><ul><ul><li>Maintained in the form of: </li></ul></ul><ul><ul><ul><li>Spreadsheets (structured) </li></ul></ul></ul><ul><ul><ul><li>Documents (unstructured) </li></ul></ul></ul><ul><ul><li>Treated like any other source data </li></ul></ul>Warehouse database Planning Accounting Marketing
  49. 49. External Data <ul><ul><li>Information from outside the organization </li></ul></ul><ul><ul><li>Issues of frequency, format, and predictability </li></ul></ul><ul><ul><li>Described and tracked using metadata </li></ul></ul>A.C. Nielsen, IRI, IMRB, ORG-MARG Barron's Dun and Bradstreet Purchased databases Wall Street Journal Economic forecasts Competitive information Warehousing databases
  50. 50. Extraction, Transformation and Loading (ETL)
  51. 51. Extraction, Transformation and Loading (ETL) <ul><li>“ Effective data extract, transform and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical data warehousing project.” </li></ul><ul><ul><li>DM Review, March 2001 </li></ul></ul>Source Target Staging Area
  52. 52. Staging Models <ul><li>Remote staging model </li></ul><ul><li>Onsite staging model </li></ul>
  53. 53. Remote Staging Model Load Load Data staging area within the warehouse environment Data staging area in its own independent environment Extract Extract Transform Staging area Transform Staging area Warehouse Warehouse Operational system Operational system
  54. 54. On-site Staging Model <ul><li>Data staging area within the operational environment, possibly affecting the operational system </li></ul>Extract Load Warehouse Operational system Transform Staging area
  55. 55. Extraction Methods <ul><ul><li>Logical Extraction methods: </li></ul></ul><ul><ul><ul><li>Full Extraction </li></ul></ul></ul><ul><ul><ul><li>Incremental Extraction </li></ul></ul></ul>
  56. 56. Extraction Methods <ul><ul><li>Physical Extraction methods: </li></ul></ul><ul><ul><ul><li>Online Extraction </li></ul></ul></ul><ul><ul><ul><li>Offline Extraction </li></ul></ul></ul>
  57. 57. ETL Techniques <ul><ul><li>Programs: C, C++, COBOL, PL/SQL, Java </li></ul></ul><ul><ul><li>Gateways: Transparent Database Access </li></ul></ul><ul><ul><li>Tools: </li></ul></ul><ul><ul><ul><li>In-house developed tools </li></ul></ul></ul><ul><ul><ul><li>Vendor’s ETL tools (Ideal technique) </li></ul></ul></ul>
  58. 58. Mapping Data <ul><li>Mapping data defines: </li></ul><ul><ul><li>Which operational attributes to use </li></ul></ul><ul><ul><li>How to transform the attributes for the warehouse </li></ul></ul><ul><ul><li>Where the attributes exist in the warehouse </li></ul></ul>Metadata File A F1 Staging File One Number F2 F3 Name DOB Staging File One Number USA123 Name Mr. Bloggs DOB 10-Dec-56 File A F1 123 F2 Bloggs F3 10/12/56
  59. 59. Transformation Routines <ul><ul><li>Cleaning data </li></ul></ul><ul><ul><li>Eliminating inconsistencies </li></ul></ul><ul><ul><li>Adding elements </li></ul></ul><ul><ul><li>Merging data </li></ul></ul><ul><ul><li>Integrating data </li></ul></ul><ul><ul><li>Transforming data before load </li></ul></ul>
  60. 60. Transforming Data: Problems and Solutions <ul><ul><li>Data Anomalies </li></ul></ul><ul><ul><li>Multipart keys </li></ul></ul><ul><ul><li>Multiple local standards </li></ul></ul><ul><ul><li>Multiple files </li></ul></ul><ul><ul><li>Missing values </li></ul></ul><ul><ul><li>Duplicate values </li></ul></ul><ul><ul><li>Element names </li></ul></ul><ul><ul><li>Element meanings </li></ul></ul><ul><ul><li>Input formats </li></ul></ul><ul><ul><li>Referential Integrity constraints </li></ul></ul><ul><ul><li>Name and address </li></ul></ul>
  61. 61. Data Anomalies <ul><ul><li>No unique key </li></ul></ul><ul><ul><li>Data naming and coding anomalies </li></ul></ul><ul><ul><li>Data meaning anomalies between groups </li></ul></ul><ul><ul><li>Spelling and text inconsistencies </li></ul></ul>181 North Street, Key West, FLA Oracle Corp UK Ltd 90345672 15 Main Road, Ft. Lauderdale, FLA Oracle Corp. UK 90234889 15 Main Road, Ft. Lauderdale Oracle Computing 90233489 100 N.E. 1st St. Oracle Limited 90233479 ADDRESS NAME CUSNUM
  62. 62. Multipart Keys Problem <ul><li>Multipart keys </li></ul>Country code Sales territory Product number Salesperson code Product code = 12 M 65431 3 45
  63. 63. Multiple Local Standards Problem <ul><ul><li>Multiple local standards </li></ul></ul><ul><ul><li>Tools or filters to preprocess </li></ul></ul>cm inches cm USD 600 1,000 GBP FF 9,990 DD/MM/YY MM/DD/YY DD-Mon-YY
  64. 64. Multiple Source Files Problem <ul><ul><li>Added complexity of multiple source files </li></ul></ul>Transformed data Multiple source files Logic to detect correct source
  65. 65. Missing Values Problem <ul><li>Solution: </li></ul><ul><ul><li>Ignore </li></ul></ul><ul><ul><li>Wait </li></ul></ul><ul><ul><li>Mark rows </li></ul></ul><ul><ul><li>Extract when time-stamped </li></ul></ul>If NULL then field = ‘A’ A
  66. 66. Duplicate Values Problem <ul><li>Solution: </li></ul><ul><ul><li>SQL self-join techniques </li></ul></ul><ul><ul><li>RDMBS constraint utilities </li></ul></ul>ACME Inc ACME Inc ACME Inc SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);
  67. 67. Element Names Problem <ul><li>Solution: </li></ul><ul><ul><li>Common naming conventions </li></ul></ul>Customer Customer Client Contact Name
  68. 68. Element Meaning Problem <ul><ul><li>Avoid misinterpretation </li></ul></ul><ul><ul><li>Complex solution </li></ul></ul><ul><ul><li>Document meaning in metadata </li></ul></ul>Product number p_no Purchase order number Policy number
  69. 69. Input Format Problem <ul><li>Different character sets or data-types </li></ul>ASCII EBCDIC 12373 “ 123-73” ACME Co. áøåëéí äáàéí Beer (Pack of 8)
  70. 70. Referential Integrity Problem <ul><li>Solution: </li></ul><ul><ul><li>SQL anti-join (outer join) </li></ul></ul><ul><ul><li>Server constraints </li></ul></ul><ul><ul><li>Dedicated tools </li></ul></ul>40 30 20 10 Department 60 Harris 6786 50 Doe 1234 20 Jones 1289 10 Smith 1099 Department Name Emp
  71. 71. Name and Address Problem <ul><ul><li>Single-field format </li></ul></ul><ul><ul><li>Multiple-field format </li></ul></ul>Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 Database 1 M300 HARRY H. ENFIELD N100 DIANNE ZIEFELD LOCATION NAME Database 2 300 ENFIELD, HARRY H 100 ZIEFELD, DIANNE LOCATION NAME 23565 Code County Luth Country Bigtown Town 100 Main St. Street Mr. J. Smith Name
  72. 72. Transformation Timing and Location <ul><ul><li>Transformation is performed: </li></ul></ul><ul><ul><ul><li>Before load </li></ul></ul></ul><ul><ul><ul><li>In parallel while loading </li></ul></ul></ul><ul><ul><li>Can be initiated at different points: </li></ul></ul><ul><ul><ul><li>On the operational platform </li></ul></ul></ul><ul><ul><ul><li>In a separate staging area </li></ul></ul></ul>
  73. 73. Adding a Date Stamp: Fact Tables and Dimensions Item Table Item_id Dept_id Time_key Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Product Table Product_id Time_key Product_desc
  74. 74. Summarizing Data <ul><li>1. During extraction on staging area </li></ul><ul><li>2. After loading to the warehouse server </li></ul>Operational databases Warehouse database Staging area
  75. 75. Loading Data into the Warehouse <ul><ul><li>Loading moves the data into the warehouse </li></ul></ul><ul><ul><li>Loading can be time-consuming: </li></ul></ul><ul><ul><ul><li>Consider the load window </li></ul></ul></ul><ul><ul><ul><li>Schedule and automate the loading </li></ul></ul></ul><ul><ul><li>Initial load moves large volumes of data </li></ul></ul><ul><ul><li>Subsequent refresh moves smaller volumes of data </li></ul></ul>Operational databases Warehouse database Staging area Extract Transform Transport, Load
  76. 76. Load Window Requirements <ul><ul><li>Time available for entire ETL process </li></ul></ul><ul><ul><li>Plan </li></ul></ul><ul><ul><li>Test </li></ul></ul><ul><ul><li>Prove </li></ul></ul><ul><ul><li>Monitor </li></ul></ul>0 3 am 6 9 12 pm 3 6 9 12 User Access Period Load Window Load Window
  77. 77. Planning the Load Window <ul><ul><li>Plan and build processes according to a strategy. </li></ul></ul><ul><ul><li>Consider volumes of data. </li></ul></ul><ul><ul><li>Identify technical infrastructure. </li></ul></ul><ul><ul><li>Ensure currency of data. </li></ul></ul><ul><ul><li>Consider user access requirements first. </li></ul></ul><ul><ul><li>High availability requirements may mean a small load window. </li></ul></ul>0 3 am 6 9 12 pm 3 6 9 12 User Access Period
  78. 78. Initial Load and Refresh <ul><li>Initial Load: </li></ul><ul><ul><li>Single event that populates the database with historical data </li></ul></ul><ul><ul><li>Involves large volumes of data </li></ul></ul><ul><ul><li>Employs distinct ETL tasks </li></ul></ul><ul><ul><li>Involves large amounts of processing after load </li></ul></ul><ul><li>Refresh: </li></ul><ul><ul><li>Performed according to a business cycle </li></ul></ul><ul><ul><li>Less data to load than first-time load </li></ul></ul><ul><ul><li>complex ETL tasks </li></ul></ul><ul><ul><li>Smaller amounts of post-load processing </li></ul></ul>
  79. 79. Data Refresh Models <ul><li>Extract Processing Environment </li></ul><ul><ul><li>After each time interval, build a new snapshot of the database. </li></ul></ul><ul><ul><li>Purge old snap shots. </li></ul></ul>T1 T2 T3 Operational databases
  80. 80. Data Refresh Models <ul><li>Warehouse Environment </li></ul><ul><ul><li>Build a new database the first time. </li></ul></ul><ul><ul><li>After each time interval, add delta changes to database. </li></ul></ul><ul><ul><li>Archive or purge oldest data. </li></ul></ul>T1 T2 T3 Operational databases
  81. 81. Post-Processing of Loaded Data Post-processing of loaded data Extract Transform Load Warehouse Staging area Create indexes Generate keys Summarize Filter
  82. 82. Unique Indexes <ul><ul><li>Disable constraints before load. </li></ul></ul><ul><ul><li>Enable constraints after load. </li></ul></ul><ul><ul><li>Re-create index if necessary. </li></ul></ul>Load data Disable constraints Enable constraints Create index Reprocess Catch errors
  83. 83. Creating Derived Keys <ul><li>The use of derived (sometimes referred as generalized or artificial key or synthetic key or a surrogate or a warehouse key) is recommended to maintain the uniqueness of a row. </li></ul><ul><li>Method </li></ul><ul><ul><li>Concatenate key </li></ul></ul><ul><ul><li>Assign a number sequentially from a list </li></ul></ul>109908 01 109908 109908 100
  84. 85. Metadata Users Metadata repository End users Developers IT Professionals
  85. 86. Metadata Documentation Approaches <ul><ul><li>Automated </li></ul></ul><ul><ul><ul><li>Data modeling tools </li></ul></ul></ul><ul><ul><ul><li>ETL tools </li></ul></ul></ul><ul><ul><li>Manual </li></ul></ul>
  86. 87. Data Warehouse Design <ul><li>Dimensional Modeling </li></ul><ul><li>Identify the ‘ Business Process ’ </li></ul><ul><li>Determine the ‘ Grain ’ </li></ul><ul><li>Identify the ‘ Facts ’ </li></ul><ul><li>Identify the ‘ Dimensions ’ </li></ul>
  87. 88. Business Requirements Drive the Design Process <ul><ul><li>Primary input </li></ul></ul><ul><ul><li>Secondary input </li></ul></ul>Existing Metadata Production ERD Model Business Requirements Research
  88. 89. Perform Strategic Analysis <ul><ul><li>Identify crucial business processes </li></ul></ul><ul><ul><li>Understand business processes </li></ul></ul><ul><ul><li>Prioritize and select the business processes to implement </li></ul></ul>Business Benefit Low High Low High Feasibility
  89. 90. Using a Business Process Matrix DW Bus Architecture Promotion Channel Product Date Inventory Customer Returns Sales Business Processes Business Dimensions
  90. 91. Conformed Dimensions <ul><li>Dimensions are conformed when they are exactly the same including the keys or one is a perfect subset of the other. </li></ul><ul><li>DW bus architecture provides a standard set of conformed dimensions </li></ul>
  91. 92. Determine the Grain YEAR? QUARTER? MONTH? WEEK? DAY?
  92. 93. Documenting the Granularity <ul><li>Is an important design consideration </li></ul><ul><li>Determines the level of detail </li></ul><ul><li>Is determined by business needs </li></ul>Low-level grain (Transaction-level data) High-level grain (Summary data)
  93. 94. Defining Time Granularity Fiscal Time Hierarchy Current dimension grain Future dimension grain Fiscal Year Fiscal Quarter Fiscal Month Fiscal Week Day
  94. 95. Identify the Facts and Dimensions <ul><li>The attribute is perceived as constant or discrete: </li></ul><ul><ul><li>Product </li></ul></ul><ul><ul><li>Location </li></ul></ul><ul><ul><li>Time </li></ul></ul><ul><ul><li>Size </li></ul></ul><ul><li>The attribute varies continuously: </li></ul><ul><ul><li>Balance </li></ul></ul><ul><ul><li>Units Sold </li></ul></ul><ul><ul><li>Cost </li></ul></ul><ul><ul><li>Sales </li></ul></ul>Facts (Measures) Dimensions
  95. 96. Data Warehouse Environment Data Structures <ul><li>The data structures that are commonly found in a data warehouse environment: </li></ul><ul><ul><li>Third normal form (3NF) </li></ul></ul><ul><ul><li>Star schema </li></ul></ul><ul><ul><li>Snowflake schema </li></ul></ul>
  96. 97. Star Schema Customer Location Sales Supplier Product
  97. 98. Star Schema Model Product Table Product_id Product_disc,... Time Table Day_id Month_id Year_id,... Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Item Table Item_id Item_desc,... Store Table Store_id District_id,... Central fact table Denormalized dimensions
  98. 99. Fact Table Characteristics <ul><ul><li>Contain numerical metrics of the business </li></ul></ul><ul><ul><li>Can hold large volumes of data </li></ul></ul><ul><ul><li>Can grow quickly </li></ul></ul><ul><ul><li>Can contain base, derived, and summarized data </li></ul></ul><ul><ul><li>Are typically additive </li></ul></ul><ul><ul><li>Are joined to dimension tables </li></ul></ul><ul><ul><li>through foreign keys that reference </li></ul></ul><ul><ul><li>Primary keys in the dimension tables </li></ul></ul>Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units ...
  99. 100. Dimension Table Characteristics <ul><ul><li>Contain descriptors of the business / </li></ul></ul><ul><ul><li>textual information that represents the attributes of the business </li></ul></ul><ul><ul><li>Contain relatively static data </li></ul></ul><ul><ul><li>Are usually smaller than fact tables </li></ul></ul><ul><ul><li>Are joined to a fact table through </li></ul></ul><ul><ul><li>a foreign key reference </li></ul></ul>Item Table Item_id Item_desc,...
  100. 101. Advantages of Using a Star Dimensional Model <ul><ul><li>Design improves performance by reducing table joins. </li></ul></ul><ul><ul><li>The model is easy for users to understand. </li></ul></ul><ul><ul><li>Supports multidimensional analysis. </li></ul></ul><ul><ul><li>Provides an extensible design </li></ul></ul><ul><ul><li>Primary keys represent a dimension. </li></ul></ul><ul><ul><li>Non-foreign key columns are values. </li></ul></ul><ul><ul><li>Facts are usually highly normalized. </li></ul></ul><ul><ul><li>Dimensions are completely de-normalized. </li></ul></ul><ul><ul><li>End users can express complex queries. </li></ul></ul>
  101. 102. Base and Derived Data Payroll table Derived data Base data Emp_FK Month_FK Salary Comm Comp 101 05 1,000 0 1,000 102 05 1,500 100 1,600 103 05 1,000 200 1,200 104 05 1,500 1,000 2,500
  102. 103. Translating Business Measures into a Fact Table Business measures Facts Business Measures Number of Items Amount Cost Profit Fact Number of Items Item Amount Item Cost Profit Base Base Base Derived
  103. 104. Snowflake Schema Model Time Table Week_id Period_id Year_id Dept Table Dept_id Dept_desc Mgr_id Mgr Table Dept_id Mgr_id Mgr_name Product Table Product_id Product_desc Item Table Item_id Item_desc Dept_id Sales Fact Table Item_id Store_id Product_id Week_id Sales_amount Sales_units Store Table Store_id Store_desc District_id District Table District_id District_desc
  104. 105. Snowflake Model . . . . Order Web History_PK Customer History History_FK Customer_FK Product_FK Channel_FK Item_nbr Item_desc Quantity Discnt_price Unit-price Order_amt … Product Channel Customer_PK . . . . Product_PK . . . . Web_PK Web_url Channel_PK Web_PK Channel_desc
  105. 106. Snowflake Schema Model <ul><ul><li>Provides for speedier data loading </li></ul></ul><ul><ul><li>Can become large and unmanageable </li></ul></ul><ul><ul><li>Degrades query performance </li></ul></ul><ul><ul><li>More complex metadata </li></ul></ul><ul><ul><li>Facts are usually highly normalized </li></ul></ul><ul><ul><li>Dimensions are also normalized </li></ul></ul>Country State County City
  106. 107. Constellation Configuration Atomic fact
  107. 108. Fact Table Measures Nonadditive: Cannot be added along any dimension Semiadditive: Added along some dimensions Additive: Added across all dimensions
  108. 109. More on Factless Fact Tables Emp_FK Sal_FK Age_FK Ed_FK Grade_FK Grade dimension Grade_PK Education dimension Ed_PK Employee dimension Emp_PK Salary dimension Sal_PK Age dimension Age_PK PK = Primary Key & FK = Foreign Key
  109. 110. Factless Fact Tables <ul><ul><li>Event tracking </li></ul></ul><ul><ul><li>Coverage </li></ul></ul>
  110. 111. Bracketed Dimensions <ul><ul><li>Enhance performance and analytical capabilities </li></ul></ul><ul><ul><li>Create groups of values for attributes with many unique values, such as income ranges and age brackets </li></ul></ul><ul><ul><li>Minimize the need for full table scans by pre-aggregating data </li></ul></ul>
  111. 112. Bracketing Dimensions Customer_PK Bracket_FK Bracket_PK Customer_PK Bracket_FK Bracket dimension Customer dimension Income fact Bracket_PK Income (10Ks) Marital Status Gender Age 1 60-90 Single Male <21 2 60-90 Single Male 21-35 3 60-90 Single Male 35-55 4 60-90 Single Male >55 5 60-90 Single Female <21 6 60-90 Single Female 21-35
  112. 113. Identifying Analytical Hierarchies Store dimension Store ID Store Desc Location Size Type District ID District Desc Region ID Region Desc Business hierarchies describe organizational structure and logical parent-child relationships within the data. Region District Store Organization hierarchy
  113. 114. Multiple Hierarchies Store ID Store Desc Location Size Type District ID District Desc Region ID Region Desc City ID City Desc County ID County Desc State ID State Desc Region District Store Organization hierarchy Store dimension Region District Store Geography hierarchy
  114. 115. Multiple Time Hierarchies Fiscal year Fiscal quarter Fiscal month Fiscal time hierarchy Fiscal week Calendar year Calendar quarter Calendar month Calendar time hierarchy Calendar week
  115. 116. Drilling Up and Drilling Down Store 5 Store 1 Store 2 Region 2 District 2 District 4 Store 4 Group Market Hierarchy Region 1 District 1 Store 6 Store 3 District 3
  116. 117. Drilling Across Region District Stores > 20,000 sq. ft. Group Market hierarchy Region District Store Store City City City hierarchy
  117. 118. Using Time in the Data Warehouse <ul><ul><li>Defining standards for time is critical. </li></ul></ul><ul><ul><li>Aggregation based on time is complex. </li></ul></ul><ul><ul><li>Time is critical to the data warehouse. A consistent representation of time is required for extensibility. </li></ul></ul>Where should the element of time be stored? Time dimension Sales fact
  118. 119. Date Dimension <ul><ul><li>Should Date Dimension be modeled? </li></ul></ul>
  119. 120. Applying the Changes to Data <ul><li>You have a choice of techniques: </li></ul><ul><ul><li>Overwrite a record </li></ul></ul><ul><ul><li>Add a record </li></ul></ul><ul><ul><li>Add a field </li></ul></ul><ul><ul><li>Maintain history </li></ul></ul><ul><ul><li>Add version numbers </li></ul></ul>
  120. 121. OLAP Models <ul><ul><li>Relational (ROLAP) </li></ul></ul><ul><ul><li>Multidimensional (MOLAP) </li></ul></ul><ul><ul><li>Hybrid (HOLAP) </li></ul></ul><ul><ul><li>Desktop (DOLAP) </li></ul></ul>
  121. 122. Slowly Changing Dimensions (SCDs) <ul><li>What is a SCD? </li></ul><ul><li>It is a dimension that has attribute data that needs to be updated, rather slowly over time. </li></ul><ul><li>There are 3 standard ways outlined by Kimball (and others) to handle this situation: </li></ul><ul><ul><li>Type-I </li></ul></ul><ul><ul><li>Type-II </li></ul></ul><ul><ul><li>Type-III </li></ul></ul>
  122. 123. Type I - Overwriting a Record <ul><ul><li>Easy to implement </li></ul></ul><ul><ul><li>Loses all history </li></ul></ul><ul><ul><li>Not recommended </li></ul></ul>Single John Doe 42135 Married John Doe 42135
  123. 124. Type II - Adding a New Record <ul><ul><li>History is preserved; dimensions grow. </li></ul></ul><ul><ul><li>Generalized key is created. </li></ul></ul>Single John Doe 42135 Married John Doe 42135_01
  124. 125. Type III - Adding a Current Field <ul><ul><li>Maintains some history </li></ul></ul><ul><ul><li>Loses intermediate values </li></ul></ul><ul><ul><li>Is enhanced by adding an Effective Date field </li></ul></ul>Single John Doe 42135 Married 1-Jan-01 Single John Doe 42135
  125. 126. Maintain History <ul><li>History tables: </li></ul><ul><ul><li>One-to-many relationships </li></ul></ul><ul><ul><li>One current record and many history records </li></ul></ul>Product Time Sales HIST_CUST CUSTOMER
  126. 127. Versioning <ul><ul><li>Avoid double counting </li></ul></ul><ul><ul><li>Facts hold version number </li></ul></ul>Time Product Customer Sales $12,000 2 1234 $11,000 1 1234 Sales Facts Version Sales.CustId Comer 2 1234 Comer 1 1234 Customer Name Version Customer.CustId
  127. 128. Rapidly Changing Dimensions (RCDs) <ul><li>It is a dimension that has attribute data that needs to be updated, rather quickly over time. </li></ul><ul><li>Also referred to as Rapidly Changing Monster dimension. </li></ul><ul><li>Create a separate dimension referred to as mini dimension </li></ul>Mini Dimension :::::::::: :::: ::::: :::: 20000 – 30000 1-2 25-30 5 <20000 0 25-30 4 >30000 > 2 20-24 3 20000 – 30000 1-2 20-24 2 <20000 0 20–24 1 income children Age Demographics Key
  128. 129. Junk Dimension <ul><li>Junk dimension is an abstract dimension with the decodes for a group of low cardinality flags and indicators, thereby removing them from fact table. </li></ul>Junk Dimension ::::; :::::: ::::: ::::: Fax Urgent Credit 4 Fax Normal Credit 3 Web Urgent Cash 2 Web Normal Cash 1 Order Mode Order type Payment Type Junk Key
  129. 130. Secret of Success Think big, start small!
  130. 131. References <ul><li>Useful web sites: </li></ul><ul><li>http://www.dmreview.com </li></ul><ul><li>http://www.rkimball.com </li></ul><ul><li>http://www.billinmon.com </li></ul><ul><li>http:// www.dmforum.org </li></ul><ul><li>http://www.freedatawarehouse.com </li></ul>
  131. 132. Thank-you

×