SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Data Profiling
using CA ERwin Modeling to assure data and metadata
abstract

• This session explores the use of data profiling to increase the
  accuracy of critical data assets and their associated data
  models/metadata. This presentation will include examples of how
  clients have leveraged data profiling in combination with data
  modeling for master data management, data warehousing, data
  governance, and other data intensive initiatives.




  PAGE 2
biography

• Antonio C. Amorin
  President, Data Innovations, Inc.
  – Eighteen years of data modeling experience and fourteen years of
    experience using CA ERwin® Data Modeler
  – Ten years of data profiling experience and two years of experience using CA
    ERwin® Data Profiler
  – Data Innovations, Inc. – CA Partner since 2004
  – Presented at CA World’08, CBI’s Life Sciences Forum on “Customer Data
    Quality and Integrity”, ERwin User Groups, webcasts and at client sites
  – Graduated from Illinois State University with a BA in Computer Science and
    a minor in Economics



  PAGE 3
agenda

•   Data Profiling
•   Data and Metadata Quality
•   Data Governance and Data Warehousing
•   Real-life Examples
•   Summary




    PAGE 4
data profiling




PAGE 5
data profiling

• What is data profiling?
  – The analysis of data content to infer metadata
  – A component of data modeling
• What are the basic components of the CA ERwin® Data Profiler?
  –   Column analysis
  –   PF key analysis
  –   Data object analysis
  –   Overlap analysis




  PAGE 6
data profiling

• Column analysis
    – Inferred metadata provides
      intimate knowledge of the data
      content at the column level
           •   Cardinality
           •   Range
           •   Mode
           •   Sparse
           •   Null count




  PAGE 7
data profiling

• Column analysis (continued)
            • Value frequencies
            • Pattern frequencies
            • Length frequencies
• Identify critical data elements
     – Allows the user to focus analysis on
       specific attributes




   PAGE 8
data profiling

• PF key analysis
    – Cross-table analysis of primary-
      foreign key relationships
           • Column matches
           • Classification
               – Parent-child
               – Reference
               – None




  PAGE 9
data profiling

• PF key analysis (continued)
    – Cross-table analysis of primary-
      foreign key relationships
            • Expressions
                 – table.column=table.column
            • Row hit rate
            • Value hit rate
            • Selectivity




  PAGE 10
data profiling

• Data objects
    – Similar to subject areas
    – Groups objects together that
      contain the same data content
    – Based on the parent-child
      relationships
    – Creates an object view of related
      tables or files




  PAGE 11
data profiling

• Overlap analysis
    – Cross-system analysis that identifies
      data content overlap
    – Data Set Summary
            • Provides graphical overview
                 – Legend identifies color coded data
                   sources
                 – Allows modeler to visualize data
                   content overlap between data
                   sources




  PAGE 12
data profiling

• Overlap analysis (continued)
    – Data set overlaps
            • Table compares each data
              source to the other data
              sources
            • Allows comparison of two
              data sources at a time
            • Identifies the number of
              tables and columns that
              overlap between each data
              source




  PAGE 13
data profiling

• Overlap analysis (continued)
    – Column Summary
            • Identifies each column in
              the primary data source
            • Identifies value overlap
              between data sources
            • Allows modeler to use
              critical data elements to
              focus analysis
            • Allows modeler to drill into
              analysis to identify data
              content overlap




  PAGE 14
data profiling

• Overlap analysis (continued)
    – Matches data preview
            • Allows the modeler to view hits
              or misses
            • Identifies specific data content
              that overlaps or does not
              overlap between each data
              source




  PAGE 15
data and metadata quality




PAGE 16
data and metadata quality

• Data
  – Business data - information utilized to operate the business
• Metadata
  – Information generated during the development of IT solutions
  – Defines both the business and technical understanding of the data
  – Utilized to store, process, and report on business data




  PAGE 17
data and metadata quality

• Data Quality
  – Accuracy of the business data
  – High/low quality
  – Mission critical
• Metadata quality
  – Properly represents data content
  – Validate parent-child relationships




  PAGE 18
data and metadata quality

• Leveraging data profiling
  – Use the cardinality, range, mode, and sparse indicators to identify attributes
    requiring detailed analysis
  – Identify data quality issues and validate data types using the value and
    pattern frequencies
  – Leverage the null count and length frequencies to validate column metadata
  – Validate parent-child relationships using the primary-foreign key analysis
  – Leverage the overlap analysis with reference tables containing valid values
    for data quality assessments




  PAGE 19
data governance and data warehousing




PAGE 20
data governance and data warehousing

Leveraging data profiling for data governance
• Business Data
  – Standards
  – Master data management
  – Data quality assessments
• Metadata
  – Standards
  – Model validation




  PAGE 21
data governance and data warehousing

Leveraging data profiling for data governance (continued)
• Standards
  – Business data - valid values, data patterns, and standardized values for static
    data content
  – Metadata – validate model metadata represents data content properly and
    validate parent-child relationships
  – Automate the analysis with profiling
  – Develop profiling reports for each standard
  – Define and implement a review process
  – Integrate standards and review process into SDLC




  PAGE 22
data governance and data warehousing

Leveraging data profiling for data governance (continued)
• Master data management (MDM)
  –   Locating reference data
  –   Data mapping
  –   Harmonizing reference data
  –   Establishing validations and syndication rules
  –   Identifying hub metadata
  –   Data quality assessments




  PAGE 23
data governance and data warehousing

Leveraging data profiling for data governance (continued)
• Data quality assessments
  –   Comprehensive review at the column level
  –   Validation of primary keys
  –   Validation of parent-child relationships
  –   Point-to-point content validation between systems
  –   Standardize analysis methodology
  –   Standardize problem notation
  –   Standardize reporting




  PAGE 24
data governance and data warehousing

Leveraging data profiling for data warehousing
• Data warehouse development
  – Leverage data models and data profiling results to locate and map business
    data to the data warehouse
  – Eliminate the code-load-explode development methodology for ETL
       • Profile each data source to validate data content
       • Identify accurate requirements for transformations to consolidate data content
         and correct data quality issues
  – Use profiling results to determine model metadata for target staging
    databases and the data warehouse
  – Profile the data warehouse regularly to ensure high quality data content



  PAGE 25
real-life examples




 PAGE 26
real-life examples

Public computer hardware and software manufacturer
• Introduced data profiling into ongoing data warehousing
  project
  – Profiled first data source
       • Found questionable data content in financial data within ten minutes of
         profiling data
       • Realized that six months were wasted mapping from the data source to
         the target data warehouse
       • All new data sources were profiled going forward to ensure validity




  PAGE 27
real-life examples

Large public food manufacturer
• Introduced data profiling into sales data warehouse project
  – Leveraged data profiling results to create accurate ETL specifications,
    reducing the overall development time
  – Developers utilized data profiling to validate ETL unit testing
  – Used cross-system analysis to integrate data content from disparate data
    sources into standardized values in data warehouse
  – Profiled data warehouse regularly to identify data content issues




  PAGE 28
real-life examples

Public healthcare insurance provider
• Introduced data profiling into ongoing master data management
  project
  – Performed data content mapping utilizing profiling results
  – Analyzed IMS extracts and flat files to determine where reference data lived
    within legacy mainframe data sources
  – Leveraged profiling results to create ETL specifications
  – Harmonized reference data using profiling results
  – Validated reference data loaded into MDM hub




  PAGE 29
real-life examples

Medium-sized accounting service organization
• Created data store for reporting purposes
  – Profiled disparate data sources to identify model metadata for new data
    store
  – Leveraged profiling results to identify data quality issues for each data
    source
  – Created ETL specifications to consolidate data content from the disparate
    data sources using the profiling results
  – Validated data content in the loaded data store




  PAGE 30
summary

•   Data Profiling
•   Increases accuracy of data content and metadata
•   Reduces project overrun
•   Increases value of deliverables to the business
•   Valuable for master data management, data warehousing, data
    governance, and other data intensive initiatives




    PAGE 31
questions and answers
thank you

Mais conteúdo relacionado

Mais procurados

Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process Omid Vahdaty
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEEFINALYEARSTUDENTPROJECTS
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Data Verification In QA Department Final
Data Verification In QA Department FinalData Verification In QA Department Final
Data Verification In QA Department FinalWayne Yaddow
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
Computer based data analysis
Computer based data analysisComputer based data analysis
Computer based data analysispriyadearabi
 
RES814 U1 Individual Project
RES814 U1 Individual ProjectRES814 U1 Individual Project
RES814 U1 Individual ProjectThienSi Le
 
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F... Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...Holistic Benchmarking of Big Linked Data
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityCaserta
 
Datamining
DataminingDatamining
Dataminingsumit621
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378nitttin
 

Mais procurados (20)

Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
 
The Big Metadata
The Big MetadataThe Big Metadata
The Big Metadata
 
ETL QA
ETL QAETL QA
ETL QA
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Data Verification In QA Department Final
Data Verification In QA Department FinalData Verification In QA Department Final
Data Verification In QA Department Final
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Data mining
Data miningData mining
Data mining
 
Metadata in Business Intelligence
Metadata in Business IntelligenceMetadata in Business Intelligence
Metadata in Business Intelligence
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Computer based data analysis
Computer based data analysisComputer based data analysis
Computer based data analysis
 
RES814 U1 Individual Project
RES814 U1 Individual ProjectRES814 U1 Individual Project
RES814 U1 Individual Project
 
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F... Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 
Extended LargeRDFBench
Extended LargeRDFBenchExtended LargeRDFBench
Extended LargeRDFBench
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Datamining
DataminingDatamining
Datamining
 
Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
 

Destaque

All data models in dbms
All data models in dbmsAll data models in dbms
All data models in dbmsNaresh Kumar
 
Importance of data model
Importance of data modelImportance of data model
Importance of data modelyhen06
 
Creating enterprise standards 09302010
Creating enterprise standards 09302010Creating enterprise standards 09302010
Creating enterprise standards 09302010ERwin Modeling
 
Data modeling for the business 09282010
Data modeling for the business  09282010Data modeling for the business  09282010
Data modeling for the business 09282010ERwin Modeling
 
Ernesto_Arce_ERwin_Data_Modeling
Ernesto_Arce_ERwin_Data_ModelingErnesto_Arce_ERwin_Data_Modeling
Ernesto_Arce_ERwin_Data_ModelingErnesto Arce Jr.
 
Integrating data process a roundtrip modeling using e rwin data modeler_erwin...
Integrating data process a roundtrip modeling using e rwin data modeler_erwin...Integrating data process a roundtrip modeling using e rwin data modeler_erwin...
Integrating data process a roundtrip modeling using e rwin data modeler_erwin...ERwin Modeling
 
Sybase PowerDesigner Vs Erwin
Sybase PowerDesigner Vs ErwinSybase PowerDesigner Vs Erwin
Sybase PowerDesigner Vs ErwinSybase Türkiye
 
Mastering your data with ca e rwin dm 09082010
Mastering your data with ca e rwin dm 09082010Mastering your data with ca e rwin dm 09082010
Mastering your data with ca e rwin dm 09082010ERwin Modeling
 
Ca e rwin state of the union 09082010
Ca e rwin state of the union 09082010Ca e rwin state of the union 09082010
Ca e rwin state of the union 09082010ERwin Modeling
 
Cust experience a practical guide 09152010
Cust experience a practical guide 09152010Cust experience a practical guide 09152010
Cust experience a practical guide 09152010ERwin Modeling
 
Sneak peak ca e rwin data modeler r8 preview09222010
Sneak peak ca e rwin data modeler r8 preview09222010Sneak peak ca e rwin data modeler r8 preview09222010
Sneak peak ca e rwin data modeler r8 preview09222010ERwin Modeling
 
CA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA RMDM Latam
 
Data Modeling PPT
Data Modeling PPTData Modeling PPT
Data Modeling PPTTrinath
 

Destaque (20)

All data models in dbms
All data models in dbmsAll data models in dbms
All data models in dbms
 
Importance of data model
Importance of data modelImportance of data model
Importance of data model
 
rm006sn (2)
rm006sn (2)rm006sn (2)
rm006sn (2)
 
Creating enterprise standards 09302010
Creating enterprise standards 09302010Creating enterprise standards 09302010
Creating enterprise standards 09302010
 
Data modeling for the business 09282010
Data modeling for the business  09282010Data modeling for the business  09282010
Data modeling for the business 09282010
 
Ernesto_Arce_ERwin_Data_Modeling
Ernesto_Arce_ERwin_Data_ModelingErnesto_Arce_ERwin_Data_Modeling
Ernesto_Arce_ERwin_Data_Modeling
 
Integrating data process a roundtrip modeling using e rwin data modeler_erwin...
Integrating data process a roundtrip modeling using e rwin data modeler_erwin...Integrating data process a roundtrip modeling using e rwin data modeler_erwin...
Integrating data process a roundtrip modeling using e rwin data modeler_erwin...
 
Sybase PowerDesigner Vs Erwin
Sybase PowerDesigner Vs ErwinSybase PowerDesigner Vs Erwin
Sybase PowerDesigner Vs Erwin
 
Mastering your data with ca e rwin dm 09082010
Mastering your data with ca e rwin dm 09082010Mastering your data with ca e rwin dm 09082010
Mastering your data with ca e rwin dm 09082010
 
Rm006sn ca world2010
Rm006sn ca world2010Rm006sn ca world2010
Rm006sn ca world2010
 
Lançamento ERwin 08/02
Lançamento ERwin 08/02Lançamento ERwin 08/02
Lançamento ERwin 08/02
 
Ca e rwin state of the union 09082010
Ca e rwin state of the union 09082010Ca e rwin state of the union 09082010
Ca e rwin state of the union 09082010
 
Nagendra Resume
Nagendra ResumeNagendra Resume
Nagendra Resume
 
Cust experience a practical guide 09152010
Cust experience a practical guide 09152010Cust experience a practical guide 09152010
Cust experience a practical guide 09152010
 
Sneak peak ca e rwin data modeler r8 preview09222010
Sneak peak ca e rwin data modeler r8 preview09222010Sneak peak ca e rwin data modeler r8 preview09222010
Sneak peak ca e rwin data modeler r8 preview09222010
 
CA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User Presentation
 
Different data models
Different data modelsDifferent data models
Different data models
 
Dbms models
Dbms modelsDbms models
Dbms models
 
Data models
Data modelsData models
Data models
 
Data Modeling PPT
Data Modeling PPTData Modeling PPT
Data Modeling PPT
 

Semelhante a Using ca e rwin modeling to asure data 09162010

Data Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationData Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationAlan McSweeney
 
Jumbune data analyzer
Jumbune data analyzerJumbune data analyzer
Jumbune data analyzerPrachi Gupta
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesMark Kromer
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Datadapaasproject
 
Chapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.pptChapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.pptAnasSamara3
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataPrecisely
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...Neo4j
 
Data base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaData base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaMukesh Jaiswal
 
Management information system database management
Management information system database managementManagement information system database management
Management information system database managementOnline
 
Sql server ___________session_1-intro
Sql server  ___________session_1-introSql server  ___________session_1-intro
Sql server ___________session_1-introEhtisham Ali
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...semanticsconference
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsPeter Haase
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
`Data mining
`Data mining`Data mining
`Data miningJebin R
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 

Semelhante a Using ca e rwin modeling to asure data 09162010 (20)

Data Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationData Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata Harmonisation
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Jumbune data analyzer
Jumbune data analyzerJumbune data analyzer
Jumbune data analyzer
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
Chapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.pptChapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.ppt
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
Data base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaData base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somya
 
Chapter 5 data resource management
Chapter 5  data resource managementChapter 5  data resource management
Chapter 5 data resource management
 
Management information system database management
Management information system database managementManagement information system database management
Management information system database management
 
Sql server ___________session_1-intro
Sql server  ___________session_1-introSql server  ___________session_1-intro
Sql server ___________session_1-intro
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data Portals
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
`Data mining
`Data mining`Data mining
`Data mining
 
Chapter5
Chapter5Chapter5
Chapter5
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 

Mais de ERwin Modeling

Zen of metadata 09212010
Zen of metadata 09212010Zen of metadata 09212010
Zen of metadata 09212010ERwin Modeling
 
Staying relevant in todays changing dm environment 09282010
Staying relevant in todays changing dm environment 09282010Staying relevant in todays changing dm environment 09282010
Staying relevant in todays changing dm environment 09282010ERwin Modeling
 
Monetizing data management 09162010
Monetizing data management 09162010Monetizing data management 09162010
Monetizing data management 09162010ERwin Modeling
 
Effective capture of metadata using ca e rwin data modeler 09232010
Effective capture of metadata using ca e rwin data modeler 09232010Effective capture of metadata using ca e rwin data modeler 09232010
Effective capture of metadata using ca e rwin data modeler 09232010ERwin Modeling
 
Deciding to go cloud 09212010
Deciding to go cloud  09212010Deciding to go cloud  09212010
Deciding to go cloud 09212010ERwin Modeling
 
Ca e rwin modeling global user communities_09232010 - webcast
Ca e rwin modeling global user communities_09232010 - webcastCa e rwin modeling global user communities_09232010 - webcast
Ca e rwin modeling global user communities_09232010 - webcastERwin Modeling
 
10 things to avoid in data model 09242010
10 things to avoid in data model 0924201010 things to avoid in data model 09242010
10 things to avoid in data model 09242010ERwin Modeling
 
5 physical data modeling blunders 09092010
5 physical data modeling blunders 090920105 physical data modeling blunders 09092010
5 physical data modeling blunders 09092010ERwin Modeling
 
Optimizing the design of your data warehouse 09222010
Optimizing the design of your data warehouse 09222010Optimizing the design of your data warehouse 09222010
Optimizing the design of your data warehouse 09222010ERwin Modeling
 

Mais de ERwin Modeling (9)

Zen of metadata 09212010
Zen of metadata 09212010Zen of metadata 09212010
Zen of metadata 09212010
 
Staying relevant in todays changing dm environment 09282010
Staying relevant in todays changing dm environment 09282010Staying relevant in todays changing dm environment 09282010
Staying relevant in todays changing dm environment 09282010
 
Monetizing data management 09162010
Monetizing data management 09162010Monetizing data management 09162010
Monetizing data management 09162010
 
Effective capture of metadata using ca e rwin data modeler 09232010
Effective capture of metadata using ca e rwin data modeler 09232010Effective capture of metadata using ca e rwin data modeler 09232010
Effective capture of metadata using ca e rwin data modeler 09232010
 
Deciding to go cloud 09212010
Deciding to go cloud  09212010Deciding to go cloud  09212010
Deciding to go cloud 09212010
 
Ca e rwin modeling global user communities_09232010 - webcast
Ca e rwin modeling global user communities_09232010 - webcastCa e rwin modeling global user communities_09232010 - webcast
Ca e rwin modeling global user communities_09232010 - webcast
 
10 things to avoid in data model 09242010
10 things to avoid in data model 0924201010 things to avoid in data model 09242010
10 things to avoid in data model 09242010
 
5 physical data modeling blunders 09092010
5 physical data modeling blunders 090920105 physical data modeling blunders 09092010
5 physical data modeling blunders 09092010
 
Optimizing the design of your data warehouse 09222010
Optimizing the design of your data warehouse 09222010Optimizing the design of your data warehouse 09222010
Optimizing the design of your data warehouse 09222010
 

Último

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Using ca e rwin modeling to asure data 09162010

  • 1. Data Profiling using CA ERwin Modeling to assure data and metadata
  • 2. abstract • This session explores the use of data profiling to increase the accuracy of critical data assets and their associated data models/metadata. This presentation will include examples of how clients have leveraged data profiling in combination with data modeling for master data management, data warehousing, data governance, and other data intensive initiatives. PAGE 2
  • 3. biography • Antonio C. Amorin President, Data Innovations, Inc. – Eighteen years of data modeling experience and fourteen years of experience using CA ERwin® Data Modeler – Ten years of data profiling experience and two years of experience using CA ERwin® Data Profiler – Data Innovations, Inc. – CA Partner since 2004 – Presented at CA World’08, CBI’s Life Sciences Forum on “Customer Data Quality and Integrity”, ERwin User Groups, webcasts and at client sites – Graduated from Illinois State University with a BA in Computer Science and a minor in Economics PAGE 3
  • 4. agenda • Data Profiling • Data and Metadata Quality • Data Governance and Data Warehousing • Real-life Examples • Summary PAGE 4
  • 6. data profiling • What is data profiling? – The analysis of data content to infer metadata – A component of data modeling • What are the basic components of the CA ERwin® Data Profiler? – Column analysis – PF key analysis – Data object analysis – Overlap analysis PAGE 6
  • 7. data profiling • Column analysis – Inferred metadata provides intimate knowledge of the data content at the column level • Cardinality • Range • Mode • Sparse • Null count PAGE 7
  • 8. data profiling • Column analysis (continued) • Value frequencies • Pattern frequencies • Length frequencies • Identify critical data elements – Allows the user to focus analysis on specific attributes PAGE 8
  • 9. data profiling • PF key analysis – Cross-table analysis of primary- foreign key relationships • Column matches • Classification – Parent-child – Reference – None PAGE 9
  • 10. data profiling • PF key analysis (continued) – Cross-table analysis of primary- foreign key relationships • Expressions – table.column=table.column • Row hit rate • Value hit rate • Selectivity PAGE 10
  • 11. data profiling • Data objects – Similar to subject areas – Groups objects together that contain the same data content – Based on the parent-child relationships – Creates an object view of related tables or files PAGE 11
  • 12. data profiling • Overlap analysis – Cross-system analysis that identifies data content overlap – Data Set Summary • Provides graphical overview – Legend identifies color coded data sources – Allows modeler to visualize data content overlap between data sources PAGE 12
  • 13. data profiling • Overlap analysis (continued) – Data set overlaps • Table compares each data source to the other data sources • Allows comparison of two data sources at a time • Identifies the number of tables and columns that overlap between each data source PAGE 13
  • 14. data profiling • Overlap analysis (continued) – Column Summary • Identifies each column in the primary data source • Identifies value overlap between data sources • Allows modeler to use critical data elements to focus analysis • Allows modeler to drill into analysis to identify data content overlap PAGE 14
  • 15. data profiling • Overlap analysis (continued) – Matches data preview • Allows the modeler to view hits or misses • Identifies specific data content that overlaps or does not overlap between each data source PAGE 15
  • 16. data and metadata quality PAGE 16
  • 17. data and metadata quality • Data – Business data - information utilized to operate the business • Metadata – Information generated during the development of IT solutions – Defines both the business and technical understanding of the data – Utilized to store, process, and report on business data PAGE 17
  • 18. data and metadata quality • Data Quality – Accuracy of the business data – High/low quality – Mission critical • Metadata quality – Properly represents data content – Validate parent-child relationships PAGE 18
  • 19. data and metadata quality • Leveraging data profiling – Use the cardinality, range, mode, and sparse indicators to identify attributes requiring detailed analysis – Identify data quality issues and validate data types using the value and pattern frequencies – Leverage the null count and length frequencies to validate column metadata – Validate parent-child relationships using the primary-foreign key analysis – Leverage the overlap analysis with reference tables containing valid values for data quality assessments PAGE 19
  • 20. data governance and data warehousing PAGE 20
  • 21. data governance and data warehousing Leveraging data profiling for data governance • Business Data – Standards – Master data management – Data quality assessments • Metadata – Standards – Model validation PAGE 21
  • 22. data governance and data warehousing Leveraging data profiling for data governance (continued) • Standards – Business data - valid values, data patterns, and standardized values for static data content – Metadata – validate model metadata represents data content properly and validate parent-child relationships – Automate the analysis with profiling – Develop profiling reports for each standard – Define and implement a review process – Integrate standards and review process into SDLC PAGE 22
  • 23. data governance and data warehousing Leveraging data profiling for data governance (continued) • Master data management (MDM) – Locating reference data – Data mapping – Harmonizing reference data – Establishing validations and syndication rules – Identifying hub metadata – Data quality assessments PAGE 23
  • 24. data governance and data warehousing Leveraging data profiling for data governance (continued) • Data quality assessments – Comprehensive review at the column level – Validation of primary keys – Validation of parent-child relationships – Point-to-point content validation between systems – Standardize analysis methodology – Standardize problem notation – Standardize reporting PAGE 24
  • 25. data governance and data warehousing Leveraging data profiling for data warehousing • Data warehouse development – Leverage data models and data profiling results to locate and map business data to the data warehouse – Eliminate the code-load-explode development methodology for ETL • Profile each data source to validate data content • Identify accurate requirements for transformations to consolidate data content and correct data quality issues – Use profiling results to determine model metadata for target staging databases and the data warehouse – Profile the data warehouse regularly to ensure high quality data content PAGE 25
  • 27. real-life examples Public computer hardware and software manufacturer • Introduced data profiling into ongoing data warehousing project – Profiled first data source • Found questionable data content in financial data within ten minutes of profiling data • Realized that six months were wasted mapping from the data source to the target data warehouse • All new data sources were profiled going forward to ensure validity PAGE 27
  • 28. real-life examples Large public food manufacturer • Introduced data profiling into sales data warehouse project – Leveraged data profiling results to create accurate ETL specifications, reducing the overall development time – Developers utilized data profiling to validate ETL unit testing – Used cross-system analysis to integrate data content from disparate data sources into standardized values in data warehouse – Profiled data warehouse regularly to identify data content issues PAGE 28
  • 29. real-life examples Public healthcare insurance provider • Introduced data profiling into ongoing master data management project – Performed data content mapping utilizing profiling results – Analyzed IMS extracts and flat files to determine where reference data lived within legacy mainframe data sources – Leveraged profiling results to create ETL specifications – Harmonized reference data using profiling results – Validated reference data loaded into MDM hub PAGE 29
  • 30. real-life examples Medium-sized accounting service organization • Created data store for reporting purposes – Profiled disparate data sources to identify model metadata for new data store – Leveraged profiling results to identify data quality issues for each data source – Created ETL specifications to consolidate data content from the disparate data sources using the profiling results – Validated data content in the loaded data store PAGE 30
  • 31. summary • Data Profiling • Increases accuracy of data content and metadata • Reduces project overrun • Increases value of deliverables to the business • Valuable for master data management, data warehousing, data governance, and other data intensive initiatives PAGE 31