2. abstract
• This session explores the use of data profiling to increase the
accuracy of critical data assets and their associated data
models/metadata. This presentation will include examples of how
clients have leveraged data profiling in combination with data
modeling for master data management, data warehousing, data
governance, and other data intensive initiatives.
PAGE 2
3. biography
• Antonio C. Amorin
President, Data Innovations, Inc.
– Eighteen years of data modeling experience and fourteen years of
experience using CA ERwin® Data Modeler
– Ten years of data profiling experience and two years of experience using CA
ERwin® Data Profiler
– Data Innovations, Inc. – CA Partner since 2004
– Presented at CA World’08, CBI’s Life Sciences Forum on “Customer Data
Quality and Integrity”, ERwin User Groups, webcasts and at client sites
– Graduated from Illinois State University with a BA in Computer Science and
a minor in Economics
PAGE 3
4. agenda
• Data Profiling
• Data and Metadata Quality
• Data Governance and Data Warehousing
• Real-life Examples
• Summary
PAGE 4
6. data profiling
• What is data profiling?
– The analysis of data content to infer metadata
– A component of data modeling
• What are the basic components of the CA ERwin® Data Profiler?
– Column analysis
– PF key analysis
– Data object analysis
– Overlap analysis
PAGE 6
7. data profiling
• Column analysis
– Inferred metadata provides
intimate knowledge of the data
content at the column level
• Cardinality
• Range
• Mode
• Sparse
• Null count
PAGE 7
8. data profiling
• Column analysis (continued)
• Value frequencies
• Pattern frequencies
• Length frequencies
• Identify critical data elements
– Allows the user to focus analysis on
specific attributes
PAGE 8
10. data profiling
• PF key analysis (continued)
– Cross-table analysis of primary-
foreign key relationships
• Expressions
– table.column=table.column
• Row hit rate
• Value hit rate
• Selectivity
PAGE 10
11. data profiling
• Data objects
– Similar to subject areas
– Groups objects together that
contain the same data content
– Based on the parent-child
relationships
– Creates an object view of related
tables or files
PAGE 11
12. data profiling
• Overlap analysis
– Cross-system analysis that identifies
data content overlap
– Data Set Summary
• Provides graphical overview
– Legend identifies color coded data
sources
– Allows modeler to visualize data
content overlap between data
sources
PAGE 12
13. data profiling
• Overlap analysis (continued)
– Data set overlaps
• Table compares each data
source to the other data
sources
• Allows comparison of two
data sources at a time
• Identifies the number of
tables and columns that
overlap between each data
source
PAGE 13
14. data profiling
• Overlap analysis (continued)
– Column Summary
• Identifies each column in
the primary data source
• Identifies value overlap
between data sources
• Allows modeler to use
critical data elements to
focus analysis
• Allows modeler to drill into
analysis to identify data
content overlap
PAGE 14
15. data profiling
• Overlap analysis (continued)
– Matches data preview
• Allows the modeler to view hits
or misses
• Identifies specific data content
that overlaps or does not
overlap between each data
source
PAGE 15
17. data and metadata quality
• Data
– Business data - information utilized to operate the business
• Metadata
– Information generated during the development of IT solutions
– Defines both the business and technical understanding of the data
– Utilized to store, process, and report on business data
PAGE 17
18. data and metadata quality
• Data Quality
– Accuracy of the business data
– High/low quality
– Mission critical
• Metadata quality
– Properly represents data content
– Validate parent-child relationships
PAGE 18
19. data and metadata quality
• Leveraging data profiling
– Use the cardinality, range, mode, and sparse indicators to identify attributes
requiring detailed analysis
– Identify data quality issues and validate data types using the value and
pattern frequencies
– Leverage the null count and length frequencies to validate column metadata
– Validate parent-child relationships using the primary-foreign key analysis
– Leverage the overlap analysis with reference tables containing valid values
for data quality assessments
PAGE 19
21. data governance and data warehousing
Leveraging data profiling for data governance
• Business Data
– Standards
– Master data management
– Data quality assessments
• Metadata
– Standards
– Model validation
PAGE 21
22. data governance and data warehousing
Leveraging data profiling for data governance (continued)
• Standards
– Business data - valid values, data patterns, and standardized values for static
data content
– Metadata – validate model metadata represents data content properly and
validate parent-child relationships
– Automate the analysis with profiling
– Develop profiling reports for each standard
– Define and implement a review process
– Integrate standards and review process into SDLC
PAGE 22
23. data governance and data warehousing
Leveraging data profiling for data governance (continued)
• Master data management (MDM)
– Locating reference data
– Data mapping
– Harmonizing reference data
– Establishing validations and syndication rules
– Identifying hub metadata
– Data quality assessments
PAGE 23
24. data governance and data warehousing
Leveraging data profiling for data governance (continued)
• Data quality assessments
– Comprehensive review at the column level
– Validation of primary keys
– Validation of parent-child relationships
– Point-to-point content validation between systems
– Standardize analysis methodology
– Standardize problem notation
– Standardize reporting
PAGE 24
25. data governance and data warehousing
Leveraging data profiling for data warehousing
• Data warehouse development
– Leverage data models and data profiling results to locate and map business
data to the data warehouse
– Eliminate the code-load-explode development methodology for ETL
• Profile each data source to validate data content
• Identify accurate requirements for transformations to consolidate data content
and correct data quality issues
– Use profiling results to determine model metadata for target staging
databases and the data warehouse
– Profile the data warehouse regularly to ensure high quality data content
PAGE 25
27. real-life examples
Public computer hardware and software manufacturer
• Introduced data profiling into ongoing data warehousing
project
– Profiled first data source
• Found questionable data content in financial data within ten minutes of
profiling data
• Realized that six months were wasted mapping from the data source to
the target data warehouse
• All new data sources were profiled going forward to ensure validity
PAGE 27
28. real-life examples
Large public food manufacturer
• Introduced data profiling into sales data warehouse project
– Leveraged data profiling results to create accurate ETL specifications,
reducing the overall development time
– Developers utilized data profiling to validate ETL unit testing
– Used cross-system analysis to integrate data content from disparate data
sources into standardized values in data warehouse
– Profiled data warehouse regularly to identify data content issues
PAGE 28
29. real-life examples
Public healthcare insurance provider
• Introduced data profiling into ongoing master data management
project
– Performed data content mapping utilizing profiling results
– Analyzed IMS extracts and flat files to determine where reference data lived
within legacy mainframe data sources
– Leveraged profiling results to create ETL specifications
– Harmonized reference data using profiling results
– Validated reference data loaded into MDM hub
PAGE 29
30. real-life examples
Medium-sized accounting service organization
• Created data store for reporting purposes
– Profiled disparate data sources to identify model metadata for new data
store
– Leveraged profiling results to identify data quality issues for each data
source
– Created ETL specifications to consolidate data content from the disparate
data sources using the profiling results
– Validated data content in the loaded data store
PAGE 30
31. summary
• Data Profiling
• Increases accuracy of data content and metadata
• Reduces project overrun
• Increases value of deliverables to the business
• Valuable for master data management, data warehousing, data
governance, and other data intensive initiatives
PAGE 31