Data Profiling: The First Step to Big Data Quality

Data Profiling:
The First Step to Big Data Quality
Harald Smith, Dir. Product Marketing

Housekeeping
Webcast Audio
• Today’s webcast audio is streamed through your computer speakers.
• If you need technical assistance with the web interface or audio,
please reach out to us using the chat window.
Questions Welcome
• Submit your questions at any time during the presentation
using the chat window.
• Our team will reach out to you to answer them following the
presentation.
Recording and slides
• This webcast is being recorded. You will receive an
email following the webcast with a link to download
both the recording and the slides.

Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus on
data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog author: “Data Democratized”

Only 35%of senior executives have a
high level of trust in the
accuracy of their Big Data
Analytics
KPMG 2016 Global CEO Outlook
92% of
executives are concerned
about the negative impact of
data and analytics on
corporate reputation
KPMG 2017 Global CEO Outlook
80%of AI/ML projects are stalling
due to poor data quality
Dimensional Research, 2019
Big Data Needs
Data Quality
“Societal trust in business is
arguably at an all-time low
and, in a world increasingly
driven by data and
technology,
reputations and brands are
ever harder to protect.”
EY “Trust in Data and Why it Matters”, 2017.
The importance of data
quality in the enterprise:
• Decision making
• Customer centricity
• Compliance
• Machine learning & AI

“
”
The magic of machine learning is that you build a
statistical model based on the most valid dataset for
the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018

Data Quality Challenges with Machine Learning
Incorrect, Incomplete, Mis-Formatted, and Sparse “Dirty Data” –
Mistakes and errors are almost never the patterns you’re looking for in
a data set. Sparse data generates other issues. Correcting and
standardizing will tend to boost the signal, but must account for bias.
Missing context – Many data sources lack context around location or
population segments. Unless enriched with other data sets, (e.g.
geospatial, demographics, or firmographics data), some ML algorithms
will not be usable.
Multiple copies – If your data comes from many sources, as it often
does, it may contain multiple records of information about the same
person, company, product or other entity. Removing duplicates and
enhancing the overall depth and accuracy of knowledge about a single
entity can make a huge difference.
Spurious correlations – Just as missing context may hinder some ML
algorithms, inclusion of already correlated data (e.g. city and postal
code) may result in overfitting of ML algorithms.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
But data analysts may not be aware of
specific data quality issues that must be
addressed to support machine learning.
Traditional data quality processes are
an effective method to identify defects.

Understanding Big Data Quality
Data Profiling
The set of analytical techniques that
evaluate actual data content (vs.
metadata) to provide a complete view
of each data element in a data source.
Provides summarized inferences, and
details of value and pattern frequencies
to quickly gain data insights.
Business Rules
The data quality or validation rules that
help ensure that data is “fit for use” in
its intended operational and decision-
making contexts.
Covers the accuracy, completeness,
consistency, relevance, timeliness and
validity of data.

Five Key Steps to effective Data Profiling
These are not new, but good to reiterate in the
context of Big Data:
1. How you want to analyze the data?
2. What should you review? (there's a lot of stuff)
3. What should you look for? (based on data “type”)
4. When should you build rules? (laser-focus; CDE’s)
5. What needs to be communicated?

1. How do you want to analyze the data?

Universal DQ best practices:
Understand the End Goal
• How does the business intend to
use the data (i.e. what’s the use
case)?
• Empower users (“Who”) to gain
new clarity into the core problem
(“Why”)
• What will the data be used for?
• What defines the Fitness for your
Purpose?
Establish Scope
• Ask the “right questions” about the
use case and the data (not just
“what” and “how”)
• What data is relevant to the effort?
• Big Data or other, you need to set
boundaries for the work
Understand Context
• How does the business define the
data?
• What are the important
characteristics and context of the
data?
• What are the Critical Data
Elements?
• What qualities will you need to
address, or leave alone?
• “High-quality data” definition will
vary by business problem“If you don’t know what you want to
get out of the data, how can you
know what data you need – and
what insight you’re looking for?”
Wolf Ruzicka, Chairman of the Board at EastBanc Technologies,
Blog post: June 1, 2017, “Grow A Data Tree Out Of The “Big Data”
Swamp”

“
”
Never lead with a data set;
lead with a question.
Anthony Scriffignano, Chief Data Scientist, Dun & Bradstreet
Forbes Insights, May 31, 2017, “The Data Differentiator”

To Sample or not to Sample?
Sampling helps with:
• Data Integration
• Source-to-target mapping
• Data Modeling
• Discovering Correlations
When the focus is on the structure of the data
❖ REMEMBER: your target is a statistically
valid sample!
❖ ~16k records gives you 99% confidence
with a margin of error of 1% for 100B
records
❖ ~66k records gives you 99% confidence
with a margin of error of .5% for same
Full Volume needed with:
• Data Quality
• Data Governance
• Regulatory Compliance
• Finding Outliers and Issues
with Content
• “Needles in the haystack”
When the focus is on the quality of or risks
within the data
❖ Focus on critical data elements and
leverage tools that scale to data volume

Big Data at scale distributes data across many
nodes – not necessarily with other relevant data!
• Processing routines must apply same approach and logic each
time
• Implications for profiling, joining, sorting, and matching data,
whether for enrichment, verification against trusted sources, or a
consolidated single view
Data Quality functions must be performed in a consistent manner,
no matter where actual processing takes place, how the data is
segmented, and what the data volume is.
• Data quality cleansing and preparation routines have to be
reproduced at scale, both to get the data ready to train machine
learning models, and to comply with business regulations.
• Critical to establishing, building, and maintaining trust
Scaling Data Quality best practices:
Consistent processing at scale
Source: HP Analyst Briefing

2. What do you want to review?

Common Data Quality Measurements
What measures can we take advantage of?
1. Completeness – Are the relevant fields populated?
2. Integrity – Does the data maintain an internal structural
integrity or a relational integrity across sources
3. Uniqueness – Are keys or records unique?
4. Validity – Does the data have the correct values?
• Code and reference values
• Valid ranges
• Valid value combinations
5. Consistency – Is the data at consistent levels of
aggregation or does it have consistent valid values
over time?
6. Timeliness – Did the data arrive in a time period
that makes it useful or usable?

New data, new data quality challenges
• 3rd Party and external data with unknown provenance or relevance
• Bias in the data – whether in collection, extraction, or other processing
• Data without standardized structure or formatting
• Continuously streaming data
• Disjointed data (e.g. gaps in receipt)
• Consistency and verification of data sources
• Changes and transformation applied to data (i.e. does it really
represent the original input)
New Data Quality Problems
“34 percent of bankers in our survey report that their organization
has been the target of adversarial AI at least once, and 78 percent
believe automated systems create new risks, such as fake data,
external data manipulation, and inherent bias.”
Accenture Banking Technology Vision 2018

• Contextual visualizations
• Value and pattern distributions
• Attribute summaries and metadata
• Sort and filter to quickly find data
of interest
• Detail drilldowns to any content
Let Data Profiling guide you

Common Data Types
What variances do you need awareness of?
1. Identifiers – data that uniquely identifies something
2. Indicators – data that flags a specific condition
3. Dates – data that identifies a point in time
4. Quantities – data that identifies an amount or value of something
5. Codes – data that segments other data
6. Text – data that describes or names something

Identifiers
Use cases:
• Business Operations
• 360 View of Entity
• BI Reporting (incl. EDW)
• Analytics
• AI/ML
Examples:
• Customer ID
• National ID / Passport #
• Social Security # / Tax ID
• Product ID
What to look for:
• 100% Complete
• All Unique values
• Anomalous patterns
• Numeric vs. String
Notes:
• Needs full volume assessment

Indicators (aka Flags)
Use cases:
• Governance and Compliance
• Analytics
• AI/ML
Examples:
• True / False (or T/F)
• Yes / No (or Y/N)
• 1 / 0
What to look for:
• Binary Values only
• Consistent pattern
• No mixing of “Y” vs “YES”
• If NULL occurs, it must be
one of the binary values
• Skews in frequency
distributions
Notes:
• May need segmentation, filtering, or
grouping via business rules to resolve or
clarify discrepancies
• Often are triggers for other conditions –
look for use in business rules, but likely
occur downstream

Codes
Use cases:
• Analytics
• AI/ML
Examples:
• Account Status
• Credit Rating
• Diagnosis/Procedure Codes
• Order Status
• Postal Code
What to look for:
• Expected values
• Consistent patterns
• No mixing of “A” vs “active”
• NULL values
distributions
Notes:
clarify discrepancies
• Often are triggers for or from other
conditions – look for use in business rules
• May correlate to other fields

Dates
Use cases:
• Analytics
• AI/ML
Examples:
• Birth Date
• Departure Date
• Order Date
• Shipping Date
• Timestamp
What to look for:
distributions
• E.g. 01/01/2001
• Numeric vs. String
• Unusual values
• Missing values and gaps
Notes:
clarify

Quantities
Use cases:
• Analytics
• AI/ML
Examples:
• Amount (e.g. item count, amount due)
• Price
• Sales
• Total (e.g. order total)
What to look for:
distributions
• Excessively high (or low)
values
Notes:
clarify

Text
Use cases:
• Building blocks for other
identifiers!
• Analytics
• AI/ML
Examples:
• Name
• Address
• Product Description
• Claim Description
What to look for:
• Missing Values
• Frequency of patterns /
Anomalous patterns
• Existence of numerics
• Values <= 5 characters
• Compound values
• Unusual, recurring values
• “Do not use”
Notes:
• Look for correlations with Code values
that indicate specific conditions (e.g.
values used for testing purposes)

Focus on:
• Critical Data Elements (data quality dimensions)
• Policy-based conditions (e.g. regulatory
compliance)
• Correlated data conditions (e.g. If x, then y)
• Filtering and segmenting data (refining
evaluations; investigating root cause)
Build Rules for Defined Conditions

• Validate critical requirements within or
across data sources
• Build common rules that can be readily
tested and shared
• Evaluate and remediate issues
• Take action on incorrect data and defaults
• Create flags for subsequent use in marking
or remediating data
• Filter result sets and export for additional
use
Benefits of Business Rules

5. What should you communicate?

Culture of Data Literacy
• “Democratization of Data” requires cultural support
• Empowered to ask questions about the data
• Trained to understand and use data
• Trained to understand approaching and evaluating data quality
• Traditional data, new data, machine learning requirements, …
• Understand the business context of the data
Program of Data Governance
• Provide the processes and practices necessary for success
• Measure, monitor, and improve
• Continous iteration and development
Center of Excellence/Knowledge Base
• Where do you go to find answers?
• Who can help show you how?
Communicate!

• Annotate what you’ve found
• Identify the subject and add a description that is meaningful
• Utilize flags, tags, and other indicators to help others distinguish
types and severity of issues
• Integrate into data governance and BI tools for maximum visibility
Annotate Results with Findings

Summary
Evaluating Big Data
It is challenging to keep the end
goal in mind
• Data comes from multiple
disparate systems & sources
• The number of touchpoints for
policies and rules has grown
• There is a higher demand and
expectation for seeing data
quality in context.
• You need to assess and measure
the data content if you
5 Key Steps
• Remember the end goal – ask
questions, use best practices,
and establish scope & context
• Consider what criteria and
dimensions are needed
• Focus your attention based on
the type of data and the use case
• Build rules when necessary to
get laser-focused
• Determine what needs to be
communicated and delivered
Gaining insight and measurement of data quality is more critical than ever!

Data Profiling: The First Step to Big Data Quality

Data Profiling: The First Step to Big Data Quality

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data Profiling: The First Step to Big Data Quality

Semelhante a Data Profiling: The First Step to Big Data Quality (20)

Mais de Precisely

Mais de Precisely (20)

Último

Último (20)

Data Profiling: The First Step to Big Data Quality