SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
Cleanliness is next to Godliness
Deduplicating Your Customer Data
Parts of the talk
Talking about Data Quality
Techniques for Deduplication
Processing, Timing and Mindset
Part 1 Part 2 Part 3
Timeline
Once upon a time...
Age of information
Large amounts of data inputted by humans
Humans make mistakes...
Information is a significant raw material for
businesses around the world.
Making data-based decisions
Wrong information leads to wrong decisions
Information as products
Bad and unimpressive products
Information for logistics
Company may shut down
Gathering Data from Humans
Paper forms
• Spelling mistakes
• Unclear questions
• Bare minimum information
• OCR
Web forms
• Bypassing filters
Tainting Existing Data
Changes in procedures
• Didn’t update older data
• Different data structures
• Different ways of handling data
Importing sources of (bad) data
Some Industry Jargon
Single View of Customer
• Marketing Campaigns
Single Version of the Truth
• Strategy
Getting Correct Reports
Consider this
You start a direct mail marketing campaign
And this happens...
Dear Mr ----- O’Brien,
We are delighted to inform you
that we have an amazing offer
specifically for you..
Avoiding Embarrassing Mistakes
• Marketing/PR
• Accounting
• Shipping
• Strategy
How much is it worth?
• 30% ROI (big consultancy)
• 10-25% Loss of revenue for bad data quality
• Competitive advantage
• Avoid going out of business
MFI Group
Founded 1964
Upgraded ERP systems early 2000’s
Due to issues with data quality in 2004
• £46m in lost sales, £16m extra deliveries +
technical costs and £20m for the actual
system.
Administration 2008
(Comeback 2010)
Recap
Data Quality is a big subject
Avoid embarrassing mistakes
Keep company running efficiently
Good for reports
What Deduplication is used for
Increasing data quality
Compressing data
Pre-stage data cleansing needed
Matching
Techniques
• Address
• Name
• Fuzzy
• DOB
Business Rules
Quality Matching
Ask the Data
Address Matching
Databases
• Royal Mail (PAF)
• Council Address Data
• Do Your Own
Fill in missing parts
House Number, Building Number, House Name,
Flat Number, Company Name, Street, Locality,
Town, City, County, Country and Postcode
Name Matching
Name, Full name
Forename, Firstname,
Lastname, Surname
Initial
Middle name(s)
Title, Suffix
Qualification
Lord James Jonah William Smith 3rd
SQL example
SELECT c1.*, c2.*
FROM customers c1 INNER JOIN customers c2
ON c1.address_id = c2.address_id
WHERE c1.surname = c2.surname
AND c1.forename = c2.forename
AND (c1.middlename = c2.middlename
XOR (c1.middlename = ‘’ XOR c2.middle=name‘’));
Title Forename Middle Surname DOB
MR MARK MADANES 05/10/1963
MR MARK MADANES 04/10/1963
Title Forename Middle Surname DOB
MR CIARAN GERARD O’NEILL 26/07/1971
MR CIARAN M O’NEILL 26/07/1971
Title Forename Middle Surname DOB
MS JAN PHILMORE 15/10/1954
MR JAN PHILMORE 00/00/0000
Title Forename Middle Surname DOB
MR ALBERTO CARLOS 00/00/0000
MR ALBERT O CARLOS 00/00/0000
Fuzzy Matching
Levenshtein
select levenshtein(‘jonathan’,’jonathon’) -> 1
Download from: http://www.artfulsoftware.com/infotree/queries.php?&bw=1280#552
Fuzzy Matching
Soundex
select soundex('jonathan') -> J535
Metaphone
echo metaphone('jonathan') -> JNON
Title Forename Middle Surname DOB
SAMUEL JOHNSTONE 00/00/0000
MR SAMUEL JOHNSTON 00/00/0000
Business Rules
Certain Level of Correctness
Generic Rules and Source Specific Rules
Business Rules
Example
• Middle name: Adam Smith vs. Adam E Smith
• Title: Miss vs. Ms vs. Lady
• Initial: A Smith vs. Adam Smith (same address)
• Surnames: O`Brien vs. O’Brien vs. O’Brien
• More Surname: McDonald vs. Mc Donald vs. Mac
Donald
Things to Watch Out for
Same father/son or mother/daughter names
Twins with same DOB
Initial for a forename
Mixing of forename with middle name
Changing surname after marriage
Quality Matching
Analyze data sources
How recent the data is
Ask the Data
Name popularity
Number of sources
• Example: 4 sources vs. 1 source say this spelling is
right
Consider Using a Democratic System
Opposite of hieratical (if-then-else) system
If rules order is problematic
Business Rules + Asking the Data
Recap
Find address
Find duplicates
Try to make a decision for deduplication
• Business Rules
• Ask the Data
Processing
CPU/Disk/Memory bound
Sequential or parallel
Processing Data
Extra data
Result table
Temp data
Timing
On insert
A few minutes after insert (events)
Scheduled tasks
Pre-fetch
When user asks for it
New Data User Request
Points in Time
Using Your Team
DBAs
Database Developers/ETL experts
Data Analysts
Developers
Testers
Mindset
Never 100%
Best Effort
Pareto Principle
Continuous Improvement
Cost
Benefits
Final Recap
Continuous Improvements
Which duplicate is the correct one?
Combine business rules + ask the data
Questions & Answers
Contact Information:
MySQL-related questions about presentation?
Non-profit or Medical?
contact@jonathanlevin.co.uk

Mais conteúdo relacionado

Destaque

Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
Jonathan Levin
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETL
Jonathan Levin
 
Pstti teaching cleanliness to a child
Pstti teaching cleanliness to a childPstti teaching cleanliness to a child
Pstti teaching cleanliness to a child
PSTTI
 
Classroom Cleanliness Talk 2013
Classroom Cleanliness Talk 2013Classroom Cleanliness Talk 2013
Classroom Cleanliness Talk 2013
jiayingjy
 
Personal hygiene ppt
Personal hygiene pptPersonal hygiene ppt
Personal hygiene ppt
ps24ctt
 

Destaque (17)

Ind eng-062-ppt
Ind eng-062-pptInd eng-062-ppt
Ind eng-062-ppt
 
Cleanliness
CleanlinessCleanliness
Cleanliness
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
 
Cleanliness
CleanlinessCleanliness
Cleanliness
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETL
 
Cleanliness ppt (always keep clean yourself)
Cleanliness ppt (always keep clean yourself)Cleanliness ppt (always keep clean yourself)
Cleanliness ppt (always keep clean yourself)
 
Quick And Easy Guide To Speeding Up MySQL for web developers
Quick And Easy Guide To Speeding Up MySQL for web developersQuick And Easy Guide To Speeding Up MySQL for web developers
Quick And Easy Guide To Speeding Up MySQL for web developers
 
Caching Business Logic in the Database
Caching Business Logic in the DatabaseCaching Business Logic in the Database
Caching Business Logic in the Database
 
Cleanliness of surroundings and health
Cleanliness of surroundings and healthCleanliness of surroundings and health
Cleanliness of surroundings and health
 
Cleanliness of thoughts & actions
Cleanliness of thoughts & actionsCleanliness of thoughts & actions
Cleanliness of thoughts & actions
 
Pstti teaching cleanliness to a child
Pstti teaching cleanliness to a childPstti teaching cleanliness to a child
Pstti teaching cleanliness to a child
 
Swachh bharat abhiyan missions for school.
Swachh bharat abhiyan missions for school.Swachh bharat abhiyan missions for school.
Swachh bharat abhiyan missions for school.
 
Classroom Cleanliness Talk 2013
Classroom Cleanliness Talk 2013Classroom Cleanliness Talk 2013
Classroom Cleanliness Talk 2013
 
Health and hygiene for class 5
Health and hygiene for class 5Health and hygiene for class 5
Health and hygiene for class 5
 
Personal hygiene ppt
Personal hygiene pptPersonal hygiene ppt
Personal hygiene ppt
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
 
Personal Hygiene for Kids!
Personal Hygiene for Kids! Personal Hygiene for Kids!
Personal Hygiene for Kids!
 

Semelhante a Cleanliness is next to Godliness

Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRM
Divya Malik
 
StrikeIron Data Management
StrikeIron Data ManagementStrikeIron Data Management
StrikeIron Data Management
mflannigan
 
Ons households july 17 addressing ac mj
Ons households july 17 addressing ac mjOns households july 17 addressing ac mj
Ons households july 17 addressing ac mj
onsaddresses
 

Semelhante a Cleanliness is next to Godliness (20)

Get it Clean and Keep it Clean
Get it Clean and Keep it CleanGet it Clean and Keep it Clean
Get it Clean and Keep it Clean
 
Common issues with data for small to medium enterprises
Common issues with data for small to medium enterprisesCommon issues with data for small to medium enterprises
Common issues with data for small to medium enterprises
 
Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRM
 
Data Quality for AML
Data Quality for AMLData Quality for AML
Data Quality for AML
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
 
Responsible Appending
Responsible AppendingResponsible Appending
Responsible Appending
 
Data Quality
Data QualityData Quality
Data Quality
 
Pioneer marketers services brochure
Pioneer marketers services brochurePioneer marketers services brochure
Pioneer marketers services brochure
 
The Duplicitous Duplicate
The Duplicitous DuplicateThe Duplicitous Duplicate
The Duplicitous Duplicate
 
Inch by inch
Inch by inchInch by inch
Inch by inch
 
GDPR for Things - ThingsCon Amsterdam 2017
GDPR for Things - ThingsCon Amsterdam 2017GDPR for Things - ThingsCon Amsterdam 2017
GDPR for Things - ThingsCon Amsterdam 2017
 
Address Capture with Seven Keystrokes
Address Capture with Seven KeystrokesAddress Capture with Seven Keystrokes
Address Capture with Seven Keystrokes
 
SugarCon 2013: Data Management & Spatial Intelligence from the Cumulus Clouds...
SugarCon 2013: Data Management & Spatial Intelligence from the Cumulus Clouds...SugarCon 2013: Data Management & Spatial Intelligence from the Cumulus Clouds...
SugarCon 2013: Data Management & Spatial Intelligence from the Cumulus Clouds...
 
Advancements in Legal Entity Data Quality
Advancements in Legal Entity Data QualityAdvancements in Legal Entity Data Quality
Advancements in Legal Entity Data Quality
 
StrikeIron Data Management
StrikeIron Data ManagementStrikeIron Data Management
StrikeIron Data Management
 
Data Management
Data ManagementData Management
Data Management
 
Seven Signs You Need a Data Warehouse
Seven Signs You Need a Data WarehouseSeven Signs You Need a Data Warehouse
Seven Signs You Need a Data Warehouse
 
Dmef2010 Dm Im Research Summit (Jos Schijns)
Dmef2010 Dm Im Research Summit (Jos Schijns)Dmef2010 Dm Im Research Summit (Jos Schijns)
Dmef2010 Dm Im Research Summit (Jos Schijns)
 
The Search for Work
The Search for WorkThe Search for Work
The Search for Work
 
Ons households july 17 addressing ac mj
Ons households july 17 addressing ac mjOns households july 17 addressing ac mj
Ons households july 17 addressing ac mj
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Cleanliness is next to Godliness