SlideShare uma empresa Scribd logo
1 de 22
Diagnosing
Dirty Data
Jaimi Dowdell, IRE/NICAR
Jennifer LaFleur, ProPublica
Get your data's history
• Know the source of the data
• Know how it's used
• Know what all the fields mean
• Know what other stories have
been done with it
What is dirty data?
• Missing records
• Incorrect information
• Duplicate information
• No standardization
Take your data's
temperature
• How many records should you have?
• Double-check totals or counts. Check for
studies/ summary reports.
• Check for duplicates. Make sure they are
real duplicates. Is it possible that there are
hidden duplicates?
• Consistency-check all fields. Are all
city/county names spelled the same? Are
all codes found within documentation?
Internal consistency
checks
• Is there more money going to sub-contractors than went to
the prime contractor?
• Are there more teachers than students?
• How about other important fields?
• Check the range of fields. (For example, check for DOBs
that would make people too old or too young.)
• Check for missing data or blank fields. Are they real values,
or did something happen with an import or append query?
External Checks
• Compare to reports
• Data reported to other agencies
• On the ground reporting
• Verification from sources
Steps for cleaning data
• Assess the problem
• Identify your goal
• Find the right tool for the job
• Set aside time (double what you think)
• Make a backup copy
• Make a backup copy
• Never alter the original data. Make new
columns so you can compare and show
your work.
• Create an audit trail.
• Spot check as you go.
Tips for success
• Keep a data notebook
• Duplicate your work
• Duplicate your work
• Bounce your results off folks who really know
the data
• Set up some standards for your
work/newsroom
Choose the right
tool
• You don't need to be fancy, just get the job done
• Work with what you're comfortable with
• Don't forget the power of Excel
• Text editors can be lifesavers
• Many tools exist - Open Refine, programming, etc.
• Get training as needed
Focus is important
So get plenty
of food and rest
Get a data
buddy
Common ailments
Dates that aren't dates
Names, names, names...
Location matters
Leading and trailing spaces
"Pretty" reports
Inoperable data: Pain management
• Explain caveats
• Choose your wording carefully
• Know when to leave out records
• Be transparent
• Know what questions can and can't be
answered with this dataset
• Know when to get more information
Continue learning about dirty data: Sat. 3:40 p.m.
Conference Room 11
BYOD (Bring your own data): Sat. 4:50 p.m.,
Conference Room 11
Get your hands dirty
Jennifer.lafleur@propublica.org (@j_la28)
jaimi@ire.org (@jaimidowdell)
Questions?

Mais conteúdo relacionado

Destaque

Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyCat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyVAidehi Sachin
 
Number Off
Number OffNumber Off
Number OffLouka5
 
Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6Jennifer LaFleur
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without dataJennifer LaFleur
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Jennifer LaFleur
 
VVOJ Intro to data journalism
VVOJ Intro to data journalismVVOJ Intro to data journalism
VVOJ Intro to data journalismJennifer LaFleur
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14Jennifer LaFleur
 
The CASTLE Principles - mini description
The CASTLE Principles - mini descriptionThe CASTLE Principles - mini description
The CASTLE Principles - mini descriptionLance Secretan
 
The CASTLE Principles - Presentation
The CASTLE Principles - PresentationThe CASTLE Principles - Presentation
The CASTLE Principles - PresentationLance Secretan
 

Destaque (14)

Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyCat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
 
Getting it the rightest
Getting it the rightestGetting it the rightest
Getting it the rightest
 
Number Off
Number OffNumber Off
Number Off
 
Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6
 
ACP Getting the Goods
ACP Getting the GoodsACP Getting the Goods
ACP Getting the Goods
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without data
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)
 
VVOJ Intro to data journalism
VVOJ Intro to data journalismVVOJ Intro to data journalism
VVOJ Intro to data journalism
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14
 
The CASTLE Principles - mini description
The CASTLE Principles - mini descriptionThe CASTLE Principles - mini description
The CASTLE Principles - mini description
 
The CASTLE Principles - Presentation
The CASTLE Principles - PresentationThe CASTLE Principles - Presentation
The CASTLE Principles - Presentation
 
Transparency ire13
Transparency ire13Transparency ire13
Transparency ire13
 
Ona 2012
Ona 2012Ona 2012
Ona 2012
 
Cats stats
Cats statsCats stats
Cats stats
 

Semelhante a Diagnosing dirty data_ire2013

Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...News Leaders Association's NewsTrain
 
Great Survey Design
Great Survey DesignGreat Survey Design
Great Survey DesignSurveyGizmo
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...News Leaders Association's NewsTrain
 
Analysing Qualitative Data
Analysing Qualitative DataAnalysing Qualitative Data
Analysing Qualitative DataMike Crabb
 
Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Prof. Dr. Hironmoy Roy
 
Preparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewPreparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewSusanne Markgren
 
Questionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalQuestionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalRuth Deakin Crick
 
Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath scienceMitikuTeka1
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital AgeJ T "Tom" Johnson
 
Intro to dh data management
Intro to dh data management Intro to dh data management
Intro to dh data management Rachel Di Cresce
 
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Matt Stubbs
 
Four Short Foibles of Organizational Data
Four Short Foibles of Organizational DataFour Short Foibles of Organizational Data
Four Short Foibles of Organizational DataLars von Sneidern
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...David Saldaña Sage
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...David Saldaña
 
Ramping up to the debate
Ramping up to the debateRamping up to the debate
Ramping up to the debatenstearns
 
ER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerejdmillerUNT
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicatorsclearsateam
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesMarieke Guy
 

Semelhante a Diagnosing dirty data_ire2013 (20)

Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
 
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRIICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
 
Great Survey Design
Great Survey DesignGreat Survey Design
Great Survey Design
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
 
Analysing Qualitative Data
Analysing Qualitative DataAnalysing Qualitative Data
Analysing Qualitative Data
 
Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)
 
Preparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewPreparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The Interview
 
Questionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalQuestionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_final
 
Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath science
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital Age
 
Intro to dh data management
Intro to dh data management Intro to dh data management
Intro to dh data management
 
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
 
Four Short Foibles of Organizational Data
Four Short Foibles of Organizational DataFour Short Foibles of Organizational Data
Four Short Foibles of Organizational Data
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
 
Ramping up to the debate
Ramping up to the debateRamping up to the debate
Ramping up to the debate
 
ER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin Miller
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicators
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sources
 

Último

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Último (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Diagnosing dirty data_ire2013

  • 1. Diagnosing Dirty Data Jaimi Dowdell, IRE/NICAR Jennifer LaFleur, ProPublica
  • 2. Get your data's history • Know the source of the data • Know how it's used • Know what all the fields mean • Know what other stories have been done with it
  • 3. What is dirty data? • Missing records • Incorrect information • Duplicate information • No standardization
  • 4. Take your data's temperature • How many records should you have? • Double-check totals or counts. Check for studies/ summary reports. • Check for duplicates. Make sure they are real duplicates. Is it possible that there are hidden duplicates? • Consistency-check all fields. Are all city/county names spelled the same? Are all codes found within documentation?
  • 5. Internal consistency checks • Is there more money going to sub-contractors than went to the prime contractor? • Are there more teachers than students? • How about other important fields? • Check the range of fields. (For example, check for DOBs that would make people too old or too young.) • Check for missing data or blank fields. Are they real values, or did something happen with an import or append query?
  • 6. External Checks • Compare to reports • Data reported to other agencies • On the ground reporting • Verification from sources
  • 7. Steps for cleaning data • Assess the problem • Identify your goal • Find the right tool for the job • Set aside time (double what you think) • Make a backup copy • Make a backup copy • Never alter the original data. Make new columns so you can compare and show your work. • Create an audit trail. • Spot check as you go.
  • 8. Tips for success • Keep a data notebook • Duplicate your work • Duplicate your work • Bounce your results off folks who really know the data • Set up some standards for your work/newsroom
  • 9. Choose the right tool • You don't need to be fancy, just get the job done • Work with what you're comfortable with • Don't forget the power of Excel • Text editors can be lifesavers • Many tools exist - Open Refine, programming, etc. • Get training as needed
  • 11. So get plenty of food and rest
  • 19. Inoperable data: Pain management • Explain caveats • Choose your wording carefully • Know when to leave out records • Be transparent • Know what questions can and can't be answered with this dataset • Know when to get more information
  • 20. Continue learning about dirty data: Sat. 3:40 p.m. Conference Room 11 BYOD (Bring your own data): Sat. 4:50 p.m., Conference Room 11 Get your hands dirty