SlideShare uma empresa Scribd logo
1 de 6
DATA MUNGING
The Good, the Bad, and the Ugly

Presented by: Daniel D. Gutierrez
DATA MUNGING CAN TAKE WORK

•
•
•
•
•
•

I kept getting burned by data munging phase!
The importance of data munging to the success of a data
science project must be understood
The level of difficulty depends on the quality of data
More work required for dirty, inconsistent, malformed data
Can often amount to 70% of overall project time & budget
Need to work with person delivering data: ETL engineer
GIVE DATA MUNGING SOME RESPECT

• Data munging phase is often trivialized
• New data scientists not always informed about the

complexity of data munging: Coursera
• Example: amount of data munging work for winning
entries for Kaggle competition: Heritage Health
Network. Much data munging done in SQL
USE CASE EXAMPLE

I was given a data set by a client domain “expert”
She clearly wanted me to read her mind!
The data was awful: inconsistent data types, loads of
missing values, poor structure, outliers
• Delivered in Excel
• Took many meeting with department staff to iron out
BEFORE the data munging could even commence
• Feature engineering can become “social engineering” –
traveling up the corporate food chain to get answers
•
•
•
A DATA MUNGING RESOURCE

•
•

•
•

Here is an outline from Hadley Wickham’s Ph.D. thesis
“First, you get the data in a form that you can work with ...
Second, you plot the data to get a feel for what is going on ...
Third, you iterate between graphics and models to build a
succinct quantitative summary of the data ... Finally, you look
back at what you have done, and contemplate what tools you
need to do better in the future”
http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf
In Chapter 2 he talks a lot about data munging using the reshape
package: melting and casting
THANK YOU!

• Web: www.amuletanalytics.com
• Twitter: @AMULETAnalytics
• Email: dan@amuletc.com

Mais conteúdo relacionado

Mais de amuletc

Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....amuletc
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 ramuletc
 
LA Fashion Industry analytics project with Grid110
LA Fashion Industry analytics project with Grid110LA Fashion Industry analytics project with Grid110
LA Fashion Industry analytics project with Grid110amuletc
 
Los Angeles R User Group meetup - useR! 2014 Best of List
Los Angeles R User Group meetup - useR! 2014 Best of ListLos Angeles R User Group meetup - useR! 2014 Best of List
Los Angeles R User Group meetup - useR! 2014 Best of Listamuletc
 
Introduction to Big Data for LABDUG
Introduction to Big Data for LABDUGIntroduction to Big Data for LABDUG
Introduction to Big Data for LABDUGamuletc
 
What is Data Science? Daniel D Gutierrez
What is Data Science? Daniel D GutierrezWhat is Data Science? Daniel D Gutierrez
What is Data Science? Daniel D Gutierrezamuletc
 

Mais de amuletc (6)

Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
Inside Big Data Inteview With Ashwin - Build and Operate Data Products SMALL....
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 r
 
LA Fashion Industry analytics project with Grid110
LA Fashion Industry analytics project with Grid110LA Fashion Industry analytics project with Grid110
LA Fashion Industry analytics project with Grid110
 
Los Angeles R User Group meetup - useR! 2014 Best of List
Los Angeles R User Group meetup - useR! 2014 Best of ListLos Angeles R User Group meetup - useR! 2014 Best of List
Los Angeles R User Group meetup - useR! 2014 Best of List
 
Introduction to Big Data for LABDUG
Introduction to Big Data for LABDUGIntroduction to Big Data for LABDUG
Introduction to Big Data for LABDUG
 
What is Data Science? Daniel D Gutierrez
What is Data Science? Daniel D GutierrezWhat is Data Science? Daniel D Gutierrez
What is Data Science? Daniel D Gutierrez
 

Último

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Data Munging: the good, the bad and the ugly

  • 1. DATA MUNGING The Good, the Bad, and the Ugly Presented by: Daniel D. Gutierrez
  • 2. DATA MUNGING CAN TAKE WORK • • • • • • I kept getting burned by data munging phase! The importance of data munging to the success of a data science project must be understood The level of difficulty depends on the quality of data More work required for dirty, inconsistent, malformed data Can often amount to 70% of overall project time & budget Need to work with person delivering data: ETL engineer
  • 3. GIVE DATA MUNGING SOME RESPECT • Data munging phase is often trivialized • New data scientists not always informed about the complexity of data munging: Coursera • Example: amount of data munging work for winning entries for Kaggle competition: Heritage Health Network. Much data munging done in SQL
  • 4. USE CASE EXAMPLE I was given a data set by a client domain “expert” She clearly wanted me to read her mind! The data was awful: inconsistent data types, loads of missing values, poor structure, outliers • Delivered in Excel • Took many meeting with department staff to iron out BEFORE the data munging could even commence • Feature engineering can become “social engineering” – traveling up the corporate food chain to get answers • • •
  • 5. A DATA MUNGING RESOURCE • • • • Here is an outline from Hadley Wickham’s Ph.D. thesis “First, you get the data in a form that you can work with ... Second, you plot the data to get a feel for what is going on ... Third, you iterate between graphics and models to build a succinct quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to do better in the future” http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf In Chapter 2 he talks a lot about data munging using the reshape package: melting and casting
  • 6. THANK YOU! • Web: www.amuletanalytics.com • Twitter: @AMULETAnalytics • Email: dan@amuletc.com