Iterative data discovery and transformation with open refine
1. Martin Magdinier - @magdmartin 1
Iterative data discovery and
transformation with
Martin Magdinier - @magdmartin
OpenRefine - @OpenRefine
http://openrefine.org
2. Martin Magdinier - @magdmartin 2
80% of data analysis
is spent on the process of
cleaning, transformation and integration
3. Martin Magdinier - @magdmartin 3
• Duplicate value & Typos
• Multi value cells
• Data in the wrong field
• Missing / Partial Values
• Encoding Errors
• Change format (text, number, date)
• Flat to relational data set
• Schema alignment
• Transpose rows and columns
• Join data-set
• Enrichment from other sources
(MDM, API calls)
Data Quality & Integration &
Is Time Consuming
4. Martin Magdinier - @magdmartin 4
OpenRefine
Bridges The Skill Gap
DBA
ETL
Data Science
Spreadsheet User
Data Visualization / Interpretation
Data
Preparation
Understand The Data
(Business Skills)
Know How To Transform
Data
(Technical Skills)
User Base
5. Martin Magdinier - @magdmartin 5
• SaaS and on-premise solution for extra
compute power, collaboration and
lightweight ETL
• On demand training
• Custom development
• Free & Open Source
• Community developed for 5 years
• Available on local machine only
• 5,000+ monthly download
• Strong user base with Open Data,
Library, Semantic web and Bio
Science
Semantic
WebLibraryBio
ScienceOpen
Data
6. Martin Magdinier - @magdmartin 6
Data Engineer
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Agile Data Process
7. Martin Magdinier - @magdmartin 7
Data Engineer
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Data Scientist
Develop
Machine Learning
&
Data Analysis
Model
Agile Data Process
8. Martin Magdinier - @magdmartin 8
Data Engineer
IT
Support
Governance
Access To Data
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Data Scientist
Discovery
Data Wrangling
Profiling
Preparation
Quality
Integration
Agile Data Process
Business Analyst
Develop
Machine Learning
&
Data Analysis
Model
Sense Making
Data Exploration
Reporting
Analysis
Scale
Real -Time
Lightweight ETL
Migration
9. Martin Magdinier - @magdmartin 9
Business Analyst
Data Engineer
IT
Support
Governance
Access To Data
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Data Scientist
Discovery
Data Wrangling
Profiling
Preparation
Quality
Integration
Agile Data Process
Develop
Machine Learning
&
Data Analysis
Model
ETL
Tools
10. Martin Magdinier - @magdmartin 10
Demo: 2014 Toronto
Cleared Building Permits
http://ow.ly/Js8GD
Data Discovery
1. What of Permit Type are
issued?
2. Explore Previous usage ;
Application Date &
Dwelling Units Created
Data Preparation
1. Geocode with Google
Maps API
2. Map Construction with
over 10 new Dwelling
Units Created
11. Martin Magdinier - @magdmartin 11
Iterative data discovery and
transformation with
Martin Magdinier - @magdmartin
OpenRefine - @OpenRefine
http://openrefine.org