SlideShare uma empresa Scribd logo
1 de 11
Baixar para ler offline
Martin Magdinier - @magdmartin 1
Iterative data discovery and
transformation with
Martin Magdinier - @magdmartin
OpenRefine - @OpenRefine
http://openrefine.org
Martin Magdinier - @magdmartin 2
80% of data analysis
is spent on the process of
cleaning, transformation and integration
Martin Magdinier - @magdmartin 3
• Duplicate value & Typos
• Multi value cells
• Data in the wrong field
• Missing / Partial Values
• Encoding Errors
• Change format (text, number, date)
• Flat to relational data set
• Schema alignment
• Transpose rows and columns
• Join data-set
• Enrichment from other sources
(MDM, API calls)
Data Quality & Integration &
Is Time Consuming
Martin Magdinier - @magdmartin 4
OpenRefine
Bridges The Skill Gap
DBA
ETL
Data Science
Spreadsheet User
Data Visualization / Interpretation
Data
Preparation
Understand The Data
(Business Skills)
Know How To Transform
Data
(Technical Skills)
User Base
Martin Magdinier - @magdmartin 5
• SaaS and on-premise solution for extra
compute power, collaboration and
lightweight ETL
• On demand training
• Custom development
• Free & Open Source
• Community developed for 5 years
• Available on local machine only
• 5,000+ monthly download
• Strong user base with Open Data,
Library, Semantic web and Bio
Science
Semantic
WebLibraryBio
ScienceOpen
Data
Martin Magdinier - @magdmartin 6
Data Engineer
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Agile Data Process
Martin Magdinier - @magdmartin 7
Data Engineer
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Data Scientist
Develop
Machine Learning
&
Data Analysis
Model
Agile Data Process
Martin Magdinier - @magdmartin 8
Data Engineer
IT
Support
Governance
Access To Data
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Data Scientist
Discovery
Data Wrangling
Profiling
Preparation
Quality
Integration
Agile Data Process
Business Analyst
Develop
Machine Learning
&
Data Analysis
Model
Sense Making
Data Exploration
Reporting
Analysis
Scale
Real -Time
Lightweight ETL
Migration
Martin Magdinier - @magdmartin 9
Business Analyst
Data Engineer
IT
Support
Governance
Access To Data
Scale & Automate
Processes
Data Quality
Manage
Master
Data
Data Scientist
Discovery
Data Wrangling
Profiling
Preparation
Quality
Integration
Agile Data Process
Develop
Machine Learning
&
Data Analysis
Model
ETL
Tools
Martin Magdinier - @magdmartin 10
Demo: 2014 Toronto
Cleared Building Permits
http://ow.ly/Js8GD
Data Discovery
1. What of Permit Type are
issued?
2. Explore Previous usage ;
Application Date &
Dwelling Units Created
Data Preparation
1. Geocode with Google
Maps API
2. Map Construction with
over 10 new Dwelling
Units Created
Martin Magdinier - @magdmartin 11
Iterative data discovery and
transformation with
Martin Magdinier - @magdmartin
OpenRefine - @OpenRefine
http://openrefine.org

Mais conteúdo relacionado

Mais procurados

When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Cambridge Semantics
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseMarin Dimitrov
 
How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps Ontotext
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...semanticsconference
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...semanticsconference
 
Fast Data processing with RFX
Fast Data processing with RFXFast Data processing with RFX
Fast Data processing with RFXTrieu Nguyen
 
High quality Linked Data generation for librarians
High quality Linked Data generation for librariansHigh quality Linked Data generation for librarians
High quality Linked Data generation for librariansandimou
 
On-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudOn-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudMarin Dimitrov
 
Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsMongoDB
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastFranz Inc. - AllegroGraph
 
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...andimou
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 

Mais procurados (20)

When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Tracking data lineage at Stitch Fix
Tracking data lineage at Stitch FixTracking data lineage at Stitch Fix
Tracking data lineage at Stitch Fix
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and Reuse
 
How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
 
NoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and AnalyticsNoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and Analytics
 
Fast Data processing with RFX
Fast Data processing with RFXFast Data processing with RFX
Fast Data processing with RFX
 
High quality Linked Data generation for librarians
High quality Linked Data generation for librariansHigh quality Linked Data generation for librarians
High quality Linked Data generation for librarians
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
On-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudOn-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the Cloud
 
Lightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB ApplicationsLightning Talk: Get Even More Value from MongoDB Applications
Lightning Talk: Get Even More Value from MongoDB Applications
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcast
 
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 

Semelhante a Iterative data discovery and transformation with open refine

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Precisely
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AIGary Allemann
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata StrategiesDATAVERSITY
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarDatameer
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Somuvadali 180712051740
Somuvadali 180712051740Somuvadali 180712051740
Somuvadali 180712051740somu-vadali
 
Qiagram
QiagramQiagram
Qiagramjwppz
 
Data Systems Integration & Business Value Pt. 1: Metadata
Data Systems Integration & Business Value Pt. 1: MetadataData Systems Integration & Business Value Pt. 1: Metadata
Data Systems Integration & Business Value Pt. 1: MetadataDATAVERSITY
 
Data-Ed: Data Systems Integration & Business Value PT. 1: Metadata
Data-Ed: Data Systems Integration & Business Value PT. 1: MetadataData-Ed: Data Systems Integration & Business Value PT. 1: Metadata
Data-Ed: Data Systems Integration & Business Value PT. 1: MetadataData Blueprint
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
IT6701 Information Management - Unit III
IT6701 Information Management - Unit IIIIT6701 Information Management - Unit III
IT6701 Information Management - Unit IIIpkaviya
 
Metadata Strategies - Data Squared
Metadata Strategies - Data SquaredMetadata Strategies - Data Squared
Metadata Strategies - Data SquaredDATAVERSITY
 
Master Your Data. Master Your Business
Master Your Data. Master Your BusinessMaster Your Data. Master Your Business
Master Your Data. Master Your BusinessDLT Solutions
 
Learn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
Learn How to Turbocharge Your AI/ML Data Workflows with Data EnrichmentLearn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
Learn How to Turbocharge Your AI/ML Data Workflows with Data EnrichmentPrecisely
 
Getting Meta at Mesa
Getting Meta at MesaGetting Meta at Mesa
Getting Meta at MesaSafe Software
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupCaserta
 
Data-Ed: Business Value From MDM
Data-Ed: Business Value From MDM Data-Ed: Business Value From MDM
Data-Ed: Business Value From MDM Data Blueprint
 
Data-Ed Online Webinar: Business Value from MDM
Data-Ed Online Webinar: Business Value from MDMData-Ed Online Webinar: Business Value from MDM
Data-Ed Online Webinar: Business Value from MDMDATAVERSITY
 

Semelhante a Iterative data discovery and transformation with open refine (20)

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata Strategies
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Somuvadali 180712051740
Somuvadali 180712051740Somuvadali 180712051740
Somuvadali 180712051740
 
Qiagram
QiagramQiagram
Qiagram
 
Data Systems Integration & Business Value Pt. 1: Metadata
Data Systems Integration & Business Value Pt. 1: MetadataData Systems Integration & Business Value Pt. 1: Metadata
Data Systems Integration & Business Value Pt. 1: Metadata
 
Data-Ed: Data Systems Integration & Business Value PT. 1: Metadata
Data-Ed: Data Systems Integration & Business Value PT. 1: MetadataData-Ed: Data Systems Integration & Business Value PT. 1: Metadata
Data-Ed: Data Systems Integration & Business Value PT. 1: Metadata
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
IT6701 Information Management - Unit III
IT6701 Information Management - Unit IIIIT6701 Information Management - Unit III
IT6701 Information Management - Unit III
 
Metadata Strategies - Data Squared
Metadata Strategies - Data SquaredMetadata Strategies - Data Squared
Metadata Strategies - Data Squared
 
Master Your Data. Master Your Business
Master Your Data. Master Your BusinessMaster Your Data. Master Your Business
Master Your Data. Master Your Business
 
Learn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
Learn How to Turbocharge Your AI/ML Data Workflows with Data EnrichmentLearn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
Learn How to Turbocharge Your AI/ML Data Workflows with Data Enrichment
 
Getting Meta at Mesa
Getting Meta at MesaGetting Meta at Mesa
Getting Meta at Mesa
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
Data-Ed: Business Value From MDM
Data-Ed: Business Value From MDM Data-Ed: Business Value From MDM
Data-Ed: Business Value From MDM
 
Data-Ed Online Webinar: Business Value from MDM
Data-Ed Online Webinar: Business Value from MDMData-Ed Online Webinar: Business Value from MDM
Data-Ed Online Webinar: Business Value from MDM
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Iterative data discovery and transformation with open refine

  • 1. Martin Magdinier - @magdmartin 1 Iterative data discovery and transformation with Martin Magdinier - @magdmartin OpenRefine - @OpenRefine http://openrefine.org
  • 2. Martin Magdinier - @magdmartin 2 80% of data analysis is spent on the process of cleaning, transformation and integration
  • 3. Martin Magdinier - @magdmartin 3 • Duplicate value & Typos • Multi value cells • Data in the wrong field • Missing / Partial Values • Encoding Errors • Change format (text, number, date) • Flat to relational data set • Schema alignment • Transpose rows and columns • Join data-set • Enrichment from other sources (MDM, API calls) Data Quality & Integration & Is Time Consuming
  • 4. Martin Magdinier - @magdmartin 4 OpenRefine Bridges The Skill Gap DBA ETL Data Science Spreadsheet User Data Visualization / Interpretation Data Preparation Understand The Data (Business Skills) Know How To Transform Data (Technical Skills) User Base
  • 5. Martin Magdinier - @magdmartin 5 • SaaS and on-premise solution for extra compute power, collaboration and lightweight ETL • On demand training • Custom development • Free & Open Source • Community developed for 5 years • Available on local machine only • 5,000+ monthly download • Strong user base with Open Data, Library, Semantic web and Bio Science Semantic WebLibraryBio ScienceOpen Data
  • 6. Martin Magdinier - @magdmartin 6 Data Engineer Scale & Automate Processes Data Quality Manage Master Data Agile Data Process
  • 7. Martin Magdinier - @magdmartin 7 Data Engineer Scale & Automate Processes Data Quality Manage Master Data Data Scientist Develop Machine Learning & Data Analysis Model Agile Data Process
  • 8. Martin Magdinier - @magdmartin 8 Data Engineer IT Support Governance Access To Data Scale & Automate Processes Data Quality Manage Master Data Data Scientist Discovery Data Wrangling Profiling Preparation Quality Integration Agile Data Process Business Analyst Develop Machine Learning & Data Analysis Model Sense Making Data Exploration Reporting Analysis Scale Real -Time Lightweight ETL Migration
  • 9. Martin Magdinier - @magdmartin 9 Business Analyst Data Engineer IT Support Governance Access To Data Scale & Automate Processes Data Quality Manage Master Data Data Scientist Discovery Data Wrangling Profiling Preparation Quality Integration Agile Data Process Develop Machine Learning & Data Analysis Model ETL Tools
  • 10. Martin Magdinier - @magdmartin 10 Demo: 2014 Toronto Cleared Building Permits http://ow.ly/Js8GD Data Discovery 1. What of Permit Type are issued? 2. Explore Previous usage ; Application Date & Dwelling Units Created Data Preparation 1. Geocode with Google Maps API 2. Map Construction with over 10 new Dwelling Units Created
  • 11. Martin Magdinier - @magdmartin 11 Iterative data discovery and transformation with Martin Magdinier - @magdmartin OpenRefine - @OpenRefine http://openrefine.org