SlideShare uma empresa Scribd logo
1 de 10
2013-06-26
Knowledge Workers Toronto
Methods
1
We are surrounded by data
2013-06-26
Knowledge Workers Toronto
Methods
2
... by MESSY data
- Multiple standards and formats
Structured vs unstructured
Field nomination and format varies ...
- Human Error (misspellings, errors, etc)
- Non-normalized inputs (free-text entries)
- Incomplete data (laziness)
....
2013-06-26
Knowledge Workers Toronto
Methods
3
2013-06-26
Knowledge Workers Toronto
Methods
4
OpenRefine the
- Swiss army knife for data manipulation!
- glue step between your IT systems
2013-06-26
Knowledge Workers Toronto
Methods
5
What's OpenRefine
(former Google Refine, former Gridworks)
- A Cross platform Web Application that runs
locally
- A Community based project hosted on GitHub
- Which have two distributions and multiple
extensions
- Something between a spreadsheet and SQL
2013-06-26
Knowledge Workers Toronto
Methods
6
Four use cases
1. Data Profiling *
2. Data Cleaning *
3. Data extension *
4. ETL (Extract Transform Load) Prototyping
2013-06-26
Knowledge Workers Toronto
Methods
7
File 1: Data Profiling &
Cleaning
City Subject Thesaurus XML file with 5431
concept.
What we will do:
- Explore the file
- Fix inconsistencies
- Transpose / Merge fields
2013-06-26
Knowledge Workers Toronto
Methods
8
File 2: The Economist Best
City Contest 2012
Prepare Data for an application to the Economist
Intelligence Units (EIU) Best City Contest 2012
using G. Hofstede Values Survey Module 2008
What we will do:
- Clean Duplicate
- Create New Data
- Add data from a different project
- Use Project History
2013-06-26
Knowledge Workers Toronto
Methods
9
OpenRefine
http://openrefine.org
@OpenRefine
Martin Magdinier
martin.magdinier@gmail.com
@magdmartin
Thanks!
2013-06-26
Knowledge Workers Toronto
Methods
10
DESCRIPTOR The preferred term
FAC Facet
SC Subject category of the term
SN Scope note
SRC Source of term
UF Used for
BT Broader term
City Subject Thesaurus
Legend
NT Narrower term
RT Related term
STA Term status
INP Input date
APP Approval date
UPD Modified date
TNR Term number

Mais conteúdo relacionado

Mais procurados

Introduction to Data Structures
Introduction to Data StructuresIntroduction to Data Structures
Introduction to Data Structuresnayanbanik
 
Preparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001ePreparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001eGihan Wikramanayake
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataArtem Lutov
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...David Milward
 
Data structure (basics)
Data structure (basics)Data structure (basics)
Data structure (basics)ShrushtiGole
 
Data structure unitfirst part1
Data structure unitfirst part1Data structure unitfirst part1
Data structure unitfirst part1Amar Rawat
 
Importing data in Oasis Montaj
Importing data in Oasis MontajImporting data in Oasis Montaj
Importing data in Oasis MontajAmin khalil
 
self designed Linked List
self designed Linked Listself designed Linked List
self designed Linked ListDAYASAGAR KADAM
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...National Institute of Informatics
 
Introduction to data structures (ss)
Introduction to data structures (ss)Introduction to data structures (ss)
Introduction to data structures (ss)Madishetty Prathibha
 

Mais procurados (16)

Introduction to Data Structures
Introduction to Data StructuresIntroduction to Data Structures
Introduction to Data Structures
 
Preparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001ePreparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001e
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked Data
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
 
Data structures
Data structuresData structures
Data structures
 
Linked lists
Linked listsLinked lists
Linked lists
 
Data structure (basics)
Data structure (basics)Data structure (basics)
Data structure (basics)
 
Data structure unitfirst part1
Data structure unitfirst part1Data structure unitfirst part1
Data structure unitfirst part1
 
Data mining
Data miningData mining
Data mining
 
Building intelligent systems with FAIR data
Building intelligent systems with FAIR dataBuilding intelligent systems with FAIR data
Building intelligent systems with FAIR data
 
Importing data in Oasis Montaj
Importing data in Oasis MontajImporting data in Oasis Montaj
Importing data in Oasis Montaj
 
PPL, OQL & oodbms
PPL, OQL & oodbmsPPL, OQL & oodbms
PPL, OQL & oodbms
 
self designed Linked List
self designed Linked Listself designed Linked List
self designed Linked List
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...
 
Introduction to data structures (ss)
Introduction to data structures (ss)Introduction to data structures (ss)
Introduction to data structures (ss)
 
Ef overview
Ef overviewEf overview
Ef overview
 

Destaque

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Administracion Empresarial 122116
Administracion Empresarial 122116Administracion Empresarial 122116
Administracion Empresarial 122116sena
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)GOKb Project
 

Destaque (6)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
LOD2 Webinar Series: Zemanta / Open refine
LOD2 Webinar Series: Zemanta / Open refine LOD2 Webinar Series: Zemanta / Open refine
LOD2 Webinar Series: Zemanta / Open refine
 
Administracion Empresarial 122116
Administracion Empresarial 122116Administracion Empresarial 122116
Administracion Empresarial 122116
 
Administracion Empresarial 122116
Administracion Empresarial 122116Administracion Empresarial 122116
Administracion Empresarial 122116
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)
 

Semelhante a 20130626 OpenRefine Introduction

Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfRAKESHG79
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Erdenebayar Erdenebileg
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things PayamBarnaghi
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Introduction to Smart Data Models
Introduction to Smart Data ModelsIntroduction to Smart Data Models
Introduction to Smart Data ModelsFIWARE
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsMarkus Neteler
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemCameron Kiddle
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 

Semelhante a 20130626 OpenRefine Introduction (20)

20130206 open refine
20130206  open refine20130206  open refine
20130206 open refine
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 
DBMS.ppt
DBMS.pptDBMS.ppt
DBMS.ppt
 
Uc13.chapter.14
Uc13.chapter.14Uc13.chapter.14
Uc13.chapter.14
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things
 
unit-4-notes.pdf
unit-4-notes.pdfunit-4-notes.pdf
unit-4-notes.pdf
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Introduction to Smart Data Models
Introduction to Smart Data ModelsIntroduction to Smart Data Models
Introduction to Smart Data Models
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formats
 
Emerging Technologies in IT
Emerging Technologies in ITEmerging Technologies in IT
Emerging Technologies in IT
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management System
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 

Último

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

20130626 OpenRefine Introduction