SlideShare uma empresa Scribd logo
1 de 10
2013-06-26
Knowledge Workers Toronto
Methods
1
We are surrounded by data
2013-06-26
Knowledge Workers Toronto
Methods
2
... by MESSY data
- Multiple standards and formats
Structured vs unstructured
Field nomination and format varies ...
- Human Error (misspellings, errors, etc)
- Non-normalized inputs (free-text entries)
- Incomplete data (laziness)
....
2013-06-26
Knowledge Workers Toronto
Methods
3
2013-06-26
Knowledge Workers Toronto
Methods
4
OpenRefine the
- Swiss army knife for data manipulation!
- glue step between your IT systems
2013-06-26
Knowledge Workers Toronto
Methods
5
What's OpenRefine
(former Google Refine, former Gridworks)
- A Cross platform Web Application that runs
locally
- A Community based project hosted on GitHub
- Which have two distributions and multiple
extensions
- Something between a spreadsheet and SQL
2013-06-26
Knowledge Workers Toronto
Methods
6
Four use cases
1. Data Profiling *
2. Data Cleaning *
3. Data extension *
4. ETL (Extract Transform Load) Prototyping
2013-06-26
Knowledge Workers Toronto
Methods
7
File 1: Data Profiling &
Cleaning
City Subject Thesaurus XML file with 5431
concept.
What we will do:
- Explore the file
- Fix inconsistencies
- Transpose / Merge fields
2013-06-26
Knowledge Workers Toronto
Methods
8
File 2: The Economist Best
City Contest 2012
Prepare Data for an application to the Economist
Intelligence Units (EIU) Best City Contest 2012
using G. Hofstede Values Survey Module 2008
What we will do:
- Clean Duplicate
- Create New Data
- Add data from a different project
- Use Project History
2013-06-26
Knowledge Workers Toronto
Methods
9
OpenRefine
http://openrefine.org
@OpenRefine
Martin Magdinier
martin.magdinier@gmail.com
@magdmartin
Thanks!
2013-06-26
Knowledge Workers Toronto
Methods
10
DESCRIPTOR The preferred term
FAC Facet
SC Subject category of the term
SN Scope note
SRC Source of term
UF Used for
BT Broader term
City Subject Thesaurus
Legend
NT Narrower term
RT Related term
STA Term status
INP Input date
APP Approval date
UPD Modified date
TNR Term number

Mais conteúdo relacionado

Mais procurados

Introduction to Data Structures
Introduction to Data StructuresIntroduction to Data Structures
Introduction to Data Structuresnayanbanik
 
Preparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001ePreparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001eGihan Wikramanayake
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataArtem Lutov
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...David Milward
 
Data structure (basics)
Data structure (basics)Data structure (basics)
Data structure (basics)ShrushtiGole
 
Data structure unitfirst part1
Data structure unitfirst part1Data structure unitfirst part1
Data structure unitfirst part1Amar Rawat
 
Importing data in Oasis Montaj
Importing data in Oasis MontajImporting data in Oasis Montaj
Importing data in Oasis MontajAmin khalil
 
self designed Linked List
self designed Linked Listself designed Linked List
self designed Linked ListDAYASAGAR KADAM
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...National Institute of Informatics
 
Introduction to data structures (ss)
Introduction to data structures (ss)Introduction to data structures (ss)
Introduction to data structures (ss)Madishetty Prathibha
 

Mais procurados (16)

Introduction to Data Structures
Introduction to Data StructuresIntroduction to Data Structures
Introduction to Data Structures
 
Preparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001ePreparing for BIT – IT2301 Database Management Systems 2001e
Preparing for BIT – IT2301 Database Management Systems 2001e
 
StaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked DataStaTIX - Statistical Type Inference on Linked Data
StaTIX - Statistical Type Inference on Linked Data
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
 
Data structures
Data structuresData structures
Data structures
 
Linked lists
Linked listsLinked lists
Linked lists
 
Data structure (basics)
Data structure (basics)Data structure (basics)
Data structure (basics)
 
Data structure unitfirst part1
Data structure unitfirst part1Data structure unitfirst part1
Data structure unitfirst part1
 
Data mining
Data miningData mining
Data mining
 
Building intelligent systems with FAIR data
Building intelligent systems with FAIR dataBuilding intelligent systems with FAIR data
Building intelligent systems with FAIR data
 
Importing data in Oasis Montaj
Importing data in Oasis MontajImporting data in Oasis Montaj
Importing data in Oasis Montaj
 
PPL, OQL & oodbms
PPL, OQL & oodbmsPPL, OQL & oodbms
PPL, OQL & oodbms
 
self designed Linked List
self designed Linked Listself designed Linked List
self designed Linked List
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...
 
Introduction to data structures (ss)
Introduction to data structures (ss)Introduction to data structures (ss)
Introduction to data structures (ss)
 
Ef overview
Ef overviewEf overview
Ef overview
 

Destaque

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Administracion Empresarial 122116
Administracion Empresarial 122116Administracion Empresarial 122116
Administracion Empresarial 122116sena
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)GOKb Project
 

Destaque (6)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
LOD2 Webinar Series: Zemanta / Open refine
LOD2 Webinar Series: Zemanta / Open refine LOD2 Webinar Series: Zemanta / Open refine
LOD2 Webinar Series: Zemanta / Open refine
 
Administracion Empresarial 122116
Administracion Empresarial 122116Administracion Empresarial 122116
Administracion Empresarial 122116
 
Administracion Empresarial 122116
Administracion Empresarial 122116Administracion Empresarial 122116
Administracion Empresarial 122116
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)
 

Semelhante a 20130626 OpenRefine Introduction

Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfRAKESHG79
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Erdenebayar Erdenebileg
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things PayamBarnaghi
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Introduction to Smart Data Models
Introduction to Smart Data ModelsIntroduction to Smart Data Models
Introduction to Smart Data ModelsFIWARE
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsMarkus Neteler
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemCameron Kiddle
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 

Semelhante a 20130626 OpenRefine Introduction (20)

20130206 open refine
20130206  open refine20130206  open refine
20130206 open refine
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 
DBMS.ppt
DBMS.pptDBMS.ppt
DBMS.ppt
 
Uc13.chapter.14
Uc13.chapter.14Uc13.chapter.14
Uc13.chapter.14
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things
 
unit-4-notes.pdf
unit-4-notes.pdfunit-4-notes.pdf
unit-4-notes.pdf
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Introduction to Smart Data Models
Introduction to Smart Data ModelsIntroduction to Smart Data Models
Introduction to Smart Data Models
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formats
 
Emerging Technologies in IT
Emerging Technologies in ITEmerging Technologies in IT
Emerging Technologies in IT
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management System
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 

Último

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Último (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

20130626 OpenRefine Introduction