SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
OPEN REFINE TO PROFILE
AND CLEAN UP YOUR
MESSY DATA
THE WOES OF DATA
NOTE ALL DATA ARE MADE EQUAL!
DATA ARE CREATED FROM A VARIETY OF SOURCES:
• AUTOMATICALLY GENERATED
• MANUALLY CREATED AND MANAGED
• SCRAPPED FROM THE WEB
• TAKEN FROM ANOTHER SOURCE
• ETC.
EACH SOURCE TYPICALLY HAS ITS OWN METHOD OF HANDLING DATA.
THEREFORE DATA SOURCES ARE OFTEN HETEROGENEOUS, WHICH CREATES
INCONSISTENT AND INACCURATE DATA. FURTHER, THERE IS NO INFORMATION
PROVIDED ABOUT THE DATA TO HELP BETTER UNDERSTAND WHAT THE DATA ARE
ABOUT.
WHAT TO DO?
ESSENTIALLY, YOU ARE LOOKING FOR CONSISTENT AND ACCURATE
DATA. TO DO THIS, THERE ARE OPTIONS:
• CREATE A SCRIPT TO CLEAN DATA
• HAVE SOMEONE CLEAN YOUR DATA
• USE AN APPLICATION TO HELP YOU CLEAN YOUR DATA
TODAY, WE’RE GOING TO LOOK AT OPEN REFINE, AN APPLICATION
THAT CAN HELP YOU CLEAN YOUR DATA.
WHAT IS OPEN REFINE?
OPEN REFINE ORIGINATED WITH METAWEB TECHNOLOGIES AND THEN
GOOGLE. FORMERLY KNOWN AS FREEBASE GRIDWORKS, IT BECAME
KNOWN AS GOOGLE REFINE AND THEN OPEN REFINE.
IT IS A FULLY OPEN SOURCE APPLICATION.
YOU CAN DOWNLOAD IT AT: HTTP://OPEN
REFINE.ORG/DOWNLOAD.HTML
OPEN REFINE AND THE WOES OF DATA
WHETHER YOU HAVE HETEROGENEOUS DATA OR DATA YOU’RE
UNFAMILIAR WITH, OPEN REFINE CAN HELP WITH:
• DATA CLEANUP
• CORRECT INCONSISTENCIES AND INACCURACIES
• DATA PROFILING
• CREATE AN ANALYSIS OF UNFAMILIAR DATA BY LEARNING WHAT THE
DATA ARE
2 BASIC OPERATIONS OF OPEN REFINE
BOTH DATA CLEANUP AND DATA PROFILING RELY ON A DATABASE,
CALLED FREEBASE, TO VISUALIZE AND MANIPULATE YOUR DATA.
VISUALIZING YOUR DATA INCLUDES BEING ABLE TO SEE THE “TYPE”
OF DATA. IS IT A DATE, NUMBER, TEXT? ARE ALL THE ENTRIES THE
SAME? VISUALIZING ALSO INCLUDES BEING ABLE TO SEE TIMELINES
FOR ENTRIES THAT ARE DATES.
MANIPULATING DATA INCLUDES GLOBAL SEARCHING AND
REPLACING, UPDATING ENTRIES SINGLY OR BY BATCH, ADDING NEW
DATA, REMOVING DATA, CHANGING THE DATA TYPES, AND MORE!
LET’S DO THE TOUR
• CREATE A PROJECT
• DIFFERENT VIEWS
• SEEING YOUR DATA
• MANIPULATING ROWS/COLUMNS
• CORRECTING ERRORS
HOW TO MAKE CHANGES TO DATA
• CHANGES TO COLUMNS
• SPLIT
• ADD
• REMOVE
• RENAME
• MOVE
• CHANGES CELLS (TO DATA ANYWHERE)
• EDIT CELLS USING CUSTOM OR DEFAULT TRANSFORMS, FILL
DOWN/BLANK, SPLIT/JOIN CELLS, CLUSTER/EDIT
AVAILABLE TRANSFORMATIONS
TRIM OR COLLAPSE WHITESPACES
UNESCAPE HTML ENTITIES
CHANGE THE CASE (TITLE, LOWER OR UPPER CASE)
CHANGE THE DATA TO NUMERIC, DATE OR TEXT
CUSTOM TRANSFORMATIONS
• JYTHON
• HTTP://WWW.JYTHON.ORG/
• HTTPS://GITHUB.COM/OPENREFINE/OPENREFINE/WIKI/JYTHON
• GREL (OPEN REFINE EXPRESSION LANGUAGE)
• HTTPS://GITHUB.COM/OPENREFINE/OPENREFINE/WIKI/GREL-FUNCTIONS
• REGULAR EXPRESSIONS
• HTTP://EN.WIKIPEDIA.ORG/WIKI/REGULAR_EXPRESSION
• HTTP://WWW.REGULAR-EXPRESSIONS.INFO/QUICKSTART.HTML
GREL
• SLICE
• EXAMPLE CHANGE “2010-05-31T01:10:0Z” TO “05/31/2010”
• VALUE.SLICE(5,7) + ‘/’ + VALUE.SLICE(8,10) + ‘/’ + VALUE.SLICE(0,4)
• YOU CAN ALSO USE:
• ADD A PREFIX
• “PREFIX” + VALUE
• SPLIT AND JOIN
• A:B:C:D:E -> B:C:D
REGULAR EXPRESSIONS
• TEXT PATTERN THAT ONE CAN USE WITH MANY MODERN
APPLICATIONS AND PROGRAMMING LANGUAGES
• REGULAR EXPRESSIONS COME IN FLAVORS
• .NET, JAVA, JAVASCRIPT, PERL, PYTHON, RUBY, …
• METACHARACTERS: 12 PUNCTUATION CHARACTERS THAT MAKE
REGULAR EXPRESSIONS WORK
REGULAR EXPRESSIONS
• CHARACTER CLASS ABBREVIATIONS
REGULAR EXPRESSIONS
• REMOVE THE “.” AT THE END OF A PHRASE
• VALUE.REPLACE(/[.]$/, “”)
• WHAT HAPPENS IF YOU JUST PUT /.$/ WITHOUT THE BRACKETS?
• REMOVE A STRING AT THE BEGINNING OF THE PHRASE WHERE THE
FIRST LETTER OF THE STRING IS UPPER OR LOWER CASE
• VALUE.REPLACE(/^W+S/, “”)
• TEST IT OUT
• HTTP://REGEXPAL.COM/
QUESTIONS?
RESOURCES
• USING OPEN REFINE … THE BOOK, QA76.9.D343 V47 2013
• OPEN REFINE DOCUMENTATION: HTTP://OPEN
REFINE.ORG/DOCUMENTATION.HTML
• OPEN REFINE COMMUNITY: HTTP://OPEN
REFINE.ORG/COMMUNITY.HTML
• CODE PROJECT ON REGULAR EXPRESSIONS:
HTTP://WWW.CODEPROJECT.COM/ARTICLES/9099/THE-MINUTE-
REGEX-TUTORIAL
• REGULAR EXPRESSIONS CHEAT SHEET:
HTTP://WWW.CHEATOGRAPHY.COM/DAVECHILD/CHEAT-
SHEETS/REGULAR-EXPRESSIONS/
• REGULAR EXPRESSIONS TESTER: HTTP://REGEXPAL.COM/ ,
HTTP://REGEX101.COM/

Mais conteúdo relacionado

Destaque

Collaborative Data Archiving and Access: Developing a Shared Repository Infra...
Collaborative Data Archiving and Access: Developing a Shared Repository Infra...Collaborative Data Archiving and Access: Developing a Shared Repository Infra...
Collaborative Data Archiving and Access: Developing a Shared Repository Infra...University of Connecticut Libraries
 
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...University of Connecticut Libraries
 
We Don't Make Your Preservation Program, We Make Your Preservation Program Be...
We Don't Make Your Preservation Program, We Make Your Preservation Program Be...We Don't Make Your Preservation Program, We Make Your Preservation Program Be...
We Don't Make Your Preservation Program, We Make Your Preservation Program Be...University of Connecticut Libraries
 
A Cloud of Your Own: Preservation & Access Services from the Connecticut Digi...
A Cloud of Your Own: Preservation & Access Services from the Connecticut Digi...A Cloud of Your Own: Preservation & Access Services from the Connecticut Digi...
A Cloud of Your Own: Preservation & Access Services from the Connecticut Digi...University of Connecticut Libraries
 

Destaque (19)

How to Add A Compound Object
How to Add A Compound ObjectHow to Add A Compound Object
How to Add A Compound Object
 
CTDA Brown Bag, Dec. 2016
CTDA Brown Bag, Dec. 2016CTDA Brown Bag, Dec. 2016
CTDA Brown Bag, Dec. 2016
 
CTDA Brown Bag, Oct. 2016
CTDA Brown Bag, Oct. 2016CTDA Brown Bag, Oct. 2016
CTDA Brown Bag, Oct. 2016
 
CTDA Flyer 2016
CTDA Flyer 2016CTDA Flyer 2016
CTDA Flyer 2016
 
Collaborative Data Archiving and Access: Developing a Shared Repository Infra...
Collaborative Data Archiving and Access: Developing a Shared Repository Infra...Collaborative Data Archiving and Access: Developing a Shared Repository Infra...
Collaborative Data Archiving and Access: Developing a Shared Repository Infra...
 
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
 
How to Add Or Replace a Datastream
How to Add Or Replace  a DatastreamHow to Add Or Replace  a Datastream
How to Add Or Replace a Datastream
 
CTDA Annual Meeting 2016
CTDA Annual Meeting 2016CTDA Annual Meeting 2016
CTDA Annual Meeting 2016
 
CTDA Overview September 2016
CTDA Overview September 2016CTDA Overview September 2016
CTDA Overview September 2016
 
We Don't Make Your Preservation Program, We Make Your Preservation Program Be...
We Don't Make Your Preservation Program, We Make Your Preservation Program Be...We Don't Make Your Preservation Program, We Make Your Preservation Program Be...
We Don't Make Your Preservation Program, We Make Your Preservation Program Be...
 
CTDA Workshop on XML and MODS
CTDA Workshop on XML and MODSCTDA Workshop on XML and MODS
CTDA Workshop on XML and MODS
 
CTDA Metadata Application Profile
CTDA Metadata Application ProfileCTDA Metadata Application Profile
CTDA Metadata Application Profile
 
A Cloud of Your Own: Preservation & Access Services from the Connecticut Digi...
A Cloud of Your Own: Preservation & Access Services from the Connecticut Digi...A Cloud of Your Own: Preservation & Access Services from the Connecticut Digi...
A Cloud of Your Own: Preservation & Access Services from the Connecticut Digi...
 
CTDA MODS Implementation Guidelines
CTDA MODS Implementation GuidelinesCTDA MODS Implementation Guidelines
CTDA MODS Implementation Guidelines
 
How to Use the Manuscript Content Model
How to Use the Manuscript Content ModelHow to Use the Manuscript Content Model
How to Use the Manuscript Content Model
 
CTDA End of Year Reports
CTDA End of Year ReportsCTDA End of Year Reports
CTDA End of Year Reports
 
CTDA Brown Bag, Feb. 2017
CTDA Brown Bag, Feb. 2017CTDA Brown Bag, Feb. 2017
CTDA Brown Bag, Feb. 2017
 
CTDA: Brief Introduction
CTDA: Brief IntroductionCTDA: Brief Introduction
CTDA: Brief Introduction
 
CTDA Workshop on XSL
CTDA Workshop on XSLCTDA Workshop on XSL
CTDA Workshop on XSL
 

Semelhante a Open refine to update and clean up your messy data

Comparativeanalysisofdatabasesystems 160317091155
Comparativeanalysisofdatabasesystems 160317091155Comparativeanalysisofdatabasesystems 160317091155
Comparativeanalysisofdatabasesystems 160317091155Aravindharamanan S
 
Comparative Analysis of Database Systems
Comparative Analysis of Database SystemsComparative Analysis of Database Systems
Comparative Analysis of Database SystemsHaris Jamil
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrBrooke Ganz
 
Nonprofits & Data: When Data is Everywhere, Where Do You Start?
Nonprofits & Data: When Data is Everywhere, Where Do You Start?Nonprofits & Data: When Data is Everywhere, Where Do You Start?
Nonprofits & Data: When Data is Everywhere, Where Do You Start?Forum One
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server DatabasesColdFusionConference
 
Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017Jim Adcock
 
Agile bringing Big Data & Analytics closer
Agile bringing Big Data & Analytics closerAgile bringing Big Data & Analytics closer
Agile bringing Big Data & Analytics closerNitin Khattar
 
Nonprofit Data: What to Visualize
Nonprofit Data: What to VisualizeNonprofit Data: What to Visualize
Nonprofit Data: What to VisualizeForum One
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminarGlen McGregor
 
Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesInformaticaTrainingClasses
 
Webinar - Harness the Power of Data with Tableau - 2016-02-18
Webinar - Harness the Power of Data with Tableau - 2016-02-18Webinar - Harness the Power of Data with Tableau - 2016-02-18
Webinar - Harness the Power of Data with Tableau - 2016-02-18TechSoup
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Tom Rieger
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Tracy Blackburn
 
Final presentation updated
Final presentation updatedFinal presentation updated
Final presentation updatedJosephMoreira2
 
SQL for Data Science Tutorial | Data Science Tutorial | Edureka
SQL for Data Science Tutorial | Data Science Tutorial | EdurekaSQL for Data Science Tutorial | Data Science Tutorial | Edureka
SQL for Data Science Tutorial | Data Science Tutorial | EdurekaEdureka!
 
Word press theme and plugins WordCamp Presentation
Word press theme and plugins WordCamp PresentationWord press theme and plugins WordCamp Presentation
Word press theme and plugins WordCamp PresentationAngela Samuels
 

Semelhante a Open refine to update and clean up your messy data (20)

Comparativeanalysisofdatabasesystems 160317091155
Comparativeanalysisofdatabasesystems 160317091155Comparativeanalysisofdatabasesystems 160317091155
Comparativeanalysisofdatabasesystems 160317091155
 
Comparative Analysis of Database Systems
Comparative Analysis of Database SystemsComparative Analysis of Database Systems
Comparative Analysis of Database Systems
 
Online research
Online researchOnline research
Online research
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
 
SEO and SEM
SEO and SEMSEO and SEM
SEO and SEM
 
Nonprofits & Data: When Data is Everywhere, Where Do You Start?
Nonprofits & Data: When Data is Everywhere, Where Do You Start?Nonprofits & Data: When Data is Everywhere, Where Do You Start?
Nonprofits & Data: When Data is Everywhere, Where Do You Start?
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server Databases
 
Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017
 
Agile bringing Big Data & Analytics closer
Agile bringing Big Data & Analytics closerAgile bringing Big Data & Analytics closer
Agile bringing Big Data & Analytics closer
 
Nonprofit Data: What to Visualize
Nonprofit Data: What to VisualizeNonprofit Data: What to Visualize
Nonprofit Data: What to Visualize
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminar
 
Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClasses
 
Webinar - Harness the Power of Data with Tableau - 2016-02-18
Webinar - Harness the Power of Data with Tableau - 2016-02-18Webinar - Harness the Power of Data with Tableau - 2016-02-18
Webinar - Harness the Power of Data with Tableau - 2016-02-18
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
 
Database
DatabaseDatabase
Database
 
Final presentation updated
Final presentation updatedFinal presentation updated
Final presentation updated
 
SQL for Data Science Tutorial | Data Science Tutorial | Edureka
SQL for Data Science Tutorial | Data Science Tutorial | EdurekaSQL for Data Science Tutorial | Data Science Tutorial | Edureka
SQL for Data Science Tutorial | Data Science Tutorial | Edureka
 
Word press theme and plugins WordCamp Presentation
Word press theme and plugins WordCamp PresentationWord press theme and plugins WordCamp Presentation
Word press theme and plugins WordCamp Presentation
 
Databases
DatabasesDatabases
Databases
 

Último

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 

Último (20)

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 

Open refine to update and clean up your messy data

  • 1. OPEN REFINE TO PROFILE AND CLEAN UP YOUR MESSY DATA
  • 2. THE WOES OF DATA NOTE ALL DATA ARE MADE EQUAL! DATA ARE CREATED FROM A VARIETY OF SOURCES: • AUTOMATICALLY GENERATED • MANUALLY CREATED AND MANAGED • SCRAPPED FROM THE WEB • TAKEN FROM ANOTHER SOURCE • ETC. EACH SOURCE TYPICALLY HAS ITS OWN METHOD OF HANDLING DATA. THEREFORE DATA SOURCES ARE OFTEN HETEROGENEOUS, WHICH CREATES INCONSISTENT AND INACCURATE DATA. FURTHER, THERE IS NO INFORMATION PROVIDED ABOUT THE DATA TO HELP BETTER UNDERSTAND WHAT THE DATA ARE ABOUT.
  • 3. WHAT TO DO? ESSENTIALLY, YOU ARE LOOKING FOR CONSISTENT AND ACCURATE DATA. TO DO THIS, THERE ARE OPTIONS: • CREATE A SCRIPT TO CLEAN DATA • HAVE SOMEONE CLEAN YOUR DATA • USE AN APPLICATION TO HELP YOU CLEAN YOUR DATA TODAY, WE’RE GOING TO LOOK AT OPEN REFINE, AN APPLICATION THAT CAN HELP YOU CLEAN YOUR DATA.
  • 4. WHAT IS OPEN REFINE? OPEN REFINE ORIGINATED WITH METAWEB TECHNOLOGIES AND THEN GOOGLE. FORMERLY KNOWN AS FREEBASE GRIDWORKS, IT BECAME KNOWN AS GOOGLE REFINE AND THEN OPEN REFINE. IT IS A FULLY OPEN SOURCE APPLICATION. YOU CAN DOWNLOAD IT AT: HTTP://OPEN REFINE.ORG/DOWNLOAD.HTML
  • 5. OPEN REFINE AND THE WOES OF DATA WHETHER YOU HAVE HETEROGENEOUS DATA OR DATA YOU’RE UNFAMILIAR WITH, OPEN REFINE CAN HELP WITH: • DATA CLEANUP • CORRECT INCONSISTENCIES AND INACCURACIES • DATA PROFILING • CREATE AN ANALYSIS OF UNFAMILIAR DATA BY LEARNING WHAT THE DATA ARE
  • 6. 2 BASIC OPERATIONS OF OPEN REFINE BOTH DATA CLEANUP AND DATA PROFILING RELY ON A DATABASE, CALLED FREEBASE, TO VISUALIZE AND MANIPULATE YOUR DATA. VISUALIZING YOUR DATA INCLUDES BEING ABLE TO SEE THE “TYPE” OF DATA. IS IT A DATE, NUMBER, TEXT? ARE ALL THE ENTRIES THE SAME? VISUALIZING ALSO INCLUDES BEING ABLE TO SEE TIMELINES FOR ENTRIES THAT ARE DATES. MANIPULATING DATA INCLUDES GLOBAL SEARCHING AND REPLACING, UPDATING ENTRIES SINGLY OR BY BATCH, ADDING NEW DATA, REMOVING DATA, CHANGING THE DATA TYPES, AND MORE!
  • 7. LET’S DO THE TOUR • CREATE A PROJECT • DIFFERENT VIEWS • SEEING YOUR DATA • MANIPULATING ROWS/COLUMNS • CORRECTING ERRORS
  • 8. HOW TO MAKE CHANGES TO DATA • CHANGES TO COLUMNS • SPLIT • ADD • REMOVE • RENAME • MOVE • CHANGES CELLS (TO DATA ANYWHERE) • EDIT CELLS USING CUSTOM OR DEFAULT TRANSFORMS, FILL DOWN/BLANK, SPLIT/JOIN CELLS, CLUSTER/EDIT
  • 9. AVAILABLE TRANSFORMATIONS TRIM OR COLLAPSE WHITESPACES UNESCAPE HTML ENTITIES CHANGE THE CASE (TITLE, LOWER OR UPPER CASE) CHANGE THE DATA TO NUMERIC, DATE OR TEXT
  • 10. CUSTOM TRANSFORMATIONS • JYTHON • HTTP://WWW.JYTHON.ORG/ • HTTPS://GITHUB.COM/OPENREFINE/OPENREFINE/WIKI/JYTHON • GREL (OPEN REFINE EXPRESSION LANGUAGE) • HTTPS://GITHUB.COM/OPENREFINE/OPENREFINE/WIKI/GREL-FUNCTIONS • REGULAR EXPRESSIONS • HTTP://EN.WIKIPEDIA.ORG/WIKI/REGULAR_EXPRESSION • HTTP://WWW.REGULAR-EXPRESSIONS.INFO/QUICKSTART.HTML
  • 11. GREL • SLICE • EXAMPLE CHANGE “2010-05-31T01:10:0Z” TO “05/31/2010” • VALUE.SLICE(5,7) + ‘/’ + VALUE.SLICE(8,10) + ‘/’ + VALUE.SLICE(0,4) • YOU CAN ALSO USE: • ADD A PREFIX • “PREFIX” + VALUE • SPLIT AND JOIN • A:B:C:D:E -> B:C:D
  • 12. REGULAR EXPRESSIONS • TEXT PATTERN THAT ONE CAN USE WITH MANY MODERN APPLICATIONS AND PROGRAMMING LANGUAGES • REGULAR EXPRESSIONS COME IN FLAVORS • .NET, JAVA, JAVASCRIPT, PERL, PYTHON, RUBY, … • METACHARACTERS: 12 PUNCTUATION CHARACTERS THAT MAKE REGULAR EXPRESSIONS WORK
  • 13. REGULAR EXPRESSIONS • CHARACTER CLASS ABBREVIATIONS
  • 14. REGULAR EXPRESSIONS • REMOVE THE “.” AT THE END OF A PHRASE • VALUE.REPLACE(/[.]$/, “”) • WHAT HAPPENS IF YOU JUST PUT /.$/ WITHOUT THE BRACKETS? • REMOVE A STRING AT THE BEGINNING OF THE PHRASE WHERE THE FIRST LETTER OF THE STRING IS UPPER OR LOWER CASE • VALUE.REPLACE(/^W+S/, “”) • TEST IT OUT • HTTP://REGEXPAL.COM/
  • 16. RESOURCES • USING OPEN REFINE … THE BOOK, QA76.9.D343 V47 2013 • OPEN REFINE DOCUMENTATION: HTTP://OPEN REFINE.ORG/DOCUMENTATION.HTML • OPEN REFINE COMMUNITY: HTTP://OPEN REFINE.ORG/COMMUNITY.HTML • CODE PROJECT ON REGULAR EXPRESSIONS: HTTP://WWW.CODEPROJECT.COM/ARTICLES/9099/THE-MINUTE- REGEX-TUTORIAL • REGULAR EXPRESSIONS CHEAT SHEET: HTTP://WWW.CHEATOGRAPHY.COM/DAVECHILD/CHEAT- SHEETS/REGULAR-EXPRESSIONS/ • REGULAR EXPRESSIONS TESTER: HTTP://REGEXPAL.COM/ , HTTP://REGEX101.COM/