SlideShare uma empresa Scribd logo
1 de 76
DATA LIBERATION
Opening Up Data by Hook
or by Crook - Data
Scraping, Linkage and the
Value of a Good Identifier
                               Tony Hirst
                      Department of Communication
                             and Systems
                          The Open University
data NOT
information
              by Vick
[Disruptive
Innovation?]
“First” generation:
 data catalogues
Breathing life
 into data…
=importData(“CSV_URL”)
the spreadsheet becomes

A DATABASE
“Second” generation:
 data management
      systems
There’s lots more
data that’s locked
up in web pages…
Scraping…
“grabbing web content
in a machine readable
   format and then
 processing it for your
    own purposes”
Original      Extract
                          Accessible
HTML web    Information
                          web page
  page         -> data
Recreating the
database that was used
     to populate a
   (templated) page
…quick’n’dirty
Scrapers
                  SQLite
    Scraper      database




Views
   SQLitedatab
       ase
                 Scraper
Sometimes the
 data is spread
across different
     files…
Row based
aggregation
Sometimes the
 data is spread
across different
  websites…
…   Normalisation…
Data
Enrichment
Column
Additions/An
 notations
Sometimes the
  data is split
across different
     files…
Column
based merge
-> Data
cleansing
Clustering…
http://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/

/via Martin Hawksey/@mhawksey
“Finessing” a
  common
  identifer
Common identifiers
 (common KEYS) make
it MUCH easier to JOIN
   datasets by column
Book Title
-> ISBN
I am “psychemedia”
            on
Twitter, delicious, slide
  share, flickr, etc etc
Reconciliation…
Linked
Data™
So who speaks SPARQL?




     Diners - Journal Canteen
     by avlxyz
You DON’T have to….
Just think about how one piece of
 data might be related to another
   through a common means of
        addressing them…
http://ouseful.info

 @psychemedia

Mais conteúdo relacionado

Mais procurados

Soton2013 opendata
Soton2013 opendataSoton2013 opendata
Soton2013 opendataTony Hirst
 
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsPromises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsEmily Nimsakont
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentationekansa
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexingVaralakshmiRSR
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamEnno Meijers
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?Emily Nimsakont
 
Data(base) taxonomy
Data(base) taxonomyData(base) taxonomy
Data(base) taxonomyDejan Radic
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTESShana McDanold
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshareHafabe
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldMarc D Anderson
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technicallisld
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionEmily Nimsakont
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the cataloglisld
 
Practical Metadata Where Do I Start For a Digital Project
Practical Metadata Where Do I Start For a Digital ProjectPractical Metadata Where Do I Start For a Digital Project
Practical Metadata Where Do I Start For a Digital ProjectJill Strass
 

Mais procurados (18)

Soton2013 opendata
Soton2013 opendataSoton2013 opendata
Soton2013 opendata
 
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsPromises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
I say NoSQL you say what
I say NoSQL you say whatI say NoSQL you say what
I say NoSQL you say what
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexing
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics Amsterdam
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?
 
Data(base) taxonomy
Data(base) taxonomyData(base) taxonomy
Data(base) taxonomy
 
Databases and types of databases
Databases and types of databasesDatabases and types of databases
Databases and types of databases
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTES
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshare
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM Landscape
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern World
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technical
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An Introduction
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the catalog
 
Practical Metadata Where Do I Start For a Digital Project
Practical Metadata Where Do I Start For a Digital ProjectPractical Metadata Where Do I Start For a Digital Project
Practical Metadata Where Do I Start For a Digital Project
 

Semelhante a Data Liberation - Tony Hirst

What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...Emily Nimsakont
 
What is the Semantic Web
What is the Semantic WebWhat is the Semantic Web
What is the Semantic WebJuan Sequeda
 
Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Emily Nimsakont
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?MIUR
 
Linked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, MuseumsLinked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, Museumsljsmart
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futureslisld
 
Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2Jenel Farrell
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesAlessandro Adamou
 
Metadata in the age of data curation and linked data
Metadata in the age of data curation and linked dataMetadata in the age of data curation and linked data
Metadata in the age of data curation and linked dataRyan Johnson
 
Madrid Building blocks of Linked Data
Madrid Building blocks of Linked DataMadrid Building blocks of Linked Data
Madrid Building blocks of Linked DataVictor de Boer
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersPrattSILS
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environmentJakob .
 
What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection? What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection? Debra Shapiro
 
Semantic Mapping and LOD prez
Semantic Mapping and LOD prezSemantic Mapping and LOD prez
Semantic Mapping and LOD prezCarol Chiodo
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 

Semelhante a Data Liberation - Tony Hirst (20)

What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
 
What is the Semantic Web
What is the Semantic WebWhat is the Semantic Web
What is the Semantic Web
 
Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?
 
Linked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, MuseumsLinked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, Museums
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futures
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Metadata in the age of data curation and linked data
Metadata in the age of data curation and linked dataMetadata in the age of data curation and linked data
Metadata in the age of data curation and linked data
 
Madrid Building blocks of Linked Data
Madrid Building blocks of Linked DataMadrid Building blocks of Linked Data
Madrid Building blocks of Linked Data
 
Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environment
 
Linked library data
Linked library dataLinked library data
Linked library data
 
What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection? What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection?
 
Semantic Mapping and LOD prez
Semantic Mapping and LOD prezSemantic Mapping and LOD prez
Semantic Mapping and LOD prez
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N Basavarajappa
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 

Mais de Incisive_Events

Gaby Lutgens Edl@b experiment
Gaby Lutgens Edl@b experimentGaby Lutgens Edl@b experiment
Gaby Lutgens Edl@b experimentIncisive_Events
 
Louise Corti Data scientists
Louise Corti Data scientistsLouise Corti Data scientists
Louise Corti Data scientistsIncisive_Events
 
Richard Wallis Linked Data
Richard Wallis Linked DataRichard Wallis Linked Data
Richard Wallis Linked DataIncisive_Events
 
Alain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersAlain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersIncisive_Events
 
Andrew Cox Research data management
Andrew Cox Research data managementAndrew Cox Research data management
Andrew Cox Research data managementIncisive_Events
 
Anne Osterrieder Tools for sharing your research
Anne Osterrieder Tools for sharing your researchAnne Osterrieder Tools for sharing your research
Anne Osterrieder Tools for sharing your researchIncisive_Events
 
Mahendra Mahey British Library Labs
Mahendra Mahey British Library LabsMahendra Mahey British Library Labs
Mahendra Mahey British Library LabsIncisive_Events
 
Phil Bradley The future of Search
Phil Bradley The future of SearchPhil Bradley The future of Search
Phil Bradley The future of SearchIncisive_Events
 
Arthur Weiss Google vs other search tools
Arthur Weiss Google vs other search toolsArthur Weiss Google vs other search tools
Arthur Weiss Google vs other search toolsIncisive_Events
 
James Bennett CLA Search and Licence System
James Bennett CLA Search and Licence SystemJames Bennett CLA Search and Licence System
James Bennett CLA Search and Licence SystemIncisive_Events
 
Lucy Montgomery Open access for scholarly books
Lucy Montgomery Open access for scholarly booksLucy Montgomery Open access for scholarly books
Lucy Montgomery Open access for scholarly booksIncisive_Events
 
Max Espley Royal Society of Chemistry and Open Access
Max Espley Royal Society of Chemistry and Open AccessMax Espley Royal Society of Chemistry and Open Access
Max Espley Royal Society of Chemistry and Open AccessIncisive_Events
 
Jacob Morgan The Future of Work
Jacob Morgan The Future of WorkJacob Morgan The Future of Work
Jacob Morgan The Future of WorkIncisive_Events
 
Mark Stevenson Surviving in a fast changing world
Mark Stevenson Surviving in a fast changing worldMark Stevenson Surviving in a fast changing world
Mark Stevenson Surviving in a fast changing worldIncisive_Events
 
Alex Follett Integrating your library into wider institutional environment
Alex Follett Integrating your library into wider institutional environmentAlex Follett Integrating your library into wider institutional environment
Alex Follett Integrating your library into wider institutional environmentIncisive_Events
 
Sarah Fahy Reshaping Your Team
Sarah Fahy Reshaping Your TeamSarah Fahy Reshaping Your Team
Sarah Fahy Reshaping Your TeamIncisive_Events
 
James Andrews User Engagement
James Andrews User EngagementJames Andrews User Engagement
James Andrews User EngagementIncisive_Events
 

Mais de Incisive_Events (20)

Gaby Lutgens Edl@b experiment
Gaby Lutgens Edl@b experimentGaby Lutgens Edl@b experiment
Gaby Lutgens Edl@b experiment
 
Hugh Davis MOOCs
Hugh Davis MOOCsHugh Davis MOOCs
Hugh Davis MOOCs
 
Louise Corti Data scientists
Louise Corti Data scientistsLouise Corti Data scientists
Louise Corti Data scientists
 
Richard Wallis Linked Data
Richard Wallis Linked DataRichard Wallis Linked Data
Richard Wallis Linked Data
 
Alain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersAlain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producers
 
Andrew Cox Research data management
Andrew Cox Research data managementAndrew Cox Research data management
Andrew Cox Research data management
 
Jan Reichelt Mendeley
Jan Reichelt MendeleyJan Reichelt Mendeley
Jan Reichelt Mendeley
 
Rachel Green Jove
Rachel Green JoveRachel Green Jove
Rachel Green Jove
 
Anne Osterrieder Tools for sharing your research
Anne Osterrieder Tools for sharing your researchAnne Osterrieder Tools for sharing your research
Anne Osterrieder Tools for sharing your research
 
Mahendra Mahey British Library Labs
Mahendra Mahey British Library LabsMahendra Mahey British Library Labs
Mahendra Mahey British Library Labs
 
Phil Bradley The future of Search
Phil Bradley The future of SearchPhil Bradley The future of Search
Phil Bradley The future of Search
 
Arthur Weiss Google vs other search tools
Arthur Weiss Google vs other search toolsArthur Weiss Google vs other search tools
Arthur Weiss Google vs other search tools
 
James Bennett CLA Search and Licence System
James Bennett CLA Search and Licence SystemJames Bennett CLA Search and Licence System
James Bennett CLA Search and Licence System
 
Lucy Montgomery Open access for scholarly books
Lucy Montgomery Open access for scholarly booksLucy Montgomery Open access for scholarly books
Lucy Montgomery Open access for scholarly books
 
Max Espley Royal Society of Chemistry and Open Access
Max Espley Royal Society of Chemistry and Open AccessMax Espley Royal Society of Chemistry and Open Access
Max Espley Royal Society of Chemistry and Open Access
 
Jacob Morgan The Future of Work
Jacob Morgan The Future of WorkJacob Morgan The Future of Work
Jacob Morgan The Future of Work
 
Mark Stevenson Surviving in a fast changing world
Mark Stevenson Surviving in a fast changing worldMark Stevenson Surviving in a fast changing world
Mark Stevenson Surviving in a fast changing world
 
Alex Follett Integrating your library into wider institutional environment
Alex Follett Integrating your library into wider institutional environmentAlex Follett Integrating your library into wider institutional environment
Alex Follett Integrating your library into wider institutional environment
 
Sarah Fahy Reshaping Your Team
Sarah Fahy Reshaping Your TeamSarah Fahy Reshaping Your Team
Sarah Fahy Reshaping Your Team
 
James Andrews User Engagement
James Andrews User EngagementJames Andrews User Engagement
James Andrews User Engagement
 

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Último (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Data Liberation - Tony Hirst

Notas do Editor

  1. Tony HirstTwitter:@psychemediaBlog: http://blog.ouseful.infoPresentation prepared for: Online Info 12/11/2012DATA LIBERATION: OPENING UP DATA BY HOOK OR BY CROOK - DATA SCRAPING, LINKAGE AND THE VALUE OF A GOOD IDENTIFIERThe 1/9/90 rule is often used to characterise the way in which a small number of creators generate content that a larger number (but still small percentage in the greater scheme of things) comment on or amplify, whilst the majority just passively consume. In this presentation, I will explore the extent to which a similar view applies to the world of "data liberation". After reviewing the idea of data scraping, and some of the techniques surrounding it, I will describe how online tools such as Scraperwiki provide a platform for concentrating data scraping activity and expertise, as well as supporting the publication of data /as data/ in a variety of formats, in addition to 'end user' views in the form of graphical charts and interactive visualisations.One of the major motivations for data scraping is the aggregation of data from a variety of data sources into a larger, integrated whole. For example, the aggregation of research council funding data from separate research councils allows us to view a large proportion of the publicly funded research grants received by a single institution; or the collection of local council spending data across all UK councils allows us to see how councils spend money with each other across a range of transaction areas. But how do we actually create such aggregations when the data is sourced from different areas? In order to do this, we need to know when different datasets are actually talking about the same thing, which is where common identifiers come in. For it is surely the case that when we have common identifiers, we can have linkage, and as a result start to realise some of the benefits of Linked Data (as well as developing a wider appreciation of what those benefits might actually be...) (As an aside, I'll describe how we might go about deriving such identifiers when they are missing from a data set that might otherwise, or more conveniently, be expected to publish them.)Throughout the presentation, I will draw on practical examples of how aggregated "liberated" data has been used as the basis of wider interest, and even status quo disrupting, services, as well as reflecting on what other sources of data we might see the data liberators turning their attention to next...Key learning points:1 - What is "data scraping", how can I do it and is my website at risk of it?2 - Why the secret to understanding "Linked Data" is the very idea of it, not just (or not even) the technology.3 - How has data scraping been used to "open up" data in actual practice?
  2. The focus on this presentation is not the release of “information”, but the release of data in raw form so that it can be interpreted and presented in informative ways by other parties.
  3. The London Datastore is an early example of a council-centric open data website. Early signs suggest it is natural to locate data websites at addresses of the form data.COUNCILNAME.gov.uk or www.COUNCILNAME.gov.uk/data
  4. Another example that demonstrates how CSV can be used to help data flow is demonstrated by Google Spreadsheets. The =importData formula allows a user to specify a source data URL, and pull the CSV data found at that location in to the spreadsheet. Unlike Many Eyes Wikified, if the source data at the URL is updated, the updated will (eventually) be pulled into the spreadsheet automatically.
  5. One of the really good reasons for getting data into a data processing environment such as a spreadsheet is that you can start to work it. In the case of Google Spreadsheets, the spreadsheet environment can also be used as a database environment. That is, we can treat one or more data containing sheets in a spreadsheet as a database, and generate new views over the data, as well as running queries over that data.
  6. Another way of using a Google Spreadsheet as a database is via the Google Spreadsheets API. The GoogleVisualisation API (?) provides a way of passing queries written using the Google ???viz query language from an arbitrary web page or web application, and receiving the resulting data in a standard JSON based format, which also happens to play nicely with the Google Visualisation API???The Guardian Datastore explorer is a crude demonstration for 2009(??) demonstrating how data from the Guardian datastore, data that is stored across a range of Google spreadsheets, can be explored , queried and visualised via these APIs. Users can select a dataset from a drop down menu, fed from a delicious account to which various datastore spreadsheets have been bookmarked using a particular set of tags, or by pasting in the URL of an arbitrary (public) Google spreadsheet. The first row/headings of the data can then be previewed (a simple spreadsheet is assumed, in which column headings appear In the first row of the spreadsheet).
  7. A series of list boxes are then populated with the column labels and there names, and provide a certain amount of help for the creation of a query over the spreadsheet data. A range of output formats can also be selected, from simple HTML data tables, to a range of charts. URLs are also generated for HTML and CSV representations of the data returned from the query.
  8. One of the nice things about the data table widget (a standard GoogleVisualisation API component in this case, though similar examples exist for YUI, the Yahoo User Interface Libraries, or frameworks such as JQuery), is that is supports things like row sorting by column, (for free – no programming required!), allowing even further manipulation of the data, albeit at a simplistic level.(It’s probably worth pointing out here that it may be worth providing a preview of the column headings and first few rows (or a sample of random rows) of data when datasets are published, just so that users can see what sort of data is on offer without having to download the whole data set?)
  9. If you’re in the business of selling information as data, you are under threat where that information is published in an openly licensed way.
  10. Linked Data – the TM is something of a joke and refers to the particular style of publishing data according to set of principles first outlined by the inventor of the World Wide Web, Sir Tim Berners Lee – is one of the data formats that the Government’s data task force favour for the publication of data.
  11. There is a problem though – at the moment, there are barriers to entry to Linked Data world from both the query side (not many people speak SPARQL, or know how to construct a SPARQL query to an endpoint) and the results side (data is returned as RDF).
  12. So – do you speak SPARQL?