SlideShare a Scribd company logo
1 of 19
Download to read offline
Summer School
                    "Data journalism e visualizzazione
                    grafica dei dati"
                    29 July 2011 – Flavon (TN)




A introduction to

  for not developers




                 Maurizio Napolitano <napo@fbk.eu>
Description in the
                                                                  name
SCRAPER

                                                                 WIKI




source
http://www.modot.org/central/major_projects/July2006photos.htm




                                                                 source
                                                                 http://www.commoncraft.com/video/wikis
Wiki like Wikipedia
                          Scraper like ???




a scraper extract data
from a content
Legal aspect

Scraper sites may violate
copyright law.
Even taking content from an open content site can be a
copyright violation, if done in a way which does not respect
the license.
For instance, the GNU Free Documentation License (GFDL)
and Creative Commons ShareAlike (CC-BY-SA) licenses
require that a republisher inform readers of the license
conditions, and give credit to the original author.


 http://en.wikipedia.org/wiki/Scraper_site
.. then scraperwiki is ...




    https://scraperwiki.com/

A place where share scrapers … and data :)
ScraperWiki legal
                                          aspect
Use
6. You agree that, in using the ScraperWiki site and services, you will
not interfere with the legal rights
[...]
Intellectual Property
9. Subject to the following paragraphs, the source code of the
ScraperWiki site, and all other copyrightable materials that form a part
of it is released under the GNU Affero General Public License.
10. All scraping code hosted on the site is licensed under the GNU
General Public License. You hereby license all scraping code you
create using ScraperWiki under the same licence.
11. You agree to assert no additional intellectual property rights,
including copyright and database right, in any scraped data other than
those which subsisted in the relevant web sites before the running of
the relevant scraper and which were held by you at that time.
12. You grant us a non-exclusive, worldwide, licence to use any data
that you store on our site, for the purposes of administering the site.


                                https://scraperwiki.com/terms_and_conditions/
ScraperWiki legal
                                          aspect
USE
6.You agree [..] you will not interfere with
the legal rights
[...]

INTELLECTUAL PROPERTY
9. […] the   source code of the ScraperWiki [..] is released
under the GNU      Affero General Public License.
10. All
    scraping code […] is licensed under the GNU
General Public License.
11.You agree to assert no additional
intellectual property rights [...]
12. You grant us a non-exclusive, worldwide, licence to use any data
that you store on our site, for the purposes of administering the site.
HOW CREATE A
   SCRAPER?
The NOT developers
The technical
                                       approach




http://unstats.un.org/unsd/demographic/products/socind/education.htm
Behind the page



         HTML
         code
Where are the data?




      There is a structure
      behind!!!
The algorithm!!!
Download th web page        Read the information



Find the right position

                             Extract the data


 Create a CSV file



                           data1;data2;data3
                           [...]
                           dataN1;dataN2;dataN3
Example: python code




https://scraperwiki.com/docs/python/python_intro_tutorial/
… and everything run
       in the cloud!!!
The code in the cloud




https://scraperwiki.com/scrapers/mlb_rosters/
Sharing & ReUse
Enjoy!!!




httpS://scraperwiki.com/
Thanks!
 A introduction to ScraperWiki for NOT developers by
 Maurizio Napolitano <napo@fbk.eu>
 is licensed under a
 Creative Commons Attribuzione 3.0 Unported License.




Created for
                   Summer School
                   "Data journalism e visualizzazione
                   grafica dei dati"
                   29 July 2011 – Flavon (TN)

More Related Content

Similar to A introduction to Scraperwiki (for not developers)

Semantic.edu, an introduction
Semantic.edu, an introductionSemantic.edu, an introduction
Semantic.edu, an introduction
Bryan Alexander
 
Power to the Users (and Librarians)
Power to the Users (and Librarians)Power to the Users (and Librarians)
Power to the Users (and Librarians)
Guus van den Brekel
 
Tech Tools for Reference: Enhancing the Research Experience in the Health Sci...
Tech Tools for Reference: Enhancing the Research Experience in the Health Sci...Tech Tools for Reference: Enhancing the Research Experience in the Health Sci...
Tech Tools for Reference: Enhancing the Research Experience in the Health Sci...
Christine Tobias
 
The Archives Forum - The National Archives - 02 March 2011
The Archives Forum - The National Archives - 02 March 2011The Archives Forum - The National Archives - 02 March 2011
The Archives Forum - The National Archives - 02 March 2011
David F. Flanders
 

Similar to A introduction to Scraperwiki (for not developers) (20)

Semantic.edu, an introduction
Semantic.edu, an introductionSemantic.edu, an introduction
Semantic.edu, an introduction
 
Digital Fabrication Studio v.0.2: Digital Fabrication and FabLab ecosystem
Digital Fabrication Studio v.0.2: Digital Fabrication and FabLab ecosystemDigital Fabrication Studio v.0.2: Digital Fabrication and FabLab ecosystem
Digital Fabrication Studio v.0.2: Digital Fabrication and FabLab ecosystem
 
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific researchWeb 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
 
W3 C Intro And Beyond - Eyal Sela
W3 C Intro And Beyond - Eyal SelaW3 C Intro And Beyond - Eyal Sela
W3 C Intro And Beyond - Eyal Sela
 
Web browser pdf
Web browser pdfWeb browser pdf
Web browser pdf
 
The SIOC Project
The SIOC ProjectThe SIOC Project
The SIOC Project
 
Digital Fabrication Studio.01 _Fabbing @ Aalto Media Factory
Digital Fabrication Studio.01 _Fabbing @ Aalto Media FactoryDigital Fabrication Studio.01 _Fabbing @ Aalto Media Factory
Digital Fabrication Studio.01 _Fabbing @ Aalto Media Factory
 
Web 2.0: What Can It Offer The Research Community?
Web 2.0: What Can It Offer The Research Community?Web 2.0: What Can It Offer The Research Community?
Web 2.0: What Can It Offer The Research Community?
 
Creative Commons - Cases & Tools
Creative Commons - Cases & ToolsCreative Commons - Cases & Tools
Creative Commons - Cases & Tools
 
WebRTC From Asterisk to Headline - MoNage
WebRTC From Asterisk to Headline - MoNageWebRTC From Asterisk to Headline - MoNage
WebRTC From Asterisk to Headline - MoNage
 
Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)
Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)
Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)
 
Web 2.0: characteristics and tools (2010 eng)
Web 2.0: characteristics and tools (2010 eng)Web 2.0: characteristics and tools (2010 eng)
Web 2.0: characteristics and tools (2010 eng)
 
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
 
Power to the Users (and Librarians)
Power to the Users (and Librarians)Power to the Users (and Librarians)
Power to the Users (and Librarians)
 
Bits+atoms+processes: the influence of code culture on Design @ Cumulus Helsi...
Bits+atoms+processes: the influence of code culture on Design @ Cumulus Helsi...Bits+atoms+processes: the influence of code culture on Design @ Cumulus Helsi...
Bits+atoms+processes: the influence of code culture on Design @ Cumulus Helsi...
 
Web 2.0 Rvce Mca
Web 2.0 Rvce McaWeb 2.0 Rvce Mca
Web 2.0 Rvce Mca
 
Tech Tools for Reference: Enhancing the Research Experience in the Health Sci...
Tech Tools for Reference: Enhancing the Research Experience in the Health Sci...Tech Tools for Reference: Enhancing the Research Experience in the Health Sci...
Tech Tools for Reference: Enhancing the Research Experience in the Health Sci...
 
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebDataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
 
The Archives Forum - The National Archives - 02 March 2011
The Archives Forum - The National Archives - 02 March 2011The Archives Forum - The National Archives - 02 March 2011
The Archives Forum - The National Archives - 02 March 2011
 
Webware Webinar
Webware WebinarWebware Webinar
Webware Webinar
 

More from Maurizio Napolitano

More from Maurizio Napolitano (20)

I dati AGCOM del pluralismo politico sociale in televisione
I dati AGCOM del pluralismo politico sociale in televisioneI dati AGCOM del pluralismo politico sociale in televisione
I dati AGCOM del pluralismo politico sociale in televisione
 
FIPAV - allievo allenatore Il protocollo di allenamento - Modulo 2 - napolita...
FIPAV - allievo allenatore Il protocollo di allenamento - Modulo 2 - napolita...FIPAV - allievo allenatore Il protocollo di allenamento - Modulo 2 - napolita...
FIPAV - allievo allenatore Il protocollo di allenamento - Modulo 2 - napolita...
 
La gestione del gruppo
La gestione del gruppoLa gestione del gruppo
La gestione del gruppo
 
percorsi ciclabili e stress
percorsi ciclabili e stresspercorsi ciclabili e stress
percorsi ciclabili e stress
 
Soluzioni open source per la mobilità
Soluzioni open source per la mobilitàSoluzioni open source per la mobilità
Soluzioni open source per la mobilità
 
Il diritto all'oblio nell'era digitale
Il diritto all'oblio nell'era digitaleIl diritto all'oblio nell'era digitale
Il diritto all'oblio nell'era digitale
 
OpenStreetMap: disegnamo la mappa del mondo
OpenStreetMap: disegnamo la mappa del mondoOpenStreetMap: disegnamo la mappa del mondo
OpenStreetMap: disegnamo la mappa del mondo
 
Estrarre dati da Twitter via API e soluzioni OSINT
Estrarre dati da Twitter via API e soluzioni OSINTEstrarre dati da Twitter via API e soluzioni OSINT
Estrarre dati da Twitter via API e soluzioni OSINT
 
OpenStreetMap: passato, presente e futuro (?)
OpenStreetMap:  passato, presente e futuro (?)OpenStreetMap:  passato, presente e futuro (?)
OpenStreetMap: passato, presente e futuro (?)
 
Strumenti per il Fact Checking
Strumenti per il Fact CheckingStrumenti per il Fact Checking
Strumenti per il Fact Checking
 
Estrarre contenuti da Web
Estrarre contenuti da WebEstrarre contenuti da Web
Estrarre contenuti da Web
 
Ten years of opendata: what has happened and what is there to do
Ten years of opendata: what has happened and what is there to doTen years of opendata: what has happened and what is there to do
Ten years of opendata: what has happened and what is there to do
 
Infographics & data visualization - corso base FBK
Infographics & data visualization - corso base FBKInfographics & data visualization - corso base FBK
Infographics & data visualization - corso base FBK
 
Percorso di specializzazione per i ruoli di ricevitore–attaccante, opposto e ...
Percorso di specializzazione per i ruoli di ricevitore–attaccante, opposto e ...Percorso di specializzazione per i ruoli di ricevitore–attaccante, opposto e ...
Percorso di specializzazione per i ruoli di ricevitore–attaccante, opposto e ...
 
Dati: catalizzatori di innovazione per la smarticity
Dati: catalizzatori di innovazione per la smarticityDati: catalizzatori di innovazione per la smarticity
Dati: catalizzatori di innovazione per la smarticity
 
la comunicazione attraverso i social media
la comunicazione attraverso i social mediala comunicazione attraverso i social media
la comunicazione attraverso i social media
 
creare cruscotti per investigare i dati
creare cruscotti per investigare i daticreare cruscotti per investigare i dati
creare cruscotti per investigare i dati
 
Follow the white Rabbit: opportunità e trabocchetti nella nostra vita digitale
Follow the white Rabbit: opportunità e trabocchetti nella nostra vita digitaleFollow the white Rabbit: opportunità e trabocchetti nella nostra vita digitale
Follow the white Rabbit: opportunità e trabocchetti nella nostra vita digitale
 
Strumenti e suggerimenti per creare grafici
Strumenti e suggerimenti per creare graficiStrumenti e suggerimenti per creare grafici
Strumenti e suggerimenti per creare grafici
 
Data Journalism e Fake News
Data Journalism e Fake NewsData Journalism e Fake News
Data Journalism e Fake News
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

A introduction to Scraperwiki (for not developers)

  • 1. Summer School "Data journalism e visualizzazione grafica dei dati" 29 July 2011 – Flavon (TN) A introduction to for not developers Maurizio Napolitano <napo@fbk.eu>
  • 2. Description in the name SCRAPER WIKI source http://www.modot.org/central/major_projects/July2006photos.htm source http://www.commoncraft.com/video/wikis
  • 3. Wiki like Wikipedia Scraper like ??? a scraper extract data from a content
  • 4. Legal aspect Scraper sites may violate copyright law. Even taking content from an open content site can be a copyright violation, if done in a way which does not respect the license. For instance, the GNU Free Documentation License (GFDL) and Creative Commons ShareAlike (CC-BY-SA) licenses require that a republisher inform readers of the license conditions, and give credit to the original author. http://en.wikipedia.org/wiki/Scraper_site
  • 5. .. then scraperwiki is ... https://scraperwiki.com/ A place where share scrapers … and data :)
  • 6. ScraperWiki legal aspect Use 6. You agree that, in using the ScraperWiki site and services, you will not interfere with the legal rights [...] Intellectual Property 9. Subject to the following paragraphs, the source code of the ScraperWiki site, and all other copyrightable materials that form a part of it is released under the GNU Affero General Public License. 10. All scraping code hosted on the site is licensed under the GNU General Public License. You hereby license all scraping code you create using ScraperWiki under the same licence. 11. You agree to assert no additional intellectual property rights, including copyright and database right, in any scraped data other than those which subsisted in the relevant web sites before the running of the relevant scraper and which were held by you at that time. 12. You grant us a non-exclusive, worldwide, licence to use any data that you store on our site, for the purposes of administering the site. https://scraperwiki.com/terms_and_conditions/
  • 7. ScraperWiki legal aspect USE 6.You agree [..] you will not interfere with the legal rights [...] INTELLECTUAL PROPERTY 9. […] the source code of the ScraperWiki [..] is released under the GNU Affero General Public License. 10. All scraping code […] is licensed under the GNU General Public License. 11.You agree to assert no additional intellectual property rights [...] 12. You grant us a non-exclusive, worldwide, licence to use any data that you store on our site, for the purposes of administering the site.
  • 8. HOW CREATE A SCRAPER?
  • 10. The technical approach http://unstats.un.org/unsd/demographic/products/socind/education.htm
  • 11. Behind the page HTML code
  • 12. Where are the data? There is a structure behind!!!
  • 13. The algorithm!!! Download th web page Read the information Find the right position Extract the data Create a CSV file data1;data2;data3 [...] dataN1;dataN2;dataN3
  • 15. … and everything run in the cloud!!!
  • 16. The code in the cloud https://scraperwiki.com/scrapers/mlb_rosters/
  • 19. Thanks! A introduction to ScraperWiki for NOT developers by Maurizio Napolitano <napo@fbk.eu> is licensed under a Creative Commons Attribuzione 3.0 Unported License. Created for Summer School "Data journalism e visualizzazione grafica dei dati" 29 July 2011 – Flavon (TN)