SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Internet Content as
    Research Data
 Digital Humanities Australia
   March 2012, Canberra
Monica Omodei & Gordon Mohr
Research Examples
•    Social networking
•    Lexicography
•    Linguistics
•    Network Science
•    Political Science
•    Media Studies
•    Contemporary history
Common	
  Collec)on	
  Strategies	
  
•  Crawl	
  Scope	
  &	
  Focus	
  
    1)       Thema)c/Topical	
  (elec)ons,	
  events,	
  global	
  warming…)	
  
    2)       Resource-­‐specific	
  (video,	
  pdf,	
  etc.)	
  
    3)       Broad	
  survey	
  (domain	
  wide	
  for	
  .com/.net/.org/.edu/.gov)	
  
    4)       Exhaus)ve	
  (end	
  of	
  life, closure crawls, natl domains)	
  
    5)       Frequency-­‐Based	
  
    	
  
•  Key	
  Inputs:	
  nomina)ons	
  from	
  subject	
  maSer	
  experts,	
  
   prior	
  crawl	
  data,	
  registry	
  data,	
  trusted	
  directories,	
  
   wikipedia	
  
Exis)ng	
  web	
  archives	
  	
  
•    Internet	
  Archive	
  
•    Common	
  Crawl	
  	
  
•    Pandora	
  Archive	
  
•    Internet	
  Memory	
  Founda)on	
  Archive	
  
•    Other	
  na)onal	
  archives	
  
•    Research,	
  University	
  Library	
  archives	
  	
  
Internet Archive’s Web Archive

Positives
  –  Very broad – 175+ billion web instances
  –  Historic – started 1996
  –  Publicly accessible
  –  Time-based URL search
  –  API access
  –  Not constrained by legislation – covered by
     fair use and fast take-down response
Internet	
  Archive’s	
  Web	
  Archive	
  
Negatives
       –  Because of size can’t search by keyword
       –  Because of size, fully automated - QA not
          possible
	
  
Common	
  Use	
  Cases	
  for	
  IA’s	
  web	
  
                 archive	
  
•  Content	
  discovery	
  
•  Nostalgia	
  queries	
  
•  Web	
  site	
  restora)on	
  and	
  file	
  recovery	
  
•  Domain	
  name	
  valua)on	
  
•  Collabora)ve	
  R&D	
  
•  Prior	
  art	
  analysis	
  and	
  patent/copyright	
  infringement	
  
   research	
  
•  Legal	
  cases	
  
•  Topic	
  analysis,	
  web	
  trends	
  analysis,	
  popularity	
  
   analysis	
  
Common	
  Crawl	
  
•  Non-­‐profit	
  founda)on	
  building	
  an	
  open	
  crawl	
  
   of	
  the	
  web	
  to	
  seed	
  research	
  and	
  innova)on	
  
•  Currently	
  5	
  billion	
  pages	
  
•  Stored	
  on	
  Amazon’s	
  S3	
  	
  
•  Accessible	
  via	
  MapReduce	
  processing	
  in	
  
   Amazon’s	
  EC2	
  compute	
  cloud	
  
•  Wholesale	
  extrac)on,	
  transforma)on,	
  and	
  
   analysis	
  of	
  web	
  data	
  cheap	
  and	
  easy	
  
•  commoncrawl.org/data/accessing-­‐the-­‐data/	
  
Common	
  Crawl	
  
Nega)ves	
  
•  Not	
  designed	
  for	
  human	
  browsing	
  but	
  for	
  
   machine	
  access	
  
•  Objec)ve	
  is	
  to	
  support	
  large-­‐scale	
  analysis	
  and	
  
   text	
  mining/indexing	
  –	
  not	
  long-­‐term	
  
   preserva)on	
  
•  Some	
  costs	
  are	
  involved	
  for	
  direct	
  extrac)on	
  
   of	
  data	
  from	
  S3	
  storage	
  using	
  Requester-­‐Pays	
  
   API	
  	
  
Pandora	
  Archive	
  
•  Posi)ves	
  
   –  Quality	
  checked	
  
   –  Targeted	
  Australian	
  content	
  with	
  selec)on	
  policy	
  
   –  Historical	
  –	
  started	
  1996	
  
   –  Bibliocentric	
  approach	
  –we	
  sites/publica)ons	
  
      selected	
  for	
  archiving	
  are	
  catalogued	
  (see	
  Trove)	
  
   –  Keyword	
  search	
  
   –  Publicly	
  accessible	
  
   –  You	
  can	
  nominate	
  Australian	
  web	
  sites	
  for	
  
      inclusion	
  -­‐	
  pandora.nla.gov.au/
      registra)on_form.html	
  
Pandora	
  Archive	
  
•  Nega)ves	
  
   –  labour	
  intensive	
  so	
  small	
  
   –  significant	
  content	
  missed	
  because	
  permission	
  to	
  
      copy	
  refused	
  
•  Situa)on	
  will	
  improve	
  markedly	
  if	
  Legal	
  
   Deposit	
  provisions	
  extended	
  to	
  digital	
  
   publica)ons	
  
•  Broader	
  coverage	
  will	
  be	
  achieved	
  when	
  
   infrastructure	
  is	
  upgraded	
  hence	
  reducing	
  
   labour	
  costs	
  for	
  checking/fixing	
  crawls	
  
Pandora	
  Archive	
  Stats	
  
•    Size	
  –	
  6.32	
  TB	
  
•    Number	
  of	
  Files	
  	
  >	
  140	
  million	
  
•    Number	
  of	
  ‘)tles’	
  >	
  30.5K	
  
•    Number	
  of	
  )tle	
  instances	
  >	
  73.5K	
  
.au	
  Domain	
  Annual	
  Snapshots	
  
•  Annual	
  crawls	
  since	
  2005	
  commissioned	
  from	
  
   Internet	
  Archive	
  
•  Includes	
  sites	
  on	
  servers	
  located	
  in	
  Australia	
  
   as	
  well	
  as	
  .au	
  domain	
  
•  Robots.txt	
  respected	
  except	
  for	
  inline	
  images	
  
   and	
  stylesheets	
  
•  No	
  public	
  access	
  –	
  researcher	
  access	
  protocols	
  
   are	
  being	
  developed	
  
•  Full	
  text	
  search	
  –	
  tailored	
  to	
  archive	
  search	
  
•  Separate	
  .gov	
  crawl	
  publicly	
  accessible	
  soon	
  
Australian	
  web	
  domain	
  crawls	
  

Year	
              2005	
        2006	
        2007	
        2008	
             2009	
        2011	
  
Files	
             185	
         596	
         516	
         1	
  billion	
     765	
         660	
  
                    million	
     million	
     million	
                        million	
     million	
  
Hosts	
             811,523	
     1,046,038	
   1,247,614	
   3,038,658	
   1,074,645	
   1,346,549	
  
crawled	
  
Size	
  (TBs)	
     6.69	
        19.04	
       18.47	
       34.55	
            24.29	
       30.71	
  
Internet	
  Memory	
  Founda)on	
  
                   Archive	
  
•  internetmemory.org/en/	
  
•  no	
  keyword	
  search	
  yet	
  –	
  only	
  URL	
  
•  Number	
  of	
  European	
  partners	
  
Other	
  Na)onal	
  Archives	
  
•  List	
  of	
  Interna)onal	
  Internet	
  Preserva)on	
  
   Consor)um	
  member	
  archives	
  –	
  
   netpreserve.org/about/archiveList.php	
  
•  Some	
  are	
  whole	
  domain	
  archives,	
  some	
  	
  are	
  
   selec)ve	
  archives,	
  many	
  are	
  both	
  
•  Some	
  have	
  public	
  access,	
  others	
  you	
  will	
  need	
  
   to	
  nego)ate	
  access	
  for	
  research	
  
•  Most	
  archives	
  have	
  been	
  collected	
  using	
  the	
  
   heritrix	
  open-­‐source	
  crawler	
  and	
  thus	
  use	
  the	
  
   standard	
  format	
  (warc	
  ISO	
  format)	
  
Research	
  Archives	
  
•  California	
  Digital	
  Library	
  
•  Harvard	
  University	
  Libraries	
  
•  Columbia	
  	
  University	
  Libraries	
  
•  University	
  of	
  North	
  Texas	
  
….	
  and	
  many	
  more	
  
	
  
•  WebCITE	
  -­‐	
  webcita)on.org	
  (cita)on	
  service	
  
     archive)	
  
Bringing	
  Archives	
  Together	
  
•  Common	
  standard	
  and	
  APIs	
  
•  Memento	
  project	
  
	
  
Create	
  your	
  own	
  Archive	
  
•  Use	
  a	
  subscrip)on	
  service	
  
•  Build	
  your	
  own	
  archive	
  using	
  open-­‐source	
  
   crawler	
  heritrix	
  and	
  standard	
  file	
  format	
  .warc	
  	
  
•  Use	
  web	
  cita)on	
  services	
  that	
  create	
  archive	
  
   copies	
  as	
  you	
  bookmark	
  pages	
  
Subscrip)on	
  Services	
  
•  archive-­‐it.org	
  (service	
  operated	
  by	
  non-­‐profit	
  
   Internet	
  Archive	
  since	
  2006)	
  
•  archivethe.net	
  (service	
  operated	
  by	
  non-­‐profit	
  	
  
   Internet	
  Memory	
  Founda)on)	
  
•  California	
  Digital	
  Library	
  Web	
  Archiving	
  
   Service	
  -­‐	
  cdlib.org/services/uc3/was.html	
  
•  OCLC	
  Harvester	
  Service	
  -­‐	
  oclc.org/
   webharvester/overview/default.htm	
  
Install	
  web	
  archiving	
  system	
  locally	
  
•  Easy-­‐to-­‐deploy	
  web	
  archiving	
  toolkit	
  not	
  yet	
  
   available	
  (that	
  meets	
  web	
  archive	
  standards)	
  
•  Ins)tu)onal	
  web	
  archiving	
  infrastructure	
  is	
  
   feasible	
  and	
  has	
  been	
  established	
  at	
  a	
  number	
  
   of	
  universi)es	
  for	
  use	
  by	
  researchers	
  –	
  needs	
  
   IT	
  systems	
  engineers	
  to	
  set	
  up	
  though	
  
•  Archives	
  can	
  be	
  deposited	
  with	
  the	
  NLA	
  for	
  
   long-­‐term	
  preserva)on	
  
'Memento':	
  adding	
  )me	
  to	
  the	
  
                   web	
  
Protocol	
  and	
  browser	
  add-­‐on	
  (MementoFox)	
  
•  Aids	
  discovery,	
  aggrega)on	
  of	
  page	
  histories	
  


	
  
Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
  Data Analysis

  Need fast iteration to understand the right
  questions to ask
  More minds able to contribute = more value
  (perceived and real) placed on the importance
  of the data
  Increased demand for/value of the data = more
  funding to support it
  Need to surface the Information amongst all
  that data…
Platform & Toolkit: Overview

•  Software	

   –  Apache Hadoop	

   –  Apache Pig	

•  Data/File format	

   –  WARC	

   –  CDX	

   –  WAT (new!)
Apache Hadoop

•  HDFS	

   –  Distributed storage	

   –  Durable, default 3x replication	

   –  Scalable: Yahoo! 60+PB HDFS	

•  MapReduce	

   –  Distributed computation	

   –  You write Java functions	

   –  Hadoop distributes work across cluster	

   –  Tolerates failures
File formats and data: WARC
File formats and data: CDX
•  Index for Wayback Machine: used to browse
   WARC-based archive	

•  Space-delimited text file	

•  Only essential metadata needed by Wayback	

  –  URL	

  –  Content Digest	

  –  Capture Timestamp	

  –  Content-Type	

  –  HTTP response code	

  –  etc.
File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹	

•  Not preservation format	

•  Data exchange and analysis	

•  Less than full WARC, more than CDX	

•  Essential metadata for many types of analysis	

•  Avoids barriers to data exchange: copyright,
   privacy	

•  Work-in-progress: we want your feedback
File formats and data: WAT
•  WAT is WARC ☺	

  –  WAT records are WARC
     metadata records	

       File formats & data:	

  –  WARC-Refers-To header     •  CDX: 53 MB	

     identifies original WARC
     record	

                 •  WAT: 443 MB	

•  WAT payload is JSON	

      •  WARC: 8,651 MB	

  –  Compact	

  –  Hierarchical	

  –  Supported by every
     programming environ
Some	
  References	
  
•  hSp://en.wikipedia.org/wiki/Web_archiving	
  
•  hSp://netpreserve.org/about/archiveList.php	
  
•  Web	
  Archives:	
  The	
  Future(s)	
  -­‐	
  
   hSp://www.netpreserve.org/publica)ons/
   2011_06_IIPC_WebArchives-­‐TheFutures.pdf	
  
Contacts	
  
•  Webarchive	
  @	
  nla.gov.au	
  
•  Secretariat	
  @	
  internetmemory.org	
  
•  Queries	
  about	
  the	
  internet	
  archive	
  web	
  archive	
  
   hSp://iawebarchiving.wordpress.com/	
  
•  Queries	
  about	
  Archive-­‐It	
  service	
  
   hSp://www.archive-­‐it.org/contact-­‐us	
  

•  momodei	
  @	
  nla.gov.au	
  
•  gojomo	
  @	
  xavvy.com	
  
	
  

Mais conteúdo relacionado

Mais procurados

Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterElephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterRobert H. McDonald
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2Boston Consulting Group
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
"Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ..."Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ...Ahmed AlSum
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportPascal-Nicolas Becker
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...Martin Klein
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)Christine Stohn
 
London HUG
London HUGLondon HUG
London HUGBoudicca
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarFAIRDOM
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College LondonSarah Anna Stewart
 
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)IWMW
 
Building EOL species pages
Building EOL species pagesBuilding EOL species pages
Building EOL species pagesCyndy Parr
 

Mais procurados (20)

Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterElephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
 
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
"Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ..."Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ...
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)
 
London HUG
London HUGLondon HUG
London HUG
 
Update on HDF5 1.8
Update on HDF5 1.8Update on HDF5 1.8
Update on HDF5 1.8
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management Webinar
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College London
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 
The WSTIERIA Project – A Web of Services
The  WSTIERIA Project – A Web of ServicesThe  WSTIERIA Project – A Web of Services
The WSTIERIA Project – A Web of Services
 
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)
IWMW 2003: Semantic Web Technologies for UK HE and FE Institutions (Part 2)
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
Building EOL species pages
Building EOL species pagesBuilding EOL species pages
Building EOL species pages
 

Semelhante a Internet content as research data

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryBiblioteca Nacional de España
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaJisc RDM
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Petter Skodvin-Hvammen
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Internet and its applications
Internet and its applicationsInternet and its applications
Internet and its applicationsBurhan Ahmed
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with ArchivematicaJenny Mitcham
 
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation SlidesDuraSpace
 
Lisa Rogers
Lisa RogersLisa Rogers
Lisa RogersJisc
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3Essam Obaid
 
It takes a Village: Implementing a Homegrown Solution for Streaming Video Res...
It takes a Village: Implementing a Homegrown Solution for Streaming Video Res...It takes a Village: Implementing a Homegrown Solution for Streaming Video Res...
It takes a Village: Implementing a Homegrown Solution for Streaming Video Res...mharpasu
 
The workflows for the ingest of digital objects into a repository/digital l...
The workflows for the ingest of  digital objects into a repository/digital l...The workflows for the ingest of  digital objects into a repository/digital l...
The workflows for the ingest of digital objects into a repository/digital l...Hong (Jenny) Jing
 
Olympya web-tools 2011
Olympya web-tools 2011Olympya web-tools 2011
Olympya web-tools 2011Paulo Mattos
 
OpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseOpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseHostway|HOSTING
 

Semelhante a Internet content as research data (20)

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via Archivematica
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Internet and its applications
Internet and its applicationsInternet and its applications
Internet and its applications
 
Internet and Its Applications
Internet and Its ApplicationsInternet and Its Applications
Internet and Its Applications
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
IIPC GA 2014 Solr
IIPC GA 2014 SolrIIPC GA 2014 Solr
IIPC GA 2014 Solr
 
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
 
Lisa Rogers
Lisa RogersLisa Rogers
Lisa Rogers
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3
 
It takes a Village: Implementing a Homegrown Solution for Streaming Video Res...
It takes a Village: Implementing a Homegrown Solution for Streaming Video Res...It takes a Village: Implementing a Homegrown Solution for Streaming Video Res...
It takes a Village: Implementing a Homegrown Solution for Streaming Video Res...
 
The workflows for the ingest of digital objects into a repository/digital l...
The workflows for the ingest of  digital objects into a repository/digital l...The workflows for the ingest of  digital objects into a repository/digital l...
The workflows for the ingest of digital objects into a repository/digital l...
 
Olympya web-tools 2011
Olympya web-tools 2011Olympya web-tools 2011
Olympya web-tools 2011
 
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
OpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseOpenStack Swift In the Enterprise
OpenStack Swift In the Enterprise
 

Mais de National Library of Australia

Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...National Library of Australia
 
CHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic ArtCHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic ArtNational Library of Australia
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaNational Library of Australia
 
Trove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLATrove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLANational Library of Australia
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...National Library of Australia
 
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
 Assessing Significance and Significance 2.0: an introduction - Margaret Birt... Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...National Library of Australia
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyNational Library of Australia
 
Publicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'CastroPublicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'CastroNational Library of Australia
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaNational Library of Australia
 
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLATROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLANational Library of Australia
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...National Library of Australia
 
CHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of SandhurstCHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of SandhurstNational Library of Australia
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyNational Library of Australia
 
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...National Library of Australia
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaNational Library of Australia
 

Mais de National Library of Australia (20)

Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
 
CHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic ArtCHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic Art
 
Completing your CHG project - Fran D'Castro
Completing your CHG project - Fran D'CastroCompleting your CHG project - Fran D'Castro
Completing your CHG project - Fran D'Castro
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
 
Trove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLATrove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLA
 
National Archives of Australia
National Archives of AustraliaNational Archives of Australia
National Archives of Australia
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
 
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
 Assessing Significance and Significance 2.0: an introduction - Margaret Birt... Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
 
Preservation Needs Assessment - Tamara Lavrencic
Preservation Needs Assessment  - Tamara LavrencicPreservation Needs Assessment  - Tamara Lavrencic
Preservation Needs Assessment - Tamara Lavrencic
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania Cleary
 
Publicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'CastroPublicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'Castro
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
 
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLATROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
 
CHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of SandhurstCHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
 
Preservation Needs Assessment - Tamara Lavrencic
Preservation Needs Assessment - Tamara LavrencicPreservation Needs Assessment - Tamara Lavrencic
Preservation Needs Assessment - Tamara Lavrencic
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania Cleary
 
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
 
Preservation assessment - Tamara Lavrencic
Preservation assessment - Tamara LavrencicPreservation assessment - Tamara Lavrencic
Preservation assessment - Tamara Lavrencic
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Internet content as research data

  • 1. Internet Content as Research Data Digital Humanities Australia March 2012, Canberra Monica Omodei & Gordon Mohr
  • 2. Research Examples •  Social networking •  Lexicography •  Linguistics •  Network Science •  Political Science •  Media Studies •  Contemporary history
  • 3. Common  Collec)on  Strategies   •  Crawl  Scope  &  Focus   1)  Thema)c/Topical  (elec)ons,  events,  global  warming…)   2)  Resource-­‐specific  (video,  pdf,  etc.)   3)  Broad  survey  (domain  wide  for  .com/.net/.org/.edu/.gov)   4)  Exhaus)ve  (end  of  life, closure crawls, natl domains)   5)  Frequency-­‐Based     •  Key  Inputs:  nomina)ons  from  subject  maSer  experts,   prior  crawl  data,  registry  data,  trusted  directories,   wikipedia  
  • 4. Exis)ng  web  archives     •  Internet  Archive   •  Common  Crawl     •  Pandora  Archive   •  Internet  Memory  Founda)on  Archive   •  Other  na)onal  archives   •  Research,  University  Library  archives    
  • 5. Internet Archive’s Web Archive Positives –  Very broad – 175+ billion web instances –  Historic – started 1996 –  Publicly accessible –  Time-based URL search –  API access –  Not constrained by legislation – covered by fair use and fast take-down response
  • 6. Internet  Archive’s  Web  Archive   Negatives –  Because of size can’t search by keyword –  Because of size, fully automated - QA not possible  
  • 7. Common  Use  Cases  for  IA’s  web   archive   •  Content  discovery   •  Nostalgia  queries   •  Web  site  restora)on  and  file  recovery   •  Domain  name  valua)on   •  Collabora)ve  R&D   •  Prior  art  analysis  and  patent/copyright  infringement   research   •  Legal  cases   •  Topic  analysis,  web  trends  analysis,  popularity   analysis  
  • 8.
  • 9.
  • 10.
  • 11. Common  Crawl   •  Non-­‐profit  founda)on  building  an  open  crawl   of  the  web  to  seed  research  and  innova)on   •  Currently  5  billion  pages   •  Stored  on  Amazon’s  S3     •  Accessible  via  MapReduce  processing  in   Amazon’s  EC2  compute  cloud   •  Wholesale  extrac)on,  transforma)on,  and   analysis  of  web  data  cheap  and  easy   •  commoncrawl.org/data/accessing-­‐the-­‐data/  
  • 12. Common  Crawl   Nega)ves   •  Not  designed  for  human  browsing  but  for   machine  access   •  Objec)ve  is  to  support  large-­‐scale  analysis  and   text  mining/indexing  –  not  long-­‐term   preserva)on   •  Some  costs  are  involved  for  direct  extrac)on   of  data  from  S3  storage  using  Requester-­‐Pays   API    
  • 13. Pandora  Archive   •  Posi)ves   –  Quality  checked   –  Targeted  Australian  content  with  selec)on  policy   –  Historical  –  started  1996   –  Bibliocentric  approach  –we  sites/publica)ons   selected  for  archiving  are  catalogued  (see  Trove)   –  Keyword  search   –  Publicly  accessible   –  You  can  nominate  Australian  web  sites  for   inclusion  -­‐  pandora.nla.gov.au/ registra)on_form.html  
  • 14.
  • 15. Pandora  Archive   •  Nega)ves   –  labour  intensive  so  small   –  significant  content  missed  because  permission  to   copy  refused   •  Situa)on  will  improve  markedly  if  Legal   Deposit  provisions  extended  to  digital   publica)ons   •  Broader  coverage  will  be  achieved  when   infrastructure  is  upgraded  hence  reducing   labour  costs  for  checking/fixing  crawls  
  • 16. Pandora  Archive  Stats   •  Size  –  6.32  TB   •  Number  of  Files    >  140  million   •  Number  of  ‘)tles’  >  30.5K   •  Number  of  )tle  instances  >  73.5K  
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. .au  Domain  Annual  Snapshots   •  Annual  crawls  since  2005  commissioned  from   Internet  Archive   •  Includes  sites  on  servers  located  in  Australia   as  well  as  .au  domain   •  Robots.txt  respected  except  for  inline  images   and  stylesheets   •  No  public  access  –  researcher  access  protocols   are  being  developed   •  Full  text  search  –  tailored  to  archive  search   •  Separate  .gov  crawl  publicly  accessible  soon  
  • 22. Australian  web  domain  crawls   Year   2005   2006   2007   2008   2009   2011   Files   185   596   516   1  billion   765   660   million   million   million   million   million   Hosts   811,523   1,046,038   1,247,614   3,038,658   1,074,645   1,346,549   crawled   Size  (TBs)   6.69   19.04   18.47   34.55   24.29   30.71  
  • 23. Internet  Memory  Founda)on   Archive   •  internetmemory.org/en/   •  no  keyword  search  yet  –  only  URL   •  Number  of  European  partners  
  • 24.
  • 25. Other  Na)onal  Archives   •  List  of  Interna)onal  Internet  Preserva)on   Consor)um  member  archives  –   netpreserve.org/about/archiveList.php   •  Some  are  whole  domain  archives,  some    are   selec)ve  archives,  many  are  both   •  Some  have  public  access,  others  you  will  need   to  nego)ate  access  for  research   •  Most  archives  have  been  collected  using  the   heritrix  open-­‐source  crawler  and  thus  use  the   standard  format  (warc  ISO  format)  
  • 26. Research  Archives   •  California  Digital  Library   •  Harvard  University  Libraries   •  Columbia    University  Libraries   •  University  of  North  Texas   ….  and  many  more     •  WebCITE  -­‐  webcita)on.org  (cita)on  service   archive)  
  • 27. Bringing  Archives  Together   •  Common  standard  and  APIs   •  Memento  project    
  • 28. Create  your  own  Archive   •  Use  a  subscrip)on  service   •  Build  your  own  archive  using  open-­‐source   crawler  heritrix  and  standard  file  format  .warc     •  Use  web  cita)on  services  that  create  archive   copies  as  you  bookmark  pages  
  • 29. Subscrip)on  Services   •  archive-­‐it.org  (service  operated  by  non-­‐profit   Internet  Archive  since  2006)   •  archivethe.net  (service  operated  by  non-­‐profit     Internet  Memory  Founda)on)   •  California  Digital  Library  Web  Archiving   Service  -­‐  cdlib.org/services/uc3/was.html   •  OCLC  Harvester  Service  -­‐  oclc.org/ webharvester/overview/default.htm  
  • 30.
  • 31. Install  web  archiving  system  locally   •  Easy-­‐to-­‐deploy  web  archiving  toolkit  not  yet   available  (that  meets  web  archive  standards)   •  Ins)tu)onal  web  archiving  infrastructure  is   feasible  and  has  been  established  at  a  number   of  universi)es  for  use  by  researchers  –  needs   IT  systems  engineers  to  set  up  though   •  Archives  can  be  deposited  with  the  NLA  for   long-­‐term  preserva)on  
  • 32. 'Memento':  adding  )me  to  the   web   Protocol  and  browser  add-­‐on  (MementoFox)   •  Aids  discovery,  aggrega)on  of  page  histories    
  • 33. Web Data Mining & Analysis – What is it? Why Do It? Innovation is increasingly driven from Large scale Data Analysis Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…
  • 34. Platform & Toolkit: Overview •  Software –  Apache Hadoop –  Apache Pig •  Data/File format –  WARC –  CDX –  WAT (new!)
  • 35. Apache Hadoop •  HDFS –  Distributed storage –  Durable, default 3x replication –  Scalable: Yahoo! 60+PB HDFS •  MapReduce –  Distributed computation –  You write Java functions –  Hadoop distributes work across cluster –  Tolerates failures
  • 36. File formats and data: WARC
  • 37. File formats and data: CDX •  Index for Wayback Machine: used to browse WARC-based archive •  Space-delimited text file •  Only essential metadata needed by Wayback –  URL –  Content Digest –  Capture Timestamp –  Content-Type –  HTTP response code –  etc.
  • 38. File formats and data: WAT •  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright, privacy •  Work-in-progress: we want your feedback
  • 39. File formats and data: WAT •  WAT is WARC ☺ –  WAT records are WARC metadata records File formats & data: –  WARC-Refers-To header •  CDX: 53 MB identifies original WARC record •  WAT: 443 MB •  WAT payload is JSON •  WARC: 8,651 MB –  Compact –  Hierarchical –  Supported by every programming environ
  • 40. Some  References   •  hSp://en.wikipedia.org/wiki/Web_archiving   •  hSp://netpreserve.org/about/archiveList.php   •  Web  Archives:  The  Future(s)  -­‐   hSp://www.netpreserve.org/publica)ons/ 2011_06_IIPC_WebArchives-­‐TheFutures.pdf  
  • 41. Contacts   •  Webarchive  @  nla.gov.au   •  Secretariat  @  internetmemory.org   •  Queries  about  the  internet  archive  web  archive   hSp://iawebarchiving.wordpress.com/   •  Queries  about  Archive-­‐It  service   hSp://www.archive-­‐it.org/contact-­‐us   •  momodei  @  nla.gov.au   •  gojomo  @  xavvy.com