SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Using DISC to Reindex ~50Tb of URIs


                                   Robert Sanderson
                                              rsanderson@lanl.gov
                                              @azaroth42

                                   Herbert Van de Sompel
                                              herbertv@lanl.gov
                                              @hvdsomp

                                     http://www.mementoweb.org/



Towards Seamless Navigation of the Web of the Past
            Big Data Interest Group, LANL, Feb 21st 2013            1
                            [Unclassified]
Summary


Memento

Project Overview
Participants

The Plan …
… and the Reality

Technical Details

Next Run


             Big Data Interest Group, LANL, Feb 21st 2013   2
                             [Unclassified]
Memento Project



   “       Provide web citizens seamless access
                   to the web of the past                      ”
Some accolades:
•  $1M funding from the Library of Congress
•  Winner of 2010 International Digital Preservation Award
•  Accepted by International Internet Preservation Consortium
      as the way forwards for access to web archives
•  Nominated for Best Paper at JCDL 2010
•  Best Poster at JCDL 2012
•  Tim Berners-Lee @ LDOW10: “We absolutely need this”
•  Internet Draft 6 (final call) by May




                Big Data Interest Group, LANL, Feb 21st 2013       3
                                [Unclassified]
Old Copies have New Names
Everyone knows the URL for CNN is:               http://www.cnn.com/
Everyone knows the URL for CNN was:              http://www.cnn.com/




               Big Data Interest Group, LANL, Feb 21st 2013            4
                               [Unclassified]
Old Copies have New Names
Copies of the past versions of to http://www.cnn.com/ exist
… but you don’t go there to get them, they have new URLs
… and there’s no way to automatically discover those new URLs




          http://web.archive.org/web/20031013111028/http://www2.cnn.com/


                  Big Data Interest Group, LANL, Feb 21st 2013             5
                                  [Unclassified]
Discovery is Hard
People want to find, not search
… and especially not searching by hand!




               http://web.archive.org/web/*/http://www2.cnn.com/


                Big Data Interest Group, LANL, Feb 21st 2013       6
                                [Unclassified]
Navigating in the Past
Links to the real resource bring you back out of the past
… or trap you in a single incomplete archive.




                           Pentagon




     Dec 20 2001, 4:51:00 UTC                                          current
                   http://web.archive.org/web/*/http://www2.cnn.com/


                    Big Data Interest Group, LANL, Feb 21st 2013                 7
                                    [Unclassified]
Memento: Time Travel for the Web
  TimeGate: Resource that knows the locations of archived copies
  Memento: An archived copy of previous state of a web resource

Link header to TimeGate
                           302 redirect to Memento




  How does the TimeGate know where the appropriate Memento is?


                  Big Data Interest Group, LANL, Feb 21st 2013           8
                                  [Unclassified]
Project Overview

Goal:
•  To aggregate the metadata of the distributed archives
   of the IIPC, and
    •  To provide Memento based access to the holdings
       of open archives
    •  To provide knowledge of the holdings of restricted
       archives
    •  To provide knowledge to IIPC members of the
       holdings of totally closed archives




              Big Data Interest Group, LANL, Feb 21st 2013   9
                              [Unclassified]
Experiment Participants

•    Austrian National Library
•    Bibliothèque Nationale de France
•    British Library
•    Institut National de l'Audiovisuel
•    Internet Archive
•    Koninklijke Bibliotheek
•    Library of Congress
•    Netarchive.dk
•    Swiss National Library
•    University of North Texas

• Los Alamos National Laboratory
• Old Dominion University

                Big Data Interest Group, LANL, Feb 21st 2013   10
                                [Unclassified]
Data

Canonical URL     kosovakosovo.com/photo.php?id=5785
   Datestamp      20090608161553
 Request URL      http://www.kosovakosovo.com/photo.php?id=5785
   MIME type      text/html
  HTTP Status     200
   Checksum       M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN
    Redirect?     -
        Bytes     563096

  Storage File    AIT-1068-20090608161511.warc.gz
                  

           Multiplied by 6Tb compressed, ~50Tb uncompressed



                   Big Data Interest Group, LANL, Feb 21st 2013   11
                                   [Unclassified]
The Plan…

•  To provide fast access to distributed archives, LANL
   would merge the indexes of the holdings of multiple
   archives and provide Memento based access

•  Step 1: Library of Congress gathers CDX files
   Step 2: LANL indexes (a.k.a. “…”)
   Step 3: Profit

•  Data: 6T of gzipped CDX files (mostly from IA)
    •  Shipped on hard drives
•  Computing: 210 node DISC cluster at LANL
    •  2x 2ghz processors, 2x 2T HDD, 8G RAM


              Big Data Interest Group, LANL, Feb 21st 2013   12
                              [Unclassified]
… and the Reality

•  Hardware failure killed one of the drives en route
    •  Transferred remaining files via BagIt from LoC

•  DISC has restricted access:
    •  Had to transfer data over intranet
    •  2 weeks to sync (5Mb/sec)
    •  And then 2 weeks to get the processed results off

•  Compute cluster has faulty switch, unreliable nodes:
    •  Ran original processing 15 times without success
       due to hardware failures



              Big Data Interest Group, LANL, Feb 21st 2013   13
                              [Unclassified]
Processing Design

•  For each CDX file,
    •  For each URI + timestamps,
        •  Map URI to an appropriate database slice
        •  Merge timestamps with those of previous CDXs

•  Possible because:
    •  No need to do truncated search
    •  No need to walk through URIs in order
    •  No need for time based access, only URI

•  Problem is “Embarrassingly Parallel”



              Big Data Interest Group, LANL, Feb 21st 2013   14
                              [Unclassified]
Approach 1: Online Messaging

•  25 read nodes, 1 control node, 150 write nodes
•  Messages (1000 URIs) sent via control node to write




•  Failed 15 times due to hardware issues
             Big Data Interest Group, LANL, Feb 21st 2013   15
                             [Unclassified]
Approach 1: Implementation

•  Python

•  MPI        L
•  PyMPI       LL
•  No failure detection, no auto restart, no zombie killing,
   no …        LLL
•  Several months of experiments to find issues, fine tune
   parameters most likely to complete…




              Big Data Interest Group, LANL, Feb 21st 2013     16
                              [Unclassified]
Approach 2: No Interaction

•  43 read/split nodes
•  Phase 1: Read nodes split CDX files to 3000 slices




              Big Data Interest Group, LANL, Feb 21st 2013   17
                              [Unclassified]
Approach 2: No Interaction

•  Phase 2a: Transfer CDX slices to Control node
•  Phase 2b: Transfer CDX slices to Write nodes




             Big Data Interest Group, LANL, Feb 21st 2013   18
                             [Unclassified]
Approach 2: No Interaction

•  50 write nodes (* 60 slices each = 3000 slices)
•  Phase 3: Merge slices from nodes to BerkeleyDBs




             Big Data Interest Group, LANL, Feb 21st 2013   19
                             [Unclassified]
Next Steps


•  Re-index, using non-interactive approach
    •  New data: 14 Tb (~120Tb uncompressed?)
    •  Use FileMap for better automation?

•  Two eSata 8T LaCie cubes, 2 3T internal drives to
   install locally

•  More intelligent data partitioning
•  Some sort of error detection and handling!




              Big Data Interest Group, LANL, Feb 21st 2013   20
                              [Unclassified]
Memento

                              http://mementoweb.org/

                                          Robert Sanderson
                                        rsanderson@lanl.gov
                                                @azaroth42

                                     Herbert Van de Sompel
                                         herbertv@lanl.gov
                                                @hvdsomp


Big Data Interest Group, LANL, Feb 21st 2013              21
                [Unclassified]

Mais conteúdo relacionado

Destaque

Analyzing the Persistence of Referenced Web Resources with Memento
Analyzing the Persistence of Referenced Web Resources with MementoAnalyzing the Persistence of Referenced Web Resources with Memento
Analyzing the Persistence of Referenced Web Resources with MementoRobert Sanderson
 
Linked Data: Building Standards and Communities
Linked Data: Building Standards and CommunitiesLinked Data: Building Standards and Communities
Linked Data: Building Standards and CommunitiesRobert Sanderson
 
Transcending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital FacsimilesTranscending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital FacsimilesRobert Sanderson
 
No uso de las TIC en el Aula
No uso de las TIC en el AulaNo uso de las TIC en el Aula
No uso de las TIC en el AulaAndrés Mov
 
Erika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdezErika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdezguest1cc234
 
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...Robert Sanderson
 
NLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvasNLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvasRobert Sanderson
 
Dit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit GezienDit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit Gezienguest6964ce
 
W3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use CasesW3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use CasesRobert Sanderson
 
Making Web Annotations Persistent over Time
Making Web Annotations Persistent over TimeMaking Web Annotations Persistent over Time
Making Web Annotations Persistent over TimeRobert Sanderson
 
NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)Robert Sanderson
 
W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)Robert Sanderson
 
IIIF Overview for Linked Data Exhibitions
IIIF Overview for Linked Data ExhibitionsIIIF Overview for Linked Data Exhibitions
IIIF Overview for Linked Data ExhibitionsRobert Sanderson
 
Annotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelAnnotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelRobert Sanderson
 

Destaque (20)

Analyzing the Persistence of Referenced Web Resources with Memento
Analyzing the Persistence of Referenced Web Resources with MementoAnalyzing the Persistence of Referenced Web Resources with Memento
Analyzing the Persistence of Referenced Web Resources with Memento
 
Linked Data: Building Standards and Communities
Linked Data: Building Standards and CommunitiesLinked Data: Building Standards and Communities
Linked Data: Building Standards and Communities
 
Transcending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital FacsimilesTranscending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital Facsimiles
 
No uso de las TIC en el Aula
No uso de las TIC en el AulaNo uso de las TIC en el Aula
No uso de las TIC en el Aula
 
Erika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdezErika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdez
 
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
 
Niso Annotation Webinar
Niso Annotation WebinarNiso Annotation Webinar
Niso Annotation Webinar
 
Python Web Interaction
Python Web InteractionPython Web Interaction
Python Web Interaction
 
NLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvasNLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvas
 
Dit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit GezienDit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit Gezien
 
W3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use CasesW3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use Cases
 
Making Web Annotations Persistent over Time
Making Web Annotations Persistent over TimeMaking Web Annotations Persistent over Time
Making Web Annotations Persistent over Time
 
NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)
 
W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)
 
IIIF Presentation API
IIIF Presentation API IIIF Presentation API
IIIF Presentation API
 
IIIF Overview for Linked Data Exhibitions
IIIF Overview for Linked Data ExhibitionsIIIF Overview for Linked Data Exhibitions
IIIF Overview for Linked Data Exhibitions
 
Annotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelAnnotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation Model
 
Hemoptysis jack
Hemoptysis jackHemoptysis jack
Hemoptysis jack
 
Lactate by jack.
Lactate by jack.Lactate by jack.
Lactate by jack.
 
Pneumothorax ..jack
Pneumothorax ..jackPneumothorax ..jack
Pneumothorax ..jack
 

Semelhante a Big Data: Indexing ~50Tb of URIs

Big Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageBig Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageDebasish Ghosh
 
Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015spectralogic
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentationekansa
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...Imperva Incapsula
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and HadoopSalil Navgire
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?Ivan Herman
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingGlobus
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Future Perfect 2012
 
Labours of Love & Convenience - Open Repositories 2018
Labours of Love & Convenience - Open Repositories 2018Labours of Love & Convenience - Open Repositories 2018
Labours of Love & Convenience - Open Repositories 2018Stefano Cossu
 
Linked Data from a Digital Object Management System
Linked Data from a Digital Object Management SystemLinked Data from a Digital Object Management System
Linked Data from a Digital Object Management SystemUldis Bojars
 
Honey on the Wire KohaCon18
Honey on the Wire  KohaCon18Honey on the Wire  KohaCon18
Honey on the Wire KohaCon18Joy Nelson
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
January 2022: Central Iowa Linux Users Group: Git
January 2022: Central Iowa Linux Users Group: GitJanuary 2022: Central Iowa Linux Users Group: Git
January 2022: Central Iowa Linux Users Group: GitAndrew Denner
 
Pg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughPg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughRenaud Bruyeron
 
DIGITAL LIBRARIES
DIGITAL LIBRARIESDIGITAL LIBRARIES
DIGITAL LIBRARIESviedma2
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataSchubert Zhang
 

Semelhante a Big Data: Indexing ~50Tb of URIs (20)

Big Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageBig Data - architectural concerns for the new age
Big Data - architectural concerns for the new age
 
Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data Sharing
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
 
afternoon3.pdf
afternoon3.pdfafternoon3.pdf
afternoon3.pdf
 
Labours of Love & Convenience - Open Repositories 2018
Labours of Love & Convenience - Open Repositories 2018Labours of Love & Convenience - Open Repositories 2018
Labours of Love & Convenience - Open Repositories 2018
 
Linked Data from a Digital Object Management System
Linked Data from a Digital Object Management SystemLinked Data from a Digital Object Management System
Linked Data from a Digital Object Management System
 
Honey on the Wire KohaCon18
Honey on the Wire  KohaCon18Honey on the Wire  KohaCon18
Honey on the Wire KohaCon18
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
January 2022: Central Iowa Linux Users Group: Git
January 2022: Central Iowa Linux Users Group: GitJanuary 2022: Central Iowa Linux Users Group: Git
January 2022: Central Iowa Linux Users Group: Git
 
Pg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughPg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enough
 
DIGITAL LIBRARIES
DIGITAL LIBRARIESDIGITAL LIBRARIES
DIGITAL LIBRARIES
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of Data
 

Mais de Robert Sanderson

LUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleLUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleRobert Sanderson
 
Zoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataZoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataRobert Sanderson
 
Provenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtProvenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtRobert Sanderson
 
Data is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityData is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityRobert Sanderson
 
A Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityA Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityRobert Sanderson
 
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataLinked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataRobert Sanderson
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataRobert Sanderson
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Robert Sanderson
 
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemSanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemRobert Sanderson
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingRobert Sanderson
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUDRobert Sanderson
 
Introduction to Linked Art Model
Introduction to Linked Art ModelIntroduction to Linked Art Model
Introduction to Linked Art ModelRobert Sanderson
 
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Robert Sanderson
 
Strong Opinions, Weakly Held
Strong Opinions, Weakly HeldStrong Opinions, Weakly Held
Strong Opinions, Weakly HeldRobert Sanderson
 
IIIF Discovery Walkthrough
IIIF Discovery WalkthroughIIIF Discovery Walkthrough
IIIF Discovery WalkthroughRobert Sanderson
 
Linked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMLinked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMRobert Sanderson
 
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeEuromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeRobert Sanderson
 
Linked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelLinked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelRobert Sanderson
 
EuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDEuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDRobert Sanderson
 

Mais de Robert Sanderson (20)

Understanding Linked Art
Understanding Linked ArtUnderstanding Linked Art
Understanding Linked Art
 
LUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleLUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at Yale
 
Zoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataZoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable Data
 
Provenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtProvenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked Art
 
Data is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityData is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD Sustainability
 
A Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityA Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and Usability
 
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataLinked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)
 
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemSanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUD
 
Introduction to Linked Art Model
Introduction to Linked Art ModelIntroduction to Linked Art Model
Introduction to Linked Art Model
 
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
 
Strong Opinions, Weakly Held
Strong Opinions, Weakly HeldStrong Opinions, Weakly Held
Strong Opinions, Weakly Held
 
IIIF Discovery Walkthrough
IIIF Discovery WalkthroughIIIF Discovery Walkthrough
IIIF Discovery Walkthrough
 
Linked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMLinked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRM
 
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeEuromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over Committee
 
Linked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelLinked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data Model
 
EuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDEuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUD
 

Último

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Último (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Big Data: Indexing ~50Tb of URIs

  • 1. Using DISC to Reindex ~50Tb of URIs Robert Sanderson rsanderson@lanl.gov @azaroth42 Herbert Van de Sompel herbertv@lanl.gov @hvdsomp http://www.mementoweb.org/ Towards Seamless Navigation of the Web of the Past Big Data Interest Group, LANL, Feb 21st 2013 1 [Unclassified]
  • 2. Summary Memento Project Overview Participants The Plan … … and the Reality Technical Details Next Run Big Data Interest Group, LANL, Feb 21st 2013 2 [Unclassified]
  • 3. Memento Project “ Provide web citizens seamless access to the web of the past ” Some accolades: •  $1M funding from the Library of Congress •  Winner of 2010 International Digital Preservation Award •  Accepted by International Internet Preservation Consortium as the way forwards for access to web archives •  Nominated for Best Paper at JCDL 2010 •  Best Poster at JCDL 2012 •  Tim Berners-Lee @ LDOW10: “We absolutely need this” •  Internet Draft 6 (final call) by May Big Data Interest Group, LANL, Feb 21st 2013 3 [Unclassified]
  • 4. Old Copies have New Names Everyone knows the URL for CNN is: http://www.cnn.com/ Everyone knows the URL for CNN was: http://www.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 4 [Unclassified]
  • 5. Old Copies have New Names Copies of the past versions of to http://www.cnn.com/ exist … but you don’t go there to get them, they have new URLs … and there’s no way to automatically discover those new URLs http://web.archive.org/web/20031013111028/http://www2.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 5 [Unclassified]
  • 6. Discovery is Hard People want to find, not search … and especially not searching by hand! http://web.archive.org/web/*/http://www2.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 6 [Unclassified]
  • 7. Navigating in the Past Links to the real resource bring you back out of the past … or trap you in a single incomplete archive. Pentagon Dec 20 2001, 4:51:00 UTC current http://web.archive.org/web/*/http://www2.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 7 [Unclassified]
  • 8. Memento: Time Travel for the Web TimeGate: Resource that knows the locations of archived copies Memento: An archived copy of previous state of a web resource Link header to TimeGate 302 redirect to Memento How does the TimeGate know where the appropriate Memento is? Big Data Interest Group, LANL, Feb 21st 2013 8 [Unclassified]
  • 9. Project Overview Goal: •  To aggregate the metadata of the distributed archives of the IIPC, and •  To provide Memento based access to the holdings of open archives •  To provide knowledge of the holdings of restricted archives •  To provide knowledge to IIPC members of the holdings of totally closed archives Big Data Interest Group, LANL, Feb 21st 2013 9 [Unclassified]
  • 10. Experiment Participants •  Austrian National Library •  Bibliothèque Nationale de France • British Library • Institut National de l'Audiovisuel • Internet Archive • Koninklijke Bibliotheek • Library of Congress •  Netarchive.dk •  Swiss National Library •  University of North Texas • Los Alamos National Laboratory • Old Dominion University Big Data Interest Group, LANL, Feb 21st 2013 10 [Unclassified]
  • 11. Data Canonical URL kosovakosovo.com/photo.php?id=5785 Datestamp 20090608161553 Request URL http://www.kosovakosovo.com/photo.php?id=5785 MIME type text/html HTTP Status 200 Checksum M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN Redirect? - Bytes 563096 Storage File AIT-1068-20090608161511.warc.gz Multiplied by 6Tb compressed, ~50Tb uncompressed Big Data Interest Group, LANL, Feb 21st 2013 11 [Unclassified]
  • 12. The Plan… •  To provide fast access to distributed archives, LANL would merge the indexes of the holdings of multiple archives and provide Memento based access •  Step 1: Library of Congress gathers CDX files Step 2: LANL indexes (a.k.a. “…”) Step 3: Profit •  Data: 6T of gzipped CDX files (mostly from IA) •  Shipped on hard drives •  Computing: 210 node DISC cluster at LANL •  2x 2ghz processors, 2x 2T HDD, 8G RAM Big Data Interest Group, LANL, Feb 21st 2013 12 [Unclassified]
  • 13. … and the Reality •  Hardware failure killed one of the drives en route •  Transferred remaining files via BagIt from LoC •  DISC has restricted access: •  Had to transfer data over intranet •  2 weeks to sync (5Mb/sec) •  And then 2 weeks to get the processed results off •  Compute cluster has faulty switch, unreliable nodes: •  Ran original processing 15 times without success due to hardware failures Big Data Interest Group, LANL, Feb 21st 2013 13 [Unclassified]
  • 14. Processing Design •  For each CDX file, •  For each URI + timestamps, •  Map URI to an appropriate database slice •  Merge timestamps with those of previous CDXs •  Possible because: •  No need to do truncated search •  No need to walk through URIs in order •  No need for time based access, only URI •  Problem is “Embarrassingly Parallel” Big Data Interest Group, LANL, Feb 21st 2013 14 [Unclassified]
  • 15. Approach 1: Online Messaging •  25 read nodes, 1 control node, 150 write nodes •  Messages (1000 URIs) sent via control node to write •  Failed 15 times due to hardware issues Big Data Interest Group, LANL, Feb 21st 2013 15 [Unclassified]
  • 16. Approach 1: Implementation •  Python •  MPI L •  PyMPI LL •  No failure detection, no auto restart, no zombie killing, no … LLL •  Several months of experiments to find issues, fine tune parameters most likely to complete… Big Data Interest Group, LANL, Feb 21st 2013 16 [Unclassified]
  • 17. Approach 2: No Interaction •  43 read/split nodes •  Phase 1: Read nodes split CDX files to 3000 slices Big Data Interest Group, LANL, Feb 21st 2013 17 [Unclassified]
  • 18. Approach 2: No Interaction •  Phase 2a: Transfer CDX slices to Control node •  Phase 2b: Transfer CDX slices to Write nodes Big Data Interest Group, LANL, Feb 21st 2013 18 [Unclassified]
  • 19. Approach 2: No Interaction •  50 write nodes (* 60 slices each = 3000 slices) •  Phase 3: Merge slices from nodes to BerkeleyDBs Big Data Interest Group, LANL, Feb 21st 2013 19 [Unclassified]
  • 20. Next Steps •  Re-index, using non-interactive approach •  New data: 14 Tb (~120Tb uncompressed?) •  Use FileMap for better automation? •  Two eSata 8T LaCie cubes, 2 3T internal drives to install locally •  More intelligent data partitioning •  Some sort of error detection and handling! Big Data Interest Group, LANL, Feb 21st 2013 20 [Unclassified]
  • 21. Memento http://mementoweb.org/ Robert Sanderson rsanderson@lanl.gov @azaroth42 Herbert Van de Sompel herbertv@lanl.gov @hvdsomp Big Data Interest Group, LANL, Feb 21st 2013 21 [Unclassified]