Presentation at the Big Data interest group at Los Alamos National Laboratory, regarding with with the IIPC to merge the indexes of multiple web archives.
OpenShift Commons Paris - Choose Your Own Observability Adventure
Big Data: Indexing ~50Tb of URIs
1. Using DISC to Reindex ~50Tb of URIs
Robert Sanderson
rsanderson@lanl.gov
@azaroth42
Herbert Van de Sompel
herbertv@lanl.gov
@hvdsomp
http://www.mementoweb.org/
Towards Seamless Navigation of the Web of the Past
Big Data Interest Group, LANL, Feb 21st 2013 1
[Unclassified]
3. Memento Project
“ Provide web citizens seamless access
to the web of the past ”
Some accolades:
• $1M funding from the Library of Congress
• Winner of 2010 International Digital Preservation Award
• Accepted by International Internet Preservation Consortium
as the way forwards for access to web archives
• Nominated for Best Paper at JCDL 2010
• Best Poster at JCDL 2012
• Tim Berners-Lee @ LDOW10: “We absolutely need this”
• Internet Draft 6 (final call) by May
Big Data Interest Group, LANL, Feb 21st 2013 3
[Unclassified]
4. Old Copies have New Names
Everyone knows the URL for CNN is: http://www.cnn.com/
Everyone knows the URL for CNN was: http://www.cnn.com/
Big Data Interest Group, LANL, Feb 21st 2013 4
[Unclassified]
5. Old Copies have New Names
Copies of the past versions of to http://www.cnn.com/ exist
… but you don’t go there to get them, they have new URLs
… and there’s no way to automatically discover those new URLs
http://web.archive.org/web/20031013111028/http://www2.cnn.com/
Big Data Interest Group, LANL, Feb 21st 2013 5
[Unclassified]
6. Discovery is Hard
People want to find, not search
… and especially not searching by hand!
http://web.archive.org/web/*/http://www2.cnn.com/
Big Data Interest Group, LANL, Feb 21st 2013 6
[Unclassified]
7. Navigating in the Past
Links to the real resource bring you back out of the past
… or trap you in a single incomplete archive.
Pentagon
Dec 20 2001, 4:51:00 UTC current
http://web.archive.org/web/*/http://www2.cnn.com/
Big Data Interest Group, LANL, Feb 21st 2013 7
[Unclassified]
8. Memento: Time Travel for the Web
TimeGate: Resource that knows the locations of archived copies
Memento: An archived copy of previous state of a web resource
Link header to TimeGate
302 redirect to Memento
How does the TimeGate know where the appropriate Memento is?
Big Data Interest Group, LANL, Feb 21st 2013 8
[Unclassified]
9. Project Overview
Goal:
• To aggregate the metadata of the distributed archives
of the IIPC, and
• To provide Memento based access to the holdings
of open archives
• To provide knowledge of the holdings of restricted
archives
• To provide knowledge to IIPC members of the
holdings of totally closed archives
Big Data Interest Group, LANL, Feb 21st 2013 9
[Unclassified]
10. Experiment Participants
• Austrian National Library
• Bibliothèque Nationale de France
• British Library
• Institut National de l'Audiovisuel
• Internet Archive
• Koninklijke Bibliotheek
• Library of Congress
• Netarchive.dk
• Swiss National Library
• University of North Texas
• Los Alamos National Laboratory
• Old Dominion University
Big Data Interest Group, LANL, Feb 21st 2013 10
[Unclassified]
11. Data
Canonical URL kosovakosovo.com/photo.php?id=5785
Datestamp 20090608161553
Request URL http://www.kosovakosovo.com/photo.php?id=5785
MIME type text/html
HTTP Status 200
Checksum M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN
Redirect? -
Bytes 563096
Storage File AIT-1068-20090608161511.warc.gz
Multiplied by 6Tb compressed, ~50Tb uncompressed
Big Data Interest Group, LANL, Feb 21st 2013 11
[Unclassified]
12. The Plan…
• To provide fast access to distributed archives, LANL
would merge the indexes of the holdings of multiple
archives and provide Memento based access
• Step 1: Library of Congress gathers CDX files
Step 2: LANL indexes (a.k.a. “…”)
Step 3: Profit
• Data: 6T of gzipped CDX files (mostly from IA)
• Shipped on hard drives
• Computing: 210 node DISC cluster at LANL
• 2x 2ghz processors, 2x 2T HDD, 8G RAM
Big Data Interest Group, LANL, Feb 21st 2013 12
[Unclassified]
13. … and the Reality
• Hardware failure killed one of the drives en route
• Transferred remaining files via BagIt from LoC
• DISC has restricted access:
• Had to transfer data over intranet
• 2 weeks to sync (5Mb/sec)
• And then 2 weeks to get the processed results off
• Compute cluster has faulty switch, unreliable nodes:
• Ran original processing 15 times without success
due to hardware failures
Big Data Interest Group, LANL, Feb 21st 2013 13
[Unclassified]
14. Processing Design
• For each CDX file,
• For each URI + timestamps,
• Map URI to an appropriate database slice
• Merge timestamps with those of previous CDXs
• Possible because:
• No need to do truncated search
• No need to walk through URIs in order
• No need for time based access, only URI
• Problem is “Embarrassingly Parallel”
Big Data Interest Group, LANL, Feb 21st 2013 14
[Unclassified]
15. Approach 1: Online Messaging
• 25 read nodes, 1 control node, 150 write nodes
• Messages (1000 URIs) sent via control node to write
• Failed 15 times due to hardware issues
Big Data Interest Group, LANL, Feb 21st 2013 15
[Unclassified]
16. Approach 1: Implementation
• Python
• MPI L
• PyMPI LL
• No failure detection, no auto restart, no zombie killing,
no … LLL
• Several months of experiments to find issues, fine tune
parameters most likely to complete…
Big Data Interest Group, LANL, Feb 21st 2013 16
[Unclassified]
17. Approach 2: No Interaction
• 43 read/split nodes
• Phase 1: Read nodes split CDX files to 3000 slices
Big Data Interest Group, LANL, Feb 21st 2013 17
[Unclassified]
18. Approach 2: No Interaction
• Phase 2a: Transfer CDX slices to Control node
• Phase 2b: Transfer CDX slices to Write nodes
Big Data Interest Group, LANL, Feb 21st 2013 18
[Unclassified]
19. Approach 2: No Interaction
• 50 write nodes (* 60 slices each = 3000 slices)
• Phase 3: Merge slices from nodes to BerkeleyDBs
Big Data Interest Group, LANL, Feb 21st 2013 19
[Unclassified]
20. Next Steps
• Re-index, using non-interactive approach
• New data: 14 Tb (~120Tb uncompressed?)
• Use FileMap for better automation?
• Two eSata 8T LaCie cubes, 2 3T internal drives to
install locally
• More intelligent data partitioning
• Some sort of error detection and handling!
Big Data Interest Group, LANL, Feb 21st 2013 20
[Unclassified]
21. Memento
http://mementoweb.org/
Robert Sanderson
rsanderson@lanl.gov
@azaroth42
Herbert Van de Sompel
herbertv@lanl.gov
@hvdsomp
Big Data Interest Group, LANL, Feb 21st 2013 21
[Unclassified]