Presentation at the Big Data interest group at Los Alamos National Laboratory, regarding with with the IIPC to merge the indexes of multiple web archives.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Big Data: Indexing ~50Tb of URIs
1. Using DISC to Reindex ~50Tb of URIs
Robert Sanderson
rsanderson@lanl.gov
@azaroth42
Herbert Van de Sompel
herbertv@lanl.gov
@hvdsomp
http://www.mementoweb.org/
Towards Seamless Navigation of the Web of the Past
Big Data Interest Group, LANL, Feb 21st 2013 1
[Unclassified]
3. Memento Project
“ Provide web citizens seamless access
to the web of the past ”
Some accolades:
• $1M funding from the Library of Congress
• Winner of 2010 International Digital Preservation Award
• Accepted by International Internet Preservation Consortium
as the way forwards for access to web archives
• Nominated for Best Paper at JCDL 2010
• Best Poster at JCDL 2012
• Tim Berners-Lee @ LDOW10: “We absolutely need this”
• Internet Draft 6 (final call) by May
Big Data Interest Group, LANL, Feb 21st 2013 3
[Unclassified]
4. Old Copies have New Names
Everyone knows the URL for CNN is: http://www.cnn.com/
Everyone knows the URL for CNN was: http://www.cnn.com/
Big Data Interest Group, LANL, Feb 21st 2013 4
[Unclassified]
5. Old Copies have New Names
Copies of the past versions of to http://www.cnn.com/ exist
… but you don’t go there to get them, they have new URLs
… and there’s no way to automatically discover those new URLs
http://web.archive.org/web/20031013111028/http://www2.cnn.com/
Big Data Interest Group, LANL, Feb 21st 2013 5
[Unclassified]
6. Discovery is Hard
People want to find, not search
… and especially not searching by hand!
http://web.archive.org/web/*/http://www2.cnn.com/
Big Data Interest Group, LANL, Feb 21st 2013 6
[Unclassified]
7. Navigating in the Past
Links to the real resource bring you back out of the past
… or trap you in a single incomplete archive.
Pentagon
Dec 20 2001, 4:51:00 UTC current
http://web.archive.org/web/*/http://www2.cnn.com/
Big Data Interest Group, LANL, Feb 21st 2013 7
[Unclassified]
8. Memento: Time Travel for the Web
TimeGate: Resource that knows the locations of archived copies
Memento: An archived copy of previous state of a web resource
Link header to TimeGate
302 redirect to Memento
How does the TimeGate know where the appropriate Memento is?
Big Data Interest Group, LANL, Feb 21st 2013 8
[Unclassified]
9. Project Overview
Goal:
• To aggregate the metadata of the distributed archives
of the IIPC, and
• To provide Memento based access to the holdings
of open archives
• To provide knowledge of the holdings of restricted
archives
• To provide knowledge to IIPC members of the
holdings of totally closed archives
Big Data Interest Group, LANL, Feb 21st 2013 9
[Unclassified]
10. Experiment Participants
• Austrian National Library
• Bibliothèque Nationale de France
• British Library
• Institut National de l'Audiovisuel
• Internet Archive
• Koninklijke Bibliotheek
• Library of Congress
• Netarchive.dk
• Swiss National Library
• University of North Texas
• Los Alamos National Laboratory
• Old Dominion University
Big Data Interest Group, LANL, Feb 21st 2013 10
[Unclassified]
11. Data
Canonical URL kosovakosovo.com/photo.php?id=5785
Datestamp 20090608161553
Request URL http://www.kosovakosovo.com/photo.php?id=5785
MIME type text/html
HTTP Status 200
Checksum M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN
Redirect? -
Bytes 563096
Storage File AIT-1068-20090608161511.warc.gz
Multiplied by 6Tb compressed, ~50Tb uncompressed
Big Data Interest Group, LANL, Feb 21st 2013 11
[Unclassified]
12. The Plan…
• To provide fast access to distributed archives, LANL
would merge the indexes of the holdings of multiple
archives and provide Memento based access
• Step 1: Library of Congress gathers CDX files
Step 2: LANL indexes (a.k.a. “…”)
Step 3: Profit
• Data: 6T of gzipped CDX files (mostly from IA)
• Shipped on hard drives
• Computing: 210 node DISC cluster at LANL
• 2x 2ghz processors, 2x 2T HDD, 8G RAM
Big Data Interest Group, LANL, Feb 21st 2013 12
[Unclassified]
13. … and the Reality
• Hardware failure killed one of the drives en route
• Transferred remaining files via BagIt from LoC
• DISC has restricted access:
• Had to transfer data over intranet
• 2 weeks to sync (5Mb/sec)
• And then 2 weeks to get the processed results off
• Compute cluster has faulty switch, unreliable nodes:
• Ran original processing 15 times without success
due to hardware failures
Big Data Interest Group, LANL, Feb 21st 2013 13
[Unclassified]
14. Processing Design
• For each CDX file,
• For each URI + timestamps,
• Map URI to an appropriate database slice
• Merge timestamps with those of previous CDXs
• Possible because:
• No need to do truncated search
• No need to walk through URIs in order
• No need for time based access, only URI
• Problem is “Embarrassingly Parallel”
Big Data Interest Group, LANL, Feb 21st 2013 14
[Unclassified]
15. Approach 1: Online Messaging
• 25 read nodes, 1 control node, 150 write nodes
• Messages (1000 URIs) sent via control node to write
• Failed 15 times due to hardware issues
Big Data Interest Group, LANL, Feb 21st 2013 15
[Unclassified]
16. Approach 1: Implementation
• Python
• MPI L
• PyMPI LL
• No failure detection, no auto restart, no zombie killing,
no … LLL
• Several months of experiments to find issues, fine tune
parameters most likely to complete…
Big Data Interest Group, LANL, Feb 21st 2013 16
[Unclassified]
17. Approach 2: No Interaction
• 43 read/split nodes
• Phase 1: Read nodes split CDX files to 3000 slices
Big Data Interest Group, LANL, Feb 21st 2013 17
[Unclassified]
18. Approach 2: No Interaction
• Phase 2a: Transfer CDX slices to Control node
• Phase 2b: Transfer CDX slices to Write nodes
Big Data Interest Group, LANL, Feb 21st 2013 18
[Unclassified]
19. Approach 2: No Interaction
• 50 write nodes (* 60 slices each = 3000 slices)
• Phase 3: Merge slices from nodes to BerkeleyDBs
Big Data Interest Group, LANL, Feb 21st 2013 19
[Unclassified]
20. Next Steps
• Re-index, using non-interactive approach
• New data: 14 Tb (~120Tb uncompressed?)
• Use FileMap for better automation?
• Two eSata 8T LaCie cubes, 2 3T internal drives to
install locally
• More intelligent data partitioning
• Some sort of error detection and handling!
Big Data Interest Group, LANL, Feb 21st 2013 20
[Unclassified]
21. Memento
http://mementoweb.org/
Robert Sanderson
rsanderson@lanl.gov
@azaroth42
Herbert Van de Sompel
herbertv@lanl.gov
@hvdsomp
Big Data Interest Group, LANL, Feb 21st 2013 21
[Unclassified]