Evaluating Memento Service Optimizations

•Transferir como PPTX, PDF•

0 gostou•1,482 visualizações

Martin Klein

Evaluating Memento Service Optimizations JCDL 2019 short paper presentation

Internet

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Evaluating
Service Optimizations
Martin Klein
Lyudmila Balakireva
Harihar Shankar
Research Library
Los Alamos National Laboratory
https://arxiv.org/abs/1906.00058

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Time Travel
http://timetravel.mementoweb.org/
2

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Time Travel
https://arquivo.pt/wayback/20160515103313/http://www.jcdl.org/
4

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
How does this work?
Memento Aggregator (simplistic view)
5

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Aggregator
6

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Memento Aggregator
7

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• As the number of archives grows, sending requests to each archive
for every incoming request is not feasible
• Response times
• Memento infrastructure load
• Load on distributed archives
LANL Memento Aggregator - Problem
8

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• We could predict whether or not to issue a request to a specific
archive?
• By merely looking at the requested URI-R
• A binary classifier per archive
• We could train the classifiers using cached data?
• That would be pretty neat, indeed:
• Retrain classifiers as web archive collections evolve
• Not dependent on external data
• Querying classifiers probably way faster (msec) than polling
archives (sec)
What if…
9

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
10
We can! Published @ JCDL 2016
https://doi.org/10.1145/2910896.2910899
• ML models based on simple URI features
• Character count, n-grams, domain
• Common ML algorithms used per archive
• Logistic Regression, Multinomial Bayes, SVM
• Optimized for
• Prediction time, not training time
• Reduction of false positive rate
Results:
• Requests per URI-R: 3.96 vs 11
• Response time:
2.16s vs 3.08s
• Recall:
0.847

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
11

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
In Production…
12

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
13

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
14

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
15

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
16

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
17

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
18

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• How effective is the cache?
• What is the hit/miss ratio? Does it vary for different Memento
services?
• Is the cache freshness period appropriate?
• How effective is the ML process?
• What is the recall and the false positive rate?
• Do we need to retrain the models? How often?
Questions to Ask
19

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• Memento Aggregator currently covers
• 23 web archives
• 17 with native memento support
• 6 with by-proxy memento support
• Analysis of log files
• recorded between July 4th 2017 and October 17th 2018
• > 11m requests in total
• Approx. 2.6m requests against machine learning process
• Results in 2.6m lookups to populate cache
• Used as “truth” to assess ML prediction
Evaluation
20

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Cache Hit/Miss Rate
21

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Cache Hit/Miss Rate
22
humanshumans machines machinesMostly driven by

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Recall
23
0.847
0.727

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
False Positives
24

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
Dynamic Web Archives
25

Evaluating Memento Service Optimizations
@mart1nkle1n
JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA
• Memento Aggregator cache is very effective
• ~ 60% of requests served from cache
• Human-driven services benefit the most
• Machine learning process saves!
• Requests & time while at acceptable recall level
• Recall: 0.727
• Re-training seems necessary, frequency TBD
Optimization
• ML model trained on archival holdings, not usage logs/cache
• Beneficial for new archives
• Neural network classifier, based on simple URI features, show
promising results
Takeaways
26

Mais conteúdo relacionado

Mais de Martin Klein

An Institutional Perspective to Rescue Scholarly OrphansMartin Klein

A Vision of the Library’s Role in Archiving Scholarly ArtifactsMartin Klein

First Steps in Research Data Management Under Constraints of a National Secur...Martin Klein

Smart Routing of Memento RequestsMartin Klein

Building Event Collections from Crawling Web ArchivesMartin Klein

A Web-Centric Pipeline for Archiving Scholarly ArtifactsMartin Klein

Focused Crawl of Web Archives to Build Event CollectionsMartin Klein

Creating Topical Collections:Web Archives vs. Live WebMartin Klein

Robust Linking to Web ResourcesMartin Klein

Signposting for RepositoriesMartin Klein

Discovering Scholarly Orphans Using ORCIDMartin Klein

Using the Memento Framework to Assess Content Drift in Scholarly CommunicationMartin Klein

Uniform Access to Raw MementosMartin Klein

Robust Links - a proposed solution to reference rot in scholarly communicationMartin Klein

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...Martin Klein

To the Rescue of the Orphans of Scholarly CommunicationMartin Klein

web_archive_interoperability_mementoMartin Klein

Reference Rot in Scholarly Communication: A Reliable Quantification and a P...Martin Klein

Comparing Published Scientific Journal Articles to Their Pre-print VersionsMartin Klein

Preserving Born-Digital News Panel JCDL 2016Martin Klein

Mais de Martin Klein (20)

An Institutional Perspective to Rescue Scholarly Orphans

A Vision of the Library’s Role in Archiving Scholarly Artifacts

First Steps in Research Data Management Under Constraints of a National Secur...

Smart Routing of Memento Requests

Building Event Collections from Crawling Web Archives

A Web-Centric Pipeline for Archiving Scholarly Artifacts

Focused Crawl of Web Archives to Build Event Collections

Creating Topical Collections:Web Archives vs. Live Web

Robust Linking to Web Resources

Signposting for Repositories

Discovering Scholarly Orphans Using ORCID

Using the Memento Framework to Assess Content Drift in Scholarly Communication

Uniform Access to Raw Mementos

Robust Links - a proposed solution to reference rot in scholarly communication

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...

To the Rescue of the Orphans of Scholarly Communication

web_archive_interoperability_memento

Reference Rot in Scholarly Communication: A Reliable Quantification and a P...

Comparing Published Scientific Journal Articles to Their Pre-print Versions

Preserving Born-Digital News Panel JCDL 2016

Último

Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Delhi Call girls

Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.soniya singh

Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...SUHANI PANDEY

Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6

Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.soniya singh

valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceEscorts Call Girls

VVIP Pune Call Girls Mohammadwadi WhatSapp Number 8005736733 With Elite Staff...SUHANI PANDEY

All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi

6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678

Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.soniya singh

VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698

Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...SUHANI PANDEY

Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubaikojalkojal131

Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceDelhi Call girls

VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Real Men Wear Diapers T Shirts sweatshirtrahman018755

Al Barsha Night Partner +0567686026 Call Girls DubaiEscorts Call Girls

Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile

Evaluating Memento Service Optimizations

1. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Evaluating Service Optimizations Martin Klein Lyudmila Balakireva Harihar Shankar Research Library Los Alamos National Laboratory https://arxiv.org/abs/1906.00058

2. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel http://timetravel.mementoweb.org/ 2

3. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel http://timetravel.mementoweb.org/list/20160214085934/http://jcdl.org 3

4. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel https://arquivo.pt/wayback/20160515103313/http://www.jcdl.org/ 4

5. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA How does this work? Memento Aggregator (simplistic view) 5

6. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Aggregator 6

7. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Aggregator 7

8. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • As the number of archives grows, sending requests to each archive for every incoming request is not feasible • Response times • Memento infrastructure load • Load on distributed archives LANL Memento Aggregator - Problem 8

9. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • We could predict whether or not to issue a request to a specific archive? • By merely looking at the requested URI-R • A binary classifier per archive • We could train the classifiers using cached data? • That would be pretty neat, indeed: • Retrain classifiers as web archive collections evolve • Not dependent on external data • Querying classifiers probably way faster (msec) than polling archives (sec) What if… 9

10. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 10 We can! Published @ JCDL 2016 https://doi.org/10.1145/2910896.2910899 • ML models based on simple URI features • Character count, n-grams, domain • Common ML algorithms used per archive • Logistic Regression, Multinomial Bayes, SVM • Optimized for • Prediction time, not training time • Reduction of false positive rate Results: • Requests per URI-R: 3.96 vs 11 • Response time: 2.16s vs 3.08s • Recall: 0.847

11. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 11

12. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA In Production… 12

13. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 13

14. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 14

15. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 15

16. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 16

17. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 17

18. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 18

19. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • How effective is the cache? • What is the hit/miss ratio? Does it vary for different Memento services? • Is the cache freshness period appropriate? • How effective is the ML process? • What is the recall and the false positive rate? • Do we need to retrain the models? How often? Questions to Ask 19

20. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • Memento Aggregator currently covers • 23 web archives • 17 with native memento support • 6 with by-proxy memento support • Analysis of log files • recorded between July 4th 2017 and October 17th 2018 • > 11m requests in total • Approx. 2.6m requests against machine learning process • Results in 2.6m lookups to populate cache • Used as “truth” to assess ML prediction Evaluation 20

21. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Cache Hit/Miss Rate 21

22. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Cache Hit/Miss Rate 22 humanshumans machines machinesMostly driven by

23. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Recall 23 0.847 0.727

24. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA False Positives 24

25. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Dynamic Web Archives 25

26. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • Memento Aggregator cache is very effective • ~ 60% of requests served from cache • Human-driven services benefit the most • Machine learning process saves! • Requests & time while at acceptable recall level • Recall: 0.727 • Re-training seems necessary, frequency TBD Optimization • ML model trained on archival holdings, not usage logs/cache • Beneficial for new archives • Neural network classifier, based on simple URI features, show promising results Takeaways 26

27. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Evaluating Service Optimizations Martin Klein Lyudmila Balakireva Harihar Shankar Research Library Los Alamos National Laboratory https://arxiv.org/abs/1906.00058

Evaluating Memento Service Optimizations

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Martin Klein

Mais de Martin Klein (20)

Último

Último (20)

Evaluating Memento Service Optimizations