SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Deep Search or Harvest: how we decided 
Simon Moffatt Ilaria Corda 
Discovery & Access – The British Library
www.bl.uk 
2 
Overview 
•Focussing on operational and functional considerations 
– Harvesting data and using Deep Search. 
•Along the way… 
–Challenges of operating with a large data set 
–Demonstrate our new Web Archive search of 1.3bn web pages 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
3 
Introductions 
•We work in a team of six supporting and developing Discovery and Access at the British Library 
•The British Library has reading rooms and storage in London and in Yorkshire. (We are based in Yorkshire) 
•We are a UK Legal Deposit library 
–We collect everything 
–It can only be accessed in our reading rooms 
•Our Ex Libris products 
–Aleph v20 Primo v4.x SFX v4.x 
–All locally hosted 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
4 
Deep Search or Harvest: how we decided 
•Like most institutions, the remit of our search is expanding 
–New digital content 
–Migration from older technologies 
•If you have a large number of new records to add, often you have two options: 
–Harvest the records into your main index 
–Deep Search to an external index 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
5 
Harvest 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
6 
Deep Search 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
7 
Two case studies 
Recently we faced this Harvest vs Deep Search dilemma for two new sources of data 
•A replacement to our Content Management System 
–we decided to harvest 
•A search of our new Web Archive 
– we implemented a deep search 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
8 
Content Management System - Harvested 
The technology behind the BL website is being updated. 
•There are around 50,000 pages 
•It has its own index, updated daily 
Work involved 
•Creating Normalisation rules and Pipe 
•Setting up daily processes 
Not appropriate for deep search – more later…. 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
9 
Large datasets in Primo: Our experience 
Background: Our data 
•13m Aleph records (Books, Journals, Maps etc) 
•5m SirsiDynix Symphony records (Sound Archive) 
•48m Journal articles (growing by 2m per year) 
•1m other records from five other pipes 
Total: 67 Million records 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 10 
Our topology 
FE1 
Front End Server 
FE2 
Front End Server 
FE3 
Front End Server 
SE1 
Search 
Slice 
SE8 
Search 
Slice 
SE7 
Search 
Slice 
SE6 
Search 
Slice 
SE5 
Search 
Slice 
SE4 
Search 
Slice 
SE3 
Search 
Slice 
SE2 
Search 
Slice 
BO 
Back Office Server 
DB 
Database Server 
Search Results 
Indexes 
Config, PNXs etc 
Config, 
Web 2.0 
etc 
Public 
Private 
Harvests 
Load-balanced requests 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
11 
Service challenges with 67m records 
Our index is ~100GB 
•Indexing takes at least 3 hours; Hotswap takes 4 hours 
–Even if there is only one new record 
–Overnight schedules are tight 
–Fast data updates are impossible 
•System restart takes 5 hours 
–Re-sync Search Schema is a whole day 
–Failover system must be available at all times 
•Primo Service Packs and Hotfixes need caution 
•Standard documented advice must be read carefully 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
12 
Development challenges with 67m records 
•Full re-harvest takes 7 weeks 
–Major normalisation changes only 3 times per year 
–Smaller UI changes can be made more often 
•Primo Version upgrades affect 13 servers (or 56 in total) 
•Implementing Primo enhancements 
–We must consider index size 
–Sort, browse etc all have an impact 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
13 
Our development cycle (not to scale) 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
14 
But there are compensations… 
•Speed of the index 
•Control over the data 
•Consistency of rules 
•Common development skills 
•A single point of failure 
These are all important 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
15 
Web Archive - Deep Search 
Web Archiving collects, makes accessible and preserves web resources of scholarly and cultural importance from the UK domain 
Some Figures: 
 ~1.3 billion documents 
 Regular crawling processes - E.g. BBC News 
And the infrastructure? 
 index size is ~3.5TB spread across 24 Solr shards 
 80-node Hadoop cluster (where crawls are processed for submission) 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
16 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
17 
Service challenges with Deep Search 
 Additional point of failure within Primo 
–Newly introduced Troubleshooting procedure 
–Service Notices for planned Downtime/Maintenance 
–Changes to the Solr Schema that could break our search 
 Primo Upgrades & Hotfixes can affect the functioning of the Deep Search 
–Re-builds of the Client in case the Deep Search libraries (jars) change 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
18 
Development challenges with Deep Search 
Significant development work to implement Primo/Solr integration 
Accommodate new Primo features (versioning control) 
E.g. Multi-select faceting (Exclusion & Inclusion) 
Needs to ensure consistency across local and non-local collections 
Seemingly NO difference from a UI/UX point of view 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
19 
But there are compensations… 
Ideal for large indexes and frequent updates 
Independent indexing processes 
Maintenance of existing scheduled processes 
Leverage existing Primo-built in features 
Existing and Extendable Solr APIs / active community 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
21 
Harvest vs Deep Search: how we decided 
ADDITIONAL INFRASTRUCTURE COSTS 
DEV TIME / SPECIALIST SKILLS 
INDEX SIZE/NO. RECORDS 
FREQUENCY & VOLUME OF UPDATES 
IMPACT ON CURRENT PROCESSES 
Deep Search 
Harvest 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
22 
Thank you

Mais conteúdo relacionado

Semelhante a Deep Search or Harvest, how we decided

Semelhante a Deep Search or Harvest, how we decided (20)

Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of Southampton
 
Conf2014_SplunkSearchOptimization
Conf2014_SplunkSearchOptimizationConf2014_SplunkSearchOptimization
Conf2014_SplunkSearchOptimization
 
SplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxSplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptx
 
SplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxSplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptx
 
The Good, The Bad and the Ugly
The Good, The Bad and the UglyThe Good, The Bad and the Ugly
The Good, The Bad and the Ugly
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Library
 
Inn reach for IIUG (2005)
Inn reach for IIUG (2005)Inn reach for IIUG (2005)
Inn reach for IIUG (2005)
 
Role of-analytics-in-db as-life
Role of-analytics-in-db as-lifeRole of-analytics-in-db as-life
Role of-analytics-in-db as-life
 
Customer Presentation
Customer PresentationCustomer Presentation
Customer Presentation
 
GWAVACon - Vibe: Collaboration made easy
GWAVACon - Vibe: Collaboration made easyGWAVACon - Vibe: Collaboration made easy
GWAVACon - Vibe: Collaboration made easy
 
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
 
Kscope11 recap
Kscope11 recapKscope11 recap
Kscope11 recap
 
ASTQB washington-sept-2015
ASTQB washington-sept-2015ASTQB washington-sept-2015
ASTQB washington-sept-2015
 
PEER End of Project Report
PEER End of Project ReportPEER End of Project Report
PEER End of Project Report
 
Brief on Migration to V7 Project
Brief on Migration to V7 ProjectBrief on Migration to V7 Project
Brief on Migration to V7 Project
 
US Sugar: How Mobilizing SAP ERP Delivered Sweet Results
US Sugar: How Mobilizing SAP ERP Delivered Sweet ResultsUS Sugar: How Mobilizing SAP ERP Delivered Sweet Results
US Sugar: How Mobilizing SAP ERP Delivered Sweet Results
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analytics
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - Architecture
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 

Último

Último (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Deep Search or Harvest, how we decided

  • 1. Deep Search or Harvest: how we decided Simon Moffatt Ilaria Corda Discovery & Access – The British Library
  • 2. www.bl.uk 2 Overview •Focussing on operational and functional considerations – Harvesting data and using Deep Search. •Along the way… –Challenges of operating with a large data set –Demonstrate our new Web Archive search of 1.3bn web pages IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 3. www.bl.uk 3 Introductions •We work in a team of six supporting and developing Discovery and Access at the British Library •The British Library has reading rooms and storage in London and in Yorkshire. (We are based in Yorkshire) •We are a UK Legal Deposit library –We collect everything –It can only be accessed in our reading rooms •Our Ex Libris products –Aleph v20 Primo v4.x SFX v4.x –All locally hosted IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 4. www.bl.uk 4 Deep Search or Harvest: how we decided •Like most institutions, the remit of our search is expanding –New digital content –Migration from older technologies •If you have a large number of new records to add, often you have two options: –Harvest the records into your main index –Deep Search to an external index IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 5. www.bl.uk 5 Harvest IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 6. www.bl.uk 6 Deep Search IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 7. www.bl.uk 7 Two case studies Recently we faced this Harvest vs Deep Search dilemma for two new sources of data •A replacement to our Content Management System –we decided to harvest •A search of our new Web Archive – we implemented a deep search IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 8. www.bl.uk 8 Content Management System - Harvested The technology behind the BL website is being updated. •There are around 50,000 pages •It has its own index, updated daily Work involved •Creating Normalisation rules and Pipe •Setting up daily processes Not appropriate for deep search – more later…. IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 9. www.bl.uk 9 Large datasets in Primo: Our experience Background: Our data •13m Aleph records (Books, Journals, Maps etc) •5m SirsiDynix Symphony records (Sound Archive) •48m Journal articles (growing by 2m per year) •1m other records from five other pipes Total: 67 Million records IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 10. www.bl.uk 10 Our topology FE1 Front End Server FE2 Front End Server FE3 Front End Server SE1 Search Slice SE8 Search Slice SE7 Search Slice SE6 Search Slice SE5 Search Slice SE4 Search Slice SE3 Search Slice SE2 Search Slice BO Back Office Server DB Database Server Search Results Indexes Config, PNXs etc Config, Web 2.0 etc Public Private Harvests Load-balanced requests IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 11. www.bl.uk 11 Service challenges with 67m records Our index is ~100GB •Indexing takes at least 3 hours; Hotswap takes 4 hours –Even if there is only one new record –Overnight schedules are tight –Fast data updates are impossible •System restart takes 5 hours –Re-sync Search Schema is a whole day –Failover system must be available at all times •Primo Service Packs and Hotfixes need caution •Standard documented advice must be read carefully IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 12. www.bl.uk 12 Development challenges with 67m records •Full re-harvest takes 7 weeks –Major normalisation changes only 3 times per year –Smaller UI changes can be made more often •Primo Version upgrades affect 13 servers (or 56 in total) •Implementing Primo enhancements –We must consider index size –Sort, browse etc all have an impact IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 13. www.bl.uk 13 Our development cycle (not to scale) IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 14. www.bl.uk 14 But there are compensations… •Speed of the index •Control over the data •Consistency of rules •Common development skills •A single point of failure These are all important IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 15. www.bl.uk 15 Web Archive - Deep Search Web Archiving collects, makes accessible and preserves web resources of scholarly and cultural importance from the UK domain Some Figures:  ~1.3 billion documents  Regular crawling processes - E.g. BBC News And the infrastructure?  index size is ~3.5TB spread across 24 Solr shards  80-node Hadoop cluster (where crawls are processed for submission) IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 16. www.bl.uk 16 IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 17. www.bl.uk 17 Service challenges with Deep Search  Additional point of failure within Primo –Newly introduced Troubleshooting procedure –Service Notices for planned Downtime/Maintenance –Changes to the Solr Schema that could break our search  Primo Upgrades & Hotfixes can affect the functioning of the Deep Search –Re-builds of the Client in case the Deep Search libraries (jars) change IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 18. www.bl.uk 18 Development challenges with Deep Search Significant development work to implement Primo/Solr integration Accommodate new Primo features (versioning control) E.g. Multi-select faceting (Exclusion & Inclusion) Needs to ensure consistency across local and non-local collections Seemingly NO difference from a UI/UX point of view IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 19. www.bl.uk 19 But there are compensations… Ideal for large indexes and frequent updates Independent indexing processes Maintenance of existing scheduled processes Leverage existing Primo-built in features Existing and Extendable Solr APIs / active community IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 20. www.bl.uk 21 Harvest vs Deep Search: how we decided ADDITIONAL INFRASTRUCTURE COSTS DEV TIME / SPECIALIST SKILLS INDEX SIZE/NO. RECORDS FREQUENCY & VOLUME OF UPDATES IMPACT ON CURRENT PROCESSES Deep Search Harvest IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]