SlideShare uma empresa Scribd logo
1 de 21
Searching in more than 140 years
            newspaper articles


How Bassilichi Group worked to implement the oldest Italian
newspaper historical archive of "La Stampa di Torino" from
1867 to 2006
Nicola Provenzano, Bassilichi Group, Italy
Agenda

o      About Bassilichi Group

o      The Italian newspaper historical archive of
       "La Stampa di Torino" from 1867 to 2006

o      Our Search Challenges

o      Enhancing the findability
BASSILICHI S.p.A.                                 Turnover: € 256M
An Italian Business Process Outsourcing
(BPO), the company serves as a strategic
partner for banks, businesses and the public
sector with an offering that covers the
following three areas:
Monetics, Security and Back Office




                                               Employees:
                                                 1009
                                               (at 31/12/2010)
The Italian newspaper La Stampa from Turin

o Born on February 9, 1867 with the name of “Gazzetta
   Piemontese”

o La Stampa is one of the best known and most famous Italian
   newspaper, published in Turin and distributed in Italy and
   other European nations

o With the daily sales of about 400,000 copies (2010) and
   9.000.000 of site page view in a month La Stampa is the third
   best-selling information newspaper in the country
The project: digitalize the entire historical
 archive and publish the content on the web
2007 The project starts

Digitalization




Layout Analysis




OCR




Data entry


2010 The project goes on line
Project workgroup
Committee for the Digital Library Information Journalism,
    members
    o    San Paolo Company
    o    CRT Foundation,
    o    La Stampa publishing company
    o    Regione Piemonte

Service Providers

o STI S.p.a, Bassilichi S.p.a, Microshop S.r.l, Bassnet S.r.l

Hosting and infrastructure provider

o   CSI Piemonte
Project numbers
o nearly 150 years of history

o 1,761,000 newspaper pages with various page layout

o more than 5 million newspaper articles

o 4.5 million images of photographs and negatives

o Nearly 100 TByte of images (from 300 to 96 dpi), xml and txt
   documents
Web project requirements

o Search in the articles: full-text search and search with
    headboard, date and page number



o Possibility to read the article with text only interface or with
    article highlighting over the image of the newspaper source
    page



o To use Open-source technologies
Web project input data

o XML with:
   o   Headboard, issue date,
       page number

   o   Title and article body




o Mets and Alto xml file with
   article, line and works
   position on the page
January 17, 2007

“Solr has graduated from the Apache Incubator, and
           is now a sub-project of Lucene“
Main Solr implementation tricks

o Lucene document ID is a Domain Primary Key

o Long articles text indexed but not stored to reduce index size

o Abstract article’s text is stored to reduce search result listing
    time

o Custom XmlUpdateRequestHandler to index long articles
    OCR text

o Robust Message Queuing System to handle system indexing
    commands
Web project main technologies
Web project challenges
     The search engine works good but how to ensure high
    performance in the presence of a potentially very high traffic?

TO DO:

o Investigate load balancing possibilities and fault tolerance
    strategies

o Find how to disjoin the index creation phase from the index
    release in production

o Use read-only optimized production lucene index
Solr collection distribution
                                 Load Balancer


              HTTPD                     HTTPD                             HTTPD



                                 Load Balancer

                                                                                   JBOSS EAP
                                                                                     Cluster
           Slave                Slave                    Slave    Index    Slave
   Index               Index             Index




                   Management                              Index Replication


                     Updates
                                                 Index
Administration
On line web project numbers

In the day of the presentation of the project the site supports very
                   high traffic without any problem

o The historical archive of “La Stampa di Torino” is one of the
    biggest freely available digital newspaper archive, near the
    Times and New York Times

o 509.791 page view on the 1° November 2010, 21.352 user
    sessions

o Near 15.000.000 page view in the last year
Current development version challenges
   Browsing the archive by date, article title and text give good
         search experience but how to enhance the findability?

o Boosting articles with Named Entity Recognition with help of
    Celi s.r.l

o Enhancing user search capabilities with query autocomplete
    suggestions and advanced search possibilities over Named
    Entities: author, persons, locations, organizations

o Faceting content with all the new article attributes

o Enable content tagging to collect useful user navigation
    suggestions
Current development version details
o   JQuery UI enriched our user interface

o   Date Range filters drive the new timeline
    search widget

o   Multi select faceting for user search refinement

o   MORE LIKE THIS with named entities for user
    search suggestions
Q&A
 nicola.provenzano@bdadoc.it

Bassilichi Group - Firenze - Italy

Mais conteúdo relacionado

Semelhante a Lightning talk: Searching in more than 140 years newspaper articles - Nicolas Provenzano

IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Centre of Competence
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
Europeana Newspapers
 
Overview AG AKSW
Overview AG AKSWOverview AG AKSW
Overview AG AKSW
Sören Auer
 

Semelhante a Lightning talk: Searching in more than 140 years newspaper articles - Nicolas Provenzano (20)

IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
 
Europeana datainaction nov2012
Europeana datainaction nov2012Europeana datainaction nov2012
Europeana datainaction nov2012
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, SwedenSem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
 
255 shaw
255 shaw255 shaw
255 shaw
 
Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects
 
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
 
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
 
A Semantic Multimedia Web (Part 3)
A Semantic Multimedia Web (Part 3)A Semantic Multimedia Web (Part 3)
A Semantic Multimedia Web (Part 3)
 
Devfest09 OpenSocial Enterprise
Devfest09 OpenSocial EnterpriseDevfest09 OpenSocial Enterprise
Devfest09 OpenSocial Enterprise
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
 
Public
PublicPublic
Public
 
BBC Programmes Ontology XTech2008
BBC Programmes Ontology XTech2008BBC Programmes Ontology XTech2008
BBC Programmes Ontology XTech2008
 
Thinking the archives of 2020: Opportunitiws, priorities, Issues
Thinking the archives of 2020: Opportunitiws, priorities, IssuesThinking the archives of 2020: Opportunitiws, priorities, Issues
Thinking the archives of 2020: Opportunitiws, priorities, Issues
 
From Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperabilityFrom Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperability
 
Overview AG AKSW
Overview AG AKSWOverview AG AKSW
Overview AG AKSW
 
OpenMinteD Project - building a TDM infrastructure
OpenMinteD Project - building a TDM infrastructureOpenMinteD Project - building a TDM infrastructure
OpenMinteD Project - building a TDM infrastructure
 
Harvesting&Metadata Enrich Project EVA 2009
Harvesting&Metadata Enrich Project   EVA 2009Harvesting&Metadata Enrich Project   EVA 2009
Harvesting&Metadata Enrich Project EVA 2009
 
Qbt nlp en_2014
Qbt nlp en_2014Qbt nlp en_2014
Qbt nlp en_2014
 

Mais de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Mais de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Lightning talk: Searching in more than 140 years newspaper articles - Nicolas Provenzano

  • 1. Searching in more than 140 years newspaper articles How Bassilichi Group worked to implement the oldest Italian newspaper historical archive of "La Stampa di Torino" from 1867 to 2006 Nicola Provenzano, Bassilichi Group, Italy
  • 2. Agenda o About Bassilichi Group o The Italian newspaper historical archive of "La Stampa di Torino" from 1867 to 2006 o Our Search Challenges o Enhancing the findability
  • 3. BASSILICHI S.p.A. Turnover: € 256M An Italian Business Process Outsourcing (BPO), the company serves as a strategic partner for banks, businesses and the public sector with an offering that covers the following three areas: Monetics, Security and Back Office Employees: 1009 (at 31/12/2010)
  • 4. The Italian newspaper La Stampa from Turin o Born on February 9, 1867 with the name of “Gazzetta Piemontese” o La Stampa is one of the best known and most famous Italian newspaper, published in Turin and distributed in Italy and other European nations o With the daily sales of about 400,000 copies (2010) and 9.000.000 of site page view in a month La Stampa is the third best-selling information newspaper in the country
  • 5. The project: digitalize the entire historical archive and publish the content on the web 2007 The project starts Digitalization Layout Analysis OCR Data entry 2010 The project goes on line
  • 6. Project workgroup Committee for the Digital Library Information Journalism, members o San Paolo Company o CRT Foundation, o La Stampa publishing company o Regione Piemonte Service Providers o STI S.p.a, Bassilichi S.p.a, Microshop S.r.l, Bassnet S.r.l Hosting and infrastructure provider o CSI Piemonte
  • 7. Project numbers o nearly 150 years of history o 1,761,000 newspaper pages with various page layout o more than 5 million newspaper articles o 4.5 million images of photographs and negatives o Nearly 100 TByte of images (from 300 to 96 dpi), xml and txt documents
  • 8. Web project requirements o Search in the articles: full-text search and search with headboard, date and page number o Possibility to read the article with text only interface or with article highlighting over the image of the newspaper source page o To use Open-source technologies
  • 9. Web project input data o XML with: o Headboard, issue date, page number o Title and article body o Mets and Alto xml file with article, line and works position on the page
  • 10. January 17, 2007 “Solr has graduated from the Apache Incubator, and is now a sub-project of Lucene“
  • 11. Main Solr implementation tricks o Lucene document ID is a Domain Primary Key o Long articles text indexed but not stored to reduce index size o Abstract article’s text is stored to reduce search result listing time o Custom XmlUpdateRequestHandler to index long articles OCR text o Robust Message Queuing System to handle system indexing commands
  • 12. Web project main technologies
  • 13.
  • 14.
  • 15. Web project challenges The search engine works good but how to ensure high performance in the presence of a potentially very high traffic? TO DO: o Investigate load balancing possibilities and fault tolerance strategies o Find how to disjoin the index creation phase from the index release in production o Use read-only optimized production lucene index
  • 16. Solr collection distribution Load Balancer HTTPD HTTPD HTTPD Load Balancer JBOSS EAP Cluster Slave Slave Slave Index Slave Index Index Index Management Index Replication Updates Index Administration
  • 17. On line web project numbers In the day of the presentation of the project the site supports very high traffic without any problem o The historical archive of “La Stampa di Torino” is one of the biggest freely available digital newspaper archive, near the Times and New York Times o 509.791 page view on the 1° November 2010, 21.352 user sessions o Near 15.000.000 page view in the last year
  • 18. Current development version challenges Browsing the archive by date, article title and text give good search experience but how to enhance the findability? o Boosting articles with Named Entity Recognition with help of Celi s.r.l o Enhancing user search capabilities with query autocomplete suggestions and advanced search possibilities over Named Entities: author, persons, locations, organizations o Faceting content with all the new article attributes o Enable content tagging to collect useful user navigation suggestions
  • 19. Current development version details o JQuery UI enriched our user interface o Date Range filters drive the new timeline search widget o Multi select faceting for user search refinement o MORE LIKE THIS with named entities for user search suggestions
  • 20.