A proposal for combining two different technologies, Solr and a triple store, in order to improve the (user) search experience by decoupling the “search” from the “view” perspective.
Linking library data for fast search and rich display
1. 31st ADLUG ANNUAL MEETING 2012
Sala Brunelleschi of the OPA – CESVOT - Firenze
19 – 21 September 2012
Linking Linked Data
Andrea Gazzarini
Software Architect
Copyright 2009-2010 @CULT. All rights reserved
2. Agenda
Goals
Information Retrieval
Triple store
Proof of concept
Q&A
Copyright 2009-2010 @CULT. All rights reserved 2
3. Agenda
Goals
Information Retrieval
Triple store
Proof of concept
Q&A
Copyright 2009-2010 @CULT. All rights reserved 3
4. Goals
1) Combine two different technologies in order to improve the (user) search
experience by decoupling the “search” from the “view” perspective.
2) Provide a fast full-featured fulltext search that is able to scale over billion
of records, providing tipical search features like faceting, stemming,
autocompletion and so on...
3) Provide a system that is able to benefit of the Linked Data
extensibility feature
Copyright 2009-2010 @CULT. All rights reserved 4
5. Le avventure di Pinocchio
This is a record extracted from the recordset we will use during
this presentation.
000 00694nam a2200241 i 4500
008 971205s1997 it j 000 0 ita c
020 a 880921191X
082 1 a 853.8
100 1 a Collodi, Carlo.
245 13 a Le avventure di Pinocchio /
c C. Collodi ; illustrazioni di Attilio Mussino.
260 a Firenze :
b Giunti,
c 1997.
440 0 a Collana favolosa / [Giunti]
521 a Letteratura per ragazzi
700 1 a Mussino, Attilio.
Copyright 2009-2010 @CULT. All rights reserved 5
6. Agenda
Goals
Information Retrieval
Triple store
Proof of concept
Q&A
Copyright 2009-2010 @CULT. All rights reserved 6
7. Information Retrieval (1/2)
For our purposes we will (simplistically) define an Information Retrieval (IR) as
a full-text search framework able to index textual data and perform some
manipulation in order to enable some end user interesting search features like:
» Relevance computation and boosting
» Autocompletion
» Faceting
» Stemming
» Did you mean?
» Search by phoneme (i.e. Sounds Like)
» More like this
» ...and many many others...
But there's a price to pay for that...
Copyright 2009-2010 @CULT. All rights reserved 7
8. Inverted index
In computer science, an inverted index (also referred to as postings file or
inverted file) is an index data structure storing a mapping from content, such
as words or numbers, to its locations in a database file, or in a document or a
set of documents. The purpose of an inverted index is to allow fast full text
searches, at a cost of increased processing when a document is added to the
database. The inverted file may be the database file itself, rather than its
index. It is the most popular data structure used in document retrieval systems
http://en.wikipedia.org/wiki/Inverted_index
An inverted index is an optimized structure that allows fast searches but is
supposed to be immutable so that means if you need to change something in
your data you need to rebuild your index.
Copyright 2009-2010 @CULT. All rights reserved 8
9. Semantic destruction (1/3)
A search engine doesn't care about how much accuracy you put and how
many time you spent for cataloguing a bibliographic resource...once
indexed, it will loose any semantic meaning!
...ipsum
dolor sit
amet,
consectetur
adipiscing...
A
S
C
C
I
Y
L
O E
Z
P I
O
U
A
U
Y R D
W
Copyright 2009-2010 @CULT. All rights reserved 9
10. Semantic destruction (2/3)
The adventures of Pinocchio
The adventures of Pinocchio
adventures Pinocchio
adventures pinocchio
adventure pinocchio
ATFN PNX
Tokenization
Stopwords
Lowercase
Stemming (light)
Phoneme (!)
These are the only tokens that will be indexed!
Copyright 2009-2010 @CULT. All rights reserved 10
12. Agenda
Goals
Information Retrieval
Triple store
Proof of concept
Q&A
Copyright 2009-2010 @CULT. All rights reserved 12
13. Triple store (1/2)
A triplestore is a purpose-built database for the storage and retrieval of triples,
a triple being a data entity composed of subject-predicate-object, like "Bob
is 35" or "Bob knows Fred".
http://en.wikipedia.org/wiki/Triplestore
Subject Predicate Object
book hasTitle The adventures of Pinocchio
book hasAuthor Collodi, Carlo
book hasPublisher Giunti
Of course it is more similar to a database and basically has nothing to do
with an inverted index.
Copyright 2009-2010 @CULT. All rights reserved 13
14. Triple store (2/2)
Using a triple store you can have
1) a standard Query language (SPARQL) to query the store;
2) a standard format for exchanging data (RDF);
3) a storage where you are free to change your data in realtime
without doing any kind of reindex operation;
But, most important, you cannot have
any of the seach features we described in the previous slides; for
some of them it is practically impossible (e.g. faceting), for others
(e.g. autocompletion) the problem is mainly the response time;
Copyright 2009-2010 @CULT. All rights reserved 14
15. Agenda
Goals
Information Retrieval
Triple store
Proof of concept
Q&A
Copyright 2009-2010 @CULT. All rights reserved 15
16. Proof of Concept
Our system is able to combine together the previous described technologies
trying to get all the advantages and minimize the disadvantages.
MARC (Binary) MARC XML RDF / XML N3 Turtle NTriples
Search View
Information
Retrieval
Triple store
Copyright 2009-2010 @CULT. All rights reserved 16
18. Le avventure di Pinocchio (MARC)
000 00694nam a2200241 i 4500
008 971205s1997 it j 000 0 ita c
020 a 880921191X
082 1 a 853.8
100 1 a Collodi, Carlo.
245 13 a Le avventure di Pinocchio /
c C. Collodi ; illustrazioni di Attilio Mussino.
260 a Firenze :
b Giunti,
c 1997.
440 0 a Collana favolosa / [Giunti]
521 a Letteratura per ragazzi
700 1 a Mussino, Attilio.
Copyright 2009-2010 @CULT. All rights reserved 18
19. Le avventure di Pinocchio (RDF / XML)
<bibo:Book rdf:about="http://www.cbt.trentinocultura.net/biblio/000002577949">
<dcterms:identifier>000002577949</dcterms:identifier>
<bibo:isbn10>880921191X</bibo:isbn10>
<dcterms:shortTitle>Le avventure di Pinocchio</dcterms:shortTitle>
<dcterms:title>
Le avventure di Pinocchio / C. Collodi ; illustrazioni di Attilio Mussino
The book...
</dcterms:title>
<dc:creator rdf:resource="http://www.cbt.trentinocultura.net/person/collodi_carlo"/>
<dcterms:language>ita</dcterms:language>
<dcterms:audience rdf:resource="http://www.cbt.trentinocultura.net/subject/opera_per_bambini"/>
<dcterms:isPartOf rdf:resource="http://www.cbt.trentinocultura.net/biblio/2378129373323" />
<dcterms:extent>186 p.</dcterms:extent>
<isbd:hasPlaceOfPublicationProductionDistribution>
Firenze
</isbd:hasPlaceOfPublicationProductionDistribution>
<dcterms:issued>1997</dcterms:issued>
<dcterms:publisher rdf:resource="http://www.cbt.trentinocultura.net/organisations/giunti"/>
</bibo:Book>
...the author...
<foaf:Person rdf:about="http://www.cbt.trentinocultura.net/person/collodi_carlo">
<foaf:name>Collodi, Carlo</foaf:name>
</foaf:Person>
<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/organisations/giunti">
<foaf:name>Giunti</foaf:name>
</foaf:Organization>
...and the publisher
Copyright 2009-2010 @CULT. All rights reserved 19
20. Step 1: transform MARC in RDF
As first step we need to transform MARC records in their corresponding RDF
representation.
This presentation is not focused on this advanced topic, we will just index ten
MARC records only for demonstrating the capabilities of the system.
We choosen the RDF / XML format for expressing the resulting triples. This
will be the input data of the system.
MARC 21 RDF / XML
Copyright 2009-2010 @CULT. All rights reserved 20
21. Step 2: submit RDF data
The RDF data created in the previous step needs to be submitted to the
system.
RDF / XML
Copyright 2009-2010 @CULT. All rights reserved 21
22. Step 3: make a search...
Autocompletion
Faceting
Copyright 2009-2010 @CULT. All rights reserved 22
23. Step 4: more publisher data...
It would be great if my users could see
additional data on search results.
For example, I could ask data to publishers
(logo, homepage and so on)...maybe for them
could be a kind of advertisment, while for my users an
additional information displayed on my catalog
But
1) I don't want those data be part of my search index;
2) I don't want to include those data in my bibliographic database;
3) I don't want to reindex my data when some publisher information changes
4) I would like to manage, improve those data without affecting searches
Copyright 2009-2010 @CULT. All rights reserved 23
24. Step 6: Our sample publisher
Before...
<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/organisations/giunti">
<foaf:name>Giunti</foaf:name>
</foaf:Organization>
...and after
<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/organisations/giunti">
<foaf:name>Giunti</foaf:name>
<foaf:logo rdf:resource=”http://www.giunti.it/custom/src/@css/images/logo_Giunti.jpg”/>
<rdfs:comment>Fondata nel pieno delle battaglie risorgimentali...</rdfs:comment>
<foaf:mbox rdf:resource=”mailto:contactsus@domain.it”/>
<foaf:homepage rdf:resource=”http://www.giunti.it”/>
</foaf:Organization>
As you can see, we added a logo, a brief description of the publisher, a mailbox and a
homepage. We got data directly from the publisher website.
This data will be submitted again to the search system but without rebuild the search index.
As consequence of that, changes made to the publishers are immediately available.
Copyright 2009-2010 @CULT. All rights reserved 24
25. Step 7: see additional data...
Copyright 2009-2010 @CULT. All rights reserved 25
26. Step 7 bis: another publisher...
Copyright 2009-2010 @CULT. All rights reserved 26
27. Step 8: still more (linked) data... (1/3)
Great! My users were enthusiast!!
So I'd like more...and not only publisher...
but what else?
Sir, I think it would be very useful if we would
show, beside each record, author information
Yes definitely it would, but you have no idea of what kind of
job I did to insert all publisher data and I don't
want to do the same for authors...too much work!
If I remember well your system is
Yes using Linked Data isn't it?
So in this case the right question is not “How can I do, I have no data”,
but “What kind of data I would like to show?”
???
Copyright 2009-2010 @CULT. All rights reserved 27
28. Step 8: still more (linked) data...(2/3)
There a lot of RDF authoritative endpoints that are exposing their data free of charge;
the main advantage is that you can link this information to your system and you
don't have to worry about their maintenance: it's not your data! See
http://viaf.org or http://dbpedia.org
By linking those resources, you can get data in a standardized way because sources
are sharing one or more (accepted) ontologies for describing authors, subjects,
things and so on...
So for the example above we need the gather additional information about people
(authors) and fortunately there's an ontology called Friend of a Friend (FOAF) that
fits exactly our needs. This ontology is used in all RDF sources describing persons
(like VIAF, Dbpedia)
In our example instead of copying and storing in our triple store (as we did for
publishers) all information about Carlo Collodi, the author of “The adventures of
Pinocchio”, we will simply link our internal representation with the same resource
as defined in DBPedia.
Copyright 2009-2010 @CULT. All rights reserved 28
29. Step 8: still more (linked) data...(3/3)
Copyright 2009-2010 @CULT. All rights reserved 29
30. Step 9: Our sample author
Before...
<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/person/collodi_carlo">
<foaf:name>Collodi, Carlo</foaf:name>
</foaf:Organization>
...and after
<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/person/collodi_carlo">
<foaf:name>Collodi, Carlo</foaf:name>
<owl:sameAs rdf:resource=”http://dbpedia.org/resource/Carlo_Collodi”/>
</foaf:Organization>
As you can see, we didn't add any information but just a “link” with the sameAs predicate.
The URI (http://dbpedia.org/resource/Carlo_Collodi) points to a web resource describing
Carlo Collodi, so we can gather this data and display to the end user (for example).
Copyright 2009-2010 @CULT. All rights reserved 30
31. Step 10: again the same search...
Copyright 2009-2010 @CULT. All rights reserved 31
32. Step 10 bis: another author...
Copyright 2009-2010 @CULT. All rights reserved 32
33. Step 11: still more data??? yes!
Wow!! And now?
Is there some other content I could “link”?
Yes sir, subjects for example...are you using subjects
coming from the “Nuovo Soggettario”?
Yes
So in this case you can link those subjects directly
with concepts of the thesaurus, therefore providing
to end users information like scope notes,
history notes, term relationships and so on..
And, as another example, for places you can link “Geonames”
resources, which provides RDF description of cities, countries.
Copyright 2009-2010 @CULT. All rights reserved 33
34. Step 12: Linking the “Nuovo Soggettario“
Copyright 2009-2010 @CULT. All rights reserved 34
35. Step 13: Linking Firenze with Geonames
Copyright 2009-2010 @CULT. All rights reserved 35
36. Agenda
Goals
Information Retrieval
Triple store
Proof of concept
Q&A
Copyright 2009-2010 @CULT. All rights reserved 36
37. 31st ADLUG ANNUAL MEETING 2012
Sala Brunelleschi of the OPA – Firenze
19 – 21 September 2012
Linking Linked Data
Thank You!