The document discusses indexing and searching cross-media content in the ECLAP social network. It describes ECLAP, its goals for developing an indexing/searching solution, and its data model. It then covers the indexing and searching approaches used, which are based on Apache Solr and allow for multilingual, faceted searching across different content types. System usage is also assessed based on log analysis of user searches and content access over a three month period.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Indexing and Searching Cross Media Content in a Social Network
1. Indexing and Searching Cross
Media Content in a Social
Network
Pierfrancesco Bellini, Daniele Cenni, Paolo Nesi
University of Florence
Department of Systems and Informatics
Distributed Systems and Internet Technology Laboratory
ECLAP Conference, May 7-9, 2012
2. ECLAP Social Network
ECLAP is a Digital Library on Performing
Arts connected with Europeana
ECLAP is a Social Network (blogs,
forums, comments, tagging, voting, …)
3. Goals/Requirements
Develop an Indexing/Searching solution for ECLAP
Social Network allowing:
Indexing multilingual crossmedia content metadata and
data (e.g. documents)
Indexing portal blogs, forums, events, group pages,
comments, etc.
Efficient multilingual search (keyword search and
advanced search) supporting:
misspelled words (e.g. shespeare)
partial word search
Sorting and filtering search results
re-index the whole data without blocking the system
Log and monitor users activity
…
Evaluate the Indexing/Searchig service
4. ECLAP Data Model
Group/Channel
0..n
0..n
0..n 0..n
0..n 0..n 1 0..n
TaxonomyTerm Content Comment Performing
Arts
Metadata Dublin Core
Technical
1..n
Blog WebPage Forum Object
0..n
Playlist Document Collection
0..n 1..2 0..n
Annotation AVObject 1..n
Image Video Audio
4
5. Indexing
Indexing & Search system
Based on Apache Solr
Multilingual aspects
Translate the metadata or translate the query?
We use metadata translation
Indexing schema
Dublin Core + DCTerms (multi language)
Performing Arts
Technical (provider, content type, GPS, IPR, duration, quality, …)
Groups associations (multi language)
Taxonomy associations (multi language)
Comments & multi language tags
FullText of the textual digital resources
6. Indexing
Taxnmy, Comment,
DC Perf. Full Group Tags
Media Type (ML) Tech Arts Text (ML) (ML) Votes
Audio/Video/
Image
Y Y Y Y Y Y
Document
(pdf, doc, …)
Y Y Y Y Y Y Y
CrossMedia
(html, MPEG21,…)
Y Y Y Y Y Y Y
Aggregations
(playlist,
Y Y Y Y Y Y
collection, …)
Info text
(blog, web
(Y) Y Y
pages, forum,
events, …)
8. Indexing
Re-indexing
In case of new indexing schema or index
corruption the search system should not be
blocked
The re-indexing is done on a separete indexing
machine while the production system uses the
actual index
During re-index the new uploaded/modified
content is marked to be reindexed when the
new index is put in production
9. Searching
Full text search
Uses the catch all fields to search for
keywords in most important fields in all
languages (title, description, text, body,
subject,…)
Fuzzy search
Allows matching mistyped words
Deep search
Allows searching for partial words
Relevance & boosting of terms
15. Conclusions
Solution allows indexing multilingual
metadata and texts
Searching & filtering results
Search facility assessment show that
search is a used feature
16. Context & Assessment
Context
Social Network
User and content items
Content distribution portal
Video on demand portal
Archive, digital library, Performing Arts
http://www.eclap.eu
Assessment
User behavior
Log user actions on the Web portal
User happiness
Measure the level of user satisfaction about the exposed
services
17. Logging User Profile
User Profile
Registered or anonymous, uid (user id)
Timestamp YY-mm-dd hh:mm:ss
IP address, Proxy type etc.
Platform (OS, Browser)
GeoIP data (Country, Region, City)
Friends, connections
Betweenness, Eccentricity
Joined groups
User preferred contents
18. Understanding User behavior
Online survey
A simple module, in the right side of the portal
Presenting 3 - 4 questions per topic (depending on the
current portal section visited)
Stat Drupal Modules
Custom implemented modules
Log User Activity
Keep track and depict main figures about portal activity
Can be filtered by date, user, type of content, group,
type of activity (content enrichment, social promotion,
networking etc.)
Google Analytics
19. Understanding User behavior
Top Metrics
Avg # Visits/User
Avg # Queries/User
Avg # Clicks/User
Avg Visit duration
Avg Query length
Query refinement rate
Next Page Click Rate
Back Page Click Rate
Frequency of searching (once/day, week etc.)
Success of searching (assessment...)
…
20. Logging User Behavior
Logging user activities on the portal
Downloads/Views
Queries
Anonymous/Register portal accesses
(login/logout)
Adding/Updating/Deleting digital contents
Menu clicks
Content Upload
Content Management
Social Promotion & Networking
22. Logging User Behavior
Queries (Simple, Faceted, Advanced)
Distribution of Queries per
User, Content type, Device, IP, User Agent, Query Type,
Country, Region, City, Locale, Filter (faceted)
Query Cloud
Keyword Cloud
IPR Wizard
Definition and usage of IPR Models
Metadata Editor
Access and usage
Add, Edit metadata
Video Annotations
Personal content
Other users content
23. Logging User Behavior
Social Promotion & Networking
Analysis of
Eccentricity
Betweenness
Connections
Creation, Access of Public/Private Web Pages
Activity on Forums, Blogs, Groups or between users
New Contents
Comments to Objects/Web Pages
Invited People
Featured Objects
Recommendations, suggested content
Export/Import of links to/from other SN
Private Messages
24. Logging User Behavior
Menu Clicks
Distribution of clicks per
User, IP, Locale, Timestamp etc.
LAST POSTED, FEATURED, CALENDAR, ADVANCED SEARCH,
UPLOAD AND INGEST, POPULAR, MY CONTENT, MY GROUPS ,
MY COLLEAGUES, GET AFFILIATED, TERMS OF USE, PRIVACY
POLICY, TOP RATED, COURSES, LESS POPULAR, UPLOAD NEW
CONTENT, etc.
Ranking/Voting
# of ranked items
Distribution per
User, IP, Locale, Timestamp etc.
QR Code
Access from Mobile Devices
Workflow
Distribution of Workflow Type
Content Upload
Distribution of uploads per
User, Partner, Timestamp
25. Content Access
September 1st – November 30th 2011
Affiliation # View/Play # Download
DSI 46 0
Not 1292 14
partners/Affiliated
Partners/Affiliated 6712 119
(except DSI)
Public Users 21418 947
Affiliation # View/Play # Download
DSI 3 0
Not 100 4
partners/Affiliated
Partners/Affiliated 218 11
(except DSI)
Public Users 2225 869
26. Menu Clicks
September 1st – November 30th 2011
Menu # Clicks
ABOUT->ECLAP DESCRIPTION 671
EVENTS->PAST AND FUTURE 536
SEARCH->GROUPS 524
ABOUT->ECLAP NEWS BLOG 463
CONTENT->LAST POSTED 265
CONTENT->FEATURED 343
HOWTO->UPLOAD AND 330
INGEST
SEARCH->ADVANCED 314
SEARCH
EVENTS->CALENDAR 298
ABOUT->ECLAP PARTNERS 269
ABOUT->MAIN CONTACT 249
CONTENT->POPULAR 239
27. Search
September 1st – November 30th 2011
Affiliation # Simple Queries # Faceted
Queries
DSI 13 0
Not 323 24
partners/Affiliate
d
Partners/Affiliated 1094 21
(except DSI)
Public Users Affiliation
2634 # Advanced
147
Queries
DSI 0
Not 18
partners/Affiliate
d
Partners/Affiliated 4
(except DSI)
28. Drupal Stat Metrics
September 1st – November 30th 2011
Content Access per nid
32. Google Analytics vs Drupal Stats
Service Pros Cons
Google Traffic source
data
IP approach, each IP
is considered an
Analytics Bounce rate
unique visitor
Can’t deal with
Recency (since
specific actions on
when) portal (e.g.
Loyalty (how downloads, queries)
often)
Session times
Drupal Stats
Identity approach
Actions
Can’t deal with
traffic source data
Download and bounce rate
User Access Session time raw
Queries approximation
Content type
filtering
33. Sorting Results
Sorting by
Upload Time (first time doc uploading date)
Update Time (last time doc updating date)
Score (doc relevance to search query)
Combined with faceting and paging
34. Suggestions
REALTIME, while typing a query suggests
similar searches
ecl…
eclap
eclap-de-2-1-1-user
eclap-de-2-2-1-usergroup
…
36. Indexing/Searching Reqs
Enriching search experience
Results Sorting
Suggestions
Large # of contents (~ 104-106)
External Indexing Service
Hidden/Private contents management
Monitoring Exceptions
Email notifications
Search Engine Friendly (Google, Bing, Yahoo etc.)
content site crawling HTML dumping
37. External Indexing Service 1/3
Setup an external service to avoid server
overloading when building the index
Taxonomization
Indexing (with exceptions monitoring)
Index Synchronization
Old Index replacement with new one
Index updating
Old contents cleaning (optional)
38. External Indexing Service 2/3
Taxonom Parent
y
Taxonomization Performing -
Arts
Has a cost pre-computing Cinema Performing
Digital content Arts
Music Performing
Execution Rule (JS) Arts
Indexed with object records Documenta Cinema
ry
Historical Cinema
Performing
Classical Music
Arts
Pop Music
Cinema Music
Object
Documentary Historical Classical Pop
Taxonomy
Performing Arts
Cinema Music
Documentar Classical
y
39. External Indexing Service 3/3
Indexing with exceptions monitoring
Real-time notifying system
Event time and type (add, update)
Full stacktrace info
Customizable recipients
Object Indexing Recovery
Resource Parse Error Metadata Indexing
• Index synchronization
During external indexing, contents may be
Updated/added/deleted on the original index
Need to update these contents Indexed External
Indexed
on the index (state flag)
1 1
0 1
40. Search Engine Friendly
HTLM dump service
JAVA external service
Periodically invoked by an AXCP rule
Full metadata exporting
Thumbnail
Resource link
Multilanguage
Paginated results
41. Conclusions
Drupal integrated solution for user behavior tracking
and analysis
Logging
Stat Data Graph
Online Survey
External Indexing Service
Avoids server overloading
HA of query service
Error recovering
Detailed event notifying system
Index Optimization
Dumping tool for portal contents (SEO)
Full metadata HTML exporting
Scheduled Service
42. Future Work
Keep collecting Data
Deeper Data Analysis
User Sessions
1st, 2nd..., nth click average user behavior
Depict a modular view of the system usage
Popularity/Usability for each feature &
functionality
Social Network Analysis (SNA)
Huge Population
User relationships, connections, friendships
43. References
P. Bellini, I. Bruno, D. Cenni, P. Nesi, "Micro grids for
scalable media computing and intelligence on
distributed scenarious", IEEE Multimedia, 2011
P. Bellini, I. Bruno, D. Cenni, P. Nesi, M. Paolucci, M.
Serena, "Semantic Model for Cultural Heritage Social
Network and Cross Media Content for Multiple
Devices", Conference of the Italian Association of
Artificial Intelligence, Workshop for Cultural Heritage,
15-17 September 2011, Palermo, Italy
47. Drupal
What is it?
Open source content management platform
Developed by Dries Buytaert in 2001
Written in PHP
Users: The Economist, Examiner.com, The
White House, data.gov.uk
Runs on a WEB server (e.g. Apache, IIS) and
a database (e.g. MySQL, PostgreSQL)
48. Apache Lucene
What is it?
High-performance, full-featured text
search engine library (indexing and
searching documents)
Developed by Doug Cutting (2000)
SourceForge, joined Apache Software
Foundation in 2001
Written entirely in Java
Users: Wikipedia, Technorati, Nabble,
TheServerSide, Akamai, SourceForge
49. Apache Lucene
Features
Ranked searching (best results returned first)
Powerful query types: phrase queries, wildcard
queries, proximity queries, range queries and more
Fielded searching (e.g., title, author, contents)
Date-range searching
Sorting by any field
Multiple-index searching with merged results
Allows simultaneous update and searching
50. Apache Lucene
Features
Documents added via IndexWriter
Document = a collection of fields
No config files, dynamic field typing
Flexible text analysis tokenizers, filters
Search for documents via IndexSearcher
Hits = search(Query,Filter,Sort,topN)
Scoring: tf * idf * lengthNorm
51. Apache
Solr
What is it?
A full text search server based on
Lucene (Lucene sub-project)
Developed by Yonik Seeley at CNET
Networks (2004), donated to the Apache
Software Foundation (2006)
Written in Java, deployable as a WAR
Users: CNET Reviews, CNET Channel,
shopper.com, news.com, nines.org,
krugle.com, oodle.com, booklooker.de
52. Apache
Features
Solr
Advanced Full-Text Search Capabilities
Optimized for High Volume Web Traffic
Standards Based Open Interfaces (XML, JSON,
HTTP)
Web Administration Interface
Server statistics exposed over JMX for
monitoring
Scalability, efficient Replication to other Solr
Search Servers
Flexible and Adaptable with XML configuration
Extensible Plugin Architecture