SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
Indexing and Searching Cross
  Media Content in a Social
          Network
    Pierfrancesco Bellini, Daniele Cenni, Paolo Nesi


                 University of Florence
         Department of Systems and Informatics
Distributed Systems and Internet Technology Laboratory


               ECLAP Conference, May 7-9, 2012
ECLAP Social Network

   ECLAP is a Digital Library on Performing
    Arts connected with Europeana

   ECLAP is a Social Network (blogs,
    forums, comments, tagging, voting, …)
Goals/Requirements
   Develop an Indexing/Searching solution for ECLAP
    Social Network allowing:
       Indexing multilingual crossmedia content metadata and
        data (e.g. documents)
       Indexing portal blogs, forums, events, group pages,
        comments, etc.
       Efficient multilingual search (keyword search and
        advanced search) supporting:
            misspelled words (e.g. shespeare)
            partial word search
       Sorting and filtering search results
       re-index the whole data without blocking the system
       Log and monitor users activity
       …
   Evaluate the Indexing/Searchig service
ECLAP Data Model
                                       Group/Channel
                               0..n
                                                  0..n


           0..n                                   0..n
                           0..n 0..n                     1      0..n
      TaxonomyTerm                            Content                   Comment          Performing
                                                                                             Arts

                                                                          Metadata       Dublin Core

                                                                                          Technical
                                                                              1..n
    Blog             WebPage                  Forum             Object


                                                                                                0..n

                                                                Playlist      Document   Collection
                                0..n   1..2                            0..n
                  Annotation                   AVObject       1..n




                                 Image           Video       Audio
4
Indexing
   Indexing & Search system
       Based on Apache Solr
   Multilingual aspects
       Translate the metadata or translate the query?
       We use metadata translation
   Indexing schema
       Dublin Core + DCTerms (multi language)
       Performing Arts
       Technical (provider, content type, GPS, IPR, duration, quality, …)
       Groups associations (multi language)
       Taxonomy associations (multi language)
       Comments & multi language tags
       FullText of the textual digital resources
Indexing
                                            Taxnmy,   Comment,
                   DC          Perf.   Full Group     Tags
Media Type         (ML)   Tech Arts    Text (ML)      (ML)       Votes
Audio/Video/
Image
                    Y      Y     Y            Y          Y         Y
Document
(pdf, doc, …)
                    Y      Y     Y     Y      Y          Y         Y
CrossMedia
(html, MPEG21,…)
                    Y      Y     Y     Y      Y          Y         Y
Aggregations
(playlist,
                    Y      Y     Y            Y          Y         Y
collection, …)

Info text
(blog, web
                   (Y)                 Y                 Y
pages, forum,
events, …)
Indexing
   Multilingual fields
       title_en, title_it, title_de, title_fr, title_ca, …
   Catch-all fields
                     Component fields                Boost Weight
    text             pdf_*, doc_*, ppt_*, htm_*, …   1.0
    body             body_*                          0.5
    title            title_*                         3.1
    description      description_*                   2.0
    contributor      contributor_*                   0.8
    subject          subject_*                       1.5
    taxonomy         taxonomy_*                      0.8
    PerformingArts   PerformingArtsMetadata.#        1.0
Indexing
   Re-indexing
       In case of new indexing schema or index
        corruption the search system should not be
        blocked
       The re-indexing is done on a separete indexing
        machine while the production system uses the
        actual index
       During re-index the new uploaded/modified
        content is marked to be reindexed when the
        new index is put in production
Searching
   Full text search
       Uses the catch all fields to search for
        keywords in most important fields in all
        languages (title, description, text, body,
        subject,…)
   Fuzzy search
       Allows matching mistyped words
   Deep search
       Allows searching for partial words
   Relevance & boosting of terms
Searching
   Faceted search
Searching
   Advanced search
Search Facility Assessment
   Analisys performed on 3 months
   11294 vists (6032 unique visits)
   62768 page views (avg 5.76 pages per visit)
   7.29 minutes of permanence on the portal
   30502 contents accesses (view, play and
    download)
Search Facility Assessment
               # Full Text # Faceted # Last        #Featured # Popular
users          Query       Query     Posted List   List      List
simple         323        24          4            22        17
registered

partners       1094       21          27           19        9

anonymous 2634            147         234          302       213
Total          4051       192         265          343       239
Clicks after   1564       200         318          2799      231
query/list
Search Facility Assessment
   Click order distribution




               First page
Conclusions
   Solution allows indexing multilingual
    metadata and texts
   Searching & filtering results
   Search facility assessment show that
    search is a used feature
Context & Assessment
   Context
       Social Network
            User and content items
       Content distribution portal
            Video on demand portal
       Archive, digital library, Performing Arts
            http://www.eclap.eu
   Assessment
       User behavior
            Log user actions on the Web portal
       User happiness
            Measure the level of user satisfaction about the exposed
             services
Logging User Profile
   User Profile
       Registered or anonymous, uid (user id)
       Timestamp YY-mm-dd hh:mm:ss
       IP address, Proxy type etc.
       Platform (OS, Browser)
       GeoIP data (Country, Region, City)
       Friends, connections
          Betweenness, Eccentricity
          Joined groups

          User preferred contents
Understanding User behavior
   Online survey
       A simple module, in the right side of the portal
       Presenting 3 - 4 questions per topic (depending on the
        current portal section visited)
   Stat Drupal Modules
       Custom implemented modules
       Log User Activity
       Keep track and depict main figures about portal activity
       Can be filtered by date, user, type of content, group,
        type of activity (content enrichment, social promotion,
        networking etc.)
   Google Analytics
Understanding User behavior
  Top   Metrics
      Avg # Visits/User
      Avg # Queries/User
      Avg # Clicks/User
      Avg Visit duration
      Avg Query length
      Query refinement rate
      Next Page Click Rate
      Back Page Click Rate
      Frequency of searching (once/day, week etc.)
      Success of searching (assessment...)
      …
Logging User Behavior
   Logging user activities on the portal
      Downloads/Views

      Queries

      Anonymous/Register portal accesses
       (login/logout)
      Adding/Updating/Deleting digital contents

      Menu clicks

      Content Upload

      Content Management

      Social Promotion & Networking
Logging User Behavior
   Content Accesses (Download/View)
       Axmedis Content
          Pdf, Document, Video, Playlist, Slide, Flash, Image,
           Excel, Archive, Audio, Tool, Collection
       Drupal Content
          Page, Blog, Event, Forum, Group, Comment

   Distribution of Content Access per
       Access Type, Portal, Platform, Section, Locale,
        Country, Region, City, Axoid, Nid, Content Type,
        Partner, User, Timestamp
Logging User Behavior
   Queries (Simple, Faceted, Advanced)
       Distribution of Queries per
            User, Content type, Device, IP, User Agent, Query Type,
             Country, Region, City, Locale, Filter (faceted)
   Query Cloud
   Keyword Cloud
   IPR Wizard
       Definition and usage of IPR Models
   Metadata Editor
       Access and usage
            Add, Edit metadata
   Video Annotations
       Personal content
       Other users content
Logging User Behavior
   Social Promotion & Networking
       Analysis of
            Eccentricity
            Betweenness
            Connections
       Creation, Access of Public/Private Web Pages
       Activity on Forums, Blogs, Groups or between users
            New Contents
            Comments to Objects/Web Pages
            Invited People
            Featured Objects
            Recommendations, suggested content
            Export/Import of links to/from other SN
            Private Messages
Logging User Behavior
   Menu Clicks
       Distribution of clicks per
            User, IP, Locale, Timestamp etc.
       LAST POSTED, FEATURED, CALENDAR, ADVANCED SEARCH,
        UPLOAD AND INGEST, POPULAR, MY CONTENT, MY GROUPS ,
        MY COLLEAGUES, GET AFFILIATED, TERMS OF USE, PRIVACY
        POLICY, TOP RATED, COURSES, LESS POPULAR, UPLOAD NEW
        CONTENT, etc.
   Ranking/Voting
       # of ranked items
       Distribution per
            User, IP, Locale, Timestamp etc.
   QR Code
       Access from Mobile Devices
   Workflow
       Distribution of Workflow Type
   Content Upload
       Distribution of uploads per
            User, Partner, Timestamp
Content Access
                      September 1st – November 30th 2011

      Affiliation             # View/Play                        # Download
DSI                     46                                 0
Not                     1292                               14
partners/Affiliated
Partners/Affiliated     6712                               119
(except DSI)
Public Users            21418                              947

        Affiliation           # View/Play                       # Download
  DSI                    3                                 0
  Not                 100                                  4
  partners/Affiliated
  Partners/Affiliated 218                                  11
  (except DSI)
  Public Users           2225                              869
Menu Clicks
                 September 1st – November 30th 2011

          Menu                                        # Clicks
ABOUT->ECLAP DESCRIPTION 671
EVENTS->PAST AND FUTURE             536
SEARCH->GROUPS                      524
ABOUT->ECLAP NEWS BLOG              463
CONTENT->LAST POSTED                265
CONTENT->FEATURED                   343
HOWTO->UPLOAD AND                   330
INGEST
SEARCH->ADVANCED                    314
SEARCH
EVENTS->CALENDAR                    298
ABOUT->ECLAP PARTNERS               269
ABOUT->MAIN CONTACT                 249
CONTENT->POPULAR                    239
Search
                     September 1st – November 30th 2011

      Affiliation         # Simple Queries                     # Faceted
                                                                Queries
DSI                      13                               0
Not                      323                              24
partners/Affiliate
d
Partners/Affiliated 1094                                  21
(except DSI)
Public Users Affiliation
                    2634                     # Advanced
                                                   147
                                               Queries
            DSI                         0
            Not                         18
            partners/Affiliate
            d
            Partners/Affiliated 4
            (except DSI)
Drupal Stat Metrics
                September 1st – November 30th 2011

   Content Access per nid
Drupal Stat Metrics
              September 1st – November 30th 2011

   Views by Query
Drupal Stat Metrics
               September 1st – November 30th 2011

   Content Access per Platform
Understanding User behavior
   Drupal Stats (collapsible menus on the right)
Google Analytics vs Drupal Stats
    Service             Pros                    Cons

Google            Traffic source
                   data
                                          IP approach, each IP
                                           is considered an
Analytics         Bounce rate
                                       
                                           unique visitor
                                           Can’t deal with
                  Recency (since
                                           specific actions on
                   when)                   portal (e.g.
                  Loyalty (how            downloads, queries)
                   often)
                  Session times

Drupal Stats   
               
                   Identity approach
                   Actions
                                          Can’t deal with
                                           traffic source data
                  Download                and bounce rate
                  User Access            Session time raw
                  Queries                 approximation
                  Content type
                   filtering
Sorting Results
   Sorting by
       Upload Time (first time doc uploading date)
       Update Time (last time doc updating date)
       Score (doc relevance to search query)
   Combined with faceting and paging
Suggestions
   REALTIME, while typing a query suggests
    similar searches
       ecl…
           eclap
           eclap-de-2-1-1-user
           eclap-de-2-2-1-usergroup
           …
ECLAP Survey
Indexing/Searching Reqs
   Enriching search experience
       Results Sorting
       Suggestions
   Large # of contents (~ 104-106)
       External Indexing Service
   Hidden/Private contents management
   Monitoring Exceptions
       Email notifications
   Search Engine Friendly (Google, Bing, Yahoo etc.)
       content site crawling       HTML dumping
External Indexing Service 1/3
   Setup an external service to avoid server
    overloading when building the index
       Taxonomization
       Indexing (with exceptions monitoring)
       Index Synchronization
       Old Index replacement with new one
       Index updating
       Old contents cleaning (optional)
External Indexing Service 2/3
                                                        Taxonom        Parent
                                                           y
   Taxonomization                                      Performing        -
                                                            Arts
         Has a cost        pre-computing                Cinema       Performing
         Digital content                                                 Arts
                                                          Music       Performing
         Execution Rule (JS)                                             Arts

         Indexed with object records                   Documenta      Cinema
                                                            ry
                                                         Historical    Cinema
                           Performing
                                                         Classical      Music
                              Arts
                                                           Pop          Music


              Cinema                     Music

                                                                 Object
        Documentary    Historical   Classical    Pop
                                                              Taxonomy
                                                          Performing Arts
                                                         Cinema           Music
                                                       Documentar      Classical
                                                           y
External Indexing Service 3/3
   Indexing with exceptions monitoring
       Real-time notifying system
       Event time and type (add, update)
       Full stacktrace info
       Customizable recipients
       Object Indexing Recovery
            Resource Parse Error     Metadata Indexing
•   Index synchronization
       During external indexing, contents may be
           Updated/added/deleted on the original index
           Need to update these contents               Indexed   External
                                                                  Indexed
            on the index (state flag)
                                                           1         1


                                                           0         1
Search Engine Friendly
   HTLM dump service
      JAVA external service

      Periodically invoked by an AXCP rule

      Full metadata exporting

      Thumbnail

      Resource link

      Multilanguage

      Paginated results
Conclusions
   Drupal integrated solution for user behavior tracking
    and analysis
       Logging
       Stat Data Graph
       Online Survey
   External Indexing Service
       Avoids server overloading
       HA of query service
       Error recovering
       Detailed event notifying system
       Index Optimization
   Dumping tool for portal contents (SEO)
       Full metadata HTML exporting
       Scheduled Service
Future Work
   Keep collecting Data
   Deeper Data Analysis
       User Sessions
          1st,   2nd..., nth click          average user behavior
       Depict a modular view of the system usage
          Popularity/Usability        for each feature &
           functionality
       Social Network Analysis (SNA)
          Huge     Population
                 User relationships, connections, friendships
References

   P. Bellini, I. Bruno, D. Cenni, P. Nesi, "Micro grids for
    scalable media computing and intelligence on
    distributed scenarious", IEEE Multimedia, 2011
   P. Bellini, I. Bruno, D. Cenni, P. Nesi, M. Paolucci, M.
    Serena, "Semantic Model for Cultural Heritage Social
    Network and Cross Media Content for Multiple
    Devices", Conference of the Italian Association of
    Artificial Intelligence, Workshop for Cultural Heritage,
    15-17 September 2011, Palermo, Italy
Q&A
APPENDIX
Architecture (former)
                              Index Rebuilder
                                                  Indexing Rule JS
                                  Rule JS
                                        SolrJ Client
                               Grid                 Rule
                               Node               Scheduler

                                                AXCP

            Solr
XML/HTTP             JSP      Indexing            Searching
            Cell
                               Module              Module
                   Indexing
    Apache Solr     Service
                                      Drupal
    Apache Tomcat                               Apache HTTP
Drupal
What is it?
Open source content management platform

Developed by Dries Buytaert in 2001

Written in PHP

Users: The Economist, Examiner.com, The
White House, data.gov.uk
Runs on a WEB server (e.g. Apache, IIS) and
a database (e.g. MySQL, PostgreSQL)
Apache Lucene
What is it?
High-performance, full-featured text
search engine library (indexing and
searching documents)
Developed by Doug Cutting (2000)
SourceForge, joined Apache Software
Foundation in 2001
Written entirely in Java
Users: Wikipedia, Technorati, Nabble,
TheServerSide, Akamai, SourceForge
Apache Lucene
Features
Ranked   searching (best results returned first)
Powerful query types: phrase queries, wildcard
queries, proximity queries, range queries and more
Fielded searching (e.g., title, author, contents)
Date-range searching
Sorting by any field
Multiple-index searching with merged results
Allows simultaneous update and searching
Apache Lucene
Features
Documents added via IndexWriter

Document = a collection of fields

No config files, dynamic field typing

Flexible text analysis tokenizers, filters

Search for documents via IndexSearcher
     Hits = search(Query,Filter,Sort,topN)

Scoring:    tf * idf * lengthNorm
Apache
          Solr
What is it?
A full text search server based on
Lucene (Lucene sub-project)
Developed by Yonik Seeley at CNET
Networks (2004), donated to the Apache
Software Foundation (2006)
Written in Java, deployable as a WAR
Users: CNET Reviews, CNET Channel,
shopper.com, news.com, nines.org,
krugle.com, oodle.com, booklooker.de
Apache
Features
            Solr
Advanced   Full-Text Search Capabilities
Optimized for High Volume Web Traffic
Standards Based Open Interfaces (XML, JSON,
HTTP)
Web Administration Interface
Server statistics exposed over JMX for
monitoring
Scalability, efficient Replication to other Solr
Search Servers
Flexible and Adaptable with XML configuration
Extensible Plugin Architecture

Mais conteúdo relacionado

Destaque

ECLAP short overview at Ljubljana
ECLAP short overview at LjubljanaECLAP short overview at Ljubljana
ECLAP short overview at Ljubljana
Paolo Nesi
 

Destaque (15)

DISIT lab Overview on Tourism and Training, June 2014
DISIT lab Overview on Tourism and Training, June 2014DISIT lab Overview on Tourism and Training, June 2014
DISIT lab Overview on Tourism and Training, June 2014
 
ECLAP 2013 tutorial at Porto, April 2013
ECLAP 2013 tutorial at Porto, April 2013ECLAP 2013 tutorial at Porto, April 2013
ECLAP 2013 tutorial at Porto, April 2013
 
Anatomy of Social Networks, a guide for social media strategists
Anatomy of Social Networks, a guide for social media strategistsAnatomy of Social Networks, a guide for social media strategists
Anatomy of Social Networks, a guide for social media strategists
 
Social Media Technologies, Part B of 2
Social Media Technologies, Part B of 2Social Media Technologies, Part B of 2
Social Media Technologies, Part B of 2
 
MyStoryPlayer on ECLAP and overview
MyStoryPlayer on ECLAP and overviewMyStoryPlayer on ECLAP and overview
MyStoryPlayer on ECLAP and overview
 
ICT e per la Gestione soccorso integrato nelle maxi emergenze
ICT e per la Gestione soccorso integrato nelle maxi emergenzeICT e per la Gestione soccorso integrato nelle maxi emergenze
ICT e per la Gestione soccorso integrato nelle maxi emergenze
 
ECLAP Tutorial first part, ECLAP 2012 conference. the general overview
ECLAP Tutorial first part, ECLAP 2012 conference. the general overviewECLAP Tutorial first part, ECLAP 2012 conference. the general overview
ECLAP Tutorial first part, ECLAP 2012 conference. the general overview
 
Anatomy of a Cross Media Best Practice Network for Media Aggregation and Frui...
Anatomy of a Cross Media Best Practice Network for Media Aggregation and Frui...Anatomy of a Cross Media Best Practice Network for Media Aggregation and Frui...
Anatomy of a Cross Media Best Practice Network for Media Aggregation and Frui...
 
Modelli Semantici e Gestione della Conoscenza: Social Network vs Knowledge Ma...
Modelli Semantici e Gestione della Conoscenza: Social Network vs Knowledge Ma...Modelli Semantici e Gestione della Conoscenza: Social Network vs Knowledge Ma...
Modelli Semantici e Gestione della Conoscenza: Social Network vs Knowledge Ma...
 
ECLAP short overview at Ljubljana
ECLAP short overview at LjubljanaECLAP short overview at Ljubljana
ECLAP short overview at Ljubljana
 
Anatomy of a Social Network, ECLAP
Anatomy of a Social Network, ECLAPAnatomy of a Social Network, ECLAP
Anatomy of a Social Network, ECLAP
 
A Trust P2P network for the Access to Open Archive resources
A Trust P2P network for the Access to Open Archive resourcesA Trust P2P network for the Access to Open Archive resources
A Trust P2P network for the Access to Open Archive resources
 
Eclap lubec-19-ottobre-2012-v1-0c
Eclap lubec-19-ottobre-2012-v1-0cEclap lubec-19-ottobre-2012-v1-0c
Eclap lubec-19-ottobre-2012-v1-0c
 
TUTORIAL 2/2 (of the second part) ECLAP 2012 Conference, IPR management, IPR ...
TUTORIAL 2/2 (of the second part) ECLAP 2012 Conference, IPR management, IPR ...TUTORIAL 2/2 (of the second part) ECLAP 2012 Conference, IPR management, IPR ...
TUTORIAL 2/2 (of the second part) ECLAP 2012 Conference, IPR management, IPR ...
 
Personal Content Management on PDA for Health Care Applications
Personal Content Management on PDA for Health Care Applications Personal Content Management on PDA for Health Care Applications
Personal Content Management on PDA for Health Care Applications
 

Semelhante a Indexing and Searching Cross Media Content in a Social Network

Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
Joseba Abaitua
 

Semelhante a Indexing and Searching Cross Media Content in a Social Network (20)

Resource discovery and information sharing: reaching the 2.0 turn
Resource discovery and information sharing: reaching the 2.0 turnResource discovery and information sharing: reaching the 2.0 turn
Resource discovery and information sharing: reaching the 2.0 turn
 
Improving the Search Experience in a Social Network with Cross Media Contents
Improving the Search Experiencein a Social Network with Cross Media ContentsImproving the Search Experiencein a Social Network with Cross Media Contents
Improving the Search Experience in a Social Network with Cross Media Contents
 
Slawek Korea
Slawek KoreaSlawek Korea
Slawek Korea
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
Scratchpad 2, Virtual Research Environment: Project Update
 Scratchpad 2, Virtual Research Environment: Project Update Scratchpad 2, Virtual Research Environment: Project Update
Scratchpad 2, Virtual Research Environment: Project Update
 
Indexator_oct2022.pdf
Indexator_oct2022.pdfIndexator_oct2022.pdf
Indexator_oct2022.pdf
 
The JISC Information Environment and collection description
The JISC Information Environment and collection descriptionThe JISC Information Environment and collection description
The JISC Information Environment and collection description
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
 
Technical overview of the JISC Information Environment
Technical overview of the JISC Information EnvironmentTechnical overview of the JISC Information Environment
Technical overview of the JISC Information Environment
 
The JISC Information Environment and VLEs
The JISC Information Environment and VLEsThe JISC Information Environment and VLEs
The JISC Information Environment and VLEs
 
From Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperabilityFrom Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperability
 
Modular Documentation Joe Gelb Techshoret 2009
Modular Documentation Joe Gelb Techshoret 2009Modular Documentation Joe Gelb Techshoret 2009
Modular Documentation Joe Gelb Techshoret 2009
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
 
Tracking the Tiddlythesaurus
Tracking the TiddlythesaurusTracking the Tiddlythesaurus
Tracking the Tiddlythesaurus
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projects
 
Axendo uMedial - BUUG Festival
Axendo uMedial - BUUG FestivalAxendo uMedial - BUUG Festival
Axendo uMedial - BUUG Festival
 
Gallery Systems: eMuseum Network: Bringing Access to All
Gallery Systems: eMuseum Network: Bringing Access to AllGallery Systems: eMuseum Network: Bringing Access to All
Gallery Systems: eMuseum Network: Bringing Access to All
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Indexing and Searching Cross Media Content in a Social Network

  • 1. Indexing and Searching Cross Media Content in a Social Network Pierfrancesco Bellini, Daniele Cenni, Paolo Nesi University of Florence Department of Systems and Informatics Distributed Systems and Internet Technology Laboratory ECLAP Conference, May 7-9, 2012
  • 2. ECLAP Social Network  ECLAP is a Digital Library on Performing Arts connected with Europeana  ECLAP is a Social Network (blogs, forums, comments, tagging, voting, …)
  • 3. Goals/Requirements  Develop an Indexing/Searching solution for ECLAP Social Network allowing:  Indexing multilingual crossmedia content metadata and data (e.g. documents)  Indexing portal blogs, forums, events, group pages, comments, etc.  Efficient multilingual search (keyword search and advanced search) supporting:  misspelled words (e.g. shespeare)  partial word search  Sorting and filtering search results  re-index the whole data without blocking the system  Log and monitor users activity  …  Evaluate the Indexing/Searchig service
  • 4. ECLAP Data Model Group/Channel 0..n 0..n 0..n 0..n 0..n 0..n 1 0..n TaxonomyTerm Content Comment Performing Arts Metadata Dublin Core Technical 1..n Blog WebPage Forum Object 0..n Playlist Document Collection 0..n 1..2 0..n Annotation AVObject 1..n Image Video Audio 4
  • 5. Indexing  Indexing & Search system  Based on Apache Solr  Multilingual aspects  Translate the metadata or translate the query?  We use metadata translation  Indexing schema  Dublin Core + DCTerms (multi language)  Performing Arts  Technical (provider, content type, GPS, IPR, duration, quality, …)  Groups associations (multi language)  Taxonomy associations (multi language)  Comments & multi language tags  FullText of the textual digital resources
  • 6. Indexing Taxnmy, Comment, DC Perf. Full Group Tags Media Type (ML) Tech Arts Text (ML) (ML) Votes Audio/Video/ Image Y Y Y Y Y Y Document (pdf, doc, …) Y Y Y Y Y Y Y CrossMedia (html, MPEG21,…) Y Y Y Y Y Y Y Aggregations (playlist, Y Y Y Y Y Y collection, …) Info text (blog, web (Y) Y Y pages, forum, events, …)
  • 7. Indexing  Multilingual fields  title_en, title_it, title_de, title_fr, title_ca, …  Catch-all fields Component fields Boost Weight text pdf_*, doc_*, ppt_*, htm_*, … 1.0 body body_* 0.5 title title_* 3.1 description description_* 2.0 contributor contributor_* 0.8 subject subject_* 1.5 taxonomy taxonomy_* 0.8 PerformingArts PerformingArtsMetadata.# 1.0
  • 8. Indexing  Re-indexing  In case of new indexing schema or index corruption the search system should not be blocked  The re-indexing is done on a separete indexing machine while the production system uses the actual index  During re-index the new uploaded/modified content is marked to be reindexed when the new index is put in production
  • 9. Searching  Full text search  Uses the catch all fields to search for keywords in most important fields in all languages (title, description, text, body, subject,…)  Fuzzy search  Allows matching mistyped words  Deep search  Allows searching for partial words  Relevance & boosting of terms
  • 10. Searching  Faceted search
  • 11. Searching  Advanced search
  • 12. Search Facility Assessment  Analisys performed on 3 months  11294 vists (6032 unique visits)  62768 page views (avg 5.76 pages per visit)  7.29 minutes of permanence on the portal  30502 contents accesses (view, play and download)
  • 13. Search Facility Assessment # Full Text # Faceted # Last #Featured # Popular users Query Query Posted List List List simple 323 24 4 22 17 registered partners 1094 21 27 19 9 anonymous 2634 147 234 302 213 Total 4051 192 265 343 239 Clicks after 1564 200 318 2799 231 query/list
  • 14. Search Facility Assessment  Click order distribution First page
  • 15. Conclusions  Solution allows indexing multilingual metadata and texts  Searching & filtering results  Search facility assessment show that search is a used feature
  • 16. Context & Assessment  Context  Social Network  User and content items  Content distribution portal  Video on demand portal  Archive, digital library, Performing Arts  http://www.eclap.eu  Assessment  User behavior  Log user actions on the Web portal  User happiness  Measure the level of user satisfaction about the exposed services
  • 17. Logging User Profile  User Profile  Registered or anonymous, uid (user id)  Timestamp YY-mm-dd hh:mm:ss  IP address, Proxy type etc.  Platform (OS, Browser)  GeoIP data (Country, Region, City)  Friends, connections  Betweenness, Eccentricity  Joined groups  User preferred contents
  • 18. Understanding User behavior  Online survey  A simple module, in the right side of the portal  Presenting 3 - 4 questions per topic (depending on the current portal section visited)  Stat Drupal Modules  Custom implemented modules  Log User Activity  Keep track and depict main figures about portal activity  Can be filtered by date, user, type of content, group, type of activity (content enrichment, social promotion, networking etc.)  Google Analytics
  • 19. Understanding User behavior  Top Metrics  Avg # Visits/User  Avg # Queries/User  Avg # Clicks/User  Avg Visit duration  Avg Query length  Query refinement rate  Next Page Click Rate  Back Page Click Rate  Frequency of searching (once/day, week etc.)  Success of searching (assessment...)  …
  • 20. Logging User Behavior  Logging user activities on the portal  Downloads/Views  Queries  Anonymous/Register portal accesses (login/logout)  Adding/Updating/Deleting digital contents  Menu clicks  Content Upload  Content Management  Social Promotion & Networking
  • 21. Logging User Behavior  Content Accesses (Download/View)  Axmedis Content  Pdf, Document, Video, Playlist, Slide, Flash, Image, Excel, Archive, Audio, Tool, Collection  Drupal Content  Page, Blog, Event, Forum, Group, Comment  Distribution of Content Access per  Access Type, Portal, Platform, Section, Locale, Country, Region, City, Axoid, Nid, Content Type, Partner, User, Timestamp
  • 22. Logging User Behavior  Queries (Simple, Faceted, Advanced)  Distribution of Queries per  User, Content type, Device, IP, User Agent, Query Type, Country, Region, City, Locale, Filter (faceted)  Query Cloud  Keyword Cloud  IPR Wizard  Definition and usage of IPR Models  Metadata Editor  Access and usage  Add, Edit metadata  Video Annotations  Personal content  Other users content
  • 23. Logging User Behavior  Social Promotion & Networking  Analysis of  Eccentricity  Betweenness  Connections  Creation, Access of Public/Private Web Pages  Activity on Forums, Blogs, Groups or between users  New Contents  Comments to Objects/Web Pages  Invited People  Featured Objects  Recommendations, suggested content  Export/Import of links to/from other SN  Private Messages
  • 24. Logging User Behavior  Menu Clicks  Distribution of clicks per  User, IP, Locale, Timestamp etc.  LAST POSTED, FEATURED, CALENDAR, ADVANCED SEARCH, UPLOAD AND INGEST, POPULAR, MY CONTENT, MY GROUPS , MY COLLEAGUES, GET AFFILIATED, TERMS OF USE, PRIVACY POLICY, TOP RATED, COURSES, LESS POPULAR, UPLOAD NEW CONTENT, etc.  Ranking/Voting  # of ranked items  Distribution per  User, IP, Locale, Timestamp etc.  QR Code  Access from Mobile Devices  Workflow  Distribution of Workflow Type  Content Upload  Distribution of uploads per  User, Partner, Timestamp
  • 25. Content Access September 1st – November 30th 2011 Affiliation # View/Play # Download DSI 46 0 Not 1292 14 partners/Affiliated Partners/Affiliated 6712 119 (except DSI) Public Users 21418 947 Affiliation # View/Play # Download DSI 3 0 Not 100 4 partners/Affiliated Partners/Affiliated 218 11 (except DSI) Public Users 2225 869
  • 26. Menu Clicks September 1st – November 30th 2011 Menu # Clicks ABOUT->ECLAP DESCRIPTION 671 EVENTS->PAST AND FUTURE 536 SEARCH->GROUPS 524 ABOUT->ECLAP NEWS BLOG 463 CONTENT->LAST POSTED 265 CONTENT->FEATURED 343 HOWTO->UPLOAD AND 330 INGEST SEARCH->ADVANCED 314 SEARCH EVENTS->CALENDAR 298 ABOUT->ECLAP PARTNERS 269 ABOUT->MAIN CONTACT 249 CONTENT->POPULAR 239
  • 27. Search September 1st – November 30th 2011 Affiliation # Simple Queries # Faceted Queries DSI 13 0 Not 323 24 partners/Affiliate d Partners/Affiliated 1094 21 (except DSI) Public Users Affiliation 2634 # Advanced 147 Queries DSI 0 Not 18 partners/Affiliate d Partners/Affiliated 4 (except DSI)
  • 28. Drupal Stat Metrics September 1st – November 30th 2011  Content Access per nid
  • 29. Drupal Stat Metrics September 1st – November 30th 2011  Views by Query
  • 30. Drupal Stat Metrics September 1st – November 30th 2011  Content Access per Platform
  • 31. Understanding User behavior  Drupal Stats (collapsible menus on the right)
  • 32. Google Analytics vs Drupal Stats Service Pros Cons Google  Traffic source data  IP approach, each IP is considered an Analytics  Bounce rate  unique visitor Can’t deal with  Recency (since specific actions on when) portal (e.g.  Loyalty (how downloads, queries) often)  Session times Drupal Stats   Identity approach Actions  Can’t deal with traffic source data  Download and bounce rate  User Access  Session time raw  Queries approximation  Content type filtering
  • 33. Sorting Results  Sorting by  Upload Time (first time doc uploading date)  Update Time (last time doc updating date)  Score (doc relevance to search query)  Combined with faceting and paging
  • 34. Suggestions  REALTIME, while typing a query suggests similar searches  ecl…  eclap  eclap-de-2-1-1-user  eclap-de-2-2-1-usergroup  …
  • 36. Indexing/Searching Reqs  Enriching search experience  Results Sorting  Suggestions  Large # of contents (~ 104-106)  External Indexing Service  Hidden/Private contents management  Monitoring Exceptions  Email notifications  Search Engine Friendly (Google, Bing, Yahoo etc.)  content site crawling HTML dumping
  • 37. External Indexing Service 1/3  Setup an external service to avoid server overloading when building the index  Taxonomization  Indexing (with exceptions monitoring)  Index Synchronization  Old Index replacement with new one  Index updating  Old contents cleaning (optional)
  • 38. External Indexing Service 2/3 Taxonom Parent y  Taxonomization Performing - Arts  Has a cost pre-computing Cinema Performing  Digital content Arts Music Performing  Execution Rule (JS) Arts  Indexed with object records Documenta Cinema ry Historical Cinema Performing Classical Music Arts Pop Music Cinema Music Object Documentary Historical Classical Pop Taxonomy Performing Arts Cinema Music Documentar Classical y
  • 39. External Indexing Service 3/3  Indexing with exceptions monitoring  Real-time notifying system  Event time and type (add, update)  Full stacktrace info  Customizable recipients  Object Indexing Recovery  Resource Parse Error Metadata Indexing • Index synchronization  During external indexing, contents may be  Updated/added/deleted on the original index  Need to update these contents Indexed External Indexed on the index (state flag) 1 1 0 1
  • 40. Search Engine Friendly  HTLM dump service  JAVA external service  Periodically invoked by an AXCP rule  Full metadata exporting  Thumbnail  Resource link  Multilanguage  Paginated results
  • 41. Conclusions  Drupal integrated solution for user behavior tracking and analysis  Logging  Stat Data Graph  Online Survey  External Indexing Service  Avoids server overloading  HA of query service  Error recovering  Detailed event notifying system  Index Optimization  Dumping tool for portal contents (SEO)  Full metadata HTML exporting  Scheduled Service
  • 42. Future Work  Keep collecting Data  Deeper Data Analysis  User Sessions  1st, 2nd..., nth click average user behavior  Depict a modular view of the system usage  Popularity/Usability for each feature & functionality  Social Network Analysis (SNA)  Huge Population  User relationships, connections, friendships
  • 43. References  P. Bellini, I. Bruno, D. Cenni, P. Nesi, "Micro grids for scalable media computing and intelligence on distributed scenarious", IEEE Multimedia, 2011  P. Bellini, I. Bruno, D. Cenni, P. Nesi, M. Paolucci, M. Serena, "Semantic Model for Cultural Heritage Social Network and Cross Media Content for Multiple Devices", Conference of the Italian Association of Artificial Intelligence, Workshop for Cultural Heritage, 15-17 September 2011, Palermo, Italy
  • 44. Q&A
  • 46. Architecture (former) Index Rebuilder Indexing Rule JS Rule JS SolrJ Client Grid Rule Node Scheduler AXCP Solr XML/HTTP JSP Indexing Searching Cell Module Module Indexing Apache Solr Service Drupal Apache Tomcat Apache HTTP
  • 47. Drupal What is it? Open source content management platform Developed by Dries Buytaert in 2001 Written in PHP Users: The Economist, Examiner.com, The White House, data.gov.uk Runs on a WEB server (e.g. Apache, IIS) and a database (e.g. MySQL, PostgreSQL)
  • 48. Apache Lucene What is it? High-performance, full-featured text search engine library (indexing and searching documents) Developed by Doug Cutting (2000) SourceForge, joined Apache Software Foundation in 2001 Written entirely in Java Users: Wikipedia, Technorati, Nabble, TheServerSide, Akamai, SourceForge
  • 49. Apache Lucene Features Ranked searching (best results returned first) Powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more Fielded searching (e.g., title, author, contents) Date-range searching Sorting by any field Multiple-index searching with merged results Allows simultaneous update and searching
  • 50. Apache Lucene Features Documents added via IndexWriter Document = a collection of fields No config files, dynamic field typing Flexible text analysis tokenizers, filters Search for documents via IndexSearcher  Hits = search(Query,Filter,Sort,topN) Scoring: tf * idf * lengthNorm
  • 51. Apache Solr What is it? A full text search server based on Lucene (Lucene sub-project) Developed by Yonik Seeley at CNET Networks (2004), donated to the Apache Software Foundation (2006) Written in Java, deployable as a WAR Users: CNET Reviews, CNET Channel, shopper.com, news.com, nines.org, krugle.com, oodle.com, booklooker.de
  • 52. Apache Features Solr Advanced Full-Text Search Capabilities Optimized for High Volume Web Traffic Standards Based Open Interfaces (XML, JSON, HTTP) Web Administration Interface Server statistics exposed over JMX for monitoring Scalability, efficient Replication to other Solr Search Servers Flexible and Adaptable with XML configuration Extensible Plugin Architecture