A demonstration of vivosearch.org and the open source tools that were developed to build the site.
Presented by Brian Caruso, Miles Worthington and Nick Cappadona on Thursday, August 25 at the 2011 VIVO Conference in National Harbor, MD, USA.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Search Across Multiple VIVO Instances
1. Search Across
Multiple VIVO Instances
Brian Caruso, Miles Worthington, Nick Cappadona
Albert R. Mann Library
Cornell University
1
2. Building the foundation
• VIVO core ontology
• Linked Data
• Implementation & Adoption
• Ingest & Editing
2
3. VIVO core ontology
• A hierarchy of classes and properties
• Incorporates segments of established
ontologies
– Bibontology
– FOAF
– eagle-i
• Provides structure for modeled data
3 h$p://vivoweb.org/ontology/core
4. Linked Data
“ structured data on the Web
A set of best practices for publishing and connecting
”
• URIs
• RDF
• HTTP
4 h$p://linkeddata.org
5. Implementation & Adoption
• VIVO implemented at 7 partner institutions
Cornell University University of Florida
Indiana University Washington University in St. Louis
School of Medicine
Ponce School of Medicine
Weill Cornell Medical College
The Scripps Research Institute
• Buy-in and support
5
6. Ingest & Editing
• Identify local systems of record
– HR
– Grants
– Faculty Activity
• Load data
– Harvester
– Ingest Tools
• Curation and self-editing
6
8. vivosearch.org
• An example of multi-institutional search
• Includes 7 partner institutions
– plus Harvard Catalyst Pro les
• Built using 2 tools developed on the grant
– Linked Data Index Builder
– VIVO Search Drupal module
• Both are open source and available today
– http://vivosearch.org/tools
8
9. Preparing Linked Data for search
Linked Data Index Builder
9 h$p://vivosearch.org/tools
10. Linked Data Index Builder
• A tool to create a Solr index from VIVO sites
• Linked Data principles
– URIs
– RDF
– HTTP
• Solr
– open source enterprise search platform
– http://lucene.apache.org/solr
10
11. LDIB input
• URL of VIVO instance
– or any site serving LD aligned with VIVO core ontology
– http://vivo.cornell.edu
• Method/service to retrieve list of URIs
– provided in VIVO through Index page
– http://vivo.cornell.edu/browse
11
15. LDIB to do
• Improve fault tolerance
• Automate update/sync
• Experiment with scaling
• Management tools
– need governance model to design tools
– site_name and site_url are manually curated
– no registration system in place
15
17. Why Drupal?
• Need a website as well
• Can tap into core search features
• Existing framework for connecting to Solr
17
18. Apache Solr search integration module
• Flexible, not limited to Drupal content
• Active community
• Commercially backed
18
19. VIVO search module
• Built for Drupal 7
• Works on top of the existing Drupal module
• Uses Drupal's core search system
• Packaged with 3 search facets:
classgroup, type, institution
• Written speci cally for LDIB indexes
19
24. Relevance
• Good result ranking
• Scannable results
• Clear context
• Result totals
• Handle empty results
24
25. Performance
• More critical than usual
• Don't interrupt user's train of thought
• Users will quickly abandon your site
25
26. Performance
“ Web search engines typically show ten results, or “hits,” per page,
with hyperlinks to additional pages of results .... a Google VP
reported that despite the fact that users said they wanted more
hits per page, an experiment in which the number of hits was
increased to 30 hits per page showed a 20% reduction in traffic
(Linden, 2006). The reason turned out to be that while the page
with 10 results took 0.4 seconds to generate, the page with 30
”
results took 0.9 seconds on average.
h$p://searchuserinterfaces.com/book/sui_ch5_retrieval_results.html
26
27. Performance enhancements
• Solr
• Apache mod_pagespeed
• Lots of caching
• Data URIs for CSS images
• CSS/JS aggregation and compression
27
28. Controls
• Strive for predictability and consistency
• Facets must be intuitive
• Offer an escape route
28
29. Usability testing
• 5 sessions
• Covered tasks for entire site
• Results overall positive
• Revealed issues with controls
29
32. Future enhancements
• Improved result ranking
• More informative text snippets
• Spelling and term suggestions
• Con guration for VIVO search module
32
33. Build a search site using the tools we developed
Roll Your Own
33
34. More than meets the eye
• vivosearch.org > LDIB + Drupal module
– theme
– additional utilities
34
36. Look Mom, no Drupal
• Solr is the key
• choose your weapon for integration
– http://wiki.apache.org/solr/IntegratingSolr
• Drupal is not a requirement
36
37. Brian Caruso
brian.caruso@cornell.edu
Miles Worthington
miles.worthington@cornell.edu
Nick Cappadona
nick.cappadona@cornell.edu
vivo-dev-all@lists.sourceforge.net
Questions?
Thank You
37
Notas do Editor
Alternative: Search Across the Seven Partner VIVO Instances\n
Brief intro and background on how we got to this point\n
Ontologies\n* Bibontology for publications\n* FOAF for people and organizations\n* eagle-i for scientific and research resources\n\n* Defines the common thread across institutions (tap into this for search faceting/filtering)\n
* Uniform Resource Identifier - a string used to identify a resource on the web\n* Resource Description Framework - a generic graph-based data model for describing things, including relationships to other things\n* HyperText Transfer Protocol - simple, universal mechanism for requesting and retrieving resources or descriptions of resources\n
* more than simply installing the VIVO app\n* efforts of Implementation and Outreach teams are too often overlooked\n* the buy-in and support from administration and faculty are critical\n\n* VIVO implemented at institutions and organizations beyond the 7 on the grant\n - University of Colorado\n - StonyBrook\n - there are definitely others...ask Elly for latest numbers?\n
* local SORs are key - you could load all of the data manually but that’s no fun\n* Harvester is available as subproject on sourceforge - library of ETL tools\n - extract, transform, load\n - initial integration of Harvester in VIVO 1.3\n* Even with automated ingest, you’ll still want to edit/add information on an individual basis\n\n
* so we’ve laid down this foundation and we now have the VIVO app running at the 7 partner institutions, but how do we tie all of this data together and start using it to help us discover new collaborations\n* make connections...the start of a network (probably too loaded of a term)\n\n* was thinking of listing out the URLs of the seven VIVO partner instances prior to this slide while I spoke about the points above, but felt it wasn’t necessary\n\n* vivosearch.org\n  - an example site that searches the VIVO instances at the 7 partner institutions on the grant\n  - also includes Harvard Catalyst Profiles as evidence of interoperability with external apps \n* go right into the search - start with a suggested term\n* provide a scenario or 2 that makes use of faceting\n   - need to come of with these\n* follow a result to the source institution\n
* reiterate that this is an example site :)\n* HCP has aligned itself with the VIVO core ontology, serves profiles data as RDF\n* these 2 tools are works in progress and are free for you to download and use in building a similar search site\n* we’d like to show you a closer look at each of these and provide some details on how you can build a search site of your own\n\n* need to add the link for the sandbox project once it’s online at Drupal.org\n
Pass off to Brian Caruso to work his magic\n
* emphasize that the end result is a Solr index\n* we use Solr because it’s proven and it’s fast\n* revisit Linked Data\n - making HTTP requests to VIVO instances and retrieving RDF using URIs\n
* alternate title: LDIB Minimum Requirements\n* alternate title: LDIB Ingredients\n\n* HCP is an example of one such non-VIVO site (although it doesn’t serve Linked Data -- one URI for both HTML or RDF representations)\n* this list of URIs define what will be retrieved and indexed in subsequent requests\n
* All steps during index building\n share very little state\n should be very parallelizable\n
Provide service to link individuals from one VIVO to individuals in another VIVO instance\n
* Solr is highly scalable\n distributed indexes\n used by netflix, monster.com, digg\n
Should we introduce the Solr schema here or anywhere else or is it just not worth getting into that level of detail in this presentation?\n\n* the fact that we are manually curating the site information should reinforce that we currently have no registration or signup system beyond “email Brian Caruso...”\n\n* should we mention scaling here? What do we want to say besides “scaling with Solr”\n - are there any particular example projects/numbers we want to point to?\n
\n
\n
\n
\n
\n
- Search is a goal-oriented activity. Users are typically not searching for fun. Get out of their way.\n- Google and others have established UI patterns that users are comfortable with. The UI itself is not where we want to experiment.\n- It seems so simple and familiar, but many subtleties in a search interface.\n- Usability testing is not negotiable.\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
* we don’t want to give you the wrong impression, that using only these 2 open source tools you can build this exact site, pixel for pixel.\n* additional utilities for facets, class group taxonomy, institution management\n\n\n
* show barebones D7 site with default theme and VIVO search module to illustrate extra work\n* place screenshot here and then also quickly demo this live\n\n* we will also need Apache Solr Search Integration module as well (anything else)?\n* I will work on this tonight/tomorrow\n \n
* focusing on the fact that it’s more than just these 2 tools is not the point\n* instead bring the focus to Solr\n* explain why we chose it\n* illustrate the flexibility it provides\n* Drupal is not a requirement, just one example\n* demo AJAX Solr site connected to Rollins index\n\nI will work on the AJAX Solr site tonight/tomorrow as well.\n