Site search is one of the core functionality of any website. This talk provides an overview of internal workings of CQ5 search, its limitations for implementing site search functionality and discusses design patterns & challenges for integrating various 3rd party search providers with CQ5/AEM.
2.
Session Outline
Importance of site search functionality
CQ5 internal search workings & limitations
Integrating CQ5 with external search engines &
challenges
Indexing patterns for integrating with external
search engines
Q&A
3.
Site Search is one of the core
functionality of most websites
Browse v/s Search: Alternate methods
of allowing visitors to find the
information they need quickly and
easily
“90 percent of companies report
that search is the No.1 means of
navigation on their site”
-- Forrester Research
“82 percent of visitors use site
search to find the information they
need”
-- Juniper Research
Advances in search features, which
allows site visitors to:
Auto complete/auto correct search terms
Build advanced queries,
Filter results by facets,
Search results refined by location,
preferences, previous history, etc
“Visitors who used site search
were “more likely to convert from
browsers to buyers”.”
-- Juniper Research
4. •
Jackrabbit internally uses Lucene to
Index repository content
•
Whenever any content is modified, along
with it getting stored in repository,
lucene index is also updated
•
Index Location:
<crx-quickstart>/repository:
• repository/index
• workspaces/crx.default/index
•
Index Configuration:
• Repository.xml & workspaces.xml
<SearchIndex> block
• tika-config.xml in workspaces
folder
•
Changes in new version of Jackrabbit (3.x
/ Oak)
5. •
Jackrabbit
• JCR Spec 1.0: Support for XPATH &
JCR SQL1
• JCR Spec 2.0: Support for JCR
SQL2. Support for XPATH
deprecated in JCR 2.0 but
Jackrabbit still supports it
• Both SQL & XPATH queries are
translated to same search tree
•
Query Builder is an API to build queries
for a query engine
•
CQ providers several OOTB components
& extensions which leverages
QueryBuilder API for full text or predicate
based searches
•
OOTB Search Component provides
support for full text query and enhanced
search features: similar pages, facets
support, pagination, etc
6.
Use Case: Non CQ Content Sources
Use Case: Author v/s Visitor Search Patterns
CQ generates one index per server
Author and visitor search patterns and requirements are typically different
Performance & Architecture Considerations
Larger sites with more than one source of content and assets.
Difficult to index non-CQ content
‘n’ number of queries and search variations – making it difficult to utilize CQ caching
architecture
Jackrabbit layer on top of Lucene may slow down search and query performance
Scaling of search architecture dependent upon CQ architecture
Customizations
Utilizing different content parsers, index tuning, etc (mitigated in 5.6.1)
Can I use newer version of Lucene?
How can I extend Jackrabbit search implementation?
7.
External Search Platforms
Search Providers with Crawlers (examples):
▪ Google Search Appliance
▪ Microsoft FAST
Non-crawler Search Providers (examples):
▪ Endeca
▪ Lucene/Solr
Enables independent scaling of search platform
Supports more than one content sources
Configuration & customization of search application is decoupled from
CQ5 application
May provide more advanced search features (faceted search, geospatial
search, personalization, etc)
8.
Challenges building & managing search indexes
Building Site Index: Crawl or Query & Inject?
How often should index be rebuilt?
How to ensure that content & metadata between content
sources and search index is always in sync?
In case of multiple data sources, how to manage
duplicates, index structure and common metadata model?
Challenges querying & building search results
Should search results page be hosted on the provider’s
platform or within CQ?
Does search provider offer extended API to query and
build search results within the application?
9.
10.
Integration Notes:
GSA, FAST Site Crawler, Endeca’s Plugin for CRX Indexing, Solr via open
Source crawlers (Nutch, etc)
May require custom service which returns data (for example for Solr, Endeca)
Pros:
Ease of implementation
Indexes rendered version of the pages
Cons:
Lag between content publishing and index update process may result in out
of sync search results experience. Also, what happens to deleted content?
Larger index crawl and build times
Search index doesn’t have complete set of meta-data
11.
12.
Example – CQ / FAST connector (available via service pack)
Pros:
⁻ Search index always in sync with content repository
⁻ Ability to send metadata with content
⁻ Customizable data formats and allows for partial indexing of
page
Cons:
⁻ Will require custom development efforts
⁻ Indexing of content instead of rendered version of the pages
⁻ System Performance / Event Handling
13.
14.
15.
Pros:
⁻ Search index (mostly) in sync with content repository
⁻ Ability to send metadata with content
⁻ Customizable data formats and allows for partial indexing
of page
⁻ Minimal replication event processing
Cons:
⁻ Will require custom development efforts
⁻ Search index may get out of sync with content repository
(but for a shorter duration only)
⁻ Indexing of content instead of rendered version of the
pages
16.
Handling initial content load & index creation
In case of content push approach, how will initial index be generated? May
need to create initial baseline via site crawl or custom service
In case of content pull approach, how will index reflect deleted, moved, site
pages?
Permission sensitive site pages & assets
Option 1: Export ACLs to Search Provider (example: CQ/FAST Connector)
Option 2: Check user permission via CQ at run time (similar to how CQ handles
delivery of content incase of closed user groups)
Referenced assets, content pages and promos
Option: Query referenced pages and index. May cause performance (&
recursive index) issue though.
Option: Selective content indexing (Index parts of page instead of entire page)