SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
Challenges of Simple
Documents
Cassandra Targett
Director of Engineering, Lucidworks
@childerelda
#Activate18 #ActivateSearch
About Me
• Lucene/Solr committer and
member of PMC
• Director of Engineering at
Lucidworks
• Manage team of Solr
committers
• Live in the Florida
Panhandle
Agenda
• Looking at the Solr Reference Guide as a content source
• Structure of raw documents
• HTML format
• Indexing the Reference Guide with bin/post
• Indexing the Reference Guide with Site Search
Solr Reference Guide as a
Content Source
Brief History of the Ref Guide
2009
First version of the
Guide created by
Lucidworks
Guide integrated
with Solr source
Lucidworks donates Guide
to Lucene/Solr community
2013 2017
Moving from Confluence
• What we gained:
• Control over information design and presentation
• Versioned Guides
• Tighter integration with developers
• What we lost:
• Comments
• Managed infrastructure
• SEARCH
Challenges with Providing Ref Guide
Search as a Community Artifact
Server
None of this is the core
mission of the committer
community
Baseline Feature Set
• Full text search
• Auto-complete
• Suggestions/Spellcheck
• Highlighting
• Facets? …based on?
Reference Guide
Document Format
Asciidoc Format
Page title
Text
Image reference
with caption
Section title
Ref Guide Content Structure
• Asciidoc is relatively well-structured
• headings clearly separate from general text (==)
• code examples in blocks ([source,xml])
• Doesn’t include header/footer/nav “cruft”
• Challenges:
• Document links are to other .adoc files
• No URL for access via a search result list
• HTML metadata missing
• No established means of indexing .adoc files
Maybe We Should Use HTML Format?
• Lots of systems know how to read HTML
• URLs for access already exist and are correct
• Inter/intra-document links converted to correct HTML
references (anchors or other pages)
• Challenges:
• Includes “cruft” of navs and headers/footers
• HTML can be pretty unstructured
toc.js
Jekyll
template
Take One: bin/post
aka, Solr Cell and Tika
Solr’s bin/post
• Simple command line tool for POSTing content
• XML, JSON, CSV, HTML, PDF
• Includes a basic crawler
• Determines update handler to use based on file type
• JSON, CSV -> /update/json, /update/csv
• PDF, HTML, Word -> /update/extract
• Delegates to post.jar (in example/exampledocs)
$ ./bin/post -c post-html -filetypes html example/refguide-html/
/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath /
Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html -
Dc=post-html -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool example/
refguide-html/
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/post-html/update...
Entering auto mode. File endings considered are html
Entering recursive mode, max depth=999, delay=0s
Indexing directory example/refguide-html (245 files, depth=0)
POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract
POSTing file client-api-lineup.html (text/html) to [base]/extract
…
250 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/post-html/update...
Time spent: 0:00:07.660
/update/extract aka Solr Cell
• ExtractingRequestHandler
• Uses configuration in solrconfig.xml if not defined with
runtime parameters
• Uses Apache Tika for content extraction & parsing
• Streams documents to Solr
Indexed document
{
"id":"/Applications/Solr/solr-7.5.0/example/refguide-html/language-
analysis.html",
"stream_size":[222027],
"x_ua_compatible":["IE=edge"],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"keywords":[" "],
"viewport":["width=device-width, initial-scale=1"],
"dc_title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"],
"content_encoding":["UTF-8"],
"resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/language-
analysis.html"],
"title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"],
"content_type":["text/html; charset=UTF-8"],
"_version_":1612232702110466048}
/update/extract in solrconfig.xml
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
$ ./bin/post -c post-html -filetypes html -params "fmap.content=body" example/refguide-
html/
/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath /
Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html -
Dparams=fmap.content=body -Dc=post-html -Ddata=files -Drecursive=yes
org.apache.solr.util.SimplePostTool example/refguide-html/
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/post-html/update?
fmap.content=body...
Entering auto mode. File endings considered are html
Entering recursive mode, max depth=999, delay=0s
Indexing directory example/refguide-html (245 files, depth=0)
POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract
POSTing file client-api-lineup.html (text/html) to [base]/extract
…
250 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/post-html/update?
fmap.content=body...
Time spent: 0:00:07.347
Indexed document with body
{
"id":"/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html",
"stream_size":[46314],
"x_ua_compatible":["IE=edge"],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"keywords":[" "],
"viewport":["width=device-width, initial-scale=1"],
"dc_title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"],
"content_encoding":["UTF-8"],
"resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html"],
"title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"],
"content_type":["text/html; charset=UTF-8"],
"body":[" n n stylesheet text/css https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-
awesome.min.css n stylesheet css/lavish-bootstrap.css n stylesheet css/customstyles.css n stylesheet css/
theme-solr.css n stylesheet css/ref-guide.css n https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/
jquery.min.js n https://cdnjs.cloudflare.com/ajax/libs/jquery-cookie/1.4.1/jquery.cookie.min.js n js/
jquery.navgoco.min.js n https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js n https://
cdnjs.cloudflare.com/ajax/libs/anchor-js/2.0.0/anchor.min.js n js/toc.js n
....],
"_version_":1612248222923751424}
capture and captureAttr
• These parameters allow putting HTML tags (XHTML
elements) into separate fields in Solr
• capture=<element> puts a specific element into it’s
own field
• captureAttr=true puts all attributes of elements into
their own fields
Example Using captureAttr
• Captures attributes
of elements
• classnames
• ids
• Best used for
something like
getting href values
out of <a> tags
./bin/post -c post-html -filetypes html -
params "fmap.content=body&captureAttr=true"
example/refguide-html/
Example Using capture
• Map everything in
h2 tags to the
sectiontitles
field
• Great option for
parsing HTML
pages
./bin/post -c post-html -filetypes html -
params “fmap.content=body&capture=h2
&fmap.h2=sectiontitles" example/refguide-html/
More We Could Explore
• Tika gives us a lot of power to parse our documents
• tika.config opens up all of Tika’s options
• parseContext.config gives control over parser options
• Haven’t looked at default field analysis:
• Are we storing the fields the way we want? Storing too
many fields?
• Should we do different analysis to support more search
use cases?
Remaining Challenges
• Indexed the files locally, I don’t have the correct paths for
URLs
• Add a field with this information?
• Crawler option doesn’t like our pages (why?)
• Still need a front-end UI
• Haven’t solved server & maintenance questions
Use /browse for Front-End?
• http://localhost:9983/solr/<collection>/browse
• Most config done via solrconfig.xml & query parameters
Take Two: Site Search
What is Site Search?
• Hosted service from Lucidworks, based on Fusion
• Designed to make basic search easy to configure,
manage, and embed into a site
Fields in crawled documents
Embed into an Existing Site
• Add JS snippet to <head>
element
• Add search.html for results
• Add elements to embed:
• <cloud-search-box>
• <cloud-results>
• <cloud-tabs>
• <cloud-facets>
Queries and results
A Few Challenges
• Uniform Information Model works best when content can
conform or adapt to it
• Poorly formed HTML presents problems for machines
• Which elements hold the content?
• Which elements are nav elements?
• Default extraction of a “description” field did not allow for
good highlighting experience
• Fallback to entire <body> brought in all the navigation
“cruft”
<html>
<head> .. </head>
<body>
<div class=“container”>
<div class=“row”>
<div class=“col-md-9”>
<div class=“post-title-main”> .. </div>
<div class=“post-content”>
<div class=“main-content”>
<div class=“sect1”>
<div class=“sectionbody”>
<div class=“paragraph”>
<p> .. </p>
</div>
</div>
</div>
</div>
</div>
<footer> .. </footer>
</div>
</div>
</div>
</body>
</html>
Better HTML
structure might
help…
Before:
Reader View & Search Engines have a
hard time with this structure
<html>
<head> .. </head>
<body>
<div class=“container”>
<div class=“row”>
<nav class=“col-md-3”> .. </nav>
<article class=“col-md-9 post-content”>
<header class=“header”> .. </header>
<nav class=“toc”> .. </nav>
<section class=“content”>
<section class=“sect1”>
<h2> .. </h2>
<p> .. </p>
</section>
</section>
</article>
</div>
<footer> .. </footer>
</div>
</body>
</html>
After
implementing
semantic
elements
(SOLR-12746)
Reader View Improvement
Do better tags help…
• Site Search?
• Yes!
• We can define elements to extract & map those to Site
Search information model
• bin/post (Solr Cell)?
• No, TIKA-985 is for supporting HTML5 elements
• In the meantime, they are ignored
Search with Semantic HTML
Is Site Search the Solution?
• Hosted and managed for us
• Easy to integrate with our existing site
• Basic search features with very short set up time
• Better than a title keyword lookup
• Challenges:
• Advanced features are obscured
• Are the basic features good enough (maybe just for
now)?
Takeaways
No matter what your data
looks like, you will face
challenges
No tools have perfected this yet.
Your stuff is unique! Learn how it’s structured!
The problem isn’t always
the tool you are trying to
use
Sometimes you need to try to fix your data
(Assuming you can!)
Site Search can be a
solution for Ref Guide
search
It’s not perfect, but it’s better than today!
Questions?
Thank you!
Cassandra Targett
Director of Engineering, Lucidworks
@childerelda
#Activate18 #ActivateSearch

Mais conteúdo relacionado

Mais procurados

Custom Development with Novell Teaming
Custom Development with Novell TeamingCustom Development with Novell Teaming
Custom Development with Novell TeamingNovell
 
Developing and Deploying Custom Branding Solutions in SharePoint 2010
Developing and Deploying Custom Branding Solutions in SharePoint 2010Developing and Deploying Custom Branding Solutions in SharePoint 2010
Developing and Deploying Custom Branding Solutions in SharePoint 2010jhendrix88
 
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed CodeSEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed CodeMarc D Anderson
 
Drupal is not your Website
Drupal is not your Website Drupal is not your Website
Drupal is not your Website Phase2
 
Parsing strange v4
Parsing strange v4Parsing strange v4
Parsing strange v4Hal Stern
 
Customizing the Document Library
Customizing the Document LibraryCustomizing the Document Library
Customizing the Document LibraryAlfresco Software
 
SilverStripe From a Developer's Perspective
SilverStripe From a Developer's PerspectiveSilverStripe From a Developer's Perspective
SilverStripe From a Developer's Perspectiveajshort
 
Parsing strange v2
Parsing strange v2Parsing strange v2
Parsing strange v2Hal Stern
 
Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012Lee Klement
 
Intro to drupal
Intro to drupalIntro to drupal
Intro to drupalhernanibf
 
Building a SharePoint Platform That Scales
Building a SharePoint Platform That ScalesBuilding a SharePoint Platform That Scales
Building a SharePoint Platform That ScalesScott Hoag
 
Content by query web part
Content by query web partContent by query web part
Content by query web partIslamKhattab
 
Introduction to YUI PHP Loader
Introduction to YUI PHP LoaderIntroduction to YUI PHP Loader
Introduction to YUI PHP LoaderChad Auld
 
Alfresco tech talk live share extensibility metadata and actions for 4.1
Alfresco tech talk live share extensibility metadata and actions for 4.1Alfresco tech talk live share extensibility metadata and actions for 4.1
Alfresco tech talk live share extensibility metadata and actions for 4.1Alfresco Software
 
How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)ICF CIRCUIT
 
SharePoint Advanced Administration with Joel Oleson, Shane Young and Mike Watson
SharePoint Advanced Administration with Joel Oleson, Shane Young and Mike WatsonSharePoint Advanced Administration with Joel Oleson, Shane Young and Mike Watson
SharePoint Advanced Administration with Joel Oleson, Shane Young and Mike WatsonJoel Oleson
 
HTML5, just another presentation :)
HTML5, just another presentation :)HTML5, just another presentation :)
HTML5, just another presentation :)François Massart
 

Mais procurados (20)

Custom Development with Novell Teaming
Custom Development with Novell TeamingCustom Development with Novell Teaming
Custom Development with Novell Teaming
 
Developing and Deploying Custom Branding Solutions in SharePoint 2010
Developing and Deploying Custom Branding Solutions in SharePoint 2010Developing and Deploying Custom Branding Solutions in SharePoint 2010
Developing and Deploying Custom Branding Solutions in SharePoint 2010
 
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed CodeSEF2013 - Create a Business Solution, Step by Step, with No Managed Code
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
 
Drupal is not your Website
Drupal is not your Website Drupal is not your Website
Drupal is not your Website
 
Parsing strange v4
Parsing strange v4Parsing strange v4
Parsing strange v4
 
Customizing the Document Library
Customizing the Document LibraryCustomizing the Document Library
Customizing the Document Library
 
SilverStripe From a Developer's Perspective
SilverStripe From a Developer's PerspectiveSilverStripe From a Developer's Perspective
SilverStripe From a Developer's Perspective
 
Parsing strange v2
Parsing strange v2Parsing strange v2
Parsing strange v2
 
Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012
 
Ron
RonRon
Ron
 
SharePoint Topology
SharePoint Topology SharePoint Topology
SharePoint Topology
 
72d5drupal
72d5drupal72d5drupal
72d5drupal
 
Intro to drupal
Intro to drupalIntro to drupal
Intro to drupal
 
Building a SharePoint Platform That Scales
Building a SharePoint Platform That ScalesBuilding a SharePoint Platform That Scales
Building a SharePoint Platform That Scales
 
Content by query web part
Content by query web partContent by query web part
Content by query web part
 
Introduction to YUI PHP Loader
Introduction to YUI PHP LoaderIntroduction to YUI PHP Loader
Introduction to YUI PHP Loader
 
Alfresco tech talk live share extensibility metadata and actions for 4.1
Alfresco tech talk live share extensibility metadata and actions for 4.1Alfresco tech talk live share extensibility metadata and actions for 4.1
Alfresco tech talk live share extensibility metadata and actions for 4.1
 
How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)
 
SharePoint Advanced Administration with Joel Oleson, Shane Young and Mike Watson
SharePoint Advanced Administration with Joel Oleson, Shane Young and Mike WatsonSharePoint Advanced Administration with Joel Oleson, Shane Young and Mike Watson
SharePoint Advanced Administration with Joel Oleson, Shane Young and Mike Watson
 
HTML5, just another presentation :)
HTML5, just another presentation :)HTML5, just another presentation :)
HTML5, just another presentation :)
 

Semelhante a Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks

SPTechCon Boston 2015 - Utilizing jQuery in SharePoint
SPTechCon Boston 2015 - Utilizing jQuery in SharePointSPTechCon Boston 2015 - Utilizing jQuery in SharePoint
SPTechCon Boston 2015 - Utilizing jQuery in SharePointMark Rackley
 
Advanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoAdvanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoFu Cheng
 
SPTechCon DevDays - SharePoint & jQuery
SPTechCon DevDays - SharePoint & jQuerySPTechCon DevDays - SharePoint & jQuery
SPTechCon DevDays - SharePoint & jQueryMark Rackley
 
Share point development 101
Share point development 101Share point development 101
Share point development 101Becky Bertram
 
Staying Sane with Drupal NEPHP
Staying Sane with Drupal NEPHPStaying Sane with Drupal NEPHP
Staying Sane with Drupal NEPHPOscar Merida
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
OmniUpdate User Training Conference 2014: Our "Special Sauce" Responsive Desi...
OmniUpdate User Training Conference 2014: Our "Special Sauce" Responsive Desi...OmniUpdate User Training Conference 2014: Our "Special Sauce" Responsive Desi...
OmniUpdate User Training Conference 2014: Our "Special Sauce" Responsive Desi...C. Daniel Chase
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Richard Esplin
 
DMann-SQLDeveloper4Reporting
DMann-SQLDeveloper4ReportingDMann-SQLDeveloper4Reporting
DMann-SQLDeveloper4ReportingDavid Mann
 
Web component driven development
Web component driven developmentWeb component driven development
Web component driven developmentGil Fink
 
The SharePoint and jQuery Guide by Mark Rackley - SPTechCon
The SharePoint and jQuery Guide by Mark Rackley - SPTechConThe SharePoint and jQuery Guide by Mark Rackley - SPTechCon
The SharePoint and jQuery Guide by Mark Rackley - SPTechConSPTechCon
 
Office 365 Saturday (Sydney) - SharePoint framework – build integrated user e...
Office 365 Saturday (Sydney) - SharePoint framework – build integrated user e...Office 365 Saturday (Sydney) - SharePoint framework – build integrated user e...
Office 365 Saturday (Sydney) - SharePoint framework – build integrated user e...Anupam Ranku
 
Documenting metadata application profiles and vocabularies
Documenting metadata application profiles and vocabulariesDocumenting metadata application profiles and vocabularies
Documenting metadata application profiles and vocabulariesPaul Walk
 
(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery GuideMark Rackley
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Django Overview
Django OverviewDjango Overview
Django OverviewBrian Tol
 
Creating a Documentation Portal
Creating a Documentation PortalCreating a Documentation Portal
Creating a Documentation PortalSteve Anderson
 
Mobile and IBM Worklight Best Practices
Mobile and IBM Worklight Best PracticesMobile and IBM Worklight Best Practices
Mobile and IBM Worklight Best PracticesAndrew Ferrier
 

Semelhante a Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks (20)

Codeigniter
CodeigniterCodeigniter
Codeigniter
 
SPTechCon Boston 2015 - Utilizing jQuery in SharePoint
SPTechCon Boston 2015 - Utilizing jQuery in SharePointSPTechCon Boston 2015 - Utilizing jQuery in SharePoint
SPTechCon Boston 2015 - Utilizing jQuery in SharePoint
 
Advanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoAdvanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojo
 
SPTechCon DevDays - SharePoint & jQuery
SPTechCon DevDays - SharePoint & jQuerySPTechCon DevDays - SharePoint & jQuery
SPTechCon DevDays - SharePoint & jQuery
 
Share point development 101
Share point development 101Share point development 101
Share point development 101
 
Staying Sane with Drupal NEPHP
Staying Sane with Drupal NEPHPStaying Sane with Drupal NEPHP
Staying Sane with Drupal NEPHP
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
OmniUpdate User Training Conference 2014: Our "Special Sauce" Responsive Desi...
OmniUpdate User Training Conference 2014: Our "Special Sauce" Responsive Desi...OmniUpdate User Training Conference 2014: Our "Special Sauce" Responsive Desi...
OmniUpdate User Training Conference 2014: Our "Special Sauce" Responsive Desi...
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
 
DMann-SQLDeveloper4Reporting
DMann-SQLDeveloper4ReportingDMann-SQLDeveloper4Reporting
DMann-SQLDeveloper4Reporting
 
Web component driven development
Web component driven developmentWeb component driven development
Web component driven development
 
The SharePoint and jQuery Guide by Mark Rackley - SPTechCon
The SharePoint and jQuery Guide by Mark Rackley - SPTechConThe SharePoint and jQuery Guide by Mark Rackley - SPTechCon
The SharePoint and jQuery Guide by Mark Rackley - SPTechCon
 
Office 365 Saturday (Sydney) - SharePoint framework – build integrated user e...
Office 365 Saturday (Sydney) - SharePoint framework – build integrated user e...Office 365 Saturday (Sydney) - SharePoint framework – build integrated user e...
Office 365 Saturday (Sydney) - SharePoint framework – build integrated user e...
 
Documenting metadata application profiles and vocabularies
Documenting metadata application profiles and vocabulariesDocumenting metadata application profiles and vocabularies
Documenting metadata application profiles and vocabularies
 
(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Django Overview
Django OverviewDjango Overview
Django Overview
 
Creating a Documentation Portal
Creating a Documentation PortalCreating a Documentation Portal
Creating a Documentation Portal
 
Mobile and IBM Worklight Best Practices
Mobile and IBM Worklight Best PracticesMobile and IBM Worklight Best Practices
Mobile and IBM Worklight Best Practices
 

Mais de Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

Mais de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Último

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 

Último (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks

  • 1. Challenges of Simple Documents Cassandra Targett Director of Engineering, Lucidworks @childerelda #Activate18 #ActivateSearch
  • 2. About Me • Lucene/Solr committer and member of PMC • Director of Engineering at Lucidworks • Manage team of Solr committers • Live in the Florida Panhandle
  • 3. Agenda • Looking at the Solr Reference Guide as a content source • Structure of raw documents • HTML format • Indexing the Reference Guide with bin/post • Indexing the Reference Guide with Site Search
  • 4. Solr Reference Guide as a Content Source
  • 5. Brief History of the Ref Guide 2009 First version of the Guide created by Lucidworks Guide integrated with Solr source Lucidworks donates Guide to Lucene/Solr community 2013 2017
  • 6. Moving from Confluence • What we gained: • Control over information design and presentation • Versioned Guides • Tighter integration with developers • What we lost: • Comments • Managed infrastructure • SEARCH
  • 7. Challenges with Providing Ref Guide Search as a Community Artifact Server None of this is the core mission of the committer community
  • 8. Baseline Feature Set • Full text search • Auto-complete • Suggestions/Spellcheck • Highlighting • Facets? …based on?
  • 10. Asciidoc Format Page title Text Image reference with caption Section title
  • 11. Ref Guide Content Structure • Asciidoc is relatively well-structured • headings clearly separate from general text (==) • code examples in blocks ([source,xml]) • Doesn’t include header/footer/nav “cruft” • Challenges: • Document links are to other .adoc files • No URL for access via a search result list • HTML metadata missing • No established means of indexing .adoc files
  • 12. Maybe We Should Use HTML Format? • Lots of systems know how to read HTML • URLs for access already exist and are correct • Inter/intra-document links converted to correct HTML references (anchors or other pages) • Challenges: • Includes “cruft” of navs and headers/footers • HTML can be pretty unstructured
  • 14. Take One: bin/post aka, Solr Cell and Tika
  • 15. Solr’s bin/post • Simple command line tool for POSTing content • XML, JSON, CSV, HTML, PDF • Includes a basic crawler • Determines update handler to use based on file type • JSON, CSV -> /update/json, /update/csv • PDF, HTML, Word -> /update/extract • Delegates to post.jar (in example/exampledocs)
  • 16. $ ./bin/post -c post-html -filetypes html example/refguide-html/ /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath / Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html - Dc=post-html -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool example/ refguide-html/ SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/post-html/update... Entering auto mode. File endings considered are html Entering recursive mode, max depth=999, delay=0s Indexing directory example/refguide-html (245 files, depth=0) POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract POSTing file client-api-lineup.html (text/html) to [base]/extract … 250 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/post-html/update... Time spent: 0:00:07.660
  • 17. /update/extract aka Solr Cell • ExtractingRequestHandler • Uses configuration in solrconfig.xml if not defined with runtime parameters • Uses Apache Tika for content extraction & parsing • Streams documents to Solr
  • 18. Indexed document { "id":"/Applications/Solr/solr-7.5.0/example/refguide-html/language- analysis.html", "stream_size":[222027], "x_ua_compatible":["IE=edge"], "x_parsed_by":["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.html.HtmlParser"], "stream_content_type":["text/html"], "keywords":[" "], "viewport":["width=device-width, initial-scale=1"], "dc_title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"], "content_encoding":["UTF-8"], "resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/language- analysis.html"], "title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"], "content_type":["text/html; charset=UTF-8"], "_version_":1612232702110466048}
  • 19. /update/extract in solrconfig.xml <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.meta">ignored_</str> <str name="fmap.content">_text_</str> </lst> </requestHandler>
  • 20. $ ./bin/post -c post-html -filetypes html -params "fmap.content=body" example/refguide- html/ /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath / Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html - Dparams=fmap.content=body -Dc=post-html -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool example/refguide-html/ SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/post-html/update? fmap.content=body... Entering auto mode. File endings considered are html Entering recursive mode, max depth=999, delay=0s Indexing directory example/refguide-html (245 files, depth=0) POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract POSTing file client-api-lineup.html (text/html) to [base]/extract … 250 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/post-html/update? fmap.content=body... Time spent: 0:00:07.347
  • 21. Indexed document with body { "id":"/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html", "stream_size":[46314], "x_ua_compatible":["IE=edge"], "x_parsed_by":["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.html.HtmlParser"], "stream_content_type":["text/html"], "keywords":[" "], "viewport":["width=device-width, initial-scale=1"], "dc_title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"], "content_encoding":["UTF-8"], "resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html"], "title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"], "content_type":["text/html; charset=UTF-8"], "body":[" n n stylesheet text/css https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font- awesome.min.css n stylesheet css/lavish-bootstrap.css n stylesheet css/customstyles.css n stylesheet css/ theme-solr.css n stylesheet css/ref-guide.css n https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/ jquery.min.js n https://cdnjs.cloudflare.com/ajax/libs/jquery-cookie/1.4.1/jquery.cookie.min.js n js/ jquery.navgoco.min.js n https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js n https:// cdnjs.cloudflare.com/ajax/libs/anchor-js/2.0.0/anchor.min.js n js/toc.js n ....], "_version_":1612248222923751424}
  • 22. capture and captureAttr • These parameters allow putting HTML tags (XHTML elements) into separate fields in Solr • capture=<element> puts a specific element into it’s own field • captureAttr=true puts all attributes of elements into their own fields
  • 23. Example Using captureAttr • Captures attributes of elements • classnames • ids • Best used for something like getting href values out of <a> tags ./bin/post -c post-html -filetypes html - params "fmap.content=body&captureAttr=true" example/refguide-html/
  • 24. Example Using capture • Map everything in h2 tags to the sectiontitles field • Great option for parsing HTML pages ./bin/post -c post-html -filetypes html - params “fmap.content=body&capture=h2 &fmap.h2=sectiontitles" example/refguide-html/
  • 25. More We Could Explore • Tika gives us a lot of power to parse our documents • tika.config opens up all of Tika’s options • parseContext.config gives control over parser options • Haven’t looked at default field analysis: • Are we storing the fields the way we want? Storing too many fields? • Should we do different analysis to support more search use cases?
  • 26. Remaining Challenges • Indexed the files locally, I don’t have the correct paths for URLs • Add a field with this information? • Crawler option doesn’t like our pages (why?) • Still need a front-end UI • Haven’t solved server & maintenance questions
  • 27. Use /browse for Front-End? • http://localhost:9983/solr/<collection>/browse • Most config done via solrconfig.xml & query parameters
  • 28. Take Two: Site Search
  • 29. What is Site Search? • Hosted service from Lucidworks, based on Fusion • Designed to make basic search easy to configure, manage, and embed into a site
  • 30.
  • 31. Fields in crawled documents
  • 32.
  • 33. Embed into an Existing Site • Add JS snippet to <head> element • Add search.html for results • Add elements to embed: • <cloud-search-box> • <cloud-results> • <cloud-tabs> • <cloud-facets>
  • 35. A Few Challenges • Uniform Information Model works best when content can conform or adapt to it • Poorly formed HTML presents problems for machines • Which elements hold the content? • Which elements are nav elements? • Default extraction of a “description” field did not allow for good highlighting experience • Fallback to entire <body> brought in all the navigation “cruft”
  • 36. <html> <head> .. </head> <body> <div class=“container”> <div class=“row”> <div class=“col-md-9”> <div class=“post-title-main”> .. </div> <div class=“post-content”> <div class=“main-content”> <div class=“sect1”> <div class=“sectionbody”> <div class=“paragraph”> <p> .. </p> </div> </div> </div> </div> </div> <footer> .. </footer> </div> </div> </div> </body> </html> Better HTML structure might help… Before:
  • 37. Reader View & Search Engines have a hard time with this structure
  • 38. <html> <head> .. </head> <body> <div class=“container”> <div class=“row”> <nav class=“col-md-3”> .. </nav> <article class=“col-md-9 post-content”> <header class=“header”> .. </header> <nav class=“toc”> .. </nav> <section class=“content”> <section class=“sect1”> <h2> .. </h2> <p> .. </p> </section> </section> </article> </div> <footer> .. </footer> </div> </body> </html> After implementing semantic elements (SOLR-12746)
  • 40. Do better tags help… • Site Search? • Yes! • We can define elements to extract & map those to Site Search information model • bin/post (Solr Cell)? • No, TIKA-985 is for supporting HTML5 elements • In the meantime, they are ignored
  • 42. Is Site Search the Solution? • Hosted and managed for us • Easy to integrate with our existing site • Basic search features with very short set up time • Better than a title keyword lookup • Challenges: • Advanced features are obscured • Are the basic features good enough (maybe just for now)?
  • 44. No matter what your data looks like, you will face challenges No tools have perfected this yet. Your stuff is unique! Learn how it’s structured!
  • 45. The problem isn’t always the tool you are trying to use Sometimes you need to try to fix your data (Assuming you can!)
  • 46. Site Search can be a solution for Ref Guide search It’s not perfect, but it’s better than today!
  • 48. Thank you! Cassandra Targett Director of Engineering, Lucidworks @childerelda #Activate18 #ActivateSearch