The document discusses Lux, an open source project that allows querying XML documents stored in Lucene using XQuery. Lux indexes XML documents while maintaining their tag structure and context. It supports both precise XPath queries and more general searches over tagged text using contextual indexes. Lux can be used for complex queries over semi-structured data, content analysis, and building applications with an XQuery backend.
Handwritten Text Recognition for manuscripts and early printed texts
Querying rich text with XQuery
1. Querying Rich Text
with Lucene XQuery
{
Michael Sokolov
Senior Architect
Safari Books Online
2. ! Overview of Lux
! Why we need want a rich(er) query language
! Implementation Stories
! Indexing tagged text
! Storing documents in Lucene
! Lazy searching
! Demo
The plan for this talk
3. ! XQuery in Solr
! Query optimizer
! Efficient XML document format
! XQuery function library
! as a Java library (Lucene only)
! as Solr plugins
! as a standalone App Server
What is Lux?
6. !
!
!
!
!
maybe it was once – 10 year s ago?
Legacy stuff: DTDs, namespaces, etc
arcane Java programming interfaces
Don’t we use JSON now?
so why do we care about it?
XML is not cool
7. ! There’s a huge amount of it out there
! HTML is XML, or can be
! Lux is about making it easy (and free) to deal
with XML
But it still maZers
8. ! We make content-‐‑rich sites:
! our own site: safaribooksonline.com
! our clients sites: oed.com, degruyter.com,
oxfordreference.com, …
! Publishers provide us with content
! we debug content problems
! we add new features nimbly
! Piles of random data (XML, mostly)
Why did we make it?
9. ! Complex queries over semi-‐‑structured data, typically
documents
! You don’t need it for edismax-‐‑style “quick” search
! or highly-‐‑structured data
! XQuery comes with a rich function library;
! rich string, numeric and date functions
! extensions for HTTP, filesystem, zip
How can XQuery help?
10. DispatchFilter
UpdateProcessor
XML Indexer
XML text
fields
Tagged
TokenStream
XPath fields
Tinybin
storage
External
Field Codec
QueryComponent
QParserPlugin
Evaluator
Saxon XQuery
XSLT Processor
XQuery
Function
Library
Lazy
Searcher
ResponseWriter
Compiler
Optimizer
Tagged
Highlighter
How does Lux work?
11. ! “hamlet”
! “hamlet” in //title
! “hamlet” in //scene/title, //speaker, etc…
! XQuery, but we need an index
! DIH XPathEntityProcessor
! But are XPath indexes enough?
XML is text with context
12. ! In which speeches does Hamlet talk about poison?
! +speaker:Hamlet +line:poison
! Works great if we indexed speaker and line for each
speech
! What if we only indexed at the scene level?
! What if we just indexed speech text as a field?
! XPath indexes are precise and fine-‐‑grained
! Great when you know exactly what you need
How do we index context?
13. <play>
<title>Hamlet</title>
<act act=”1”>
<scene act=”1” scene=”1”>
<title>SCENE I. Elsinore ... </title>
Index
Values
Tags
title, act, @act
Tag Paths
/play, /play/title, /play/act, /play/act/@act
Text
hamlet,
scene,
elsinore
Tagged Text
play:hamlet,
title:hamlet,
@act:1
XPath
user-‐defined
Xpath
2.0
expression;
eg:
count(//line),
replace(//title,
'SCENE|ACT
S+','')
Contextual Indexes
14. ! Tagged Text, Path index
! Imprecise, generic indexes, but more context
than just full text
! XQuery post-‐‑processing to patch over the gaps
! Query optimizer applies indexes
! For when you don’t want to sweat the details:
ad hoc queries, content analysis and debugging
General purpose indexes
15. <scene><speech>
<speaker>Hamlet</speaker>
<line>To be or not to be, … </line>
…
scene
speech
speaker
…
scene
speech
line
…
scene
speech
line
Hamlet
To
be
!
!
!
!
Zext:scene:hamlet pos=1
Zext:speech:hamlet pos=1
Zext:speaker:hamlet pos=1
Zext:scene:to pos=2
Zext:speech:to pos=2
…
Tokens emiZed
Wraps an existing Analyzer (for the text)
Responds to XML events (start element, etc)
Maintains a tag name stack
Emits each token prefixed by enclosing tags
TaggedTokenStream
17. ! Generic JSON index
! Overlapping tags (part-‐‑of-‐‑speech, phrase-‐‑labeling, NLP)
! citation classification w/probabilistic labeling
! One stored field for all the text makes highlighting easier
! One Lucene field means you can use PhraseQuery, eg:
PhraseQuery(<speaker:hamlet <speech:to) finds all
speeches by hamlet starting with “to”.
Tagged token examples
18. !
!
!
!
!
!
stored document = 100%
qnames = +1.3%
paths = +2.4%
text tokens = 18%
tagged text (opaque) = 18%
tagged text (all transparent) = 71%
What’s the cost?
19. subsequence(
for
$doc
in
collection()[.//SPEAKER=“Hamlet”]
order
by
$doc/lux:key(“title”)
return
$doc,
1000,
20)
subsequence
(
lux:search(“<SPEAKER:Hamlet”,
“title”,
1000)
[.//SPEAKER=“Hamlet”]
,
1,
20)
Query optimization
20. ! Lux uses Lucene as its primary document store
! Lux tinybin (based on Saxon TinyTree) storage
format avoids XML parsing overhead
! Experimental new codec stores fields as files
Document storage
21. ! Problem: “big” stored fields
! Text documents get stored for highlighting
! Take time to copy when merging
! Can we do beZer by storing as files, but
managing w/Lucene?
“Big” binary stored fields
23. ! Real-‐‑time deletes
! Track deletions when merging
! Keep commits with IndexDeletionPolicy
! Delete unmerged (empty) segments
! Off-‐‑line deletes
! Cleanup tool traverses entire index
Deleting is complicated
24. !
!
!
!
2-‐‑3x write speedup for unindexed stored fields
a bit slower in the worst case
But, text analysis can take most of the time
Net: useful if you are storing large binaries
Codec Performance
(preliminary)
25. ! custom DispatchFilter provides:
! HTTP request/response handling in XQuery
! file uploads, redirects
! Ability to roll your own: cookies, authentication
! Rapid prototyping, testing query performance,
relevance, in an application seZing
App Server
26. ! Yes, but did you remember to index all the
fields you need in advance?
! Yes, but did you want to format the result into a
nice report *using your query language*?
! Yes, but did you want access to a complete
XPath 2.0 implementation in your indexer?
Isn’t Solr enough?
27. ! Find some sample content with a new tag we need
to support
! Perform complex updates to patch broken content
! Troubleshoot content
! Explore unfamiliar content
! Write prototypes and admin tools entirely in HTML,
JS and XQuery
! Demo: hZp://localhost:8080
Example uses
28. ! Downloads and Documentation at
hZp://luxdb.org
! Source code at hZp://github.com/msokolov/lux
! Freely available under OSS license (MPL 2)
! Contributions welcome
! Thank you, Safari Books!
Thank You!