Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger

2019.04.25
Natural Language Search
with Knowledge Graphs
Trey Grainger
Chief Algorithms Officer, Lucidworks

Trey Grainger
Chief Algorithms Officer
• Previously: SVP of Engineering @ Lucidworks; Director of Engineering @ CareerBuilder
• Georgia Tech – MBA, Management of Technology
• Furman University – BA, Computer Science, Business, & Philosophy
• Stanford University – Information Retrieval & Web Search
Other fun projects:
• Co-author of Solr in Action, plus numerous research publications
• Advisor to Presearch, the decentralized search engine
• Lucene / Solr contributor
About Me

Agenda
• About Lucidworks
• What is a Knowledge Graph (and related Terminology)?
• What is Natural Language Search?
• Philosophy of Language (enough to get the approach…)
• Solr’s Semantic Knowledge Graph
• Knowledge Graph Goals for Natural Language Search
• Semantic Query Parsing
• Solr Text Tagger
• Solr Statistical Phrase Identifier
• Full Knowledge Graph Capabilities with Solr
• Automated Graph Generation
• Demos!

Basic Keyword Search
(inverted index, tf-idf, bm25,
multilingual text analysis, query
formulation, etc.)
Query Intent
(query classification, semantic
query parsing, knowledge
graphs, concept expansion,
rules, clustering, classification)
Relevancy Tuning
(signals, AB testing/genetic
algorithms, Learning to Rank,
Neural Networks)
Self-learning
Relevance Engineering Sophistication
Context for
this Talk
Taxonomies / Entity
Extraction
(entity recognition, basic
ontologies, synonyms, etc.)

The Search & AI Conference
COMPANY BEHIND
Who are we?
230 CUSTOMERS ACROSS THE
FORTUNE 1000
400+EMPLOYEES
OFFICES IN
San Francisco, CA (HQ)
Raleigh-Durham, NC
Cambridge, UK
Bangalore, India
Hong Kong
Employ about
40% of the
active
committers on
the Solr project
40
%
Contribute over
70% of Solr's
open source
codebase
70%
DEVELOP & SUPPORT
Apache

Industry’s most powerful
Intelligent Search & Discovery Platform.

Let the most respected
analysts in the world
speak on our behalf
Dassault Systèmes
Mindbreeze
Coveo
Microsoft
Attivio
Expert System
Smartlogic
Sinequa
IBM
IHS Markit
Funnelback
Micro Focus
COMPLETENESS OF VISION
ABILITYTOEXECUTE
CHALLENGERS LEADERS
NICHE PLAYERS VISIONARIES
Source: June 2018 Gartner Magic Quadrant report on Insight Engines.
© Gartner, Inc.

Call for Speakers Open until May 8th, 2019!

What is a Knowledge Graph?
(vs. Ontology vs. Taxonomy vs. Synonyms, etc.)

Simplistic Definitions
Ontology: Defines relationships between types of things
[ animal eats food; human is animal ]
Knowledge Graph: Instantiation of an Ontology (contains the
things that are related)
[ john is human; john eats food ]
Taxonomy: Classifies things into Categories
[ john is Human; Human is Mammal; Mammal is Animal ]
Synonyms List: Provides substitute words that can be used to
represent the same or very similar things
[ human => homo sapien, mankind; food => sustenance, meal ]
Yes, there is overlap…

For Solr, I strongly disagree…
back to that later with demos

What is
Natural Language Search?

What kind of Knowledge Graph
can help us with the
kinds of problems we encounter in
Search use cases?

Knowledge
Graph
Challenges of building a traditional knowledge graph
Because current knowledge bases / ontology learning systems typically
requires explicitly modeling nodes and edges into a graph ahead of time, this
unfortunately presents several limitations to the use of such a knowledge graph:
• Entities not modeled explicitly as nodes have no known relationships to any other
entities.
• Edges exist between nodes, but not between arbitrary combinations of nodes, and therefore
such a graph is not ideal for representing nuanced meanings of an entity when appearing
within different contexts, as is common within natural language.
• Substantial meaning is encoded in the linguistic representation of the domain that is
lost when the underlying textual representation is not preserved: phrases, interaction of
concepts through actions (i.e. verbs), positional ordering of entities and the phrases containing
those entities, variations in spelling and other representations of entities, the use of adjectives
to modify entities to represent more complex concepts, and aggregate frequencies of
occurrence for different representations of entities relative to other representations.
• It can be an arduous process to create robust ontologies, map a domain into a graph
representing those ontologies, and ensure the generated graph is compact, accurate,
comprehensive, and kept up to date.
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A
compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

most often used in
reference to

My Three Philosophical Assertions
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical “structured
data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format (text). Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.

Assertion 1:
Unstructured data is actually
“hyper-structured” data. It is a
graph that contains much
more structure than typical
“structured data.”

Structured Data
Employees Table
id name company start_date
lw100 Trey
Grainger
1234 2016-02-01
dis2 Mickey
Mouse
9123 1928-11-28
tsla1 Elon Musk 5678 2003-07-01
Companies Table
id name start_date
1234 Lucidworks 2016-02-01
5678 Tesla 1928-11-28
9123 Disney 2003-07-01
Discrete
Values
Continuous
Values
Foreign
Key

Unstructured Data
Trey Grainger works at Lucidworks.
He is speaking at Haystack 2019. #HaystackConf
(Haystack) is being held in Charlottesville April 22-
25, 2019. Trey got his masters from Georgia Tech.

Trey Grainger works for Lucidworks.
He is speaking at the Haystack 2019.
#HaystackConf
(Haystack) is being held in
Charlottesville April 22-25, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Unstructured Data

#HaystackConf
Georgia Tech.
Trey’s Voicemail
Foreign Key?

#HaystackConf
Georgia Tech.
Trey’s Voicemail
Fuzzy Foreign Key? (Entity Resolution)

#HaystackConf
Georgia Tech.
Trey’s Voicemail
Fuzzier Foreign Key? (metadata, latent features)

Fuzzier Foreign Key? (metadata, latent features)
#HaystackConf
Georgia Tech.
Trey’s Voicemail
Not so fast!

Giant Graph of Relationships...
#HaystackConf
Georgia Tech.
Trey’s Voicemail

Assertion 2:
That graph is very rich, but is a
compression of meaning into a lossy
format (text). Much of data science
is essentially the decompression
from this lossy format into a
reconstituted form.

Semantic Data Encoded into Free Text Content

How do we easily harness this
“semantic graph” of relationships
within unstructured information?

Search Engines are really good at
querying across character sequences,
term sequences, and documents
Example Queries:
c?o CTO, CEO, CFO, …
"VP Engineering"~2 “VP of Engineering”,
VP Engineering” ,“Engineering VP”,
“VP of Infrastructure Engineering”
(Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”

/solr/collection/select/?q=apache solr
Term Documents
… …
apache
doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr
doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
solr
apache
apache solr
Matching queries to documents

id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
field term postings
list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …

Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple
levels of relationships between items in a domain.
Semantic Knowledge Graph API
Core similarity engine, exposed via API
Any product can leverage the core relationship scoring
engine to score any list of entities against any other list
Full domain support
Keywords, categories, tags, based upon any field on your
documents. Graph is build automatically from the
content representing your domain.
Intersections, overlaps, & relationship
scoring, many levels deep
Users can either provide a list of items to score, or else
have the system dynamically discover the most related
items (or both).
Knowledge
Graph

DOI: 10.1109/DSAA.2016.51
Conference: 2016 IEEE International Conference on
Data Science and Advanced Analytics (DSAA)
Knowledge
Graph
Graph Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_skill has_related_skill
has_related_skill
has_related_job_title

Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5

Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph

Knowledge
Graph
Multi-level Graph Traversal with Scores
software engineer*
(materialized node)
Java
C#
.NET
.NET
Developer
Java
Developer
Hibernate
ScalaVB.NET
Software
Engineer
Data
Scientist
Skill
Nodes
has_related_skillStarting
Node
Skill
Nodes
has_related_skill Job Title
Nodes
0.90
0.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55

Related term vector (for query concept expansion)
http://localhost:8983/solr/stack-exchange-health/skg

Content-based Recommendations (More Like This on Steroids)
http://localhost:8983/solr/job-postings/skg

Who’s in Love with Jean Grey?

Assertion 2 (Summary):
That graph is very rich, but is a
compression of meaning into a lossy
format. Much of data science is
essentially the decompression from
this lossy format into a reconstituted
form.

Assertion 3:
Every instance of a word or phrase you
ever encounter has a unique meaning.

Differentiating related terms
Misspellings: managr => manager
Synonyms: cpa => certified public accountant
rn => registered nurse
r.n. => registered nurse
Ambiguous Terms*: driver => driver (trucking) ~80% likelihood
driver => driver (software) ~20% likelihood
Related Terms: r.n. => nursing, bsn
hadoop => mapreduce, hive, pig
*differentiated based upon user and query context

Thought Exercise
What do you think of when I say the
word “driver”?
What about “architect”?

Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

A few methodologies:
1) Query Log Mining
2) Semantic Knowledge Graph
Knowledge Graph

Query Log Mining: Discovering ambiguous phrases
1) Classify users who ran each
search in the search logs
(i.e. by the job title
classifications of the jobs to
which they applied)
3) Segment the search term => related search terms list by classification,
to return a separate related terms list per classification
2) Create a probabilistic graphical model of those classifications mapped
to each keyword phrase.

Semantic Knowledge Graph: Discovering ambiguous phrases
1) Exact same concept, but use
a document classification
field (i.e. category) as the first
level of your graph, and the
related terms as the second
level to which you traverse.
2) Has the benefit that you don’t need query logs to mine, but it will be representative
of your data, as opposed to your user’s intent, so the quality depends on how clean
and representative your documents are.
Additional Benefit: Multi-dimensional disambiguation and dynamic materialization of
categories. Effectively an dynamically-materialized probabilistic graphical model

Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …

Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier

Thought Exercise
What do you think of when I say the
word “Facebook”?

Every term or phrase is a
Context-dependent cluster of
meaning with an ambiguous label

What does “love” mean?
http://localhost:8983/solr/thesaurus/skg

What does “love” mean in the context of “hug”?

What does “love” mean in the context of “child”?

My Three Assertions (Recap)
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical “structured
data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format (text). Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.

So why all the philosophy?
Because it’s much more important to intuitively understand the
kinds of problem we’re trying to solve in Natural Language Search
than to jump head-first into the Solution.
Because building the wrong thing can often be worse than not
doing anything.
And once you have an intuitive sense of the problems you need to
solve, you can confidently use the tools I’m about to describe to
build the right solution for your specific domain.

So what’s the end goal here?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
"machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)

Semantic Search Components:
• Apache Solr
• Semantic Knowledge Graph
• Statistical Phrase Identifier
• Fusion Semantic Query Pipelines
• Fusion AI Synonyms Job
• Fusion AI Token & Phrase Spell Correction Job
• Fusion AI Head/Tail Analysis Job
• Fusion AI Phrase Identification Job
• Fusion Query Rules Engine

In the past year, Lucidworks added
the following capabilities to Solr:
• Semantic Knowledge Graph
• Statistical Phrase Identifier

So I’m going to talk about those here : )
See my Activate 2018 talk on
“How to Build a Semantic Search System”
For details on extended Lucidworks Fusion capabilities.

Semantic Query Parsing
Identification of phrases in queries using two steps:
1) Check a dictionary of known terms that is continuously built,
cleaned, and refined based upon common inputs from
interactions with real users of the system. We use the Solr Text
Tagger for this at query time.*
2) Also invoke a probabilistic query parser
(“statistical phrase identifier”) to dynamically identify unknown
phrases using statistics from a corpus of data (language model)
3) Final algorithm to choose the best merge when the two
approaches disagree.
*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation
through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

Statistical Phrase Identifier
Goal: given a query, predict which
combinations of keywords should be
combined together as phrases
Example:
senior java developer hadoop
Possible Parsings:
senior, java, developer, hadoop
"senior java", developer, hadoop
"senior java developer", hadoop
"senior java developer hadoop”
"senior java", "developer hadoop”
senior, "java developer", hadoop
senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,
and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.

…based on this presentation thusfar, that’s a fair
conclusion to make.
All of the examples you’ve seen to this were
from the stand-alone plugin.
But when we committed to Solr, the Semantic
Knowledge Graph gained full graph capabilities…

More verbose, but way more powerful…

Graph Query Parser
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides controls for depth, filtering of results and inclusion
of root and/or leaves
• Limitations: distributed queries only traverse intra-shard docs
Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&
q={!graph from=in_edge to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge
traversalFilter='foo:[* TO 15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]

Find Location (Graph Query)
http://localhost:8983/solr/POI/select

Graph Traversal converted to Facet

For Remaining keywords, find doc type + related terms

Disambiguation by Category
Meaning 1: Restaurant => bbq, brisket, ribs, pork, …
Meaning 2: Outdoor Equipment => bbq, grill, charcoal, propane, …

Full Knowledge Graph Traversal in Single Request!

Tricks for
Automated Graph Generation

Named Entity Recognition (NER)
NER translates…
Barack Obama was the president of the United States of America. Before that, Obama was a senator.
into…
<person id="barack_obama">Barack Obama</person> was the <role>president</role> of the
<country id="usa">United States of America</country>. Before that, <person
id="barack_obama">Obama</person> was a <role>senator</role>.
In Solr, this would become:
text: Barack Obama was the president of the United States of America. Before that, Obama was a senator.
person: Barack Obama
country: United States of America
role: [ president, senator ]

Open Information Extraction
(automatic RDF triple extraction / explicit knowledge graph learning)

popular barbeque near Haystack
(popular same as "good", "top", "best")
movie theaters near haystack
hotels near popular BBQ in Charlottesville
BBQ near airports near haystack
hotels near movie theaters in Charlottesville …
And that’s really just the beginning!

But it’s unfortunately also the end
of our time today : (

We operationalize AI for the
largest businesses on the planet.

Trey Grainger
trey@lucidworks.com
@treygrainger
http://solrinaction.com
Other presentations:
http://www.treygrainger.com
Discount code: 39grainger
Thank you!

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger

Semelhante a Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger (20)

Mais de OpenSource Connections

Mais de OpenSource Connections (20)

Último

Último (20)

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger