SlideShare a Scribd company logo
1 of 43
Searching with Vectors
Simon Hughes
Chief Data Scientist, Dice.com
Twitter: @hughes_meister
Who Am I?
• Chief Data Scientist at DHI (owns Dice.com)
• Key Projects:
• Search and Match
• Dice Recommender Systems
• Dice Job Search
• Dice Talent Search 3.0 and 4.0
• Dice Skill Center
• Dice Career Advisory Pages
• Dice Salary Predictor
• Dice Career Paths
• PhD Candidate DePaul University
• Subject Area – Machine Learning and NLP
• Thesis – Extracting Causal Relations from Scientific Essays
• Contact Info:
• Email: simon.hughes@dhigroupinc.com
• Twitter: https://twitter.com/hughes_meister
Motivation
• Dice.com - leading US technology professional job board
• Jobs marketplace
• We connect technology talent with employers
• High quality searching and matching are critical to our value proposition, for both our customers
and our clients
• Need – high quality content-based recommender engine
• Automatically determine how well a job seeker matches a particular position, and vice versa
• Requirements:
• A semantic matching engine – goes beyond keyword search, to extracting semantic
information from job postings and resume
• Deployed at scale using existing search infrastructure (Solr and ElasticSearch)
• Github Repository for Talk:
• https://github.com/DiceTechJobs/VectorsInSearch
Agenda
• Why a Vector Representation?
• Learning Vector Representations
• Vector Based Search in an Inverted Index
Understanding Textual Data
Key Challenges:
• Synonymy – Multiple Words with the Same Meaning
• Related – typos, miss-spellings, acronyms, metonyms
• E.g. QA, Quality Assurance, Tester
• Polysemy – Ambiguity, a word has multiple meanings
• E.g. Bank, Book, Ape
• Hypernyms/Hyponyms – ‘type of’ relationships
• E.g. a dog (hyponym) is a type of animal (hypernym)
• Meronyms/Holonyms – ‘part of’ relationships
• E.g. finger (meronym) is a ‘part of’ a hand (holonym)
• What Words / Phrases are More Important?
• Named Entity Extraction (NER), Controlled Vocabularies
• Colocation (phrases) detection – e.g. “data scientist” vs “scientist who works with data”
• Stop words
• Term weighting schemes - e.g. tf.idf
How to Solve these Problems?
• Map documents and queries to a semantic space
• “From Strings to Things”?
• Google KG marketing
• Map words into concepts / semantics
• From strings to concepts
• How to represent?
Java
Technologies
Big Data Tools
Javascript
Frameworks
Representations
Java
• Local representation
• Non distributed
• Sparse
• E.g. one-hot-vector
• One vector component per unique word
• Similar items have different representations
Representations
• Distributed Representation
• Dense vector
• Components of the vector represent learned concepts / latent variables
• Similar items have similar representations
• Most existing approaches produce dense vectors
Java
Java
• Local representation
• Non distributed
• Sparse
• E.g. one-hot-vector
• One vector component per unique word
• Similar items have different representations
Agenda
• Why a Vector Representation?
• Learning Vector Representations
• Vector Based Search in an Inverted Index
The Importance of Context
How do we learn the meaning (semantics) of words?
• Distributional Hypothesis
• Words occurring in similar contexts have similar meanings
• Harris 1954
• “a word is characterized by the company it keeps”
• Firth 1957
• Ignores word order, grammar and syntax
• Latent Relation Hypothesis
• Pairs of words occurring in similar patterns have similar semantic relations
• Turney et al, 2003
• Patterns – X cuts Y, X works with Y, etc
• Word order and grammatical relations matter
• Further reading - Distributional approaches to word meanings
Learning Meaning from Context
Bag of Words Approaches – ignore word order
• Latent Models
• Context - Documents
• LSA
• LDA
• Semantic Vector Space Model
• Word Embeddings
• Context – word window
• Word2vec
• Glove
• Simple linear language models
• History - http://blog.aylien.com/a-review-of-the-recent-history-of-natural-language-processing/
• For document embeddings
• Average or idf weighted average of word vectors
• Sentence / Document Embeddings
• Context – document + word window
• E.g. Doc2vec
• Context – surrounding sentences
• E.g. skip-thought vectors
Word2Vec
• By Aelu013 [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0) ], from Wikimedia Commons
Limitations of BOW Approaches:
• Shallow representation
• Word embeddings – limited to the word level
• Latent models – document level but doesn’t encode relational information
• Synonymy - learn relatedness, not true synonyms
• E.g. Antonyms have similar vectors
• Polysemy – cannot encode different meanings of same word
• Global model not a local model
Beyond BOW - Deep Language Models
• Deep Language Model Embeddings
• Derived from the internal state of a deep LM
• Learns deep representation of sequences of
words in context
• Can adjust word vectors based on their current
context
• “NLP’s imagenet moment”
• Achieved state of the art results on many NLP tasks
• Consistently out-perform word embedding models
• Example models - ELMO, BERT, ULMFit, OpenAI
Transformer
• Used for encoding sentences not whole
documents
• Hard to scale
Deep Language Models
p(w1,w2,w3, w4,…,wn) = p(wn|w1,w2,…,wn-1)
…..
…..
…..
p(w1) p(w2|W1) p(w3|w1,w2) p(w4|w1,w2,w3)
Begin w1 w2 w3
LSTM LSTM LSTM LSTM
Embedding Models for Search
• Word Embedding Approaches
• Cluster Word Embeddings
• “Representing Documents and Queries as Sets of Word Embedded Vectors for Information
Retrieval”
• Clustered word2vec vectors using k-means
• Documents represented as clusters of word vectors
• Query - map query vectors as similarity to cluster centroids
• Out performed Jelinek Mercer LM similarity using VSM
• Average Word Embeddings
• From Chapter 5 of Deep Learning for Search
• Author - Tommaso Teofili
• Query and document represented as average of word2vec vectors
• Computing a weighted average using idf worked best
• Outperformed BM25 using cosine similarity
• BM25 + word2vec – highest NDCG score
Embedding Models for Search
• Dual Embedding Space Model (DESM)
• Research from Microsoft
• Extends word2vec
• Learns a dual embedding for queries and documents
• Paper - https://arxiv.org/pdf/1602.01137.pdf
• Evaluation
• Compared BM25, LSA and DESM on Bing Query Log Data
• Metrics - NDCG@1, NDCG@3, NDCG@5
• Results
• LSA and DESM both out-performed BM25
• DESM out-performed LSA
• DESM + BM25 out-performed all other approaches
Agenda
• Why a Vector Representation?
• Learning Vector Representations
• Vector Based Search in an Inverted Index
Vectors in Search
• Dense Embedding Vector:
• Dense
• D dimensional
• D = 50-1000
• Inverted index:
• Sparse
• Pivoted by term
• V = Vocabulary
• |V| =100k+
• Fast because sparse
[+0.12, -0.34, -0.12, +0.27, +0.63]
Term Posting List
Java 1,5,100,102
.NET 2,4,600,605,1000
C# 2,88,105,800
SQL 130,433,648,899,1200
Html 1,2,10,30,55,202,252,30,598,
Searching with Word Embeddings
Approaches for using word embeddings:
• Top N terms
• Expand query using top n terms from model
• Boost expansions by cosine similarity
• Can use as a boost query, a re-rank query or a straight term expansion
• Q = “java developer”^10
OR ”java j2ee developer”^0.91 OR “java architect”^0.89
OR “lead java developer”^0.87 OR “j2ee developer”^0.86
OR “java engineer”^0.86
• Term Clustering
• Cluster embeddings using a clustering algorithm
• E.g. k-means
• Compute different sized clusters, k=100,1000,10000
• Map clusters to tokens and index
• Different fields for each k
• Larger k fields – bigger boost or rely on idf scoring
• Query expands to top clusters, boosted by similarity
• Q = “java developer”^10
OR cluster_k1000:5894^5
OR cluster_k100:23^2.5
OR cluster_k10:8^1.25
• See https://github.com/DiceTechJobs/ConceptualSearch
Searching Vectors – k-NN Search
• K-NN search
• Find the k closest neighbors to query vector according to similarity metric
• Usually cosine similarity or Euclidean distance
• Definitions
• D = number of components in the vector
• N = number of documents
• Brute Force Search:
• O(ND) = linear
• What if N AND/OR D is(are) very large?
• Vs. Inverted Index
• Sublinear - makes uses of sparsity of terms
• BTree or Distributed Hash Table lookup for terms, iterate posting list, re-rank
matches - O(n log n)
Optimal Vector Representation In An Inverted Index?
What properties would such a representation have?
• For Performance
• Sparse representation necessary to leverage inverted index
• For Relevancy
• Distributed representation
• Each document should be a collection of tokens
• Tokens represent some semantic feature of the space
• Similarity is preserved
• Similar vectors must also be similar under this new representation
• Zipfian distribution of tokens
• “We need a Zipfian Distribution” – John Berryman (Co-author of ‘Relevant Search’)
• Tokenizing Embedding Spaces
Zipf’s Law
• The frequency of terms in a
corpus follow a power law
distribution
• Small number of tokens are
very common - filter out
irrelevant docs
• A large number of tokens
are very rare - discriminate
between similar matches
• Distribution of last names - By Thekohser [CC BY-SA 3.0
(https://creativecommons.org/licenses/by-sa/3.0 )], from Wikimedia Commons
Approximate Nearest Neighbor Search
• Faster than full k-NN, with some loss in accuracy
• Approaches can be either:
• Data Dependent
• Learns and adjusts from the data
• Makes indexing new documents hard
• Data Independent
• Some Approaches:
• KD Tree
• LSH
• Heuristic Methods
• K-Means Tree
• Randomized KD Forest
• Paper: https://arxiv.org/abs/1603.09596
• HNSW (Hierarchical Navigable Small World Graphs – Top on http://ann-benchmarks.com/
• Paper: https://arxiv.org/pdf/1603.09320.pdf
• Vector Thresholding
• Choice of similarity metric is important in choosing an algorithm
KD Trees
• Construction
• Constructs a binary search tree by partitioning the search space along each vector dimension using the
dimensions
• Partitions are chosen orthogonal to each dimension
• Usually the median
• Querying
• Described here - https://en.wikipedia.org/wiki/K-d_tree#Complexity
• Limitations
• How to implement efficiently in an inverted index?
• Lucene 6.0 dimensional points
• See also - https://www.elastic.co/blog/lucene-points-6.0
• Not exposed in Solr and Elastic Search AFAIK
• Tree needs rebalancing on each insertion
• Curse of dimensionality
• N >> 2d - for N points and D dimensions
• Complexity essentially linear for real world vectors (D>= 50)
• Approximate KNN Search
• Possible with KD tree – limit the number of searched nodes
• Typically out-performed by other ANNs approaches
Locality Sensitive Hashing
• LSH hashes items to discrete buckets
• More buckets – slower but more accurate
• Locality Preserving
• Maximizes the probability that similar items occupy the same buckets
• Random Projection LSH (sim Hash)
• LSH variant for cosine similarity
• Generate a random d-dimensional unit vector r, and for each vector v
• ℎ𝑎𝑠ℎ 𝑣 = 𝑠𝑖𝑔𝑛(𝑣. 𝑟)
• Produces a binary encoding, one bit for each hash function (random vector)
• Probability 2 vectors’ hashes match - proportional to cosine similarity
• Output of hash function can be indexed and searched using Hamming Distance
• Intuition - Van Durme and Lall - http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf
• Data independent, although data dependent variations exist
• However, for real data, it is typically out-performed by heuristic methods like k-means trees, and randomized KD-
trees
• https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf
Encoding LSH Hash into the Index
• Hash into Bits
• Store hash fingerprint as a single token • Store each bit as a token using it’s position and value
• Use mm parameter to speed up search
• Or store shingles of the binary tokens
• This is not sparse!
[+0.08, -0.16, -0.12, +0.27, +0.63, -0.01, +0.16, -0.48]
[1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
[“10110110100101”] ["00_1","01_0","02_1","04_1","04_0","05_1","06_1","07_0","08_1","09_0","10_0","11_1","12_0","13_1”]
OR
Hamming Similarity
Class
• Custom similarity class
• Computes the number of
matching tokens
K-Means Tree
• Hierarchical Clustering Algorithm
• Recursively partitions vector space using k-Means clustering
• Fast - k-means runs in linear time using Lloyd’s heuristic
• Most other clustering algorithms run in quadratic time or worse
• Tree Construction
• For some branching factor b create b clusters
• Create b nodes, store centroid for each node
• For each new cluster, cluster its members into b smaller clusters
• These form child nodes of their parent clusters, forming a tree structure
• Continue until < b members per cluster
• Paper
• "Scalable Nearest Neighbor Algorithms for High Dimensional Data" - Marius Muja,
2014 – implemented in the FLANN library
K-Means Tree
Second Layer
(Leaf Nodes)
Root Node
First Layer
…. ….…..
….
Documents
• Depth 3 K-Means Tree
Lucene Implementation Details
• Pre-train a k-means tree on a representative subset of the index
• Indexing:
• Convert all nodes from tree into unique tokens
• For each vector, find the closest matching leaf node
• Index vector with tokens for that leaf node, and all parent nodes
• Querying
• Find top n matching nodes from tree
• Convert nodes into a query, boosted by similarity to query vector
• 'q': 'clusters:(“121”^0.9 “909”^0.88 ”523”^0.91)’
• Create a re-rank query to brute force re-rank the top matching documents
• 'rq’: '{!rerank reRankQuery=$rqq reRankDocs=1000 reRankWeight=99}’
• 'rqq': '{!payloadEdismax v=$vq}’
• ‘vq’: vector:(”0”^-0.0136 ”1”^0.05387 ”2”^0.070476 ”3”^0.14529 …)
• Uses a special payload query parser (payload_score is insufficient)
• See https://github.com/DiceTechJobs/VectorsInSearch
• *Better approach – use doc values field or Lucene dimensional points
• Trade speed for accuracy depending on depth of tree search, and how many vectors are re-ranked
• Tree nodes follow a Zipfian distribution
Lucene Implementation Details
• Cluster Field – stores cluster tokens
• Turn off all norms, tf and idf weighting, custom hamming similarity class
• Vector Field – stores vectors for re-ranking
• Stores components plus payloads, custom similarity class using payloads
• Similarity classes: https://github.com/DiceTechJobs/SolrPlugins
Lucene Implementation Details
Vector field analysis chain: Cluster fields:
Other Heuristic Methods
• Randomized KD Forest
• Constructs a number of KD trees choosing axis to split on randomly
• Searches all trees in parallel to a fixed number of leaf nodes
• KD Trees are very deep
• How to implement efficiently in an inverted index?
• Hierarchical Navigable Small World Graphs
• Hierarchical graph based model - https://arxiv.org/pdf/1603.09320.pdf
• Consistently out-performs other ANNs methods on the ANNs benchmarks
page - http://ann-benchmarks.com/
Distribution of Vector
Components
• Distribution of components
from our vectors is Gaussian
• Mean is 0
• This means that most vector
components are very small
• These components will have
minimal impact on cosine
score
Histogram of components taken from 350k vectors
Mean = 0.0
Vector Thresholding with Tokenization
[+0.08, -0.16, -0.12, +0.27, +0.63, -0.01, +0.16, -0.48]
[ 0, 0, 0, 0, +0.63, 0, 0, -0.48]
• Drop all but the largest components
[“04i+0.6”, “07i-0.5”]
• Round weight to lower precision
• Encode position and weight as a single token
• Paper: “Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines”
Vector Thresholding with Payloads
[+0.08, -0.16, -0.12, +0.27, +0.63, -0.01, +0.16, -0.48]
[ 0, 0, 0, 0, +0.63, 0, 0, -0.48]
• Drop all but the largest components
• I modified the previous idea, using payload score queries
• Indexing: Store remaining (non zero) tokens in index with payloads
• Querying: Uses custom payload query parser + similarity class
• See Github repo, and solr config in Kmeans tree section
Q=vector:(”3”^-0.0136 ”14”^0.05387 ”56”^-0.070476 ”71”^0.14529 …)
&defType=payloadEdismax
Performance Comparison - Initial Results
• Hardware - Mac Book Pro, 2.6Ghz i7 CPU, 16G Ram, SSD
• Search Engine:
• Solr 7.5, single shard
• Index: 700k documents
• 1000 sample vector queries, requests were single threaded
• Metric – precision @10 compared to brute force
• Updated results – check https://github.com/DiceTechJobs/VectorsInSearch
Performance Comparison - Initial Results
• Each algorithm was ran over a range of different parameter values, to show recall – speed trade off
Performance Comparison - Initial Results
Algorithm Precision@10 Queries Per Sec
(Mean Qry Time)
LSH (Hamming Similarity) 0.69 1.3 qps (757 ms)
Kmeans Tree (trained on index) 0.88 9.2 qps (170 ms)
Kmean Tree (trained on sample) 0.85 9.5 qps (105 ms)
Vector Thresholding with Tokenization
(top 40% of components)
0.85 3.5 qps (312 ms)
Vector Threshold with Payloads
(top 40% of components)
0.94 1.8 qps (547 ms)
The Ultimate Solution - Sparse Coding?
• Also called ‘Dictionary Learning’
• Learns a sparse ‘overcomplete’ representation of a vector
• Example Algorithms:
• Sparse Auto-Encoder
• K-SVD
• Encoding needs to preserve the Metric Space
• Similar items need to remain similar after encoding
Other Relevant Approaches
• Word2bits - learns binary quantized word vectors
• https://github.com/agnusmaximus/Word2Bits
Block Max WAND
• https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-
block-max-wand
• ‘Weak AND’ algorithm to be integrated into Lucene 8.0 and ES 7.0
• Speeds up large OR queries by pruning clauses that won’t occur in top N
matches
• Speed up can be 40% to 13x
• Can help address performance of these larger OR queries
Thank you!
Github Repository:
https://github.com/DiceTechJobs/VectorsInSearch
Simon Hughes
Chief Data Scientist, Dice.com
@hughes_meister

More Related Content

What's hot

Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLJonathan Katz
 
Using MongoDB as a high performance graph database
Using MongoDB as a high performance graph databaseUsing MongoDB as a high performance graph database
Using MongoDB as a high performance graph databaseChris Clarke
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDatabricks
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Webinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance ImplicationsWebinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance ImplicationsMongoDB
 
Neo4j Presentation
Neo4j PresentationNeo4j Presentation
Neo4j PresentationMax De Marzi
 
DSpace-CRIS technical level introduction
DSpace-CRIS technical level introductionDSpace-CRIS technical level introduction
DSpace-CRIS technical level introduction4Science
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with WeaviateNETWAYS
 
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASACombining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASANeo4j
 
Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문SeungHyun Eom
 
Elasticsearch vs MongoDB comparison
Elasticsearch vs MongoDB comparisonElasticsearch vs MongoDB comparison
Elasticsearch vs MongoDB comparisonjeetendra mandal
 
Practical API-Development with Gemstone/S
 Practical API-Development with Gemstone/S Practical API-Development with Gemstone/S
Practical API-Development with Gemstone/SESUG
 
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013Juan Sequeda
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph DatabasesMax De Marzi
 
Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리종민 김
 

What's hot (20)

Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQL
 
Using MongoDB as a high performance graph database
Using MongoDB as a high performance graph databaseUsing MongoDB as a high performance graph database
Using MongoDB as a high performance graph database
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Vector database
Vector databaseVector database
Vector database
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
Webinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance ImplicationsWebinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance Implications
 
Neo4j Presentation
Neo4j PresentationNeo4j Presentation
Neo4j Presentation
 
DSpace-CRIS technical level introduction
DSpace-CRIS technical level introductionDSpace-CRIS technical level introduction
DSpace-CRIS technical level introduction
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
 
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASACombining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
 
Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문
 
Elasticsearch vs MongoDB comparison
Elasticsearch vs MongoDB comparisonElasticsearch vs MongoDB comparison
Elasticsearch vs MongoDB comparison
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Practical API-Development with Gemstone/S
 Practical API-Development with Gemstone/S Practical API-Development with Gemstone/S
Practical API-Development with Gemstone/S
 
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
RDB2RDF Tutorial (R2RML and Direct Mapping) at ISWC 2013
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리
 

Similar to Haystack 2019 - Search with Vectors - Simon Hughes

Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingSimon Hughes
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
An introduction to Metadata Application Profiles
An introduction to Metadata Application ProfilesAn introduction to Metadata Application Profiles
An introduction to Metadata Application Profileskcoylenet
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training IntroductionMax De Marzi
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
 
State of Search 2017 - Semantics and Science - Upasna Gautam
State of Search 2017 - Semantics and Science - Upasna GautamState of Search 2017 - Semantics and Science - Upasna Gautam
State of Search 2017 - Semantics and Science - Upasna GautamUpasna Gautam
 
Knowledge engineering and the Web
Knowledge engineering and the WebKnowledge engineering and the Web
Knowledge engineering and the WebGuus Schreiber
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Lucidworks
 

Similar to Haystack 2019 - Search with Vectors - Simon Hughes (20)

Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
An introduction to Metadata Application Profiles
An introduction to Metadata Application ProfilesAn introduction to Metadata Application Profiles
An introduction to Metadata Application Profiles
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Metadata
MetadataMetadata
Metadata
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Semantic web
Semantic webSemantic web
Semantic web
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training Introduction
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
State of Search 2017 - Semantics and Science - Upasna Gautam
State of Search 2017 - Semantics and Science - Upasna GautamState of Search 2017 - Semantics and Science - Upasna Gautam
State of Search 2017 - Semantics and Science - Upasna Gautam
 
Word 2 vector
Word 2 vectorWord 2 vector
Word 2 vector
 
Knowledge engineering and the Web
Knowledge engineering and the WebKnowledge engineering and the Web
Knowledge engineering and the Web
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
 

More from OpenSource Connections

How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessOpenSource Connections
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullOpenSource Connections
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonOpenSource Connections
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajOpenSource Connections
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...OpenSource Connections
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...OpenSource Connections
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
 
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...OpenSource Connections
 

More from OpenSource Connections (20)

Encores
EncoresEncores
Encores
 
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 

Haystack 2019 - Search with Vectors - Simon Hughes

  • 1. Searching with Vectors Simon Hughes Chief Data Scientist, Dice.com Twitter: @hughes_meister
  • 2. Who Am I? • Chief Data Scientist at DHI (owns Dice.com) • Key Projects: • Search and Match • Dice Recommender Systems • Dice Job Search • Dice Talent Search 3.0 and 4.0 • Dice Skill Center • Dice Career Advisory Pages • Dice Salary Predictor • Dice Career Paths • PhD Candidate DePaul University • Subject Area – Machine Learning and NLP • Thesis – Extracting Causal Relations from Scientific Essays • Contact Info: • Email: simon.hughes@dhigroupinc.com • Twitter: https://twitter.com/hughes_meister
  • 3. Motivation • Dice.com - leading US technology professional job board • Jobs marketplace • We connect technology talent with employers • High quality searching and matching are critical to our value proposition, for both our customers and our clients • Need – high quality content-based recommender engine • Automatically determine how well a job seeker matches a particular position, and vice versa • Requirements: • A semantic matching engine – goes beyond keyword search, to extracting semantic information from job postings and resume • Deployed at scale using existing search infrastructure (Solr and ElasticSearch) • Github Repository for Talk: • https://github.com/DiceTechJobs/VectorsInSearch
  • 4. Agenda • Why a Vector Representation? • Learning Vector Representations • Vector Based Search in an Inverted Index
  • 5. Understanding Textual Data Key Challenges: • Synonymy – Multiple Words with the Same Meaning • Related – typos, miss-spellings, acronyms, metonyms • E.g. QA, Quality Assurance, Tester • Polysemy – Ambiguity, a word has multiple meanings • E.g. Bank, Book, Ape • Hypernyms/Hyponyms – ‘type of’ relationships • E.g. a dog (hyponym) is a type of animal (hypernym) • Meronyms/Holonyms – ‘part of’ relationships • E.g. finger (meronym) is a ‘part of’ a hand (holonym) • What Words / Phrases are More Important? • Named Entity Extraction (NER), Controlled Vocabularies • Colocation (phrases) detection – e.g. “data scientist” vs “scientist who works with data” • Stop words • Term weighting schemes - e.g. tf.idf
  • 6. How to Solve these Problems? • Map documents and queries to a semantic space • “From Strings to Things”? • Google KG marketing • Map words into concepts / semantics • From strings to concepts • How to represent? Java Technologies Big Data Tools Javascript Frameworks
  • 7. Representations Java • Local representation • Non distributed • Sparse • E.g. one-hot-vector • One vector component per unique word • Similar items have different representations
  • 8. Representations • Distributed Representation • Dense vector • Components of the vector represent learned concepts / latent variables • Similar items have similar representations • Most existing approaches produce dense vectors Java Java • Local representation • Non distributed • Sparse • E.g. one-hot-vector • One vector component per unique word • Similar items have different representations
  • 9. Agenda • Why a Vector Representation? • Learning Vector Representations • Vector Based Search in an Inverted Index
  • 10. The Importance of Context How do we learn the meaning (semantics) of words? • Distributional Hypothesis • Words occurring in similar contexts have similar meanings • Harris 1954 • “a word is characterized by the company it keeps” • Firth 1957 • Ignores word order, grammar and syntax • Latent Relation Hypothesis • Pairs of words occurring in similar patterns have similar semantic relations • Turney et al, 2003 • Patterns – X cuts Y, X works with Y, etc • Word order and grammatical relations matter • Further reading - Distributional approaches to word meanings
  • 11. Learning Meaning from Context Bag of Words Approaches – ignore word order • Latent Models • Context - Documents • LSA • LDA • Semantic Vector Space Model • Word Embeddings • Context – word window • Word2vec • Glove • Simple linear language models • History - http://blog.aylien.com/a-review-of-the-recent-history-of-natural-language-processing/ • For document embeddings • Average or idf weighted average of word vectors • Sentence / Document Embeddings • Context – document + word window • E.g. Doc2vec • Context – surrounding sentences • E.g. skip-thought vectors
  • 12. Word2Vec • By Aelu013 [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0) ], from Wikimedia Commons
  • 13. Limitations of BOW Approaches: • Shallow representation • Word embeddings – limited to the word level • Latent models – document level but doesn’t encode relational information • Synonymy - learn relatedness, not true synonyms • E.g. Antonyms have similar vectors • Polysemy – cannot encode different meanings of same word • Global model not a local model
  • 14. Beyond BOW - Deep Language Models • Deep Language Model Embeddings • Derived from the internal state of a deep LM • Learns deep representation of sequences of words in context • Can adjust word vectors based on their current context • “NLP’s imagenet moment” • Achieved state of the art results on many NLP tasks • Consistently out-perform word embedding models • Example models - ELMO, BERT, ULMFit, OpenAI Transformer • Used for encoding sentences not whole documents • Hard to scale
  • 15. Deep Language Models p(w1,w2,w3, w4,…,wn) = p(wn|w1,w2,…,wn-1) ….. ….. ….. p(w1) p(w2|W1) p(w3|w1,w2) p(w4|w1,w2,w3) Begin w1 w2 w3 LSTM LSTM LSTM LSTM
  • 16. Embedding Models for Search • Word Embedding Approaches • Cluster Word Embeddings • “Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval” • Clustered word2vec vectors using k-means • Documents represented as clusters of word vectors • Query - map query vectors as similarity to cluster centroids • Out performed Jelinek Mercer LM similarity using VSM • Average Word Embeddings • From Chapter 5 of Deep Learning for Search • Author - Tommaso Teofili • Query and document represented as average of word2vec vectors • Computing a weighted average using idf worked best • Outperformed BM25 using cosine similarity • BM25 + word2vec – highest NDCG score
  • 17. Embedding Models for Search • Dual Embedding Space Model (DESM) • Research from Microsoft • Extends word2vec • Learns a dual embedding for queries and documents • Paper - https://arxiv.org/pdf/1602.01137.pdf • Evaluation • Compared BM25, LSA and DESM on Bing Query Log Data • Metrics - NDCG@1, NDCG@3, NDCG@5 • Results • LSA and DESM both out-performed BM25 • DESM out-performed LSA • DESM + BM25 out-performed all other approaches
  • 18. Agenda • Why a Vector Representation? • Learning Vector Representations • Vector Based Search in an Inverted Index
  • 19. Vectors in Search • Dense Embedding Vector: • Dense • D dimensional • D = 50-1000 • Inverted index: • Sparse • Pivoted by term • V = Vocabulary • |V| =100k+ • Fast because sparse [+0.12, -0.34, -0.12, +0.27, +0.63] Term Posting List Java 1,5,100,102 .NET 2,4,600,605,1000 C# 2,88,105,800 SQL 130,433,648,899,1200 Html 1,2,10,30,55,202,252,30,598,
  • 20. Searching with Word Embeddings Approaches for using word embeddings: • Top N terms • Expand query using top n terms from model • Boost expansions by cosine similarity • Can use as a boost query, a re-rank query or a straight term expansion • Q = “java developer”^10 OR ”java j2ee developer”^0.91 OR “java architect”^0.89 OR “lead java developer”^0.87 OR “j2ee developer”^0.86 OR “java engineer”^0.86 • Term Clustering • Cluster embeddings using a clustering algorithm • E.g. k-means • Compute different sized clusters, k=100,1000,10000 • Map clusters to tokens and index • Different fields for each k • Larger k fields – bigger boost or rely on idf scoring • Query expands to top clusters, boosted by similarity • Q = “java developer”^10 OR cluster_k1000:5894^5 OR cluster_k100:23^2.5 OR cluster_k10:8^1.25 • See https://github.com/DiceTechJobs/ConceptualSearch
  • 21. Searching Vectors – k-NN Search • K-NN search • Find the k closest neighbors to query vector according to similarity metric • Usually cosine similarity or Euclidean distance • Definitions • D = number of components in the vector • N = number of documents • Brute Force Search: • O(ND) = linear • What if N AND/OR D is(are) very large? • Vs. Inverted Index • Sublinear - makes uses of sparsity of terms • BTree or Distributed Hash Table lookup for terms, iterate posting list, re-rank matches - O(n log n)
  • 22. Optimal Vector Representation In An Inverted Index? What properties would such a representation have? • For Performance • Sparse representation necessary to leverage inverted index • For Relevancy • Distributed representation • Each document should be a collection of tokens • Tokens represent some semantic feature of the space • Similarity is preserved • Similar vectors must also be similar under this new representation • Zipfian distribution of tokens • “We need a Zipfian Distribution” – John Berryman (Co-author of ‘Relevant Search’) • Tokenizing Embedding Spaces
  • 23. Zipf’s Law • The frequency of terms in a corpus follow a power law distribution • Small number of tokens are very common - filter out irrelevant docs • A large number of tokens are very rare - discriminate between similar matches • Distribution of last names - By Thekohser [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0 )], from Wikimedia Commons
  • 24. Approximate Nearest Neighbor Search • Faster than full k-NN, with some loss in accuracy • Approaches can be either: • Data Dependent • Learns and adjusts from the data • Makes indexing new documents hard • Data Independent • Some Approaches: • KD Tree • LSH • Heuristic Methods • K-Means Tree • Randomized KD Forest • Paper: https://arxiv.org/abs/1603.09596 • HNSW (Hierarchical Navigable Small World Graphs – Top on http://ann-benchmarks.com/ • Paper: https://arxiv.org/pdf/1603.09320.pdf • Vector Thresholding • Choice of similarity metric is important in choosing an algorithm
  • 25. KD Trees • Construction • Constructs a binary search tree by partitioning the search space along each vector dimension using the dimensions • Partitions are chosen orthogonal to each dimension • Usually the median • Querying • Described here - https://en.wikipedia.org/wiki/K-d_tree#Complexity • Limitations • How to implement efficiently in an inverted index? • Lucene 6.0 dimensional points • See also - https://www.elastic.co/blog/lucene-points-6.0 • Not exposed in Solr and Elastic Search AFAIK • Tree needs rebalancing on each insertion • Curse of dimensionality • N >> 2d - for N points and D dimensions • Complexity essentially linear for real world vectors (D>= 50) • Approximate KNN Search • Possible with KD tree – limit the number of searched nodes • Typically out-performed by other ANNs approaches
  • 26. Locality Sensitive Hashing • LSH hashes items to discrete buckets • More buckets – slower but more accurate • Locality Preserving • Maximizes the probability that similar items occupy the same buckets • Random Projection LSH (sim Hash) • LSH variant for cosine similarity • Generate a random d-dimensional unit vector r, and for each vector v • ℎ𝑎𝑠ℎ 𝑣 = 𝑠𝑖𝑔𝑛(𝑣. 𝑟) • Produces a binary encoding, one bit for each hash function (random vector) • Probability 2 vectors’ hashes match - proportional to cosine similarity • Output of hash function can be indexed and searched using Hamming Distance • Intuition - Van Durme and Lall - http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf • Data independent, although data dependent variations exist • However, for real data, it is typically out-performed by heuristic methods like k-means trees, and randomized KD- trees • https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf
  • 27. Encoding LSH Hash into the Index • Hash into Bits • Store hash fingerprint as a single token • Store each bit as a token using it’s position and value • Use mm parameter to speed up search • Or store shingles of the binary tokens • This is not sparse! [+0.08, -0.16, -0.12, +0.27, +0.63, -0.01, +0.16, -0.48] [1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1] [“10110110100101”] ["00_1","01_0","02_1","04_1","04_0","05_1","06_1","07_0","08_1","09_0","10_0","11_1","12_0","13_1”] OR
  • 28. Hamming Similarity Class • Custom similarity class • Computes the number of matching tokens
  • 29. K-Means Tree • Hierarchical Clustering Algorithm • Recursively partitions vector space using k-Means clustering • Fast - k-means runs in linear time using Lloyd’s heuristic • Most other clustering algorithms run in quadratic time or worse • Tree Construction • For some branching factor b create b clusters • Create b nodes, store centroid for each node • For each new cluster, cluster its members into b smaller clusters • These form child nodes of their parent clusters, forming a tree structure • Continue until < b members per cluster • Paper • "Scalable Nearest Neighbor Algorithms for High Dimensional Data" - Marius Muja, 2014 – implemented in the FLANN library
  • 30. K-Means Tree Second Layer (Leaf Nodes) Root Node First Layer …. ….….. …. Documents • Depth 3 K-Means Tree
  • 31. Lucene Implementation Details • Pre-train a k-means tree on a representative subset of the index • Indexing: • Convert all nodes from tree into unique tokens • For each vector, find the closest matching leaf node • Index vector with tokens for that leaf node, and all parent nodes • Querying • Find top n matching nodes from tree • Convert nodes into a query, boosted by similarity to query vector • 'q': 'clusters:(“121”^0.9 “909”^0.88 ”523”^0.91)’ • Create a re-rank query to brute force re-rank the top matching documents • 'rq’: '{!rerank reRankQuery=$rqq reRankDocs=1000 reRankWeight=99}’ • 'rqq': '{!payloadEdismax v=$vq}’ • ‘vq’: vector:(”0”^-0.0136 ”1”^0.05387 ”2”^0.070476 ”3”^0.14529 …) • Uses a special payload query parser (payload_score is insufficient) • See https://github.com/DiceTechJobs/VectorsInSearch • *Better approach – use doc values field or Lucene dimensional points • Trade speed for accuracy depending on depth of tree search, and how many vectors are re-ranked • Tree nodes follow a Zipfian distribution
  • 32. Lucene Implementation Details • Cluster Field – stores cluster tokens • Turn off all norms, tf and idf weighting, custom hamming similarity class • Vector Field – stores vectors for re-ranking • Stores components plus payloads, custom similarity class using payloads • Similarity classes: https://github.com/DiceTechJobs/SolrPlugins
  • 33. Lucene Implementation Details Vector field analysis chain: Cluster fields:
  • 34. Other Heuristic Methods • Randomized KD Forest • Constructs a number of KD trees choosing axis to split on randomly • Searches all trees in parallel to a fixed number of leaf nodes • KD Trees are very deep • How to implement efficiently in an inverted index? • Hierarchical Navigable Small World Graphs • Hierarchical graph based model - https://arxiv.org/pdf/1603.09320.pdf • Consistently out-performs other ANNs methods on the ANNs benchmarks page - http://ann-benchmarks.com/
  • 35. Distribution of Vector Components • Distribution of components from our vectors is Gaussian • Mean is 0 • This means that most vector components are very small • These components will have minimal impact on cosine score Histogram of components taken from 350k vectors Mean = 0.0
  • 36. Vector Thresholding with Tokenization [+0.08, -0.16, -0.12, +0.27, +0.63, -0.01, +0.16, -0.48] [ 0, 0, 0, 0, +0.63, 0, 0, -0.48] • Drop all but the largest components [“04i+0.6”, “07i-0.5”] • Round weight to lower precision • Encode position and weight as a single token • Paper: “Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines”
  • 37. Vector Thresholding with Payloads [+0.08, -0.16, -0.12, +0.27, +0.63, -0.01, +0.16, -0.48] [ 0, 0, 0, 0, +0.63, 0, 0, -0.48] • Drop all but the largest components • I modified the previous idea, using payload score queries • Indexing: Store remaining (non zero) tokens in index with payloads • Querying: Uses custom payload query parser + similarity class • See Github repo, and solr config in Kmeans tree section Q=vector:(”3”^-0.0136 ”14”^0.05387 ”56”^-0.070476 ”71”^0.14529 …) &defType=payloadEdismax
  • 38. Performance Comparison - Initial Results • Hardware - Mac Book Pro, 2.6Ghz i7 CPU, 16G Ram, SSD • Search Engine: • Solr 7.5, single shard • Index: 700k documents • 1000 sample vector queries, requests were single threaded • Metric – precision @10 compared to brute force • Updated results – check https://github.com/DiceTechJobs/VectorsInSearch
  • 39. Performance Comparison - Initial Results • Each algorithm was ran over a range of different parameter values, to show recall – speed trade off
  • 40. Performance Comparison - Initial Results Algorithm Precision@10 Queries Per Sec (Mean Qry Time) LSH (Hamming Similarity) 0.69 1.3 qps (757 ms) Kmeans Tree (trained on index) 0.88 9.2 qps (170 ms) Kmean Tree (trained on sample) 0.85 9.5 qps (105 ms) Vector Thresholding with Tokenization (top 40% of components) 0.85 3.5 qps (312 ms) Vector Threshold with Payloads (top 40% of components) 0.94 1.8 qps (547 ms)
  • 41. The Ultimate Solution - Sparse Coding? • Also called ‘Dictionary Learning’ • Learns a sparse ‘overcomplete’ representation of a vector • Example Algorithms: • Sparse Auto-Encoder • K-SVD • Encoding needs to preserve the Metric Space • Similar items need to remain similar after encoding Other Relevant Approaches • Word2bits - learns binary quantized word vectors • https://github.com/agnusmaximus/Word2Bits
  • 42. Block Max WAND • https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with- block-max-wand • ‘Weak AND’ algorithm to be integrated into Lucene 8.0 and ES 7.0 • Speeds up large OR queries by pruning clauses that won’t occur in top N matches • Speed up can be 40% to 13x • Can help address performance of these larger OR queries
  • 43. Thank you! Github Repository: https://github.com/DiceTechJobs/VectorsInSearch Simon Hughes Chief Data Scientist, Dice.com @hughes_meister

Editor's Notes

  1. Metrics – recall often used for measuring synonymy and related problems, while precision and traditional IR metrics are better at measuring the efficacy at disambiguating a user’s intent
  2. Context – bag of words Global - learn semantic representations of terms Address synonymy (word level) Learn colocations (phrases) Local – can be used to disambiguate ambiguous terms Address polysemy
  3. By Aelu013 [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons. For LSA illustration, and an excellent explanation, see here - http://iv.slis.indiana.edu/sw/lsa.html
  4. Word vectors - don’t learn true synonyms – don’t truly solve synonymy problem, and don’t handle polsemy as the same vector is used for a word regardless of it’s context. Deep LM’s capture the meaning of a a sequence of words in context – not just individual words in isolation. Context – bag of words Global - learn semantic representations of terms Address synonymy (word level) Learn colocations (phrases) Local – can be used to disambiguate ambiguous terms Address polysemy
  5. Word vectors - don’t learn true synonyms – don’t truly solve synonymy problem, and don’t handle polysemy as the same vector is used for a word regardless of it’s context. Deep LM’s capture the meaning of a a sequence of words in context – not just individual words in isolation. Context – bag of words Global - learn semantic representations of terms Address synonymy (word level) Learn colocations (phrases) Local – can be used to disambiguate ambiguous terms Address polysemy
  6. How do we represent dense vectors in a form that works inside an inverted index? Dense
  7. Note – important to do colocation (phrase detection) before building an embedding model. Embeddings work better when phrases are passed as single tokens.
  8. Excellent explanation of the simHash- Dan Durme and Lall presentation, slide 15