The Typed Index

THE TYPED INDEX
Christoph Goller
christoph.goller@intrafind.de

Chief Scientist at IntraFind Software AG

Outline
•

IntraFind Software AG

•

Analyzers, Inverted File Index

•

Different Types of Terms

•

Why do we need them in one field?

•

The Typed Index

•

Multilingual Search / Mixed Language Documents

A few words about me and about IntraFind

IntraFind Software AG
•
•
•
•
•

Specialist for Information Retrieval and Enterprise Search
Founding of the company: October 2000
More than 850 customers mainly in Germany, Austria, and Switzerland
Employees: 30
Lucene Committers: B. Messer, C. Goller

•
•
•
•

Independent Software Vendor, entirely self-financed
Products are a combination of Open Source Components and in-house Development
Support (up to 7x24), Services, Training,
Focus on Quality / Text Analytics / SOA Architecture
– Linguistic Analyzers for most European Languages
– Semantic Search
– Named Entity Recognition
– Text Classification
– Clustering

Analyzers and the Inverted File Index

Analysis / Tokenization
Break stream of characters into tokens /terms
•

Normalization (e.g. case)

•

Stop Words

•

Stemming

•

Lemmatizer / Decomposer

•

Part of Speech Tagger

•

Information Extraction

Different Term Normalizations
Different Types of Terms

Morphological Analyzer vs. Stemming
•

Lemmatizer: maps words to their base forms
English

German

going



go (Verb)

lief



laufen (Verb)

bought



buy (Verb)

rannte



rennen (Verb)



Buch (Noun)

bags

bag (Noun)

Bücher

bacteria

•




bacterium (Noun)

Taschen 

Tasche (Noun)

Decomposer: decomposes words into their compounds
Kinderbuch (children‘s book)  Kind (Noun) | Buch (Noun)
Versicherungsvertrag (insurance contract)  Versicherung (Noun) | Vertrag (Noun)

Stemmer: usually simple algorithm (huge collection of stemmers available in lucene contributions)
going -> go
decoder, decoding, decodes -> decod
Overstemming: Messer -> mess ?????? king -> k ??????????? several, server -> server ????
Understemming: spoke -> speak

Bad Precision with Algorithmic Stemmer

High Recall and High Precision with
Morphological Analyzers

Word Decomposition and Search

Federal Ministry for Family Affairs

Why do we need other Normalizations?
•

Stemmers / Lemmatizers are language-specific

•

MultiTermQueries: WildcardQuery, FuzzyQuery
–
–
–
–

•

Case-Sensitive
–
–

•

no stemming, no lemmatization
should work on original terms generated from Tokenizer
only very simple normalizations such as: Citroën -> Citroen
in Solr: <analyzer type=“multiterm”>

Stemmers / Lemmatizers map everything to lowercase
sometimes case matters: MAN vs. man

Phonetic Search (Double Metaphone):
–
–
–

Mazlum -> MSLM; Muslim -> -> MSLM
book -> PK; books -> PKS
Kaother Tabai -> K0R TP , Kouther Tapei -> K0R TP

Named Entity Recognition (NER)
Automated extraction of information from
unstructured data
•
People names
•
Company names
•
Brands from product lists
•
Technical key figures from technical data
(raw materials, product types, order IDs,
process numbers, eClass categories)
•
Names of streets and locations
•
Currency and accounting values
•
Dates
•
Phone numbers, email addresses,
hyperlinks

Why do we need these different types of terms
in one field?

Why do we need them in one field?
•

Query: “MAN sagt” PhraseQuery / NearQuery !!!!!
Matching Document: “MAN sagte” not “man sagte”

•

Query: “book of Kouther Tapei” PhraseQuery / NearQuery !!!!!
Matching Document: books of Kaouther Tabai
– For book to match books we need a stemmer or a lemmatizer
– For the names to match we need phonetics

•

Query: Mazlum
– It leads to matches for the very frequent word Muslim
– Users want: Give me phonetic matches for Mazlim but not Muslim
– Mazlum=P AND NOT Muslim=E doesn’t do the job!!!

–
–
•

• No match for “Mazlum is a member of the Muslim society in Munich”
spanNot(spanOr([body:V_mazzlim, body:F_MSLM]), body:V_muslim))
New Syntax: <Mazlim=P BUTNOT Muslim=E>

Query: Persons near synonyms of founding and Microsoft
“E_Person found Microsoft” PhraseQuery / NearQuery

Semantic Search
Question:

Semantic Search

Wer hat Microsoft gegründet?

Semantic Search
Question:

Wo liegen Werke von Audi?

Semantic Search

The Typed Index
Multilingual Search
Mixed Language Documents

The typed Index
•

We need different types of terms in one field

•

Types are term properties: payloads are not a good option

•

Use prefixes to distinguish them:
–
–
–
–

–

•

V_ for fullforms (case sensitive)
N_ for diacritics normalizations
F_ for phonetic normal forms
E_ for entities
• E_Person, E_Location, E_Organization
• E_PersonName_Brown, E_Location_Munich
B_ for baseforms: B_Noun_book, B_Verb_fly, …

Multilingual Search is handled in the same way
B_EN_NOUN_book, B_DE_NOUN_buch

Multilingual Search: Standard Approach
Generate a language-specific copy of every content-field:
– configure language-specific analyzers for the language-specific fields
– Indexing: Adapt indexing chain to determine document language,
generate new language-specific fields
– Search: Use MultiFieldQueryParser to expand query to every
language-specific field

– Highlighting: depending on document-language call Highlighter for
language-specific fields with the respective analyzer
– no solution for mixed-language documents

Multilingual Search and the Typed Index
Choose analyzer depending on language but do not use different fields:
– Analyzers generate terms typed with language: B_EN_NOUN_book,
B_DE_NOUN_buch

– Indexing: choose analyzer in indexing chain based on language
– Search: Use a special MultiAnalyzerQueryParser to expand query to every
language

– Highlighting: choose analyzer based on language and apply it to content-field
– Advantage: you could implement a multi-language analyzer for handling mixedlanguage documents, which switches language even within paragraphs.

Summary: Advantages of Typed Index to
Multi-Field Index
• Keep positions aligned in an easier way
• Only tokenize once : Performance!

• Reuse existing Queries like PhraseQueries, MultiPhraseQueries
• Treatment for Mixed-Language Documents: Use Lemmatizer
Results to switch between languages

Thanks for listening
Questions ?
By the way: Our Analyzers are available as Plugins for Lucene / Solr / ElasticSearch
Dr. Christoph Goller
Phone:
+49 89 3090446-0
Fax: +49 89 3090446-29
Email:
christoph.goller@intrafind.de
Web:
www.intrafind.de
IntraFindSoftware AG
Landsberger Straße 368
80687 München
Germany

The Typed Index

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (7)

Semelhante a The Typed Index

Semelhante a The Typed Index (20)

Mais de lucenerevolution

Mais de lucenerevolution (20)

Último

Último (20)

The Typed Index