Driving Behavioral Change for Information Management through Data-Driven Gree...
A New Concept In Search
1. weighting influenced by parent, child and sib-
A New Concept in Search ling can reduce taxonomy development and
on-going maintenance by 66%-80%.
Semantic Metadata Generation
By John Challis, CEO, CTO, Concept Searching
The metadata generation issue is increas-
ingly a growing concern in large enterprises.
With A comprehensive approach that requires more
the exponential increase in unstruc- those items that are relevant to the query. than syntactic metadata and that requires end
tured information, enterprises are seeking Recall is the retrieval of all items that are rel- users to add rich metadata is haphazard and
new ways to improve not only the search evant to the query. Yet most information subjective at best. Since the suggested
and retrieval process but to identify tools to retrieval technologies are less than 22% accu- approach is no longer restricted to keyword
rate for both precision and recall. The ideal
manage, capitalize on and leverage their in- identification, compound-term metadata can
goal is to have them balanced. Compound
formation assets to improve organizational be automatically generated either when the
term processing has the ability to increase
performance. Moving beyond keyword content is created or ingested. The generation
precision with no loss of recall.
identification and traditional taxonomy ap- of metadata based on concepts extracts com-
proaches, the use of compound term pro- pound terms and keywords from a document
cessing or identifying “concepts in context” or corpus of documents that are highly corre-
Managing Content
effectively addresses the issue of managing lated to a particular concept. By identifying
Taxonomy development and mainte-
unstructured content and enables organiza- the most significant patterns in any text, these
nance has traditionally been a laborious
tions to more effectively find, organize and compound terms can then be used to generate
and on-going challenge, not to mention
manage their information capital. non-subjective metadata based on an under-
costly. The most effective approach is to
Compound term processing automatically standing of conceptual meaning.
use rules-based categorization, providing
identifies the word patterns in unstructured Compound-term processing is a new
enterprises complete control of rules-based
text that convey the most meaning and uses approach to an old problem. Instead of
descriptors unique to their organization.
these higher order terms to improve precision identifying single keywords, compound-
Since all rules can be defined and man-
with no loss of recall. The algorithms adapt to term processing identifies multi-word
aged, error-prone results utilizing “train-
each customer’s content and they work in any terms that form a complex entity and iden-
ing” algorithms typically found in other
language regardless of vocabulary or linguis- tifies them as a concept. By forming these
approaches are eliminated.
tic style. The technology was originally compound terms and placing them in the
developed by Concept Searching in 2002 and search engine’s index, the search can be
is similar in many ways to the “phrase-based performed with a higher degree of accura-
“Precision and recall
indexing” techniques detailed in various U.S. cy because the ambiguity inherent in single
patents filed in 2004 and to which Google words is no longer a problem. As a result, a
are the two key
subsequently acquired the rights. search for “survival rates following a triple
heart bypass” will locate documents about
performance this topic even if this precise phrase is not
Keyword Search
contained in any document. A concept
versus Concept Search
measurements for search using compound-term processing
can extract the key concepts, in this case
Knowledge workers need to identify
“survival rates” and “triple heart bypass”
content in the context of what they are
information retrieval.” and use these concepts to select the most
seeking. The fundamental problem with
relevant documents.
most enterprise search solutions, and all
Compound-term processing can address
statistical search solutions, is that they are A concept-based automatic classification
many challenges facing large enterprises and
based on an index of single words. Yet process identifies, during indexing, the cate-
provide many benefits. Identification of con-
most queries are expressed in short pat- gories each document belongs to. Each cate-
cepts within a large corpus of information
terns of words and not single words in iso- gory is identified by a unique descriptor and
removes the ambiguity in search, eliminates
lation which are highly ambiguous. is associated with key descriptive words
inconsistent meta-tagging and automatic clas-
A concept search engine can isolate the and/or phrases held in the database. This
sification and taxonomy management based
key meaning that is normally expressed as approach enables a rapid implementation of
on concept identification, simplifies develop-
proper nouns, nouns phrases and verb a corporate taxonomy with all documents
ment and on-going maintenance. T
phrases. Although linguistic products can classified to multiple nodes at index time.
do this, their performance is highly vari- Ideally, the taxonomy can be used to browse John Challis has had success with several ventures
able depending upon the vocabulary and the document collection or as a filter when involving the management of unstructured data. In
language in use. A statistical-based, lan- running ad hoc searches. 1990, he founded Imagesolve International which
guage-independent concept search can An easy-to-use taxonomy and automatic became the UK’s leading supplier of document image
accept queries in natural language with the classification tool creates the framework to processing and workflow products. In 1995 he launched
user typing words, phrases or whole sen- ImageFirst Office for BancTec in the US. He was also
classify content based on concepts to one or
CTO at Smartlogik, the company behind the world’s first
tences. The system then analyzes the natu- more nodes in the taxonomy. Features that
probabilistic search engine.
ral language query to extract the keywords enable subject-matter experts to interact with
and phrases to identify the main concepts the taxonomy can simplify on-going mainte- Providing advanced search, auto-classification, taxono-
and retrieve content that is highly relevant. nance. For example: automatically generating my and semantic metadata-tagging solutions, Concept
Precision and recall are the two key per- compound term clues from the document Searching is the first and only statistical search and clas-
formance measurements for information corpus; dynamically showing the effect sification company that uses compound-term process-
of changes on the taxonomy; and class
retrieval. Precision is the retrieval of only ing to identify concepts within unstructured content.
KMWorld October 2008 S17