2. Plan
Findwise – who we are, what we do.
What is content?
Why content processing is important
Content processing and information retrieval
Technology for content processing
Methods for content processing
Examples of usage
3. Findwise – Search Driven Solutions
• Founded
in
2005
• Offices
in
Sweden,
Denmark,
Norway,
Poland
and
Australia
• 90
employees
Our
objecBve
is
to
be
a
leading
provider
of
Findability
soluBons
uBlising
the
full
potenBal
of
search
technology
to
create
customer
business
value.
• Paweł
Wróblewski
&
Marcin
Goss
5. Content ≥ Information
From the business point of view INFORMATION is the key to
success.
”Informa)on
can
only
be
an
asset
when
it
enables
a
task
to
be
completed.”
“The
value
is
in
the
outcome
of
the
task,
not
in
the
informa)on
itself.”
MarBn
White
Employee productivity (The hidden cost… IDC April 2006):
” “the cost for wasted time on the part of professional searching, but not
!nding relevant information, amounts to $5.3 million annually for an enterprise
with 1000 knowledge workers.””
6. Information is hidden
Big Data is commonly described with 3V:
1. Variety
Human
generated
vs.
Machine
generated
Text
&
MulBmedia
2. Volume
Up
to
Petabytes
3. Velocity
Stream
of
data
GBs
per
day,
hour,
minute,
second
7. Information lives in the
context
The right Information is hidden in text.
Text forms a context:
word -> sentence -> paragraph -> chapter -> document
Content processing is about extracting required
information from the context.
9. Why content processing is important
To get right information in seconds
• Usage
of
faceted
search
To tag consistently large document set
• Usage
of
automaBc
extactor
To biuld semantic database
• ExtracBon
of
concepts
with
linkage
to
taxonomy/ontology
To perform document classi#cation
• ExtracBon
of
enBBes
with
grouping
/
clustering
Examples
from
publicly
available
websites
[live
show]
10. Conclusion
Content processing is a set of techniques enabling text analytics.
Content processing leverages the value of data stored in companies
improving data consumption.
Content processing used with search engines helps #nd information
in any context.
• Enteprise
Findability
• E-‐commerce
13. Content Processing – the idea
Format
Language
Spell
Lemmas
Synonyms
Conversion
Detec?on
Checking
(tenses,
forms)
Document
Geography
Taxonomy
Custom
Companies
Vectorizer
En??es
Classifica?on
PLUG-‐IN
People
Scopifier
index
PARIS
(Reuters)
-‐
Venus
Williams
raced
into
the
second
round
of
the
$11.25
million
French
Open
Monday,
brushing
aside
Bianka
Lamade,
6-‐3,
6-‐3,
in
65
minutes.
The
Wimbledon
and
U.S.
Open
champion,
seeded
second,
breezed
past
the
German
on
a
blustery
center
court
to
become
the
first
seed
to
advance
at
Roland
Garros.
"I
love
being
here,
I
love
the
French
Open
and
more
than
anything
I'd
love
to
do
well
here,"
the
American
said.
Input:
byte
stream
Output:
structured
document
ready
to
be
indexed
14. Content Processing – the implementation
Hydra is used in order to refine content before it hits the index. Every
document fetched from a source runs through a targeted pipeline,
which includes a number of stages. A stage can be considered as an
“app” within Appstore or the Android market. Findwise have created
a huge amount of such stages, where each stage has a small
purpose to enhance the content of the item. It is possible to create
additional stages to serve a specific customer functionality.
15. Hydra - example
Select
stages
to
use
in
the
pipeline,
the
leX
column
corresponds
to
the
“market”,
and
the
right
is
the
stages
used.
16. Hydra - example
Modify
the
format
of
the
date
to
only
include
year.
17. Hydra - example
The
new
year
meta-‐data
can
be
used
as
a
facet
18. Hydra - example
Map
every
author
field
to
a
metadata
field
called
author.
Pipeline
A
Pipeline
B
22. Named entity recognition – statistical classi#ers
• OpenNLP (http://opennlp.apache.org/)
• Markov chains
• Mallet (http://mallet.cs.umass.edu/)
• Conditional random #elds
Input:
Mark has been in London since Mary dumped him.
Output:
<person>Mark</person> has been in <place>London</place>
since <person>Mary</person> dumped him.
23. Classi#ers - training
• Training set - language corpora
• (http://nkjp.pl/) for Polish
Set of manually tagged texts in given language. Preferably from various
sources, various topics.
Tokens
PoS
tags
Name
tags
He
Pronoun
O
went
Verb
O
to
Prep.
O
United
AdjecBve
Place
States
Noun
Place
.
Interp
O
24. Classi#ers – supervised training
• Training input
• Features extracted from each token
token: text, PoS tag, token class
prev token: text, PoS tag, token class
next token: text, PoS tag, token class
previous tags assigned
• Token classes examples
lowercase alphabetic, digits, contains number and letter, contains number and
a hyphen, all caps, all caps with dots inbetween ...
• Training output
• <place> <location> <person>
• <B-place> <I-place> <L-place> <U-place>
25. Classi#ers – approaches
„Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w
Sheratonie”
Location? Organisation name? Person name?
• One classi!er for all name-types
• faster
• automatically resolves con#icts
• One classi!er per name-type
• slower, memory consuming
• provides more information
27. Naive approach
Often people names intersect with location names:
- Kazimierz
- Washington
Company names may come from common language:
- Oracle
- Dialog
Conlcusion: dictionaries are not enough
without contextual analysis