Content Processing Architecture and Applications - Introduction to Text Mining

CONTENT PROCESSING
ARCHITECTURE AND
APPLICATIONS
Introduction to text mining – Warsaw University of Technology

Plan

Findwise – who we are, what we do.
What is content?
Why content processing is important
Content processing and information retrieval
Technology for content processing
Methods for content processing
Examples of usage

Findwise – Search Driven Solutions
•  Founded
in
2005

•  Oﬃces
in
Sweden,
Denmark,

Norway,
Poland
and
Australia

•  90
employees

Our
objecBve
is
to
be
a
leading
provider
of
Findability
soluBons
uBlising

the
full
potenBal
of
search
technology
to
create
customer
business
value.

•  Paweł
Wróblewski
&
Marcin
Goss

Content ≥ Information
From the business point of view INFORMATION is the key to
success.

”Informa)on
can
only
be
an
asset
when
it
enables
a

task
to
be
completed.”

“The
value
is
in
the
outcome
of
the
task,
not
in

the
informa)on
itself.”

MarBn
White

Employee productivity (The hidden cost… IDC April 2006):
” “the cost for wasted time on the part of professional searching, but not
!nding relevant information, amounts to $5.3 million annually for an enterprise
with 1000 knowledge workers.””

Information is hidden
Big Data is commonly described with 3V:

1.  Variety
Human
generated
vs.
Machine
generated

Text
&
MulBmedia

2.  Volume
Up
to
Petabytes

3.  Velocity
Stream
of
data

GBs
per
day,
hour,
minute,
second

Information lives in the
context
The right Information is hidden in text.

Text forms a context:
word -> sentence -> paragraph -> chapter -> document

Content processing is about extracting required
information from the context.

WHY CONTENT PROCESSING IS
IMPORTANT?

Why content processing is important
To get right information in seconds
•  Usage
of
faceted
search

To tag consistently large document set
•  Usage
of
automaBc
extactor

To biuld semantic database
•  ExtracBon
of
concepts
with
linkage
to
taxonomy/ontology

To perform document classi#cation
•  ExtracBon
of
enBBes
with
grouping
/
clustering

Examples
from
publicly
available
websites
[live
show]

Conclusion
Content processing is a set of techniques enabling text analytics.

Content processing leverages the value of data stored in companies
improving data consumption.

Content processing used with search engines helps #nd information
in any context.
•  Enteprise
Findability

•  E-‐commerce

TECHNOLOGY FOR CONTENT
PROCESSING

General architecture of search engines

Content Processing – the idea

Format
Language
Spell
Lemmas

Synonyms

Conversion
Detec?on
Checking
(tenses,
forms)

Document

Geography

Taxonomy
Custom
Companies

Vectorizer
En??es

Classifica?on
PLUG-‐IN
People

Scopifier

index
PARIS
(Reuters)
-‐
Venus
Williams
raced
into
the
second
round
of

the
$11.25
million
French
Open
Monday,
brushing
aside

Bianka
Lamade,
6-‐3,
6-‐3,
in
65
minutes.

The
Wimbledon
and
U.S.
Open
champion,
seeded
second,
breezed

past
the
German
on
a
blustery
center
court
to
become
the

first
seed
to
advance
at
Roland
Garros.
"I
love
being
here,
I

love
the
French
Open
and
more
than
anything
I'd
love
to
do

well
here,"
the
American
said.

Input:

byte
stream

Output:
structured
document
ready
to
be
indexed

Content Processing – the implementation
Hydra is used in order to refine content before it hits the index. Every
document fetched from a source runs through a targeted pipeline,
which includes a number of stages. A stage can be considered as an
“app” within Appstore or the Android market. Findwise have created
a huge amount of such stages, where each stage has a small
purpose to enhance the content of the item. It is possible to create
additional stages to serve a specific customer functionality.

Hydra - example

Select
stages
to
use
in
the
pipeline,
the
leX
column
corresponds
to
the

“market”,
and
the
right
is
the
stages
used.

Hydra - example

Modify
the
format
of
the
date
to
only
include
year.

Hydra - example

The
new
year
meta-‐data
can
be
used
as
a
facet

Hydra - example

Map
every
author
ﬁeld
to
a
metadata
ﬁeld
called
author.

Pipeline
A

Pipeline
B

Hydra - example

In
the
search
result…

Hydra is Open Source
http://#ndwise.github.com/Hydra/

METHODS FOR CONTENT PROCESSING

Named entity recognition – statistical classi#ers

•  OpenNLP (http://opennlp.apache.org/)
•  Markov chains
•  Mallet (http://mallet.cs.umass.edu/)
•  Conditional random #elds

Input:
Mark has been in London since Mary dumped him.

Output:
<person>Mark</person> has been in <place>London</place>
since <person>Mary</person> dumped him.

Classi#ers - training

•  Training set - language corpora
•  (http://nkjp.pl/) for Polish

Set of manually tagged texts in given language. Preferably from various
sources, various topics.

Tokens
PoS
tags
Name
tags

He
Pronoun
O

went
Verb
O

to
Prep.
O

United
AdjecBve
Place

States
Noun
Place

.
Interp
O

Classi#ers – supervised training

•  Training input
•  Features extracted from each token
token: text, PoS tag, token class
prev token: text, PoS tag, token class
next token: text, PoS tag, token class
previous tags assigned

•  Token classes examples
lowercase alphabetic, digits, contains number and letter, contains number and
a hyphen, all caps, all caps with dots inbetween ...

•  Training output
•  <place> <location> <person>
•  <B-place> <I-place> <L-place> <U-place>

Classi#ers – approaches

„Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w
Sheratonie”

Location? Organisation name? Person name?

•  One classi!er for all name-types
•  faster
•  automatically resolves con#icts

•  One classi!er per name-type
•  slower, memory consuming
•  provides more information

Naive approach

Often people names intersect with location names:

- Kazimierz

- Washington

Company names may come from common language:

- Oracle

- Dialog

Conlcusion: dictionaries are not enough

without contextual analysis

Paweł Wróblewski
pawel.wroblewski@#ndwise.com

Marcin Goss
marcin.goss@#ndwise.com

Content Processing Architecture and Applications - Introduction to Text Mining

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (7)

Semelhante a Content Processing Architecture and Applications - Introduction to Text Mining

Semelhante a Content Processing Architecture and Applications - Introduction to Text Mining (20)

Mais de Findwise

Mais de Findwise (20)

Content Processing Architecture and Applications - Introduction to Text Mining