In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
14. 14
Simplest structure: salient terms
Many years later, as he faced the firing squad, Colonel
Aureliano Buendía was to remember that distant afternoon
when his father took him to discover ice.
– Marquez (1962)
15. 15
Typed entities
People, places, organizations; etc.
Simple approach: word lists.
More difficult: trained extractors (including sentiment).
19. 19
Excellent corpus:
Research articles.
Written by humans.
Tagged by authors.
Case study: ACM
But:
Half the articles untagged.
Tags sparse (90% of tags used once!)
Synonyms abound.
tags → controlled tag vocabulary → high-scoring salient tags
26. 26
Human brain is great at extracting information scent:
[word, word, word, …] → meaning
Information Scent
[island, Indonesia] [code, Sun] [coffee, beans, brew] → Java
27. 27
Vector model
– Salton (1983)
Similarity between documents = cosine of the angle between their vectors
Can also rotate basis for the best representation: LSI
32. 32
It is said Mrs. Clinton promises new jobs will be created by her.
N V V N N V A N V V V N
part of speech tagging
noun / verb phrase extraction
sentence structure analysis
anaphora resolution
passive tense flipping
triple filtering
hierarchy generation
Sentence structure parsing
33. 33
Nouns by head noun:
[Mrs. + Hillary + Bill + President]
→ Clinton
Verbs by hypernyms (broadening synonyms):
[say + tell + propose + suggest + declare]
→ express
Hierarchy generation (also semantic network!)
41. 41
Conclusions
What do we expect in the future?
Extraction leads to generation
Summarization
Generalization
Narratives
Inference and conflict resolution
We are all interested in the future, for that is where you and I
are going to spend the rest of our lives. And remember, my
friend, future events such as these will affect you in the future.
– Edward Wood Jr. (1957)
42. 42
Text analytics: what does it mean?
Unstructured text isn't unstructured. There's always structure.
Find the information scent. Let the users follow it.
Don’t trust that one query is enough. Let the users interact.
Text does not matter. Meaning does.