Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio
1. Hybrid semantic document enrichment using
machine learning and linguistics
Stefan Geißler, SEMANTICS, Leipzig Sept 14 2016
Expert System
2. • Title
What is this?
A graph showing
the distribution of
large cities in the
world
Size of the city
(population)
The city‘s rank
3. • Title
What is this?
A graph showing
the richest people
of the world
Wealth of the
person
The person‘s
rank
4. • Title
What is this?
A graph showing
the most frequent
words from a large
text corpus
Frequency of
the word
The word‘s rank
5. Empirical evidence: Many types of data from
physics, social sciences etc follow such a
distribution
„Zipf‘s law“:
The number of data points (cities, rich people,
words) with a value higher than S (on the y
axis) is proportional to 1/S.
8. Problem #1:
How does that fit the requirement at the start of
many categorization projects that a category will
need a decent amount of data (>100 documents)
to be trained?
Larger categories can be trained (learned
automatically) smaller ones often can‘t.
9. Problem #2:
Even for the frequent enough categories: Is a
training corpus really representative?
Is „Greece“ always about „debt crisis“?
Is „Ansbach“ always about „terror“?
Learning method may learn unwanted associations
10. • Title
Solution?
More data? No because,
- The graph here is
scale-free
- More data is often not
available or very costly
Frequency of
the category
The category‘s
rank
11. Solution:
Let the human expert refine the automatically
created model
Human document
categorization:
If („Etna“ or „Vesuv“ or
„Pinantubo“) AND („lava“ or
„eruption“)
Then „Volcanism“
Machine document
categorization:
12. This is seldomly a subject in scientific work on
document categorization.
Different classification
methods most often
compared only on the
basis of their (automatic)
performance on a
evaluation corpus
13. … but this is often a requirement in real-world
document categorization projects.
• Training corpora alone are often not enough to
attained expected levels of quality.
• Additional data hard to find (manual preparation or
curation very costly)
• Existing corpora may not always be representative.
14. Our suggestion
• Use available training data to train a
model
• Make the model available in a human
readable formal language
• Allow user to inspect and refine model
where needed in a dedicated
developement&testing environment
15. • A rich formal language (strings,
lemmas, regexps, semantic
concepts, operators …) allows to
express learnt associations for
bag of words models
• … as well as detailed
syntactic/semantic constraints
• … and visualize and evaluated
the result in the same application
16. • For the reasons explained above,
the statistical learning approach
may erroneously learn a rule that
the words „Athens“ or „Greece“
allone justify assigning the
document to „Banking Crisis“
• The user can refine the learnt
rule, adding the further
constraint that features like
„Debt“, „Schäuble“ or „Troika“
are required before the category
is assigned.
17. … Sample projects
• <US Media company>
• Large category schema for news articles
• Task: set up solution that allows combining
automatically created rule sets with manual refinement
• <Insurance company>
• Categorize medical reports using ICD category scheme
• Go beyond quality that can be attained by using only
the manually coded training set
18. Conclusion
• Requirements in categorization projects in the industry are
sometimes not identical to the scenarios in academic
categorization benchmarks
• Available training data sometimes limited even in the age of
big data
• Allow the seamless (one language, one development
environment) application of both learnt as well as manually
crafted rules
20. Expert System: Largest European provider of pure
semantic technologies
• 7 Geographies
• 250+ team members
• Listed on the AIM exchange
• Recommended by Gartner,
Forrester, IDC ...
• Experiences from hundreds
of projects
• Award winning technology:
Taxonomy / Ontology
Management, NLP,
Information extraction,
Question Answering,
Cognitive Computing
21. Global Positioning – Selected Clients
21
ENERGY, OIL & GAS
GOVERNMENT
FEDERAL
AGENCIES
MEDIA & PUBLISHING
Life Sciences
FINANCE