This document provides an overview of leveraging taxonomy management with machine learning. It discusses how semantic technologies and machine learning can complement each other to build cognitive applications. It also discusses how PoolParty, a semantic suite, can be used to perform tasks like corpus analysis, concept and shadow concept extraction, text classification, and improving recommender systems by utilizing knowledge graphs and machine learning algorithms. Real-world use cases are also presented, such as how The Knot uses these techniques for content recommendation.
Leveraging Taxonomy Management with Machine Learning
1. Andreas Blumauer
CEO & Managing Partner
Semantic Web Company /
PoolParty Semantic Suite
Taxonomy Boot Camp 2017
Washington, DC
Leveraging
Taxonomy Management
With Machine Learning
2. INTRODUCTION
2
Semantic Web
Company
founder &
CEO of
Andreas
Blumauer
developer and
vendor of
2004
founded
6.0
current
Version
active at
based on
Vienna
located
part of Enterprise
Knowledge Graphs
manages
standard for
part of
enriches
>200serves customers
editor of
Taxonomies
is about
Ontologies
standard for
graduates
Text
Mining
used for
3. Agenda
▸ Cognitive Computing:
Semantic Technologies & Machine Learning
▸ Terms, Concepts, Shadow Concepts
▸ Corpus Analysis & (Shadow) Concept Extraction
with PoolParty
▸ A comparison with LSA and Word2Vec
▸ Use Cases
▹ Document Annotation & Indexing
▹ Text Classification (incl. Benchmarks)
▹ Recommender Systems (incl. Use Case)
3
5. A key
assumption
of this talk
People do not search
for documents only,
they seek facts about
things and smaller
chunks of information.
Machines shall help to
create links across
data silos to give
answers to questions.
5
Converging A.I.
Technologies
6. A quick
question at the
beginning
Will Artificial
Intelligence
make
Subject Matter
Experts
obsolete?
6 Imagine you want to
build an application
that helps to identify
patients and
treatments pairings.
Which will you prefer?
Applications solely based on machine learning, those ones which
are based on doctors' knowledge only, or a combination of both?
10. Towards a
Digital Twin
Proposal for a
Cognitive
Computing
Platform
Architecture
10 Unstructured Data
Structured Data
Knowledge Graphs
Machine
Learning
Semantic
Layer
IoT & Cognitive
Applications
12. Terms and
co-occurence
models
12
Document
Corpus
- Websites
- PDF, Word, …
- Abstracts from
DBpedia
- RSS Feeds
Term 8
Term 3
Term 7
Term 8
Term 6
Term 9
Term 5
Term 10
- Relevant terms and phrases
- Relevancy of terms
- co-occurence between terms and terms
Term 1
Term 4
Term 2
13. ‘Things’ but not Strings:
Using a ‘Semantic Knowledge Graph’
http://www.my.com/
taxonomy/62346723
prefLabel
Retina
image
http://www.my.com/
images/90546089
http://www.my.com/
taxonomy/
97345854
prefLabel
Funduscope
altLabel
Ophthalmoscope
http://www.mycom.com
/taxonomy/4543567
prefLabel
Diagnostic Equipment
has broader
14. Shadow Concepts
Use co-occurences
between concepts
and terms to
extract ‘shadow
concepts’
14 This site is a
15th-century Inca
site located 2,430
metres above sea
level. It is located
in Cusco, Peru.
It is situated on a mountain ridge above
the Sacred Valley through which the
Urubamba River flows. Most
archaeologists believe that it was built as
an estate for the Inca emperor Pachacuti.
Often mistakenly referred to as the "Lost
City of the Incas", it is the most familiar
icon of Inca civilization. The Incas built
the estate around 1450, but abandoned it
a century later at the time of the
Spanish Conquest.
Inca
site
Machu
Picchu
Cusco
Inca
empire
Inca
emperor
Peru
Spanish
Conquest
Sacred
Valley
Chankas
Lost
City
Pachacuti
In addition to explicitly used concepts and terms, Machu Picchu is
extracted from the article as a Shadow Concept. As a prerequisite,
one has to provide and analyze a representative text corpus first.
Example:
16. Bionics
How do we learn
from a lot of text?
16 Bla bla
bla bla.
Bla bla
bla bla
The stove is on.
The stove is hot!
Ontological model → reasoningTaxonomical model → is-a abstractions
Bla stove
bla bla.
Bla bla
bla hot
Switched on
devices are
dangerous
devices.
The stove is on.
The stove is hot!
Statistical model/cooccurences → is related
The stove is on.
The stove is hot!
Switched on
devices are
dangerous, only if
the operating
temperature is
above 100 degrees
and the automatic
shutdown
mechanism is
broken.
Bla bla
bla bla.
Bla bla
bla bla
17. Graphs +
Machine Learning
PoolParty as a
supervised
learning system
17 Content Manager
Integrator
Taxonomist/
Ontologist
Thesaurus
Server
Extractor
PowerTagging
uses API
is user of
is user of
is basis of
is basis of
Index
annotates
enriches
Corpus Learning/
Semantic Analysis
CMS
extends
is basis of
analyzes
uses API
proposes
extensions
18. Knowledge
graphs as a
result of
human-machine
cooperation
18 Manually created parts of graph
Supervised learning
Automatically created parts of graph
(corpus analysis, RDF transformation,
machine learning, ….)
19. PoolParty
Corpus Analysis
How taxonomists
can extend
taxonomies with
some help from
machine learning
algorithms
19
Candidate Concepts derived from
sample documents can be easily
integrated into taxonomy. A list of possible Candidate Concepts is
shown per document or as a list of most
relevant candidates per corpus.
Context of a given taxonomy
concept can be visualised with a
few mouse-clicks. Terms, concepts and shadow concepts
can be high-lighted per document.
20. Network-based
Knowledge
Graph
Assessment
Thesaurus
Harmonizer
20 ▸ Find missing relationships between
concepts, which are of high
semantic relevance
▸ Point out structural flaws in
existing thesauri
▸ Identify corpora that only reflect a
fraction of a thesaurus
▹ Or, vice versa: identify
thesauri that are far too big
for their domain applications,
and possibly missing details
22. PoolParty
Extractor
Extract concepts
from text even if
not used explicitly
22
Some domains use text that doesn’t always call a spade a spade. With
‘shadow concept extraction’ those ‘masked’ concepts still can be surfaced.
Since these technologies would have become conventional
technologies that are made into products and introduced into market
at the time of their introduction, it would be difficult to differentiate
them as innovative environmental and energy technologies from other
global warming prevention technologies that have already been put to
practical use in the industrial, commercial, residential, and energy
conversion sectors.
- The Innovative Global Warming Prevention Technology Working
Group under the Research and Development Subcommittee
- Council assessed that innovative global warming prevention
technologies would bring about a reduction effect of 7.49 million t-CO2
case of average emissions factor for all power sources of carbon
dioxide in 2010. In view of the difficulty in putting innovative carbon
dioxide sequestration technology into practical use by 2010, the
Working Group reassigned it as an issue of global warming prevention
technology to be tackled by 2030.
The Central Environment Council, however, has not had the
opportunity to examine the contents of these technologies in detail.
(Promotion of climate change prevention activities by every social
actor)
- The Programme encourages every social actor to take actions to
prevent global warming. The actions include measures undertaken by
the public sector.
Climate Change
Since these technologies would have become conventional
technologies that are made into products and introduced into market
at the time of their introduction, it would be difficult to differentiate
them as innovative environmental and energy technologies from other
global warming prevention technologies that have already been put to
practical use in the industrial, commercial, residential, and energy
conversion sectors.
- The Innovative Global Warming Prevention Technology Working
Group under the Research and Development Subcommittee
- Council assessed that innovative global warming prevention
technologies would bring about a reduction effect of 7.49 million t-CO2
case of average emissions factor for all power sources of carbon
dioxide in 2010. In view of the difficulty in putting innovative carbon
dioxide sequestration technology into practical use by 2010, the
Working Group reassigned it as an issue of global warming prevention
technology to be tackled by 2030.
The Central Environment Council, however, has not had the
opportunity to examine the contents of these technologies in detail.
(Promotion of climate change prevention activities by every social
actor)
- The Programme encourages every social actor to take actions to
prevent global warming. The actions include measures undertaken by
the public sector.
Climate Change
23. PoolParty
Semantic
Classifier
Text Classification
based on Machine
Learning and
Semantic
Knowledge Models
23
PoolParty Semantic Classifier combines machine learning algorithms
(SVM, Deep Learning, Naive Bayes, etc.) with Semantic Knowledge Graphs.
24. Benchmarking
the PoolParty
Semantic
Classifier
Improvement of
5.2% compared
to traditional
(term-based)
SVM
24
Features used Classifier F1 (5 folds) Variance
Terms LinearSVC 0.83175 0.0008
Concepts from REEGLE + Shadow Concepts LinearSVC 0.84451 0.0011
Concepts from REEGLE LinearSVC 0.84647 0.0009
Terms + Concepts from REEGLE + Shadow Concepts LinearSVC 0.87474 0.0009
Reegle thesaurus
A comprehensive SKOS taxonomy
for the clean energy sector
(http://data.reeep.org/thesaurus/guide)
● 3,420 concepts
● 7,280 labels (English version)
● 9,183 relations (broader/narrower + related)
Document Training Set
1.800 documents in 7 classes
Renewable Energy, District Heating Systems,
Cogeneration, Energy Efficiency, Energy (general),
Climate Protection, Rural Electrification
25. Sample
Calculation
Based on an
improvement of
5.2%
25
Inbound
Documents
PoolParty
Semantic
Classifier
Experienced
Agent
● 100,000 documents (emails, tickets, etc.) per month
● 5 Euros extra costs per document when misrouted
● Cost savings per year:
○ 1,200.000 x €5.0 x 0.052 = € 312,000 per annum
26. Use Shadow
Concepts to
improve
Recommender
Systems
26
Mini Countryman
And it’s probably more of a
crossover than ever, with the design
to match, Being a Mini, the
Countryman is clearly meant to be
the driver’s car among small
crossovers. The suspension is
sophisticated, and there are lots of
chassis options (a stiffer sports
setup, variable damping, the
electronically controlled ALL4
all-wheel-drive).
But it’s also the crossover for people who’ve bags of cash to blow on
personalisation and luxury.
There’s been a lot of effort on ramping up the cabin quality, but then the
outgoing Countryman was a sad let-down in that department.
On the outside, plastic wheel-arch extensions, with eyebrow creases in the
metalwork above, as well as roof bars and sill protectors all add to the visual
crossover-ness. This remains the only Mini with angular rather than oval
headlamps, and there’s a load of visual posturing going on in the lower face.
There are eight versions at launch, and they’re exactly what you’d expect. It’s
Cooper or Cooper S, each fuelled by petrol or diesel, each of them with front
drive or ALL4. Oh and an eight-speed auto, too, if you count that as a
separate choice. The Cooper petrol is a three-cylinder, the rest fours.
You get extra kit as standard versus the old car, including navigation,
Bluetooth, emergency call and park sensors. Upgrades include a bigger
touch-screen nav with high-definition traffic, various posher seats, a HUD,
and driver aids. Oh and a cushion thingy that folds out from the boot so you
can sit on the rear bumper without getting your clothes mucky.
In June 2017 a Cooper E will launch, which has the Cooper three-cylinder
petrol driving the front wheels, and an electric motor for the rears, with a
capacity to do a claimed 25 miles of gentle all-electric running. So it has the
performance of a Cooper S ALL4 with the tax-busting advantages of a plug-in
hybrid. And you wouldn’t use any fuel if you commuted a short distance.
The platform is BMW’s contemporary transverse-engined hardware, in the
bigger of its two sizes. That means it shares a lot with the BMW X1. The
4WD system is more sophisticated than the previous Countryman’s. The
proportion of drive to the rear is computed by a controller that takes into
account parameters including grip, steering angle and throttle position, as
well as whether you’ve got the sports mode and sports traction systems
selected.
27. Use a Knowledge
Graph +
Co-occurences for
precise Content
Recommendation
27 RavingDe-Void
Scott
attack
Stilinski
friend
shame
O’Brien
woman
married
girl
attractive
Similarepisodes!
love
Example: Find similar episodes
29. Why ‘The Knot’
uses Machine
Learning and
Semantic
Models
29 ▹ XO Group runs ‘The Knot’
since 1996
▹ NYSE: XOXO (S&P 600
Component)
▹ 1.5 million active members
▹ The Knot has helped marry
25 million couples
▹ Partnering with 300,000
wedding vendors
▹ Millions of vendor reviews