AI for information management: why and how

AI for information management:
Why and How
Anna Divoli, Ph.D. @annadivoli
Head of R&D at @PingarHQ
20 Nov 2018 Auckland AI and Machine Learning .

How did I get
here?
Biomedical
Sciences
Bioinformatics Biomedical
Text Mining
Biomedical
User Search
Interfaces
Biomedical
Knowledge
Acquisition
Organizations’
Document
Management
e
e
Anna Divoli
All are applied science
Translating human intelligence/knowledge to machine/systems input.

Organizations’
Information
Management
Problem(s)
Too much data already (Office documents, PDFs… duplicates too!)
Enterprise data volume increases 50 times year-over-year
between 2015 and 2020.
Dispersed into several locations (file shares, email, content
management systems…)
→ Expensive to store and migrate
→ Difficult to find stuff
→ Cannot follow regulatory compliance
Anna Divoli

Translating these
problems into
Data Science
Document
Classification
Anna Divoli

It’s all about
Metadata
Anna Divoli

Traditional
Metadata Capture
Anna Divoli

Humans for
Metadata
Reasoning
 Subjective
 Inconsistent
 Inter-annotator agreement (after training)
 Finds the task boring
 Finds shortcuts when forced
 Can be put to better useAnna Divoli

AI for Metadata  Needs training
Objective
Consistent
Fast
Cheap
It doesn’t find the task boring
Perfect use
AI
Anna Divoli

The necessity of
auto-classification
9.375
6.25
3.125
0.25
0 1 2 3 4 5 6 7 8 9 10
Manual Classification
3 Minutes per record
2 Minutes per record
1 Minute per record
Automatic
Solution
Effort (Man Years) to classify 300,000
records
Most of this is
set up time.
52 weeks/year
40 hours/week
60 min/hour
-----------------------
124800 min/year
Anna Divoli

Metadata
for
auto-classification
Extracted /
generated with
NLP algorithms
• Named Entities
• Taxonomy/Ontology Terms
• KeyPhrases
• Excerpts/Summaries
• Patterns
• Events
• Relationships
• Sentiment
• Trends
• …
Anna Divoli

NLP ∈ AI
Machine
Learning
NLP
Computational
Linguistics
Applied
Text
Analytics
Connectors
Storage
Memory
Security
Friendly UIs
Visualizations
Managed
Metadata &
Ontologies
Anna Divoli

Named Entities
Algorithm
• Machine Learning
• Involves: Tagged examples, good feature selection
• Advantages: It recognizes names it has never seen
before.
• Disadvantages: It needs context (like humans!)
Anna Divoli

KeyPhrases
Algorithm
• Uses world knowledge (large amount of data) to
identify important concepts
• Uses a good scoring algorithm to determine the
most important KeyPhrases within a document
• It is document specific
• It needs updating to learn new concepts
• It behaves a bit like folksonomy (but with more
canonical use of terms)
Anna Divoli

Metadata
for
auto-classification
• Named Entities
• Taxonomy/Ontology Terms
• KeyPhrases
• Excerpts/Summaries
• Patterns
• Events
• Relationships
• Sentiment
• Trends
• …
Anna Divoli

Taxonomies /
Ontologies:
Backbone of AI for
NLP and
Document
Management
• So important for many fields.
Example: Biomedicine: http://www.obofoundry.org
• Variations
Google Knowledge Graph: uses less “formal
semantics” than a “regular” ontologyAnna Divoli

Basics of
Document
Classification
• Match specific terms and/or patterns (rules)
• Supervised Machine Learning (including Deep
Learning)
• Hybrid Systems
• Unsupervised Machine Learning, i.e., Clustering,
e.g., Topic Modelling
Anna Divoli
 In enterprise, these are nice in theory!

Typical Reality
Document
Classification
and
Taxonomies
/Ontologies
• No training data (or incomplete and/or very
inconsistent)
• Large number of specialized categories
• “Topic” vs. “reference” taxonomies
Anna Divoli
Example of a customer data set with “lots of training data”:
10 000 summaries of court judgments tagged against a taxonomy
of 890 categories.
Assuming some sort of equal distribution that would be ~11
judgments per category for training. Even if we just consider just
the leaf nodes that would be ~20 judgments per category.

What is AI? • Many definitions
• Goal: make the machines “smart”
• Typically for tasks humans tend to be better. Or, used to be
better.
• Methods: experts systems (rules) to machine learning (ML)
• Rules are instructions (so not intelligence) but a means to a
result
• Rule based AI most famous example: deep blue (played
chess 96-97)
• ML: the program learns, adapts…
• Most AI (so far): task-specific intelligence
Anna Divoli

Human Rules or
Machine
Learning?
 What kind of categories?
 What type of training data?
 How much effort?
Anna Divoli
Humans need to provide training data or heuristics.

Human Rules
Examples
Anna Divoli
* Example from Pingar’s DiscoveryOne

More Intelligence
in the Human
Rules
Semantic
similarity
Anna Divoli
* Example from Pingar’s DiscoveryOne

Machine Learning Case: Common Content Types
Anna Divoli

Methodology
Look at existing
taxonomies/
vocabularies.
Go through
documents.
Interview end
users.
Send
questionnaires to
end users.
Step 1:
Interview/discuss with
Company’s Information
Manager / Subject Matter
Expert Find out needs
and resources.
Step 2:
Use different resources
to understand the
language used and the
search needs.
Step 4:
Based on information
from steps 1, 2 and 3,
adapt rules and integrate
them in the search system
using DiscoveryOne.
Step 5:
Conduct impact studies with
the first version of
taxonomy/ies from step 3.
Get user feedback for
improvements. Steps 1, 2, 3
and 4 might need revisiting
after this.
Step 3:
Based on
information from
steps 1 and 2,
build ontologies.
Human, domain-specific knowledge/intelligence needs to be captured and entered in the system.

Demo?
Keen to see it?
Anna Divoli

How is the
metadata used?
→ Expensive to store and migrate
→ Difficult to find stuff
→ Cannot follow regulatory compliance
• Rules & workflows for migration and retention &
disposal
• Facets / Search filters

Summary:
New Era,
Expectations,
and
Education
• AI is now accepted and expected!
• But still major lack of understanding of how it
works.
• Different algorithms: KeyPhrases, Named Entities,
Taxonomy/Ontology based with rules, ML.
• Taxonomy/ontology with rules uses several NLP
aspects: stemming, pluralization, stopwords,
semantic similarity suggestions, patterns.
• “Traditional”* ML rarely works for our customers.
Anna Divoli
* “Traditional” = Expecting a few categories with a good amount of training data for each.

And we are done!
Thank you all!
Questions?
@annadivoli
Anna Divoli

AI for information management: why and how

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (6)

Semelhante a AI for information management: why and how

Semelhante a AI for information management: why and how (20)

Mais de Anna Divoli

Mais de Anna Divoli (9)

Último

Último (20)

AI for information management: why and how