Slides from presentation at Auckland AI & ML meetup in Nov 2018 by Anna Divoli.
Title: AI for information management: why and how
Synopsis:
All organizations have a large number of files they need to manage in order to keep storage costs down, find relevant content when they need it, be able to utilize data within those files, and be compliant with regulatory requirements. In this talk, we will cover how AI offers a pragmatic solution to this problem, while reviewing the challenges data and information scientists face in this area.
Brief bio:
Anna Divoli has 10 years of research experience in Academia and 7 years in industry. As a PhD student and then postdoc at the Universities of Manchester, UC Berkeley, and Chicago, she worked in the areas of biomedical text mining, NLP, user search interfaces, and knowledge acquisition. Today, as the Head of R&D at Pingar, she helps organizations tackle their document management problem.
Boost Fertility New Invention Ups Success Rates.pdf
AI for information management: why and how
1. AI for information management:
Why and How
Anna Divoli, Ph.D. @annadivoli
Head of R&D at @PingarHQ
20 Nov 2018 Auckland AI and Machine Learning .
2. How did I get
here?
Biomedical
Sciences
Bioinformatics Biomedical
Text Mining
Biomedical
User Search
Interfaces
Biomedical
Knowledge
Acquisition
Organizations’
Document
Management
e
e
Anna Divoli
All are applied science
Translating human intelligence/knowledge to machine/systems input.
3. Organizations’
Information
Management
Problem(s)
Too much data already (Office documents, PDFs… duplicates too!)
Enterprise data volume increases 50 times year-over-year
between 2015 and 2020.
Dispersed into several locations (file shares, email, content
management systems…)
→ Expensive to store and migrate
→ Difficult to find stuff
→ Cannot follow regulatory compliance
Anna Divoli
7. Humans for
Metadata
Reasoning
Subjective
Inconsistent
Inter-annotator agreement (after training)
Finds the task boring
Finds shortcuts when forced
Can be put to better useAnna Divoli
8. AI for Metadata Needs training
Objective
Consistent
Fast
Cheap
It doesn’t find the task boring
Perfect use
AI
Anna Divoli
9. The necessity of
auto-classification
9.375
6.25
3.125
0.25
0 1 2 3 4 5 6 7 8 9 10
Manual Classification
3 Minutes per record
Manual Classification
2 Minutes per record
Manual Classification
1 Minute per record
Automatic
Solution
Effort (Man Years) to classify 300,000
records
Most of this is
set up time.
52 weeks/year
40 hours/week
60 min/hour
-----------------------
124800 min/year
Anna Divoli
12. Named Entities
Algorithm
• Machine Learning
• Involves: Tagged examples, good feature selection
• Advantages: It recognizes names it has never seen
before.
• Disadvantages: It needs context (like humans!)
Anna Divoli
13. KeyPhrases
Algorithm
• Uses world knowledge (large amount of data) to
identify important concepts
• Uses a good scoring algorithm to determine the
most important KeyPhrases within a document
• It is document specific
• It needs updating to learn new concepts
• It behaves a bit like folksonomy (but with more
canonical use of terms)
Anna Divoli
15. Taxonomies /
Ontologies:
Backbone of AI for
NLP and
Document
Management
• So important for many fields.
Example: Biomedicine: http://www.obofoundry.org
• Variations
Google Knowledge Graph: uses less “formal
semantics” than a “regular” ontologyAnna Divoli
16. Basics of
Document
Classification
• Match specific terms and/or patterns (rules)
• Supervised Machine Learning (including Deep
Learning)
• Hybrid Systems
• Unsupervised Machine Learning, i.e., Clustering,
e.g., Topic Modelling
Anna Divoli
In enterprise, these are nice in theory!
17. Typical Reality
Document
Classification
and
Taxonomies
/Ontologies
• No training data (or incomplete and/or very
inconsistent)
• Large number of specialized categories
• “Topic” vs. “reference” taxonomies
Anna Divoli
Example of a customer data set with “lots of training data”:
10 000 summaries of court judgments tagged against a taxonomy
of 890 categories.
Assuming some sort of equal distribution that would be ~11
judgments per category for training. Even if we just consider just
the leaf nodes that would be ~20 judgments per category.
18. What is AI? • Many definitions
• Goal: make the machines “smart”
• Typically for tasks humans tend to be better. Or, used to be
better.
• Methods: experts systems (rules) to machine learning (ML)
• Rules are instructions (so not intelligence) but a means to a
result
• Rule based AI most famous example: deep blue (played
chess 96-97)
• ML: the program learns, adapts…
• Most AI (so far): task-specific intelligence
Anna Divoli
19. Human Rules or
Machine
Learning?
What kind of categories?
What type of training data?
How much effort?
Anna Divoli
Humans need to provide training data or heuristics.
21. More Intelligence
in the Human
Rules
Semantic
similarity
Anna Divoli
* Example from Pingar’s DiscoveryOne
22. More Intelligence
in the Human
Rules
Patterns including
named entities
Regex
(([0-9]*.?[0-9]+))((s*?)|(-)|(s*?-s*?))((metreb)|(metresb)
|(meterb)|(metersb)|(kilometreb)|(kilometresb)|
(kilometerb)|(kilometersb)|(kmb)|(kmsb)|(mb))
Anna Divoli
* Examples from Pingar’s DiscoveryOne
24. Methodology
Look at existing
taxonomies/
vocabularies.
Go through
documents.
Interview end
users.
Send
questionnaires to
end users.
Step 1:
Interview/discuss with
Company’s Information
Manager / Subject Matter
Expert Find out needs
and resources.
Step 2:
Use different resources
to understand the
language used and the
search needs.
Step 4:
Based on information
from steps 1, 2 and 3,
adapt rules and integrate
them in the search system
using DiscoveryOne.
Step 5:
Conduct impact studies with
the first version of
taxonomy/ies from step 3.
Get user feedback for
improvements. Steps 1, 2, 3
and 4 might need revisiting
after this.
Step 3:
Based on
information from
steps 1 and 2,
build ontologies.
Human, domain-specific knowledge/intelligence needs to be captured and entered in the system.
26. How is the
metadata used?
→ Expensive to store and migrate
→ Difficult to find stuff
→ Cannot follow regulatory compliance
• Rules & workflows for migration and retention &
disposal
• Facets / Search filters
27. Summary:
New Era,
Expectations,
and
Education
• AI is now accepted and expected!
• But still major lack of understanding of how it
works.
• Different algorithms: KeyPhrases, Named Entities,
Taxonomy/Ontology based with rules, ML.
• Taxonomy/ontology with rules uses several NLP
aspects: stemming, pluralization, stopwords,
semantic similarity suggestions, patterns.
• “Traditional”* ML rarely works for our customers.
Anna Divoli
* “Traditional” = Expecting a few categories with a good amount of training data for each.
28. And we are done!
Thank you all!
Questions?
@annadivoli
Anna Divoli