SlideShare uma empresa Scribd logo
1 de 28
AI for information management:
Why and How
Anna Divoli, Ph.D. @annadivoli
Head of R&D at @PingarHQ
20 Nov 2018 Auckland AI and Machine Learning .
How did I get
here?
Biomedical
Sciences
Bioinformatics Biomedical
Text Mining
Biomedical
User Search
Interfaces
Biomedical
Knowledge
Acquisition
Organizations’
Document
Management
e
e
Anna Divoli
All are applied science
Translating human intelligence/knowledge to machine/systems input.
Organizations’
Information
Management
Problem(s)
Too much data already (Office documents, PDFs… duplicates too!)
Enterprise data volume increases 50 times year-over-year
between 2015 and 2020.
Dispersed into several locations (file shares, email, content
management systems…)
→ Expensive to store and migrate
→ Difficult to find stuff
→ Cannot follow regulatory compliance
Anna Divoli
Translating these
problems into
Data Science
Document
Classification
Anna Divoli
It’s all about
Metadata
Anna Divoli
Traditional
Metadata Capture
Anna Divoli
Humans for
Metadata
Reasoning
 Subjective
 Inconsistent
 Inter-annotator agreement (after training)
 Finds the task boring
 Finds shortcuts when forced
 Can be put to better useAnna Divoli
AI for Metadata  Needs training
Objective
Consistent
Fast
Cheap
It doesn’t find the task boring
Perfect use
AI
Anna Divoli
The necessity of
auto-classification
9.375
6.25
3.125
0.25
0 1 2 3 4 5 6 7 8 9 10
Manual Classification
3 Minutes per record
Manual Classification
2 Minutes per record
Manual Classification
1 Minute per record
Automatic
Solution
Effort (Man Years) to classify 300,000
records
Most of this is
set up time.
52 weeks/year
40 hours/week
60 min/hour
-----------------------
124800 min/year
Anna Divoli
Metadata
for
auto-classification
Extracted /
generated with
NLP algorithms
• Named Entities
• Taxonomy/Ontology Terms
• KeyPhrases
• Excerpts/Summaries
• Patterns
• Events
• Relationships
• Sentiment
• Trends
• …
Anna Divoli
NLP ∈ AI
Machine
Learning
NLP
Computational
Linguistics
Applied
Text
Analytics
Connectors
Storage
Memory
Security
Friendly UIs
Visualizations
Managed
Metadata &
Ontologies
Anna Divoli
Named Entities
Algorithm
• Machine Learning
• Involves: Tagged examples, good feature selection
• Advantages: It recognizes names it has never seen
before.
• Disadvantages: It needs context (like humans!)
Anna Divoli
KeyPhrases
Algorithm
• Uses world knowledge (large amount of data) to
identify important concepts
• Uses a good scoring algorithm to determine the
most important KeyPhrases within a document
• It is document specific
• It needs updating to learn new concepts
• It behaves a bit like folksonomy (but with more
canonical use of terms)
Anna Divoli
Metadata
for
auto-classification
• Named Entities
• Taxonomy/Ontology Terms
• KeyPhrases
• Excerpts/Summaries
• Patterns
• Events
• Relationships
• Sentiment
• Trends
• …
Anna Divoli
Taxonomies /
Ontologies:
Backbone of AI for
NLP and
Document
Management
• So important for many fields.
Example: Biomedicine: http://www.obofoundry.org
• Variations
Google Knowledge Graph: uses less “formal
semantics” than a “regular” ontologyAnna Divoli
Basics of
Document
Classification
• Match specific terms and/or patterns (rules)
• Supervised Machine Learning (including Deep
Learning)
• Hybrid Systems
• Unsupervised Machine Learning, i.e., Clustering,
e.g., Topic Modelling
Anna Divoli
 In enterprise, these are nice in theory!
Typical Reality
Document
Classification
and
Taxonomies
/Ontologies
• No training data (or incomplete and/or very
inconsistent)
• Large number of specialized categories
• “Topic” vs. “reference” taxonomies
Anna Divoli
Example of a customer data set with “lots of training data”:
10 000 summaries of court judgments tagged against a taxonomy
of 890 categories.
Assuming some sort of equal distribution that would be ~11
judgments per category for training. Even if we just consider just
the leaf nodes that would be ~20 judgments per category.
What is AI? • Many definitions
• Goal: make the machines “smart”
• Typically for tasks humans tend to be better. Or, used to be
better.
• Methods: experts systems (rules) to machine learning (ML)
• Rules are instructions (so not intelligence) but a means to a
result
• Rule based AI most famous example: deep blue (played
chess 96-97)
• ML: the program learns, adapts…
• Most AI (so far): task-specific intelligence
Anna Divoli
Human Rules or
Machine
Learning?
 What kind of categories?
 What type of training data?
 How much effort?
Anna Divoli
Humans need to provide training data or heuristics.
Human Rules
Examples
Anna Divoli
* Example from Pingar’s DiscoveryOne
More Intelligence
in the Human
Rules
Semantic
similarity
Anna Divoli
* Example from Pingar’s DiscoveryOne
More Intelligence
in the Human
Rules
Patterns including
named entities
Regex
(([0-9]*.?[0-9]+))((s*?)|(-)|(s*?-s*?))((metreb)|(metresb)
|(meterb)|(metersb)|(kilometreb)|(kilometresb)|
(kilometerb)|(kilometersb)|(kmb)|(kmsb)|(mb))
Anna Divoli
* Examples from Pingar’s DiscoveryOne
Machine Learning Case: Common Content Types
Anna Divoli
Methodology
Look at existing
taxonomies/
vocabularies.
Go through
documents.
Interview end
users.
Send
questionnaires to
end users.
Step 1:
Interview/discuss with
Company’s Information
Manager / Subject Matter
Expert Find out needs
and resources.
Step 2:
Use different resources
to understand the
language used and the
search needs.
Step 4:
Based on information
from steps 1, 2 and 3,
adapt rules and integrate
them in the search system
using DiscoveryOne.
Step 5:
Conduct impact studies with
the first version of
taxonomy/ies from step 3.
Get user feedback for
improvements. Steps 1, 2, 3
and 4 might need revisiting
after this.
Step 3:
Based on
information from
steps 1 and 2,
build ontologies.
Human, domain-specific knowledge/intelligence needs to be captured and entered in the system.
Demo?
Keen to see it?
Anna Divoli
How is the
metadata used?
→ Expensive to store and migrate
→ Difficult to find stuff
→ Cannot follow regulatory compliance
• Rules & workflows for migration and retention &
disposal
• Facets / Search filters
Summary:
New Era,
Expectations,
and
Education
• AI is now accepted and expected!
• But still major lack of understanding of how it
works.
• Different algorithms: KeyPhrases, Named Entities,
Taxonomy/Ontology based with rules, ML.
• Taxonomy/ontology with rules uses several NLP
aspects: stemming, pluralization, stopwords,
semantic similarity suggestions, patterns.
• “Traditional”* ML rarely works for our customers.
Anna Divoli
* “Traditional” = Expecting a few categories with a good amount of training data for each.
And we are done!
Thank you all!
Questions?
@annadivoli
Anna Divoli

Mais conteúdo relacionado

Mais procurados

Mais procurados (6)

Slides | Targeting the librarian’s role in research services
Slides | Targeting the librarian’s role in research servicesSlides | Targeting the librarian’s role in research services
Slides | Targeting the librarian’s role in research services
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data ResourcesApplication of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
 
“Semantic Technologies for Smart Services”
“Semantic Technologies for Smart Services” “Semantic Technologies for Smart Services”
“Semantic Technologies for Smart Services”
 
Building the Data Science Profession in Europe
Building the Data Science Profession in EuropeBuilding the Data Science Profession in Europe
Building the Data Science Profession in Europe
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 

Semelhante a AI for information management: why and how

SWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFSSWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFS
Mariano Rodriguez-Muro
 
If You Tag it, Will They Come? Metadata Quality and Repository Management
If You Tag it, Will They Come? Metadata Quality and Repository ManagementIf You Tag it, Will They Come? Metadata Quality and Repository Management
If You Tag it, Will They Come? Metadata Quality and Repository Management
Sarah Currier
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
Andre Freitas
 
Missing pieces in_the_global_metadata_landscap
Missing pieces in_the_global_metadata_landscapMissing pieces in_the_global_metadata_landscap
Missing pieces in_the_global_metadata_landscap
Stuart Weibel
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
NATASHABANO
 

Semelhante a AI for information management: why and how (20)

Artificial Intelligence in Data Curation
Artificial Intelligence in Data CurationArtificial Intelligence in Data Curation
Artificial Intelligence in Data Curation
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
7 advanced uses of rdfs
7 advanced uses of rdfs7 advanced uses of rdfs
7 advanced uses of rdfs
 
SWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFSSWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFS
 
Data, AI and Tokens: A Glimpse of What is to Come
Data, AI and Tokens: A Glimpse of What is to ComeData, AI and Tokens: A Glimpse of What is to Come
Data, AI and Tokens: A Glimpse of What is to Come
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for Findability
 
Evaluating Electronic Resources
Evaluating Electronic ResourcesEvaluating Electronic Resources
Evaluating Electronic Resources
 
Human Genome and Big Data Challenges
Human Genome and Big Data ChallengesHuman Genome and Big Data Challenges
Human Genome and Big Data Challenges
 
Semantics Helps Connect the Dots
Semantics Helps Connect the DotsSemantics Helps Connect the Dots
Semantics Helps Connect the Dots
 
Setting a Course for Success: Getting Started with Digital Preservation in Yo...
Setting a Course for Success: Getting Started with Digital Preservation in Yo...Setting a Course for Success: Getting Started with Digital Preservation in Yo...
Setting a Course for Success: Getting Started with Digital Preservation in Yo...
 
From Chalkboards to Chatbots: A Deep Dive into AI for Education Workshop
From Chalkboards to Chatbots: A Deep Dive into AI for Education WorkshopFrom Chalkboards to Chatbots: A Deep Dive into AI for Education Workshop
From Chalkboards to Chatbots: A Deep Dive into AI for Education Workshop
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
If You Tag it, Will They Come? Metadata Quality and Repository Management
If You Tag it, Will They Come? Metadata Quality and Repository ManagementIf You Tag it, Will They Come? Metadata Quality and Repository Management
If You Tag it, Will They Come? Metadata Quality and Repository Management
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Data Fluency - AUA Conference
Data Fluency - AUA ConferenceData Fluency - AUA Conference
Data Fluency - AUA Conference
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Managing Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research Methods
 
Missing pieces in_the_global_metadata_landscap
Missing pieces in_the_global_metadata_landscapMissing pieces in_the_global_metadata_landscap
Missing pieces in_the_global_metadata_landscap
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 

Mais de Anna Divoli

How computers understand text content - by Anna Divoli
How computers understand text content - by Anna DivoliHow computers understand text content - by Anna Divoli
How computers understand text content - by Anna Divoli
Anna Divoli
 
Ebi apr2011 usability-part
Ebi apr2011 usability-partEbi apr2011 usability-part
Ebi apr2011 usability-part
Anna Divoli
 

Mais de Anna Divoli (9)

How computers understand text content - by Anna Divoli
How computers understand text content - by Anna DivoliHow computers understand text content - by Anna Divoli
How computers understand text content - by Anna Divoli
 
NLP Tales in Biomedicine (introductory presentation for the Auckland NLP Meet...
NLP Tales in Biomedicine (introductory presentation for the Auckland NLP Meet...NLP Tales in Biomedicine (introductory presentation for the Auckland NLP Meet...
NLP Tales in Biomedicine (introductory presentation for the Auckland NLP Meet...
 
"Findability and usability lessons learnt from text analytics" By: Anna Div...
"Findability and usability   lessons learnt from text analytics" By: Anna Div..."Findability and usability   lessons learnt from text analytics" By: Anna Div...
"Findability and usability lessons learnt from text analytics" By: Anna Div...
 
Constructing a Focused Taxonomy from a Document Collection - ESWC 2013
Constructing a Focused Taxonomy from a Document Collection - ESWC 2013Constructing a Focused Taxonomy from a Document Collection - ESWC 2013
Constructing a Focused Taxonomy from a Document Collection - ESWC 2013
 
Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to C...
Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to C...Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to C...
Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to C...
 
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group...
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group...Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group...
Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group...
 
Anna Divoli (Pingar Research) "How taxonomies and facets bring end-users clos...
Anna Divoli (Pingar Research) "How taxonomies and facets bring end-users clos...Anna Divoli (Pingar Research) "How taxonomies and facets bring end-users clos...
Anna Divoli (Pingar Research) "How taxonomies and facets bring end-users clos...
 
Divoli Presentation at EBI Apr2011 Usability Part
Divoli Presentation at EBI Apr2011 Usability PartDivoli Presentation at EBI Apr2011 Usability Part
Divoli Presentation at EBI Apr2011 Usability Part
 
Ebi apr2011 usability-part
Ebi apr2011 usability-partEbi apr2011 usability-part
Ebi apr2011 usability-part
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

AI for information management: why and how

  • 1. AI for information management: Why and How Anna Divoli, Ph.D. @annadivoli Head of R&D at @PingarHQ 20 Nov 2018 Auckland AI and Machine Learning .
  • 2. How did I get here? Biomedical Sciences Bioinformatics Biomedical Text Mining Biomedical User Search Interfaces Biomedical Knowledge Acquisition Organizations’ Document Management e e Anna Divoli All are applied science Translating human intelligence/knowledge to machine/systems input.
  • 3. Organizations’ Information Management Problem(s) Too much data already (Office documents, PDFs… duplicates too!) Enterprise data volume increases 50 times year-over-year between 2015 and 2020. Dispersed into several locations (file shares, email, content management systems…) → Expensive to store and migrate → Difficult to find stuff → Cannot follow regulatory compliance Anna Divoli
  • 4. Translating these problems into Data Science Document Classification Anna Divoli
  • 7. Humans for Metadata Reasoning  Subjective  Inconsistent  Inter-annotator agreement (after training)  Finds the task boring  Finds shortcuts when forced  Can be put to better useAnna Divoli
  • 8. AI for Metadata  Needs training Objective Consistent Fast Cheap It doesn’t find the task boring Perfect use AI Anna Divoli
  • 9. The necessity of auto-classification 9.375 6.25 3.125 0.25 0 1 2 3 4 5 6 7 8 9 10 Manual Classification 3 Minutes per record Manual Classification 2 Minutes per record Manual Classification 1 Minute per record Automatic Solution Effort (Man Years) to classify 300,000 records Most of this is set up time. 52 weeks/year 40 hours/week 60 min/hour ----------------------- 124800 min/year Anna Divoli
  • 10. Metadata for auto-classification Extracted / generated with NLP algorithms • Named Entities • Taxonomy/Ontology Terms • KeyPhrases • Excerpts/Summaries • Patterns • Events • Relationships • Sentiment • Trends • … Anna Divoli
  • 12. Named Entities Algorithm • Machine Learning • Involves: Tagged examples, good feature selection • Advantages: It recognizes names it has never seen before. • Disadvantages: It needs context (like humans!) Anna Divoli
  • 13. KeyPhrases Algorithm • Uses world knowledge (large amount of data) to identify important concepts • Uses a good scoring algorithm to determine the most important KeyPhrases within a document • It is document specific • It needs updating to learn new concepts • It behaves a bit like folksonomy (but with more canonical use of terms) Anna Divoli
  • 14. Metadata for auto-classification • Named Entities • Taxonomy/Ontology Terms • KeyPhrases • Excerpts/Summaries • Patterns • Events • Relationships • Sentiment • Trends • … Anna Divoli
  • 15. Taxonomies / Ontologies: Backbone of AI for NLP and Document Management • So important for many fields. Example: Biomedicine: http://www.obofoundry.org • Variations Google Knowledge Graph: uses less “formal semantics” than a “regular” ontologyAnna Divoli
  • 16. Basics of Document Classification • Match specific terms and/or patterns (rules) • Supervised Machine Learning (including Deep Learning) • Hybrid Systems • Unsupervised Machine Learning, i.e., Clustering, e.g., Topic Modelling Anna Divoli  In enterprise, these are nice in theory!
  • 17. Typical Reality Document Classification and Taxonomies /Ontologies • No training data (or incomplete and/or very inconsistent) • Large number of specialized categories • “Topic” vs. “reference” taxonomies Anna Divoli Example of a customer data set with “lots of training data”: 10 000 summaries of court judgments tagged against a taxonomy of 890 categories. Assuming some sort of equal distribution that would be ~11 judgments per category for training. Even if we just consider just the leaf nodes that would be ~20 judgments per category.
  • 18. What is AI? • Many definitions • Goal: make the machines “smart” • Typically for tasks humans tend to be better. Or, used to be better. • Methods: experts systems (rules) to machine learning (ML) • Rules are instructions (so not intelligence) but a means to a result • Rule based AI most famous example: deep blue (played chess 96-97) • ML: the program learns, adapts… • Most AI (so far): task-specific intelligence Anna Divoli
  • 19. Human Rules or Machine Learning?  What kind of categories?  What type of training data?  How much effort? Anna Divoli Humans need to provide training data or heuristics.
  • 20. Human Rules Examples Anna Divoli * Example from Pingar’s DiscoveryOne
  • 21. More Intelligence in the Human Rules Semantic similarity Anna Divoli * Example from Pingar’s DiscoveryOne
  • 22. More Intelligence in the Human Rules Patterns including named entities Regex (([0-9]*.?[0-9]+))((s*?)|(-)|(s*?-s*?))((metreb)|(metresb) |(meterb)|(metersb)|(kilometreb)|(kilometresb)| (kilometerb)|(kilometersb)|(kmb)|(kmsb)|(mb)) Anna Divoli * Examples from Pingar’s DiscoveryOne
  • 23. Machine Learning Case: Common Content Types Anna Divoli
  • 24. Methodology Look at existing taxonomies/ vocabularies. Go through documents. Interview end users. Send questionnaires to end users. Step 1: Interview/discuss with Company’s Information Manager / Subject Matter Expert Find out needs and resources. Step 2: Use different resources to understand the language used and the search needs. Step 4: Based on information from steps 1, 2 and 3, adapt rules and integrate them in the search system using DiscoveryOne. Step 5: Conduct impact studies with the first version of taxonomy/ies from step 3. Get user feedback for improvements. Steps 1, 2, 3 and 4 might need revisiting after this. Step 3: Based on information from steps 1 and 2, build ontologies. Human, domain-specific knowledge/intelligence needs to be captured and entered in the system.
  • 25. Demo? Keen to see it? Anna Divoli
  • 26. How is the metadata used? → Expensive to store and migrate → Difficult to find stuff → Cannot follow regulatory compliance • Rules & workflows for migration and retention & disposal • Facets / Search filters
  • 27. Summary: New Era, Expectations, and Education • AI is now accepted and expected! • But still major lack of understanding of how it works. • Different algorithms: KeyPhrases, Named Entities, Taxonomy/Ontology based with rules, ML. • Taxonomy/ontology with rules uses several NLP aspects: stemming, pluralization, stopwords, semantic similarity suggestions, patterns. • “Traditional”* ML rarely works for our customers. Anna Divoli * “Traditional” = Expecting a few categories with a good amount of training data for each.
  • 28. And we are done! Thank you all! Questions? @annadivoli Anna Divoli