SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Identifying Security Risks Using
Auto-Tagging & Text Analytics
Text Analytics Forum 2022
Joe Hilger and Sara Duane
ENTERPRISE KNOWLEDGE
Outline
EK at a Glance The Problem Our Approach Our Methodology
and Best
Practices
What You Will Learn
⬢ How to identify confidential information across an
enterprise
⬢ Best practices for leveraging and tuning auto-
tagging
⬢ How to design a taxonomy for auto-tagging
⬢ 33 Years of Consulting Experience
⬢ Expert in Knowledge Management and
Knowledge Graph Technologies
⬢ Coauthor of Making KM Clickable (2022)
JOE
CTO AND COFOUNDER, ENTERPRISE KNOWLEDGE
HILGER
SARA
SENIOR TECHNICAL ANALYST, ENTERPRISE KNOWLEDGE
DUANE
⬢ Serves as project manager for technical
implementation and strategy projects
⬢ Conducted complex auto-tagging projects
for clients in both the commercial and federal
space
ENTERPRISE KNOWLEDGE
10AREAS OF EXPERTISE
KM STRATEGY & DESIGN TAXONOMY & ONTOLOGY DESIGN
TECHNOLOGY SOLUTIONS AGILE, DESIGN THINKING, & FACILITATION
CONTENT & BRAND STRATEGY KNOWLEDGE GRAPHS, DATA MODELING, & AI
ENTERPRISE SEARCH INTEGRATED CHANGE MANAGEMENT
ENTERPRISE LEARNING CONTENT MANAGEMENT
80
+
EXPERT
CONSULTANTS
HEADQUARTERED IN WASHINGTON, DC,
USA
ESTABLISHED 2013 – OUR FOUNDERS AND PRINCIPALS HAVE BEEN PROVIDING
KNOWLEDGE MANAGEMENT CONSULTING TO GLOBAL CLIENTS FOR OVER 20 YEARS.
KMWORLD’S
100 COMPANIES THAT MATTER IN KM (2015, 2016, 2017, 2018,
2019, 2020, 2021, 2022)
TOP 50 TRAILBLAZERS IN AI (2020, 2021, 2022)
CIO REVIEW’S
20 MOST PROMISING KM SOLUTION PROVIDERS (2016)
INC MAGAZINE
#2,343 OF THE 5000 FASTEST GROWING COMPANIES (2021)
#2,574 OF THE 5000 FASTEST GROWING COMPANIES (2020)
#2,411 OF THE 5000 FASTEST GROWING COMPANIES (2019)
#1,289 OF THE 5000 FASTEST GROWING COMPANIES (2018)
INC MAGAZINE
BEST WORKPLACES (2018, 2019, 2021, 2022)
WASHINGTONIAN MAGAZINE’S
TOP 50 GREAT PLACES TO WORK (2017)
WASHINGTON BUSINESS JOURNAL’S
BEST PLACES TO WORK (2017, 2018, 2019, 2020)
ARLINGTON ECONOMIC DEVELOPMENT’S
FAST FOUR AWARD – FASTEST GROWING COMPANY (2016)
VIRGINIA CHAMBER OF COMMERCE’S
FANTASTIC 50 AWARD – FASTEST GROWING COMPANY
(2019, 2020)
AWARD-WINNING
CONSULTANCY
PRESENCE IN BRUSSELS, BELGIUM
EK At A Glance
STABLE CLIENT BASE
ENTERPRISE KNOWLEDGE
ENTERPRISE KNOWLEDGE
The Problem
Problem Statement
At this federal research organization, researchers, proposal
authors, project managers, etc. all leverage project content, data,
and documentation on their shared drives.
They need to have a way to:
▪ Identify content that is controlled, CUI, or otherwise
sensitive
So that they can…
Move the relevant documents to a secure location
Prevent data loss and compliance issues
Ensure all documents have a classification
How Common Tools Solve the Problem
A lot of tools or solutions would solve this by
looking for PII information through pattern
recognition, including:
⬢ Using regex to identify the patterns behind PII
information, such as a phone number.
⬢ Identifying specific sensitivity labels within
the content itself, such as “top secret.”
These products and solutions don’t look for terms or categories of information
that reflect sensitive content. What if a piece of information within a document
is sensitive, but doesn’t contain the term “top secret” within it nor any identifiable
PII through pattern recognition?
Our Solution
Teaching
Technology
Identify the terms, words, and categories of information that
suggest secure information.
Develop a subject-oriented topic taxonomy of secure terms.
Conduct auto-tagging on documents with this subject-oriented
taxonomy to identify the secure content.
Leverage these tags and labels to begin the migration process.
1
2
3
4
What is a Taxonomy?
A taxonomy is a controlled vocabulary
used to describe or characterize explicit
concepts of information for the purpose
of capturing, managing, and
presenting.
Taxonomies are often driven by:
● Type of Content
● Medium
● Organization
● Purpose
● Topic (most relevant for our
approach)
ENTERPRISE KNOWLEDGE
Our Approach
Building Our Understanding
Conduct focus
groups with staff who
are creators, holders,
or consumers of
content to ensure a
complete
understanding of the
content they work
with and what
constitutes secure
information for them.
Analyze
documentation,
content, and data
that suggests secure
information as well as
documentation
without secure
information to
identify key topics.
Conduct a semantic
analysis of content
that identifies
significant terms
through a machine
learning algorithm
and can validate and
enhance the
designed taxonomy.
Focus Groups Document Review Corpus Analysis
Focus Groups
with Core
Team & SMEs
33+
Documents
Evaluated
287k
For this engagement, EK conducted a
thorough discovery phase:
Building the Taxonomy
Study Area Geography Method of Measure
Environment Application Content Type
EK used the field of environmental research to model what could be identified as secure information within a
specific domain.
The terms that made up these taxonomies were identified through focus groups with environmental research
SMEs, as well as four corpus analyses on subsets of relevant content.
The corpus analysis identified and added 37% of the taxonomy terms (i.e., terms and synonyms), thus
enriching the final POC taxonomy.
Solution Architecture
Project Solution Architecture
EK leveraged two main tools for this
POC:
o PoolParty: Hosted the taxonomy
and ontology, and via API, auto-
tagged the provided documents.
o GraphDB: Stored the documents
and their applied tags from the
taxonomy and ontology.
To successfully complete this
approach, EK created data pipelines
between the document storage
account, PoolParty, and GraphDB
using UnifiedViews, an ETL tool.
These pipelines facilitated the
necessary data transformation and
integration to power GraphSearch.
Visualizing Tags
⬢ EK leveraged PoolParty’s
GraphSearch server to allow
the organization to visualize
the results of the auto-tagging
process.
⬢ Users could filter and search
for documents based on the
identified tags.
⬢ During this phase, we could
visualize and analyze the
accuracy of the tags. View of PoolParty’s GraphSearch
ENTERPRISE KNOWLEDGE
Our Methodology and
Best Practices
Designing a
Taxonomy for
Auto-Tagging
Design Best Practices
Remember Your End User: A Machine
Design requirements for a machine are different than for a taxonomy leveraged by a human for
navigation, search, etc.
Granularity Is Important
The taxonomy should reflect the granularity of the content and get into the details of what is
presented in the content.
Synonyms at the Correct Level Are Your Friends
With relevant and accurate synonyms used correctly, auto-tagging can better parse
through the text and recognize what the content is about.
Ensure Taxonomy Terms are Reflective of the Content
The topics of your content items should help form the basis of your taxonomy.
Choosing a
Method
Auto-tagging: An advanced application of taxonomy in which terms are automatically
applied to content as tags through text recognition, inheritance, or other automated means.
Basic level:
Searching the text for taxonomy
terms to apply, relying solely on the
term appearing in the content itself.
More complex level:
Using context and machine learning
to tag additional terms that may not
be in the content itself.
1
2
3
4
5
Metadata Inheritance
WHAT IS AUTO-TAGGING?
What Type of Auto-tagging Works for
Your Needs?
Migration Logic
NLP Extractor
ML Classification
Custom NER Models
AUTO-TAGGING WITH POOLPARTY
EXTRACTION
Auto-tagging is text extraction with
natural language processing (NLP) and
light machine learning (corpus scoring)
to score extracted concepts by a mix of
frequency, location in the document,
etc.
It’s important to understand both the
taxonomy and the content it will be
used to tag.
Auto-tagging will only tag well the
fields of the taxonomy that are
topical and well matched to the text
of the content items.
Core Components Necessary for Auto-
tagging:
● Synonym-rich taxonomy that is
aligned with the target content
● Taxonomy management tool
● “Learning” corpus capabilities
● Content management system with
target content
● Middle layer that can send content to
be tagged and then store the
suggested tags
Concept Extraction
Lemmatization and Stemming
Lemmatization reduces words to their common
base forms:
● am, are, is => be
● car, cars, car’s, cars’ => car
Stemming looks at the root of a word:
● accounts, accounting, accountant -> account
Concept extraction does not require that the exact term from the taxonomy be present in
the text. Techniques like stemming and lemmatization can help increase matches.
Important Note!
Stemming and lemmatization can be risky as they may
obscure real differences in meaning.
AUTO-TAGGING WITH POOLPARTY
EXTRACTION
Scoring methods:
● Frequency - the more often a term appears in a document, the higher it scores
● Location boosting - terms found in some locations in a document (for example,
the title), will have their score “boosted,” or weighted higher
● Term Frequency - Inverse Document Frequency (TF-IDF) scoring method
penalizes overly frequent terms and boosts rare terms. The frequency of a term
in a document is balanced against the frequency of that term across a
representative corpus of documents. For example, the most frequently used word
in many English documents is “the” - using TF-IDF scoring, this term will have a
low score
Scoring/Ranking Extraction
Fine-tuning
FINE-TUNING ITERATIVELY
● Blacklist
● Exact match
● Disambiguation
● Ontology
● Shadow concepts
● Corpus adjustment
● TF-IDF scoring
● F-score
Auto-tag
● Blacklist
● Exact match
● Synonyms
● Adjust taxonomy
● Prioritize content
segments (e.g., Title)
● Corpus scoring
Initial Fine-
tuning
Long-term
Fine-tuning
Initial Fine-
tuning
Long-term
Fine-tuning
Evaluate
Accuracy
Iterative Fine-tuning
You will need to conduct
multiple rounds, tweaking
the taxonomy and rules to
best fit the content you are
working with, and
evaluating the accuracy for
each round.
HOW TO ASSESS ACCURACY
GOLD STANDARD
ANECDOTAL
ACCURACY
F-SCORES AND IAA
(INTER ANNOTATOR
AGREEMENT)
How to Assess Accuracy
Q&A
Thank you for listening.
Questions?
JOE HILGER,
COO and Co-Founder of Enterprise
Knowledge
JHILGER@ENTERPRISE-KNOWLEDGE.COM
WWW.LINKEDIN.COM/IN/JOSEPH-HILGER/
SARA DUANE,
Senior Technical Analyst
SDUANE@ENTERPRISE-KNOWLEDGE.COM
WWW.LINKEDIN.COM/IN/SARA-DUANE/

Mais conteúdo relacionado

Semelhante a Identifying Security Risks Using Auto-Tagging and Text Analytics

Expert Webinar Series 2: Designing Information Architecture for SharePoint: M...
Expert Webinar Series 2: Designing Information Architecture for SharePoint: M...Expert Webinar Series 2: Designing Information Architecture for SharePoint: M...
Expert Webinar Series 2: Designing Information Architecture for SharePoint: M...martingarland
 
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOMTEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOMITC Infotech
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”voginip
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
 
Overview of Taxonomies and Artificial Intelligence
Overview of Taxonomies and Artificial IntelligenceOverview of Taxonomies and Artificial Intelligence
Overview of Taxonomies and Artificial IntelligenceEnterprise Knowledge
 
Climbing the Ontology Mountain to Achieve a Successful Knowledge Graph
Climbing the Ontology Mountain to Achieve a Successful Knowledge GraphClimbing the Ontology Mountain to Achieve a Successful Knowledge Graph
Climbing the Ontology Mountain to Achieve a Successful Knowledge GraphEnterprise Knowledge
 
KM SHOWCASE 2020 - "Lessons Learned Building a Knowledge Graph" - Chris Marino
KM SHOWCASE 2020 - "Lessons Learned Building a Knowledge Graph" - Chris MarinoKM SHOWCASE 2020 - "Lessons Learned Building a Knowledge Graph" - Chris Marino
KM SHOWCASE 2020 - "Lessons Learned Building a Knowledge Graph" - Chris MarinoKM Institute
 
When to use the different text analytics tools - Meaning Cloud
When to use the different text analytics tools - Meaning CloudWhen to use the different text analytics tools - Meaning Cloud
When to use the different text analytics tools - Meaning CloudMeaningCloud
 
The Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent SearchThe Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent SearchDaniel Faggella
 
SharePoint Saturday London - The Nuts and Bolts of Metadata Tagging and Taxon...
SharePoint Saturday London - The Nuts and Bolts of Metadata Tagging and Taxon...SharePoint Saturday London - The Nuts and Bolts of Metadata Tagging and Taxon...
SharePoint Saturday London - The Nuts and Bolts of Metadata Tagging and Taxon...Concept Searching, Inc
 
Structured authoring for business-critical content
Structured authoring for business-critical contentStructured authoring for business-critical content
Structured authoring for business-critical contentJason Aiken
 
Taxonomies And Search Aiim Mn
Taxonomies And Search Aiim MnTaxonomies And Search Aiim Mn
Taxonomies And Search Aiim MnAIIM Minnesota
 
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...Concept Searching, Inc
 
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...AgileNetwork
 
Text Analytics for Non-Experts
Text Analytics for Non-ExpertsText Analytics for Non-Experts
Text Analytics for Non-ExpertsSynaptica, LLC
 
7+ tips for better intranet search
7+ tips for better intranet search7+ tips for better intranet search
7+ tips for better intranet searchIntranätverk
 

Semelhante a Identifying Security Risks Using Auto-Tagging and Text Analytics (20)

Expert Webinar Series 2: Designing Information Architecture for SharePoint: M...
Expert Webinar Series 2: Designing Information Architecture for SharePoint: M...Expert Webinar Series 2: Designing Information Architecture for SharePoint: M...
Expert Webinar Series 2: Designing Information Architecture for SharePoint: M...
 
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOMTEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Overview of Taxonomies and Artificial Intelligence
Overview of Taxonomies and Artificial IntelligenceOverview of Taxonomies and Artificial Intelligence
Overview of Taxonomies and Artificial Intelligence
 
Climbing the Ontology Mountain to Achieve a Successful Knowledge Graph
Climbing the Ontology Mountain to Achieve a Successful Knowledge GraphClimbing the Ontology Mountain to Achieve a Successful Knowledge Graph
Climbing the Ontology Mountain to Achieve a Successful Knowledge Graph
 
Content analytics
Content analyticsContent analytics
Content analytics
 
Taxonomy and seo sla 05-06-10(jc)
Taxonomy and seo   sla 05-06-10(jc)Taxonomy and seo   sla 05-06-10(jc)
Taxonomy and seo sla 05-06-10(jc)
 
KM SHOWCASE 2020 - "Lessons Learned Building a Knowledge Graph" - Chris Marino
KM SHOWCASE 2020 - "Lessons Learned Building a Knowledge Graph" - Chris MarinoKM SHOWCASE 2020 - "Lessons Learned Building a Knowledge Graph" - Chris Marino
KM SHOWCASE 2020 - "Lessons Learned Building a Knowledge Graph" - Chris Marino
 
When to use the different text analytics tools - Meaning Cloud
When to use the different text analytics tools - Meaning CloudWhen to use the different text analytics tools - Meaning Cloud
When to use the different text analytics tools - Meaning Cloud
 
The Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent SearchThe Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent Search
 
SharePoint Saturday London - The Nuts and Bolts of Metadata Tagging and Taxon...
SharePoint Saturday London - The Nuts and Bolts of Metadata Tagging and Taxon...SharePoint Saturday London - The Nuts and Bolts of Metadata Tagging and Taxon...
SharePoint Saturday London - The Nuts and Bolts of Metadata Tagging and Taxon...
 
Aiim motorola-taxo-integration-03-15-10-cg
Aiim motorola-taxo-integration-03-15-10-cgAiim motorola-taxo-integration-03-15-10-cg
Aiim motorola-taxo-integration-03-15-10-cg
 
Structured authoring for business-critical content
Structured authoring for business-critical contentStructured authoring for business-critical content
Structured authoring for business-critical content
 
Taxonomies And Search Aiim Mn
Taxonomies And Search Aiim MnTaxonomies And Search Aiim Mn
Taxonomies And Search Aiim Mn
 
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
 
FAST Search-webinar-06-29-2010
FAST Search-webinar-06-29-2010FAST Search-webinar-06-29-2010
FAST Search-webinar-06-29-2010
 
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
 
Text Analytics for Non-Experts
Text Analytics for Non-ExpertsText Analytics for Non-Experts
Text Analytics for Non-Experts
 
7+ tips for better intranet search
7+ tips for better intranet search7+ tips for better intranet search
7+ tips for better intranet search
 

Mais de Enterprise Knowledge

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Nonprofit KM Journey to Success: Lessons and Learnings at Feeding America
Nonprofit KM Journey to Success: Lessons and Learnings at Feeding AmericaNonprofit KM Journey to Success: Lessons and Learnings at Feeding America
Nonprofit KM Journey to Success: Lessons and Learnings at Feeding AmericaEnterprise Knowledge
 
Road to the Taxonomy Rollercoaster
Road to the Taxonomy RollercoasterRoad to the Taxonomy Rollercoaster
Road to the Taxonomy RollercoasterEnterprise Knowledge
 
DGIQ - Case Studies_ Applications of Data Governance in the Enterprise (Final...
DGIQ - Case Studies_ Applications of Data Governance in the Enterprise (Final...DGIQ - Case Studies_ Applications of Data Governance in the Enterprise (Final...
DGIQ - Case Studies_ Applications of Data Governance in the Enterprise (Final...Enterprise Knowledge
 
Scaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
 
Making Knowledge Management Clickable
Making Knowledge Management ClickableMaking Knowledge Management Clickable
Making Knowledge Management ClickableEnterprise Knowledge
 
Building for the Knowledge Management Archetypes at Your Company
Building for the Knowledge Management Archetypes at Your CompanyBuilding for the Knowledge Management Archetypes at Your Company
Building for the Knowledge Management Archetypes at Your CompanyEnterprise Knowledge
 
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are PricelessKnowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are PricelessEnterprise Knowledge
 
Introducing the Agile KM Manifesto.pdf
Introducing the Agile KM Manifesto.pdfIntroducing the Agile KM Manifesto.pdf
Introducing the Agile KM Manifesto.pdfEnterprise Knowledge
 
Road Maps & Roadblocks to Federal Electronic Records Management
Road Maps & Roadblocks to Federal Electronic Records ManagementRoad Maps & Roadblocks to Federal Electronic Records Management
Road Maps & Roadblocks to Federal Electronic Records ManagementEnterprise Knowledge
 
Building an Innovative Learning Ecosystem at Scale with Graph Technologies
Building an Innovative Learning Ecosystem at Scale with Graph TechnologiesBuilding an Innovative Learning Ecosystem at Scale with Graph Technologies
Building an Innovative Learning Ecosystem at Scale with Graph TechnologiesEnterprise Knowledge
 
Taxonomy in the Age of Personalization
Taxonomy in the Age of PersonalizationTaxonomy in the Age of Personalization
Taxonomy in the Age of PersonalizationEnterprise Knowledge
 
JPL’s Institutional Knowledge Graph II: A Foundation for Constructing Enterpr...
JPL’s Institutional Knowledge Graph II: A Foundation for Constructing Enterpr...JPL’s Institutional Knowledge Graph II: A Foundation for Constructing Enterpr...
JPL’s Institutional Knowledge Graph II: A Foundation for Constructing Enterpr...Enterprise Knowledge
 
Learning 360: Crafting a Comprehensive View of Learning by Using a Graph
Learning 360: Crafting a Comprehensive View of Learning by Using a GraphLearning 360: Crafting a Comprehensive View of Learning by Using a Graph
Learning 360: Crafting a Comprehensive View of Learning by Using a GraphEnterprise Knowledge
 
Making KM Clickable: The Rapidly Changing State of Knowledge Management
Making KM Clickable: The Rapidly Changing State of Knowledge ManagementMaking KM Clickable: The Rapidly Changing State of Knowledge Management
Making KM Clickable: The Rapidly Changing State of Knowledge ManagementEnterprise Knowledge
 
How to Quickly Prototype a Scalable Graph Architecture: A Framework for Rapid...
How to Quickly Prototype a Scalable Graph Architecture: A Framework for Rapid...How to Quickly Prototype a Scalable Graph Architecture: A Framework for Rapid...
How to Quickly Prototype a Scalable Graph Architecture: A Framework for Rapid...Enterprise Knowledge
 
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...Enterprise Knowledge
 

Mais de Enterprise Knowledge (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Nonprofit KM Journey to Success: Lessons and Learnings at Feeding America
Nonprofit KM Journey to Success: Lessons and Learnings at Feeding AmericaNonprofit KM Journey to Success: Lessons and Learnings at Feeding America
Nonprofit KM Journey to Success: Lessons and Learnings at Feeding America
 
Road to the Taxonomy Rollercoaster
Road to the Taxonomy RollercoasterRoad to the Taxonomy Rollercoaster
Road to the Taxonomy Rollercoaster
 
DGIQ - Case Studies_ Applications of Data Governance in the Enterprise (Final...
DGIQ - Case Studies_ Applications of Data Governance in the Enterprise (Final...DGIQ - Case Studies_ Applications of Data Governance in the Enterprise (Final...
DGIQ - Case Studies_ Applications of Data Governance in the Enterprise (Final...
 
Scaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AI
 
Making Knowledge Management Clickable
Making Knowledge Management ClickableMaking Knowledge Management Clickable
Making Knowledge Management Clickable
 
Building for the Knowledge Management Archetypes at Your Company
Building for the Knowledge Management Archetypes at Your CompanyBuilding for the Knowledge Management Archetypes at Your Company
Building for the Knowledge Management Archetypes at Your Company
 
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are PricelessKnowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
 
Introducing the Agile KM Manifesto.pdf
Introducing the Agile KM Manifesto.pdfIntroducing the Agile KM Manifesto.pdf
Introducing the Agile KM Manifesto.pdf
 
Road Maps & Roadblocks to Federal Electronic Records Management
Road Maps & Roadblocks to Federal Electronic Records ManagementRoad Maps & Roadblocks to Federal Electronic Records Management
Road Maps & Roadblocks to Federal Electronic Records Management
 
Building an Innovative Learning Ecosystem at Scale with Graph Technologies
Building an Innovative Learning Ecosystem at Scale with Graph TechnologiesBuilding an Innovative Learning Ecosystem at Scale with Graph Technologies
Building an Innovative Learning Ecosystem at Scale with Graph Technologies
 
Taxonomy in the Age of Personalization
Taxonomy in the Age of PersonalizationTaxonomy in the Age of Personalization
Taxonomy in the Age of Personalization
 
JPL’s Institutional Knowledge Graph II: A Foundation for Constructing Enterpr...
JPL’s Institutional Knowledge Graph II: A Foundation for Constructing Enterpr...JPL’s Institutional Knowledge Graph II: A Foundation for Constructing Enterpr...
JPL’s Institutional Knowledge Graph II: A Foundation for Constructing Enterpr...
 
Learning 360: Crafting a Comprehensive View of Learning by Using a Graph
Learning 360: Crafting a Comprehensive View of Learning by Using a GraphLearning 360: Crafting a Comprehensive View of Learning by Using a Graph
Learning 360: Crafting a Comprehensive View of Learning by Using a Graph
 
Making KM Clickable: The Rapidly Changing State of Knowledge Management
Making KM Clickable: The Rapidly Changing State of Knowledge ManagementMaking KM Clickable: The Rapidly Changing State of Knowledge Management
Making KM Clickable: The Rapidly Changing State of Knowledge Management
 
How to Quickly Prototype a Scalable Graph Architecture: A Framework for Rapid...
How to Quickly Prototype a Scalable Graph Architecture: A Framework for Rapid...How to Quickly Prototype a Scalable Graph Architecture: A Framework for Rapid...
How to Quickly Prototype a Scalable Graph Architecture: A Framework for Rapid...
 
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
 
Taxonomy 101 KMWorld 2021
Taxonomy 101 KMWorld 2021Taxonomy 101 KMWorld 2021
Taxonomy 101 KMWorld 2021
 

Último

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Identifying Security Risks Using Auto-Tagging and Text Analytics

  • 1. Identifying Security Risks Using Auto-Tagging & Text Analytics Text Analytics Forum 2022 Joe Hilger and Sara Duane
  • 2. ENTERPRISE KNOWLEDGE Outline EK at a Glance The Problem Our Approach Our Methodology and Best Practices What You Will Learn ⬢ How to identify confidential information across an enterprise ⬢ Best practices for leveraging and tuning auto- tagging ⬢ How to design a taxonomy for auto-tagging
  • 3. ⬢ 33 Years of Consulting Experience ⬢ Expert in Knowledge Management and Knowledge Graph Technologies ⬢ Coauthor of Making KM Clickable (2022) JOE CTO AND COFOUNDER, ENTERPRISE KNOWLEDGE HILGER SARA SENIOR TECHNICAL ANALYST, ENTERPRISE KNOWLEDGE DUANE ⬢ Serves as project manager for technical implementation and strategy projects ⬢ Conducted complex auto-tagging projects for clients in both the commercial and federal space ENTERPRISE KNOWLEDGE
  • 4. 10AREAS OF EXPERTISE KM STRATEGY & DESIGN TAXONOMY & ONTOLOGY DESIGN TECHNOLOGY SOLUTIONS AGILE, DESIGN THINKING, & FACILITATION CONTENT & BRAND STRATEGY KNOWLEDGE GRAPHS, DATA MODELING, & AI ENTERPRISE SEARCH INTEGRATED CHANGE MANAGEMENT ENTERPRISE LEARNING CONTENT MANAGEMENT 80 + EXPERT CONSULTANTS HEADQUARTERED IN WASHINGTON, DC, USA ESTABLISHED 2013 – OUR FOUNDERS AND PRINCIPALS HAVE BEEN PROVIDING KNOWLEDGE MANAGEMENT CONSULTING TO GLOBAL CLIENTS FOR OVER 20 YEARS. KMWORLD’S 100 COMPANIES THAT MATTER IN KM (2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022) TOP 50 TRAILBLAZERS IN AI (2020, 2021, 2022) CIO REVIEW’S 20 MOST PROMISING KM SOLUTION PROVIDERS (2016) INC MAGAZINE #2,343 OF THE 5000 FASTEST GROWING COMPANIES (2021) #2,574 OF THE 5000 FASTEST GROWING COMPANIES (2020) #2,411 OF THE 5000 FASTEST GROWING COMPANIES (2019) #1,289 OF THE 5000 FASTEST GROWING COMPANIES (2018) INC MAGAZINE BEST WORKPLACES (2018, 2019, 2021, 2022) WASHINGTONIAN MAGAZINE’S TOP 50 GREAT PLACES TO WORK (2017) WASHINGTON BUSINESS JOURNAL’S BEST PLACES TO WORK (2017, 2018, 2019, 2020) ARLINGTON ECONOMIC DEVELOPMENT’S FAST FOUR AWARD – FASTEST GROWING COMPANY (2016) VIRGINIA CHAMBER OF COMMERCE’S FANTASTIC 50 AWARD – FASTEST GROWING COMPANY (2019, 2020) AWARD-WINNING CONSULTANCY PRESENCE IN BRUSSELS, BELGIUM EK At A Glance STABLE CLIENT BASE ENTERPRISE KNOWLEDGE
  • 6. Problem Statement At this federal research organization, researchers, proposal authors, project managers, etc. all leverage project content, data, and documentation on their shared drives. They need to have a way to: ▪ Identify content that is controlled, CUI, or otherwise sensitive So that they can… Move the relevant documents to a secure location Prevent data loss and compliance issues Ensure all documents have a classification
  • 7. How Common Tools Solve the Problem A lot of tools or solutions would solve this by looking for PII information through pattern recognition, including: ⬢ Using regex to identify the patterns behind PII information, such as a phone number. ⬢ Identifying specific sensitivity labels within the content itself, such as “top secret.” These products and solutions don’t look for terms or categories of information that reflect sensitive content. What if a piece of information within a document is sensitive, but doesn’t contain the term “top secret” within it nor any identifiable PII through pattern recognition?
  • 8. Our Solution Teaching Technology Identify the terms, words, and categories of information that suggest secure information. Develop a subject-oriented topic taxonomy of secure terms. Conduct auto-tagging on documents with this subject-oriented taxonomy to identify the secure content. Leverage these tags and labels to begin the migration process. 1 2 3 4
  • 9. What is a Taxonomy? A taxonomy is a controlled vocabulary used to describe or characterize explicit concepts of information for the purpose of capturing, managing, and presenting. Taxonomies are often driven by: ● Type of Content ● Medium ● Organization ● Purpose ● Topic (most relevant for our approach)
  • 11. Building Our Understanding Conduct focus groups with staff who are creators, holders, or consumers of content to ensure a complete understanding of the content they work with and what constitutes secure information for them. Analyze documentation, content, and data that suggests secure information as well as documentation without secure information to identify key topics. Conduct a semantic analysis of content that identifies significant terms through a machine learning algorithm and can validate and enhance the designed taxonomy. Focus Groups Document Review Corpus Analysis Focus Groups with Core Team & SMEs 33+ Documents Evaluated 287k For this engagement, EK conducted a thorough discovery phase:
  • 12. Building the Taxonomy Study Area Geography Method of Measure Environment Application Content Type EK used the field of environmental research to model what could be identified as secure information within a specific domain. The terms that made up these taxonomies were identified through focus groups with environmental research SMEs, as well as four corpus analyses on subsets of relevant content. The corpus analysis identified and added 37% of the taxonomy terms (i.e., terms and synonyms), thus enriching the final POC taxonomy.
  • 13. Solution Architecture Project Solution Architecture EK leveraged two main tools for this POC: o PoolParty: Hosted the taxonomy and ontology, and via API, auto- tagged the provided documents. o GraphDB: Stored the documents and their applied tags from the taxonomy and ontology. To successfully complete this approach, EK created data pipelines between the document storage account, PoolParty, and GraphDB using UnifiedViews, an ETL tool. These pipelines facilitated the necessary data transformation and integration to power GraphSearch.
  • 14. Visualizing Tags ⬢ EK leveraged PoolParty’s GraphSearch server to allow the organization to visualize the results of the auto-tagging process. ⬢ Users could filter and search for documents based on the identified tags. ⬢ During this phase, we could visualize and analyze the accuracy of the tags. View of PoolParty’s GraphSearch
  • 17. Design Best Practices Remember Your End User: A Machine Design requirements for a machine are different than for a taxonomy leveraged by a human for navigation, search, etc. Granularity Is Important The taxonomy should reflect the granularity of the content and get into the details of what is presented in the content. Synonyms at the Correct Level Are Your Friends With relevant and accurate synonyms used correctly, auto-tagging can better parse through the text and recognize what the content is about. Ensure Taxonomy Terms are Reflective of the Content The topics of your content items should help form the basis of your taxonomy.
  • 19. Auto-tagging: An advanced application of taxonomy in which terms are automatically applied to content as tags through text recognition, inheritance, or other automated means. Basic level: Searching the text for taxonomy terms to apply, relying solely on the term appearing in the content itself. More complex level: Using context and machine learning to tag additional terms that may not be in the content itself. 1 2 3 4 5 Metadata Inheritance WHAT IS AUTO-TAGGING? What Type of Auto-tagging Works for Your Needs? Migration Logic NLP Extractor ML Classification Custom NER Models
  • 20. AUTO-TAGGING WITH POOLPARTY EXTRACTION Auto-tagging is text extraction with natural language processing (NLP) and light machine learning (corpus scoring) to score extracted concepts by a mix of frequency, location in the document, etc. It’s important to understand both the taxonomy and the content it will be used to tag. Auto-tagging will only tag well the fields of the taxonomy that are topical and well matched to the text of the content items. Core Components Necessary for Auto- tagging: ● Synonym-rich taxonomy that is aligned with the target content ● Taxonomy management tool ● “Learning” corpus capabilities ● Content management system with target content ● Middle layer that can send content to be tagged and then store the suggested tags Concept Extraction
  • 21. Lemmatization and Stemming Lemmatization reduces words to their common base forms: ● am, are, is => be ● car, cars, car’s, cars’ => car Stemming looks at the root of a word: ● accounts, accounting, accountant -> account Concept extraction does not require that the exact term from the taxonomy be present in the text. Techniques like stemming and lemmatization can help increase matches. Important Note! Stemming and lemmatization can be risky as they may obscure real differences in meaning.
  • 22. AUTO-TAGGING WITH POOLPARTY EXTRACTION Scoring methods: ● Frequency - the more often a term appears in a document, the higher it scores ● Location boosting - terms found in some locations in a document (for example, the title), will have their score “boosted,” or weighted higher ● Term Frequency - Inverse Document Frequency (TF-IDF) scoring method penalizes overly frequent terms and boosts rare terms. The frequency of a term in a document is balanced against the frequency of that term across a representative corpus of documents. For example, the most frequently used word in many English documents is “the” - using TF-IDF scoring, this term will have a low score Scoring/Ranking Extraction
  • 24. FINE-TUNING ITERATIVELY ● Blacklist ● Exact match ● Disambiguation ● Ontology ● Shadow concepts ● Corpus adjustment ● TF-IDF scoring ● F-score Auto-tag ● Blacklist ● Exact match ● Synonyms ● Adjust taxonomy ● Prioritize content segments (e.g., Title) ● Corpus scoring Initial Fine- tuning Long-term Fine-tuning Initial Fine- tuning Long-term Fine-tuning Evaluate Accuracy Iterative Fine-tuning You will need to conduct multiple rounds, tweaking the taxonomy and rules to best fit the content you are working with, and evaluating the accuracy for each round.
  • 25. HOW TO ASSESS ACCURACY GOLD STANDARD ANECDOTAL ACCURACY F-SCORES AND IAA (INTER ANNOTATOR AGREEMENT) How to Assess Accuracy
  • 26. Q&A Thank you for listening. Questions? JOE HILGER, COO and Co-Founder of Enterprise Knowledge JHILGER@ENTERPRISE-KNOWLEDGE.COM WWW.LINKEDIN.COM/IN/JOSEPH-HILGER/ SARA DUANE, Senior Technical Analyst SDUANE@ENTERPRISE-KNOWLEDGE.COM WWW.LINKEDIN.COM/IN/SARA-DUANE/