Automatic and rapid generation of massive knowledge repositories from data
1. IF4IT
AUTOMATIC AND RAPID
GENERATION OF MASSIVE
KNOWLEDGE REPOSITORIES,
DIRECTLY FROM DATA
Author/Presenter: Frank Guerino
Chairman for The International Foundation for Information Technology (IF4IT)
Email: Frank.Guerino @ if4it.com
LinkedIn: https://www.linkedin.com/in/frankguerino/
Follow Us on Twitter: @IF4IT
Co-Author: Dr. Joel Kline, PhD.
Board of Advisors, The International Foundation for Information Technology (IF4IT)
Professor, Lebanon Valley College, PA-USA
1
2. IF4IT
The Future isAutomated Synthesis of Knowledge Repositories
Read More: https://www.if4it.com/knowledge-management-automated-content-generation-and-curation/
Meet Bob.
Bob is very competent.
Bob outperforms other people
by generating one great
knowledge article per hour.
Automated Content
Generation
Software
Meet Bob’s
replacement.
Bob’s replacement generates millions of
higher quality, highly curated, and
semantically inter-linked knowledge articles,
in the time it takes Bob to create just one… at
a fraction of the cost.
2
Few knowledge repositories,
limited content, poor curation,
lots of dead links, and no
semantic relationships.
More knowledge repositories,
far more content, greater
curation, almost no dead links,
and semantic relationships.
✖
✔
ACTOR ACTIONS RESULTS
3. IF4IT
The Wikipedia Problem
• The Wikipedia Community is NOT like an
Enterprise Work Community
- About 17 years to develop,
- Over 130M voluntary editors (i.e. free labor),
- Over 6M content articles
• People believe they can build internal
knowledge repositories (like libraries and intranets) using the same
manual content development paradigm as Wikipedia
• The end result is almost always the same… “Relatively empty and
low value Knowledge/Content Repositories”
People often can’t find the answers they need.
Read More: https://www.if4it.com/wikipedia-problem-understanding-enterprise-knowledge-repositories-fail/
3
4. IF4IT
The Problem is Manual Labor
Quantity: Low quantities of artifact delivery.
Quality: Higher levels of human-introduced errors.
Time: Longer artifact delivery times.
Money: High costs for delivery of artifacts.
Trend: Knowledge Repository Automation is very important because,
more often than not, teams that build them have very limited resource
(people & finances).
Trend: With the move to “Digital” the expectation of Knowledge
Repositories is even higher.
4
5. IF4IT
The Solution = Automation via Compilation
• The process is called Synthesis (a.k.a. Compilation)
• Compilation is the word used by software developers
• Synthesis is the word used by non-software developers
• Specifically, we use and recommend Data Driven
Synthesis (DDS)
• We use Compiler-based DDS to generate content, curate
content, interlink content, and automatically build and
provision Knowledge/Content repositories
Read More: https://www.if4it.com/understanding-data-driven-synthesis/
5
6. IF4IT
Many Decades of Successful Synthesis
Synthesis/Compilation of Software (Since 1970s)
Synthesis of Integrated Circuit Schematics (Since 1992)
- Inputs are Hardware Descriptive Languages (HDLs) like VHDL and Verilog.
- Outputs are used for Simulation, Acceleration, Emulation, and Fabrication
Synthesis of APIs and software code (i.e. Scaffolding for Software
Developers, such as for Java Spring and Ruby on Rails)
Synthesis of large volumes of test data to exercise complex systems
Synthesis of chemical Compounds for Drug Discovery
Synthesis of Health Care Pathways (Diagnosis + Treatments)
Synthesis of (computer generated) Music and Art
Synthesis of Electronic Documentation
(i.e. data driven content)
Synthesis of Digital Libraries (massive web sites)
Synthesis of Semantic Data Graphs (SDGs)
6
7. IF4IT
Who cares about DDS-based automation?
• Internet and Intranet Web Content Managers & Developers
• Technical Writers / Technical Communicators
• Architects (Enterprise/Solutions/Business/Applications/Data/etc.)
• Enterprise Models
• Software Developers (Using Compilation for about 5 Decades)
• API Documentation
• Software Configuration Documentation
• Engineers (Using Synthesis for about 3 Decades)
• Hardware, Network, Communications, & Semiconductor Documentation
• Anyone who documents topics, curates, and who publishes results to
web pages in some Content/Knowledge Repository
7
8. IF4IT
Common Use Cases Driving DDS
• Strategic Planning – Enterprise Portfolio Impact Analysis
• Faster Domain Documentation, - More inter-linked documentation,
with interactive data and with fewer errors, @ far lower costs
• Better Customer Support – Rapid and more accurate Incident Impact
Analysis
• Better Operational Work - Faster Knowledge Discovery = faster &
better work decisions
• Lower Development Costs – Synthesis helps eliminate significant
Software Development
• Better Search & Discovery – Synthesis helps yield better & more
accurate Search Results
Higher Levels of Customer / End-User Satisfaction
8
9. IF4IT
Synthesis is Compiler-based
Data
Compiler/Synthes
izer
Baseline Input
Data
Processing
Rules
Synthesized
Output(s)
Outputs are used for
machines like computers
AND for Humans.
Flat files like *.csv
sourced from spreadsheets
and systems.
Controls ontologies,
formatting, view controls,
report generation, semantic
relationship harvesting, etc.
9
Software
Compiler/Synthes
izer
Source Code
Files
Compiled
Software
Software
Compilation/Synthesis
Data
Compilation/Synthesis
10. IF4IT
Benefits of DDS
Agile: Changes can be made iteratively and in
seconds/minutes
• Simple CSV flat files can be compiled
• No long software development cycles
Scalable: Hundreds of Thousands or Millions of content
pages can be generated in minutes
Stable: Elimination of human errors, like dead links, leads
to far higher levels of quality.
Affordable: The cost per content page (including both
Quantity and Quality) is a small fraction of manually
generated content
10
11. IF4IT
The Synthesis Sequence of Events
Application Data
(e.g. .CSV File)
Capability Data
(e.g. .CSV File)
Human Resource Data
(e.g. .CSV File)
Product Data
(e.g. .CSV File)
Service Data
(e.g. .CSV File)
Etc. Data
(e.g. .CSV File)
Facility Data
(e.g. .CSV File)
Organization Data
(e.g. .CSV File)
…Synthesizer Inputs
Fromspreadsheetsandsystems.
1
Processing Rules
for
• Relationship Discovery
• Data Formatting
• View Generation
• Report Calculations
• Etc.
2
Data Synthesizer/
Data Compiler
3
Node Views
Data Graph/Network
Relationships
CI (z)
CI (y)
CI (x)
Business Intelligence
• Inventories
• Reports
• Graphs & Charts
• Glossaries
• Dashboards
• Visualizations
• Abbreviations
• Acronyms
Data Indexes
Catalogs
Intranet/
Digital Library
4
11
12. IF4IT
Real Business Impacts
12
Your Compiler
Intranets / Content Management Systems
(Confluence, Jive, Drupal, MediaWiki, etc.)
Architecture Modeling Tools (AMTs)
(Troux, Mega, Adaptive, System Architect, etc.)
Configuration Management Databases (CMDBs)
(HP, BMC, ServiceNow, etc.)
Stand-Alone Knowledge Management Systems
(Madcap, KPS, Bitrix, SalesForce, ServiceNow, etc.)
Library Management Systems (LMSs)
(Koha, Soft Link, NGL, LibSys, Folet, etc.)
Semantic Data Systems
(Cambridge Semantics, Protégé, Swoop, LDIF, etc.)
The Traditional Way = $$$$$$$$$$$$$$$$$$$
(Too many complex, expensive, difficult to deliver & operate systems
and tools… just to get to a comprehensive view of your enterprise!)
ExpensiveIntegration
ExpensiveBusinessIntelligence&Reporting
ExpensivePeoplewithSpecificSkills
DDS Results = $
(A very simple, very quick, and very
affordable “Compiler Based Approach”)
Your Data
Your Branded Digital Libraries
(Complete with Catalogs, Indexes,
Relationships, Data Views, Reports,
Dashboards, Visualizations, etc.)
3
4
Your Data + Your Rules1
Complexity Simplicity
2
Data Synthesizer/
Data Compiler
✖ ✔
Many Years & Countless Resources Minutes/Hours & Small # of Resources
13. IF4IT
Compiler-based DDS helps generate
“Knowledge Structures”
1. Content – High quantities, richly formatted, highly
structured, and strongly inter-linked
2. Interactive Data Visualizations - for Interactive
Analytics, Data Science, and Visual Discovery
3. Knowledge Repositories – fully curated structures
like advanced Intranets and Digital Libraries
Read More: https://www.if4it.com/knowledge-management-understanding-knowledge-structures/
13
14. IF4IT
1. Content: SFN over LFN
Raw and unstructured human
narrative in the form of “content”
(not “data”).
Highly structured data, based on
Name/Value pair paradigms
(e.g. CSV, JSON, etc.).
✖ ✔
14
15. IF4IT
2. Interactive Data Visualizations
VisualComplexity.com D3js.org
• Data Science and Data Scientists are VERY expensive.
• DDS creates a common set of fully integrated Data Visualizations
• DDS automatically creates many more out-of-the-box and ready-
to-use Data Visualizations, faster and at far lower costs.
15
16. IF4IT
Geographic Maps
Interactive Data Visualization Examples…
Force Directed Graphs Bubbles
Condegram Spirals
Bars, Pies, Lines
Sankey FlowsChords Multivariate Grids
See many interactive examples in the gallery at: http://www.d3js.org
16
18. IF4IT
The Spectrum of Synthesizable Knowledge Structures
Range of Synthesizable Knowledge Structures
• Data Records/Nodes
• Tables & Inventories
• Charts (Pie, Bar, Area,
Bubble, etc.)
• Graphs (Line, Multi-
Line, etc.)
• Web Pages
• Catalogs
• Indexes
• Reports
• Semantic Relationships
• Semantic Predicates
Simple Knowledge
Structures
• Dashboards
• Data Visualizations
(many different
visualizations)
• Semantic Data Graphs
(SDGs) / Semantic Data
Networks (SDNs)
• HTML Link Networks
• Navigation Taxonomies
• Classification
Taxonomies
Moderately Complex
Knowledge
Structures
• General Web Sites
• Intranets
• Architecture Models
• Architecture
Repositories
• Configuration
Management
Databases (CMDBs)
• Domain-specific
Knowledge
Repositories
Complex Knowledge
Structures
• Multi-Context/Multi-
Domain Digital Libraries
that include all other
structures in the
spectrum (all columns
to the left)
• Industry Specific
Determinations…
- Automatic Claim
Processing
- New Viable Drugs
- Healthcare Care
Pathways
- High Frequency Auto-
Investing
- Etc.
Super Complex
Knowledge
Structures
Example Formats = TXT, CSV, TSV, JSON, XML, HTML, SVG, PDF, Etc.
Simplest Most Complex
• Bits and Bytes
• Built-In Types and
Constants
• Lists, Arrays, and Hash
Tables
• Stacks and Heaps
• For Loops, Do Loops,
and While Loops
• Formulas and
Algorithms
• Buffers, Streams and
Files
• Classes and Objects
Simplest Knowledge
Structures
Read More: https://www.if4it.com/knowledge-management-understanding-knowledge-structures/
18
19. IF4IT
DDS Solves the Wikipedia Problem for Enterprises...
Quantity: Much higher quantities of artifact delivery.
Quality: Much higher levels quality.
Time: Much shorter times for artifact delivery (i.e.
much higher quantities with higher quality).
Money: Much lower costs to deliver artifacts
(especially for Data Science & Data Visualizations).
FASTER & BETTER
KNOWLEDGE DISCOVERY
AND DECISION MAKING
19
20. IF4IT
The Benefits of DDS
• More and Better Knowledge Repositories
- Far higher quantities of more advanced content
- More advanced features and capabilities
- Dynamic integration of data with content
- Higher quality of content (e.g. far fewer dead links)
- Far less investment of time and funds
• Higher stakeholder satisfaction and engagement
20
21. IF4IT
Getting Started with DDS
1. Acquire a Data Compiler/Synthesizer
• Contact IF4IT for a free NOUNZ Lite compiler https://www.if4it.com/contact-us/
2. Start with simple Spreadsheet-based Inventories (and Sharepoint List
Structure extracts)
3. Incrementally customize small data sets to meet your needs and your
desired look-and-feel
4. Slowly progress to more complicated Data Extracts (from proprietary
systems)
5. Keep in mind that Time-To-Learn is “incremental” [you don’t have to
start with big projects]
Crawl Walk Run
21
22. IF4IT
Questions and Discussion
22
Frank Guerino
CEO & Chairman
The International Foundation for
Information Technology (IF4IT)
Email: Frank.Guerino@if4it.com
Twitter: @IF4IT
23. IF4IT
Read More:
• Automated Content Generation & Curation: https://www.if4it.com/knowledge-
management-automated-content-generation-and-curation/
• The Wikipedia Problem: https://www.if4it.com/wikipedia-problem-understanding-
enterprise-knowledge-repositories-fail/
• Understanding Data Driven Synthesis: https://www.if4it.com/understanding-data-
driven-synthesis/
• Understanding Knowledge Structures: https://www.if4it.com/knowledge-management-
understanding-knowledge-structures/
• Learn about D3 and Interactive Visualizations: http:www.d3js.org
• Understanding Knowledge Structures: https://www.if4it.com/knowledge-management-
understanding-knowledge-structures/
• Learn about the IF4IT NOUNZ Data Compilation Platform:
https://www.if4it.com/nounz/
• See Interactive Example of DDS-generated Generic Digital Library:
http://nounz.if4it.com (Less than 3 minutes to generate.)
• See Interactive Example of DDS-generated KM Body of Knowledge:
http://km.if4it.com (Only seconds to generate.)
23
25. IF4IT
Global Biopharmaceutical
25
-- TOTAL Administration Category Noun Instances = 5: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Assay Noun Instances = 749: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Biological Matrix Category Noun Instances = 42: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Biomarker Noun Instances = 42: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Company Noun Instances = 18: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Disease Mechanism Noun Instances = 17: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Facility Noun Instances = 3: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Immunoassay Platform Noun Instances = 6: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Instrument Category Noun Instances = 5: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Instrument Noun Instances = 37: Time = Wednesday June 15, 2016 at 10:04:08
-- TOTAL Offering Noun Instances = 516: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL Program Category Noun Instances = 5: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL Study Type Noun Instances = 17: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL White Paper Noun Instances = 28: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL Application Noun Instances = 1000: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL Business Domain Noun Instances = 9: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL Capability Noun Instances = 32: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL Computing Server Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL Contract Noun Instances = 1166: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL Country Noun Instances = 251: Time = Wednesday June 15, 2016 at 10:04:09
-- TOTAL Customer Noun Instances = 150: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Database Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Data Transport Technology Noun Instances = 4: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Environment Noun Instances = 8: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Frequently Asked Question Noun Instances = 32: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Information Category Noun Instances = 16: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Interface Noun Instances = 99: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Language Code Noun Instances = 504: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Letter Noun Instances = 26: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Location Noun Instances = 50: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Market Sector Noun Instances = 2: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Market Segment Noun Instances = 2: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL News Article Noun Instances = 6: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Number Noun Instances = 9: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Organization Noun Instances = 29: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Policy Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Process Noun Instances = 26: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Product Noun Instances = 25: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Project Noun Instances = 1000: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Resource Noun Instances = 14: Time = Wednesday June 15, 2016 at 10:04:10
-- TOTAL Sales Transaction Noun Instances = 886: Time = Wednesday June 15, 2016 at 10:04:11
-- TOTAL SDLC Activity Noun Instances = 353: Time = Wednesday June 15, 2016 at 10:04:11
-- TOTAL SDLC Phase Noun Instances = 14: Time = Wednesday June 15, 2016 at 10:04:11
-- TOTAL Service Noun Instances = 561: Time = Wednesday June 15, 2016 at 10:04:11
-- TOTAL Software Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:11
-- TOTAL Glossary Term Noun Instances = 235: Time = Wednesday June 15, 2016 at 10:04:11
-- TOTAL Vendor Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:11
-- TOTAL Undefined Noun Type Noun Instances = 1: Time = Wednesday June 15, 2016 at 10:04:11
TOTAL Number of Unique Noun Types = 48: Time = Wednesday June 15, 2016 at 10:04:11
TOTAL Noun Instances registered = 8500: Time = Wednesday June 15, 2016 at 10:04:11
TOTAL Number of Unique Abbreviations or Acronyms = 655: Time = Wednesday June 15, 2016 at 10:04:11
TOTAL Number of Unique Semantic Relationships = 30767: Time = Wednesday June 15, 2016 at 10:04:15
TOTAL Number of Unique Semantic Relationship Predicates = 97: Time = Wednesday June 15, 2016 at 10:04:15
TOTAL Minimum Number of HTML Links = 113536: Time = Wednesday June 15, 2016 at 10:07:27
Spreadsheets were used to easily and quickly
collect, organize, and supply data to NOUNZ
Compiler in 1st Normal Form CSV formats.
Vertical industry and business data was collected
from public Biopharma web site, organized and
cleansed in about 5 hours.
Generic IT Data was intentionally comingled with
Biopharma vertical industry and business data, in
order to show the effects of mixing different data
types.
TOTALS:
Total unique Noun Types (Data Types) = 48
Total Catalogs = 50
Total Noun Instances (across all Noun Types = 8500
Total Semantic Relationships = 30767
Total Semantic Predicates = 97
Total Abbreviations and Acronyms = 655
Total “minimum” # of HTML links = 113536
Total Compile Time = 3 Minutes and 27 Seconds
26. IF4IT
Regional Health Care Payer/Insurer
26
• 47 defined Noun Types (a.k.a. Data Types),
• almost 49,000 Noun Instances (a.k.a. Data Instances or Records) that are sourced
from the different Noun Types,
• Almost 294,000 automatically synthesized web pages with different views of data
and information,
• Over 300K automatically discovered and harvested Semantic Relationships that
translate directly to over 1,100,000 contextual and meaningful HTML links.
• 46 total Catalogs, Including a Master Catalog, 47 Noun Domain Specific Catalogs
(one for each Noun Type), an Abbreviations/Acronyms Catalog, and a Relationship
Predicates Catalog
• 288 unique Indexing Categories with 2582 unique Data Indexes
• 869 harvested and curated Abbreviations and Acronyms
• Over 1,600 unique semantic relationship descriptors (i.e. Predicates)
• 47 Domain Specific Dashboards (one for each Noun Type).
Total Compiler Time = Approximately 15 minutes