This document discusses building knowledge graphs by extracting, aligning, and linking data from various sources. It describes crawling websites to acquire raw data, using both structured and unstructured extraction to extract features from the data, aligning the extracted features to a common schema, and resolving entities in the data to merge records referring to the same real-world entity. It also discusses techniques for collectively resolving entities in large datasets, summarizing graphs by grouping similar nodes into super-nodes, and using the summarized graph to predict links in the original graph. The overall goal is to clean, organize, and link disconnected data into a knowledge graph that is easier to query, analyze, and visualize.
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
1. Extracting, Aligning, and
Linking Data to Build
Knowledge Graphs
Craig Knoblock
University of Southern California
Thanks to my collaborators: Pedro Szekely, Linhong Zhu, Majid
Ghasemi-Gol, Mohsen Taheriyan, Minh Pham, and Steve Minton
2. Goal
USC Information Sciences Institute CC-By 2.0 2
raw messy disconnected clean organized linked
hard to query, analyze & visualize easy to query, analyze & visualize
3. Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw messy disconnected clean organized linked
hard to query, analyze & visualize easy to query, analyze & visualize
4. Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages
~ 100 Web sites
help victims
prosecute traffickers
5. Example: Investigating a Reported Victim
San Diego, where else?
USC Information Sciences Institute CC-By 2.0 5
6. DIG Interface: Find the locations where a
potential victim was advertised
CC-By 2.0 6
7. Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 7
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Data
Acquisition
8. Data Acquisition
USC Information Sciences Institute CC-By 2.0 8
downloading relevant data
batch real-time
Web pages Web service database
CSV Excel XML JSON
11. Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 11
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
12. Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• extraction from text
• extraction from structured Web pages
• extraction of image features
15. Automated Extraction
[Minton et al., Inferlink]
• Title
• Description
• Seller
• Post Date
• Expiry Date
• Price
• Location
• Category
• Member Since
• Num Views
• Post ID
USC Information Sciences Institute CC-By 2.0 15
20. Pretty Good Extractions
Want Extracted
Extra Jan. 23, 2015 Jan. 23, 2015 expires Feb
Partial Jan. 23, 2015 Jan. 23
21. Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
.95
(40/42)
.83
(40/48)
.87
(39/45)
.51
(23/45)
.68
(34/50)
1.0
(35/35)
.52
(15/29)
.76
(19/25)
.97
(35/36)
Pretty
Good
1.0
(50/50)
.98
(48/49)
.95
(40/42)
.83
(40/48)
.98
(44/45)
.84
(38/45)
.88
(44/50)
1.0
(35/35)
.55
(16/29)
1.0
(25/25)
1.0
(36/36)
10 websites, 5 pages each
fields
USC Information Sciences Institute CC-By 2.0 21
22. Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 22
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
23. Feature Alignment
USC Information Sciences Institute CC-By 2.0 23
from multiple schemas to a common domain schema
- CSV, Excel
- Database tables
- Web services
- Extractors
- Nomenclature
- Spelling
Multiple Schemas
24. Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{ JSON-LD }
Hierarchical
Sources
Schema.org
USC Information Sciences Institute CC-By 2.0 24
25. Semantic Labeling
[Pham et al., ISWC’16]
Offer Place Person
name price idname
Offer
Column-1 Column-2 Column-3 Column-4
British Lee-Enfield
No 4 MK 2 still …
1,000 68155c13de2f2532
Cabelas Millenium
Revolver in .45 colt
700 1711 Anderson Rd 12155a1a2938bc1
e
26. Learning Semantic Types
Requirements:
Learn from a small number of examples
Distinguish both string and numeric values
Can be learned quickly and is highly scalable to large
numbers of semantic types
Person OrganizationCity State
name birthdate name namename
Person
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
Domain Ontology
29. Features for
Semantic Labeling
• Features
– KS = Kolmogorov-Smirnov
– MW = Mann-Whitney
CC-By 2.0 29USC Information Sciences Institute
30. Combining the Features for
Semantic Labeling
CC-By 2.0 30USC Information Sciences Institute
31. Automatically Assigned
Semantic Labels
Offer
name
CreativeWork
fragment
Offer
description
Offer
identifier
Offer
datePosted
CreativeWork
Fragment
35 Whelen
Handi-Rifle
No Tags 35 Whelen Handi-rifle.
Black synthetic
stock/forearm, blued
barrel. Text 601-813-7280
….
245625390711756 October 19,
2015 12:43 pm
Cabelas
Millenium
Revolver in
.45 colt
No Tags This single action is built
to shoot and is a great
way for any level of
shooter to get involved
with a single action. …
12155a1a2938bc1e July 11, 2015
5:17 pm
1711 Anderson
Rd
swap stocks No Tags want to trade butler
creek folding stock for
black stock ruger mini
stock folder by butler
creek will swap even for
full rifle stock ….
5815600fd181fe3b September 22,
2015 1:05 am
white
streetAddress does not appear in training data -> more similar to noisy data
33. Results on Gun Sites
Evaluation Dataset
Average number of attributes 18
Total number of attributes 176
Correct prediction (Accuracy) 56%
Correct label is in the top 4 predictions 89%
MRR 70%
34. Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 34
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
35. Entity Resolution
USC Information Sciences Institute CC-By 2.0 35
merging records that refer to the same entity
missing data
incorrect data
scale (~100 million records)
techniques to address
58. Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 58
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
59. Graph Construction
USC Information Sciences Institute CC-By 2.0 59
assembling the data for efficient query & analysis
- ElasticSearch: scalable, efficient query
- graph databases: network analytics
- NoSQL: scalable analytics
- bulk loading: massive data imports
- real-time updates: live, changing data
60. elasticsearch
• Cloud-based search engine
• Based on Apache Lucene
• Horizontal scaling, replication, load balancing
• Blazingly fast!
• Everything is a document
– Documents are JSON objects
– Index what you want to find
– Fields can contain strings, numbers, booleans,
etc.
CC-By 2.0 60USC Information Sciences Institute
65. Indexing for High Performance
Knowledge Graph Queries
Avg. Query Times in Milliseconds
Single User Query Load
1.2 billion triples
State of the Art Graph Database (RDF)
DIG indexing deployed in ElasticSearch
USC Information Sciences Institute CC-By 2.0 65
66. Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 66
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
67.
68. DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 68
- 100 million Web pages
- Live updates (~5,000 pages/hour)
- ElasticSearch database (7 nodes)
- Hadoop workflows (20 nodes)
- District Attorney
- Law Enforcement
- NGOs
69. DIG Applications
Human Trafficking
large, real users
Material Science Research
70,000 paper abstracts (built in 1 week)
Arms Trafficking
identify illegal sales
Patent Trolls
identifies patent trolls
Predicting Cyber Attacks
combines diverse sources about vulnerabilities,
exploits, etc.
CC-By 2.0 69USC Information Sciences Institute
70. Conclusions
• Presented the end-to-end tool-chain to
build domain-specific knowledge graphs
• Integrates heterogeneous data: web
pages, databases, CSV, web APIs,
images, etc.
• Approach scales to million of pages, and
billions facts
• Has been used to build real-world
deployed applicationsUSC Information Sciences Institute CC-By 2.0 70
Notas do Editor
Karma offers suggestions on how to do the mapping
Tokenize values in a given labeled column into pure alphabetic, numeric and symbol tokens
Extract features from the tokens and the column name and associate them with column’s semantic type
Why is linking significant in this domain? Slide shows why.