Extracting, Aligning, and Linking Data to Build Knowledge Graphs

Extracting, Aligning, and
Linking Data to Build
Knowledge Graphs
Craig Knoblock
University of Southern California
Thanks to my collaborators: Pedro Szekely, Linhong Zhu, Majid
Ghasemi-Gol, Mohsen Taheriyan, Minh Pham, and Steve Minton

Goal
USC Information Sciences Institute CC-By 2.0 2
raw  messy  disconnected clean  organized  linked
hard to query, analyze & visualize easy to query, analyze & visualize

Use Case: Human Trafficking
raw  messy  disconnected clean  organized  linked
hard to query, analyze & visualize easy to query, analyze & visualize

Use Case: Human Trafficking
100 million pages
~ 100 Web sites
help victims
prosecute traffickers

Example: Investigating a Reported Victim
San Diego, where else?

DIG Interface: Find the locations where a
potential victim was advertised
CC-By 2.0 6

Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Data
Acquisition

Data Acquisition
downloading relevant data
batch  real-time
Web pages Web service  database 
CSV  Excel  XML  JSON

Traditional Web Crawler
(e.g., Nutch, Scrapy)
CC-By 2.0 9USC Information Sciences Institute

Web Crawling
24/7
5,000 Pages/Hour
~100,000,000 pages
Total

Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface

Feature Extraction
from raw sources to structured data
• extraction from text
• extraction from structured Web pages
• extraction of image features

Extraction

Structured Extraction
CC-By 2.0 14

Automated Extraction
[Minton et al., Inferlink]
• Title
• Description
• Seller
• Post Date
• Expiry Date
• Price
• Location
• Category
• Member Since
• Num Views
• Post ID

Input: A Pile of Pages

input:
a pile of pages
Classify by
Templates
pages clustered
by template

input:
a pile of pages
Classify by
Templates
pages clustered
by template
Infer
Extractor
Infer
Extractor
Infer
Extractor
Infer
Extractor
extractor

Unsupervised Extraction Tool

Pretty Good Extractions
Want Extracted
Extra Jan. 23, 2015 Jan. 23, 2015 expires Feb
Partial Jan. 23, 2015 Jan. 23

Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
.95
(40/42)
.83
(40/48)
.87
(39/45)
.51
(23/45)
.68
(34/50)
1.0
(35/35)
.52
(15/29)
.76
(19/25)
.97
(35/36)
Pretty
Good
1.0
(50/50)
.98
(48/49)
.95
(40/42)
.83
(40/48)
.98
(44/45)
.84
(38/45)
.88
(44/50)
1.0
(35/35)
.55
(16/29)
1.0
(25/25)
1.0
(36/36)
10 websites, 5 pages each
fields

Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface

Feature Alignment
from multiple schemas to a common domain schema
- CSV, Excel
- Database tables
- Web services
- Extractors
- Nomenclature
- Spelling
Multiple Schemas

Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{ JSON-LD }
Hierarchical
Sources
Schema.org

Semantic Labeling
[Pham et al., ISWC’16]
Offer Place Person
name price idname
Offer
Column-1 Column-2 Column-3 Column-4
British Lee-Enfield
No 4 MK 2 still …
1,000 68155c13de2f2532
Cabelas Millenium
Revolver in .45 colt
700 1711 Anderson Rd 12155a1a2938bc1
e

Learning Semantic Types
Requirements:
Learn from a small number of examples
Distinguish both string and numeric values
Can be learned quickly and is highly scalable to large
numbers of semantic types
Person OrganizationCity State
name birthdate name namename
Person
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
Domain Ontology

Textual
Data
Textual Data
Treat each column of data as a document
Apply TF-IDF Cosine Similarity

Numeric
Data
Numeric Data:
Apply statistical hypothesis testing to
determine which distribution fits best
Apply Kolmogorov-Smirnov Test

Features for
Semantic Labeling
• Features
– KS = Kolmogorov-Smirnov
– MW = Mann-Whitney

Combining the Features for
Semantic Labeling

Automatically Assigned
Semantic Labels
Offer
name
CreativeWork
fragment
Offer
description
Offer
identifier
Offer
datePosted
CreativeWork
Fragment
35 Whelen
Handi-Rifle
No Tags 35 Whelen Handi-rifle.
Black synthetic
stock/forearm, blued
barrel. Text 601-813-7280
….
245625390711756 October 19,
2015 12:43 pm
Cabelas
Millenium
Revolver in
.45 colt
No Tags This single action is built
to shoot and is a great
way for any level of
shooter to get involved
with a single action. …
12155a1a2938bc1e July 11, 2015
5:17 pm
1711 Anderson
Rd
swap stocks No Tags want to trade butler
creek folding stock for
black stock ruger mini
stock folder by butler
creek will swap even for
full rifle stock ….
5815600fd181fe3b September 22,
2015 1:05 am
white
streetAddress does not appear in training data -> more similar to noisy data

Results on www.msguntrader.com
number of attributes 19
Correct prediction 16
Correct label is in the top 4 predictions 18
Accuracy 84%
MRR 89%

Results on Gun Sites
Evaluation Dataset
Average number of attributes 18
Total number of attributes 176
Correct prediction (Accuracy) 56%
Correct label is in the top 4 predictions 89%
MRR 70%

Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface

Entity Resolution
merging records that refer to the same entity
missing data
incorrect data
scale (~100 million records)
techniques to address

Unsupervised Collective Entity Resolution
36
USC Information Sciences Institute

same victim
same Trafficker
Unsupervised Collective Entity Resolution

Collective Entity Resolution
[Zhu et al, ISWC’16]
Identifying and linking instances of the same real world entity
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
Product
4
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
price description
manufacturerproduct
Multi-Type Graph

Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
Product
4
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
price description
manufacturerproduct
Multi-Type Graph
Collective Entity Resolution
[Zhu et al, ISWC’16]
Identifying and linking instances of the same real world entity

Common Approach:
Pairwise Comparisons
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro
0.5
distance
0.2
Jaccard
0.3
Acceptance Threshold: 0.8

Missing Values
Product 5 299
Headphone
Bose
Electronic
Jaro
0.5
distance
0.2
Jaccard
0.3

Multiple Values
Product 5 299
Headphone
Bose
Electronic
Jaro
0.5
distance
0.2
Jaccard
0.3

Weights
Product 5 299
Headphone
Bose
Electronic
Jaro
0.5
distance
0.2
Jaccard
0.30.5 0.2 0.3

Unidirectional
Product 5 299
Headphone
Bose
Electronic
Jaro
0.5
distance
0.2
Jaccard
0.30.5 0.2 0.3

Graph Summarization:
Original Graph
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
Product
4
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
price description
manufacturerproduct

Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Similar Nodes simt(x, y)

Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Graph Sumarization:
Super-Nodes

Quiet Comfort 25 Noise
Cancelling Headphone
Noise Cancelling
Headphones
Premium Noise
Cancelling Headphones
Dish Washer
Bose Noise Cancelling
Headphones
Super-nodes Ct(x)
0.7 0.2 0.1
0.7 0.2 0.1
0.2 0.7 0.1
0.2 0.7 0.1
0.1 0.1 0.8
probability that a node x belongs to each super-node
one matrix for each type
Ct

Noise
Cancelling
Headphones
Premium
Noise
Cancelling
Headphones
Dish Washer
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose Noise
Cancelling
Headphones
Similar Nodes Should Be In The Same
Super-Node

Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Super-Links

Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Predict Links In Original Graph

Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4

Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4

Re-Clustering Improves Reconstruction
Quality
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4

Comparable Approaches
Pairwise Clustering Unsupervised Supervised
Limes, Ngomo’11 ✔ ✔
SILK, Isele’10 ✔ ✔ ✔
Serf, Benjelloun’10 ✔ ✔
*Commercial, Kӧpcke’10 ✔ ✔
GraphSum, Riondato’14 ✔ ✔
*AuthorLDA, Bhattacharya’07 ✔ ✔
CoSum (proposed) ✔ ✔

Quality Comparison
Precision Recall F-measure
Author Paper Product Author Paper Product Author Paper Product
Limes-F 0.958 0.827 0.446 0.864 0.761 0.16 0.909 0.792 0.236
Silk-F 0.846 0.877 0.459 0.986 0.756 0.348 0.91 0.812 0.395
Gsum 0.727 0.668 0.01 0.569 0.624 0.587 0.638 0.645 0.02
CoSum-B 0.993 0.871 0.58 0.94 0.611 0.477 0.966 0.718 0.524
Limes-MO 0.912 0.827 0.446 0.944 0.761 0.16 0.928 0.792 0.236
Silk-MO 0.932 0.877 0.459 0.958 0.756 0.348 0.945 0.812 0.395
Serf 0.985 0.837 0.436 0.687 0.808 0.186 0.809 0.822 0.261
CoSum-P 0.999 0.771 0.639 0.997 0.997 0.695 0.998 0.87 0.666
Commercial 0.615 0.63 0.622
AuthorLDA 0.995

Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface

Graph Construction
assembling the data for efficient query & analysis
- ElasticSearch: scalable, efficient query
- graph databases: network analytics
- NoSQL: scalable analytics
- bulk loading: massive data imports
- real-time updates: live, changing data

elasticsearch
• Cloud-based search engine
• Based on Apache Lucene
• Horizontal scaling, replication, load balancing
• Blazingly fast!
• Everything is a document
– Documents are JSON objects
– Index what you want to find
– Fields can contain strings, numbers, booleans,
etc.

Adult
Service
Offer Person
Efficient indexing and query
Phone
Web
Page
ElasticSearch Data Model

Products (AdultService) As Roots

Indexing for High Performance
Knowledge Graph Queries
Avg. Query Times in Milliseconds
Single User Query Load
1.2 billion triples
State of the Art Graph Database (RDF)
DIG indexing deployed in ElasticSearch

Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface

DIG Deployment for Human Trafficking
- 100 million Web pages
- Live updates (~5,000 pages/hour)
- ElasticSearch database (7 nodes)
- Hadoop workflows (20 nodes)
- District Attorney
- Law Enforcement
- NGOs

DIG Applications
Human Trafficking
large, real users
Material Science Research
70,000 paper abstracts (built in 1 week)
Arms Trafficking
identify illegal sales
Patent Trolls
identifies patent trolls
Predicting Cyber Attacks
combines diverse sources about vulnerabilities,
exploits, etc.

Conclusions
• Presented the end-to-end tool-chain to
build domain-specific knowledge graphs
• Integrates heterogeneous data: web
pages, databases, CSV, web APIs,
images, etc.
• Approach scales to million of pages, and
billions facts
• Has been used to build real-world
deployed applicationsUSC Information Sciences Institute CC-By 2.0 70

Extracting, Aligning, and Linking Data to Build Knowledge Graphs

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Extracting, Aligning, and Linking Data to Build Knowledge Graphs

Semelhante a Extracting, Aligning, and Linking Data to Build Knowledge Graphs (20)

Mais de Craig Knoblock

Mais de Craig Knoblock (10)

Último

Último (20)

Extracting, Aligning, and Linking Data to Build Knowledge Graphs

Notas do Editor