Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program

Advantage Through Technology

Actionable Intelligence Retrieval System (AIRS) Overview
27 November 2012

CUBRC KDD AIRS System 1

Alignment of Data Models Apr – Jul OccursOn Crop
2008 Type Failure
- Single representation for all data
sources Event
- Easily plug-in new data sources Report
OccursAt
Transcript Western
RecordedBy Afghanistan
Newsletter
Remove Perspective

Report Transcript Newsletter
Observer ID:556AS4 Date: 15 May 08 Date: 10 Apr 08
Date: 26 Apr 08 Event: Situation Description:
Data Model Event: Crop in certain areas Crop outlook
Perspective Failure Extent dire as lack of for early
Detection rain … summer …
Confounds Data
Integration
Event
Observer


Event

Advanced Analytics Algorithms

Quantitative
“Easy” Analyst Questions
- Identify All Event Information

Timeline

“Harder” Analyst Questions
-Identify Similar Events

“Hardest” Analyst Question
- Identify Predictor Events

Qualitative

Probe Tasks

• Fully automated tasks
• Test system plumbing
• Ex: Find all associates of Jim Johnson and list the person’s
affiliation to Jim. Use only data sets A, E, M.
• 20 questions like these

Analyst Tasks

• Manual task executed by actual analysts
• Test usability and applicability of developed algorithms
to realistic tasks
• Ex: Find all information that may have predicted an
attack was imminent in Khost, Afghanistan on 3 June,
2008.
• 10 questions like these


Many Sources Many Records Many Types

1K 100K 1M

DS 1 Reports
Articles
DS 2
DS 3 Blogs
Transcripts
DS 4
Structured
DS 5
DS 6 DOMEX
DS 7 Semi-Structured
DS 8 Social Media


Three Essential Components

Architecture

Research Integrated
Tasks Prototype


9 High Level Research Areas 30 Research Tasks in Phase 2

•Task 1.1.3 (CUBRC) April - PreProto
•Task 1.1.4 (CUBRC) Aug - Lab
•Task 1.1.5 (CUBRC) Aug - Lab
ALIGNMENT •Task 1.2.2 (CUBRC) April - Lab
•Task 2.1.2 (ISS) April - Lab
1. Ontology Development •Task 2.1.3.a (ISS) April - Lab
•Task 2.1.3.c (ISS) August - Lab
2. Structured Data Alignment •Task 3.1.2.a (GDIT) April Lab | Aug PreProto
•Task 3.1.3 (GDIT)
3. Unstructured Data Alignment •Task 3.1.4 (GDIT)
April PreProto
April Lab | Aug Preproto
4. Alignment Reasoner •Task 3.2.1.a (GDIT)
•Task 3.2.1.b (GDIT)
April Lab | Aug PreProto
April Lab | Aug PreProto
5. Alignment Optimization •Task 3.2.1.c (GDIT) April Lab | Aug PreProto
•Task 3.2.1.d (GDIT) April Lab | Aug Preproto
•Task 3.2.3 (GDIT) April Proproto
•Task 4.2.1 (Securboration) Aug Lab
ADVANCED ANALYTICS •Task 5.1.1 (CUBRC) Aug Lab
•Task 6.1.2 (CUBRC/UB) April Lab | Aug Preproto
6. Workflow Optimization •Task 6.1.4 (CUBRC) April Lab | Aug Preproto
•Task 6.1.5 (CUBRC)
7. Application of Analyst Context •Task 6.3.1 (CUBRC) April Lab
8. Data Association for Entity Resolution •Task 7.3.1 (Securboration)
•Task 7.4.1 (UB)
April Lab
Aug PreProto
9. Distributed Graph Matching •Task 8.1.1 (UB) Aug PreProto
•Task 8.3.1 (UB) April PreProto
•Task 8.3.2 (UB) April Lab | Aug PreProto
•Task 9.1.1 (UB) Aug Preprotp
•Task 9.1.2 (CUBRC) Aug Lab
•Task 9.2.1 (UB) Aug Lab
•Task 9.3.1 (UB) Aug Lab


Visualization Answers Analytics Data Flow
Invoked
Invoke Algorithm
Algorithm
Query: Query
Expansion Single Threaded

KDD RDF Ranked
Requery:
Queries
Abduction
Analyze
Query Use Query: Evaluate
Aligned
Execution
Models
Results

Data Services
Search Graph Graph Association: Query:
Creation: Creation: Entities & Events Sparql
Structured Unstructured
Parallelized Parallelized Parallelized

Read & Write
Raw Data Write

Data
Global Sources
Model CUBRC KDD AIRS System 8

Backbone of Project
Basic Formal
Ontology –
Relation
Ontology

Artifact Time
Ontology Ontology
Extended Information
Agent Event Geospatial Quality
Relation Technology
Ontology Ontology Ontology Ontology
Ontology Ontology

AIRS Mid-
Level
Ontology
Defines Input &
Output Format Most
Counterterrorism
Processes
Ontology


Information Entity Ontology Sample Document
• 76 local classes
• 21 equivalence class axioms
• 1 superclass axioms
• 28 local object properties
• 7 datatype properties

Agent Ontology
• 787 local classes
• 231 equivalence class axioms (mostly
persons with roles, e.g. Physician, Lawyer)
• 70 local object properties (mostly
familial relationships)
• SPARQL Inferencing Notation (SPIN)
rules that infer familial relationships from
the primitive relationships of the child_of #Note #Paragraph #SectionOfText
and parent_of and the qualities of male
and female gender.
#Person #Place


Analytics Query ‘Soup-to-Nuts’ Graph
“Documents where

Smyth is a Person &&

has Associates &&
Ontology
footnote contains ‘XY’ &&

from data set 4 or 5”

4 5
SPARQL Query Raw

Invoked
Invoke Algorithm
Algorithm
Query: Query

KDD RDF Ranked
Requery:
Queries
Abduction
Analyze
Aligned
Execution
Models
Results

Data Services

Read & Write
Raw Data Write

Data
Global Sources

Architecture Implementation
Column Alignment Request

Data Value
Learner Characterization
Learner
Learner
Learner Context Column Categorical Based
Alignment Mega- Data Value
Alignment
Data Cube Learner Characterization
Mega-
Learner Lucene Base
Alignment

* Spring Framework
Column Alignment Prediction
Data Value Characterization
• Used metadata, data values, regular expressions, and neural networks to classify columns
• Combined with a collection of heuristics
• Date Time
• Person’s Name, Alias, and Birth Date
• Recognizing unstructured data within structured
13
CUBRC KDD AIRS System

D2RQ Mapping File
• Enable dynamic RDF generation


Method
1. Document Type Identification:
• Determine document type with pattern-based configurations
2. Passage & Metadata Retrieval:
• With Document Type, Identify & extract data using:
a. Template / Grammar Process
b. Generic Heuristic Process
3. Document Genre Association:
• Link associated document genres
Document Type Passage & Metadata Document Genre
Identification Retrieval Association

Identification Template Passage &
Configuration Grammars Metadata
Document
Type Annotations Passages,
Document
Metadata,
Document Type (a) Template / Document Genre
Identification
Genre links
Grammar Process Association
Process Process

(b) Generic
Heuristic Process


Methods
• Extraction of Entity types (People, Place, Location, Facility, etc.)
• Extraction of Events and Relationships - Uses an external file of
patterns to extract attributes, relationships, and events.
• Speed is 100 - 250K per second for information extraction

Purchaser Pattern Language

Seller

Quickly Define

16
CUBRC KDD AIRS System

Developed Tools
Create Corpora Tool
1. Pulls down documents from data sources (uses samples)
2. Performs document analysis
3. Generates Core Types ~20 minutes for full markup of 1200 documents


Developed Tools
Corner Case Coverage
Text to RDF tool


Invoked
Invoke Algorithm
Algorithm
Query: Query

KDD RDF Ranked
Requery:
Queries
Abduction
Analyze
Aligned
Execution
Models
Results

Data Services

Read & Write
Raw Data Write

Data
Global Sources

Many Data Keyword- Fast Core Dynamic Graph
Sources based Analytics Generation
Query

Structured Data
Processing
Keyword
Natural Language
Index
Processing

Custom Analytics
Data
Service

Consistent 5 Minute Realist Scalable
Running Time Goal Ontology (Hadoop)


Purpose
To create a component that selects
the workflow definition that
satisfies a set of QoS requirements,
maximizing the expected outcome
of the workflow.

Method
Solve Composite Service Problem
• The problem is decomposed into a sequence of functionalities.
• Functionalities (service classes) can be executed by many candidate
services.
• Candidates have associated benefits/costs (QoS Parameters).
• Candidates are substitute and complementary within a service class.
• Given QoS requirements, e.g., algorithm runtime ≤ 5 minutes


• Implemented in prototype system as runtime QoS

Structured
Processing
Write SPARQL Write to
Search Model Query VIZ
Unstructured
Processing

5 Minutes

• Developers must adhere to QoS parameters
• Phenomenal feedback loop developed with analysts; analysts
understood and diagnosed system
• Choose two additional QoS metrics for Phase 3 (memory)


Method
Representation Similarity
Euclidean
Dynamic Weighting (.80)
Location String
Static Weighting
Spatial/Hierarchical
Logistic Regression (.75)
Event Time
Neural Network (.77)
TFIDF (0.80) SVM (0.75)
Description
Semantic (0.64)
(Max F)

Major Research Tasks:
• Identified succinct easily extractable event representation
• Tested Location and Description similarity measures
• Tested Event Similarity Algorithms
• Tested performance on natural language and structured data
sources

GTD: 200804060007 WITS: 200804509
04/06/2008: On Sunday, unknown gunmen set On 6 April 2008, in the morning, in Jurn, Ninawa,
up a fake checkpoint and intercepted two Iraq, armed assailants stopped two school buses
college buses, one carrying male students and carrying students to Mosul University at a fake
one carrying female students, in Mosul, checkpoint. The assailants then fired upon one of the
Nineveh province, Iraq. The bus carrying the busses as it managed to escape, wounding three
female students managed to escape but the students and damaging the bus. Assailants kidnapped
gunmen held the 42 male college students… all 42 students on board the second bus…

Jurn ≈ Mosul Gaza ≠ Sderot

Mosul

25 km

Jurn
Close Distance ≠ Similarity
24 CUBRC KDD AIRS System 24

Processing Pipelines for Speed vs. Quality Decision

<RDF INPUT DIRECTORY> FastestEntityResolutionSolverLocal.java
Text Files
LREntityResolutionSolverLocal.java <NEW-RDF OUTPUT DIRECTORY>
Ont Model 1
Text File
Ont Model 2 EntityResolutionSubproblemConstruction.java
New Ont Model
Ont Model 3

Ont Model 4
Subproblems FastestEntityResolutionSolverMR.java

Subproblem (1,2)
LREntityResolutionSolverMR.java
Associate: …
Person Subproblem (3,4)

Location <SUBPROBLEM DIRECTORY>
Implements JavaJobRunner

Organization Implements JavaJobRunner, but runs MR Jobs

Date Implements MapReduceJobRunner
Artifact

Method
P1
Lagrangian relaxation of an integer programming
formulation of the clustering problem. This 55 65
algorithm iteratively adjusts scores to resolve
inconsistencies, and also provides a performance P2 P3
guarantee (optimality gap) on the solutions. -85

310 45

290 40

Run Time per Iteration (minutes)
35
270
30
Objective Value

250
25
230
20
210
15
190 10
170 5

150 0
1 6 11 16 21 26 31 36 41 46 0 4 8 12 16

Iteration Number # Processors

Results Cluster AIRS Search

Arrest

Similar
Content
Trial

Cluster Similar Group 300 Distinct
Content Information Results


• Analyst Context and Current State
– Analyst may come to the system with some information
• “There was a Terrorist Act at time X”
• “I am interested in this suspected Insurgent”
• “I want to know about a relationship between groups A and B”
– Initial queries may produce statements aligned with CTO
• Abductive Requery is applied
– Select weighted fragments whose bound variables match CTO elements used in
Context/State
– Select rules those fragments correspond to, weighting by selected fragments
– Combine rule statements with known Context/State
– Produce subsequent query with known values ‘filled in’
SELECT ?w1 {
Context: CONSTRUCT { }
“Jane Doe” wife “John Doe” ?p1 wife ?p2 . WHERE {
?p2 husband ?p1 . “Jane Doe” bride ?w1 .
} “John Doe” groom ?w1 .
WHERE { ?w1 rdf:type Wedding .
?p1 bride ?w1 . }
Fragment 1 { ?p2 groom ?w1 .
?p1 wife ?p2 . } ?w1 rdf:type Wedding .
} CUBRC KDD AIRS System 28

Invoked
Invoke Algorithm
Algorithm
Query: Query

KDD RDF Ranked
Requery:
Queries
Abduction
Analyze
Aligned
Execution
Models
Results

Data Services

Read & Write
Raw Data Write

Data
Global Sources

• Developed on the Hadoop/ MapReduce framework
• Distributed services used in AIRS
– Algorithms are written within the MapReduce and HDFS (file-system)
environment – single threaded algorithms are a single “slot” algorithm
– Oozie is the workflow coordination service; all jobs are monitored,
dispatched, and logged
– HBase and HDFS are used as distributed data stores for document
metadata, and RDF graphs

AIRS Software

HBase Database Oozie Workflow Coordination Service

MySQL Database Map Reduce Processing Framework

Hadoop Distributed File System (HDFS)

Server / Cluster Hardware


SELECT DISTINCT ?personNameText
WHERE
{
?act rdf:type event:Act .
?act ro:has_participant ?person .
?person rdf:type agent:Person .
?person ero:designated_by ?personName .
?personName ero:bearer_of ?personNameBearer .
?personNameBearer info:has_text_value ?PersonNameText .
}

Initial Query Merging Query Merging Query Merging Query
• ?act rdf:type event:Act • ?act ro:has_participant • ?person rdf:type • ?person
?person agent:Person ero:designated_by
?personName

Merging Query Merging Query Distinct Query
Save a result
• ?personName • ?personNameBearer Step
iterator and return
ero:bearer_of info:has_text_value • Filter on distinct
?personNameBearer ?PersonNameText results to the user
?PersonNameText’s


“Raw” Algorithms “Secondary” Algorithms

Accept Model Query Airs Query
Data Association Query Ingestion Cluster Results
Data Association Only Query Inprocess Extract All Organizations
Ingestion Query Structured Extract All Persons
Ingestion Only Translation Data Filter By Date
Association Find Events
Sparql Translation Ingestion Topic Filters (32 variants)
Structured • Leadership
• Corruption
• Dirty bombs
• Drugs, etc.


Probe Task - Wrapper Algorithms
1400

1200
Total Wrapper Lines of Code

1000

800

600

400

200

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Probe Task

• Total Lines: 13,958*
Wrapper Code
29% – Wrapper Code: 6,778
Implementation
49% Code – Implementation Code: 3,186*
Validation Code – Validation Code: 3,994
23%
* Less code developed before Test & Evaluation


Task: Find Life Events of
an Individual

Day

0 1 2 3 4 5

Tune Life Develop Algorithm
Event Extraction (glue code) to New Analytic
(NLP & SDA) Align Events
Capabilities in Days


 Over 1200 workflows
were issued by analysts
over a 3 day period


Cluster Monitoring (Ganglia)

• System Load
• CPU Usage
• Memory Usage
• Network Bandwidth


• Fast translation technologies for structured and unstructured

• Many analytics successes - more to come in Phase 3

• All open source software, written entirely in Java
• Full Government Purpose Rights

• Installation manual and user manual ready to go


Justin Del Vecchio
delvecchio@cubrc.org
716-204-5139


Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program

Semelhante a Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program (20)

Mais de DataCards

Mais de DataCards (7)

Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program

Notas do Editor