SlideShare uma empresa Scribd logo
Advantage Through Technology

Actionable Intelligence Retrieval System (AIRS) Overview
                       27 November 2012




                                          CUBRC KDD AIRS System   1
Alignment of Data Models                               Apr – Jul     OccursOn                      Crop
                                                          2008                         Type         Failure
- Single representation for all data
sources                                                                        Event
- Easily plug-in new data sources                          Report
                                                                                   OccursAt
                                                         Transcript                                 Western
                                                                        RecordedBy                 Afghanistan
                                                         Newsletter
                                  Remove Perspective


                                       Report                   Transcript                    Newsletter
               Observer                ID:556AS4                Date: 15 May 08               Date: 10 Apr 08
                                       Date: 26 Apr 08          Event: Situation              Description:
              Data Model               Event: Crop              in certain areas              Crop outlook
               Perspective             Failure Extent           dire as lack of               for early
                                       Detection                rain …                        summer …
             Confounds Data
               Integration
 Event
                           Observer




                                                                                    CUBRC KDD AIRS System        2
Event

 Advanced Analytics Algorithms


  Quantitative
“Easy” Analyst Questions
- Identify All Event Information

                                                                 Timeline

                    “Harder” Analyst Questions
                    -Identify Similar Events



                                               “Hardest” Analyst Question
                                               - Identify Predictor Events

                                                                 Qualitative
                                                                         CUBRC KDD AIRS System   3
Probe Tasks

    • Fully automated tasks
    • Test system plumbing
    • Ex: Find all associates of Jim Johnson and list the person’s
      affiliation to Jim. Use only data sets A, E, M.
    • 20 questions like these

Analyst Tasks

    • Manual task executed by actual analysts
    • Test usability and applicability of developed algorithms
      to realistic tasks
    • Ex: Find all information that may have predicted an
      attack was imminent in Khost, Afghanistan on 3 June,
      2008.
    • 10 questions like these

                                                                 CUBRC KDD AIRS System   4
Many Sources     Many Records              Many Types

          1K   100K    1M

DS 1                                            Reports
                                Articles
DS 2
DS 3                                             Blogs
                            Transcripts
DS 4
                                            Structured
DS 5
DS 6                        DOMEX
DS 7                               Semi-Structured
DS 8                                        Social Media




                                            CUBRC KDD AIRS System   5
Three Essential Components

                      Architecture




           Research                  Integrated
             Tasks                   Prototype



                                             CUBRC KDD AIRS System   6
9 High Level Research Areas               30 Research Tasks in Phase 2

                                            •Task 1.1.3 (CUBRC)            April - PreProto
                                            •Task 1.1.4 (CUBRC)            Aug - Lab
                                            •Task 1.1.5 (CUBRC)            Aug - Lab
ALIGNMENT                                   •Task 1.2.2 (CUBRC)            April - Lab
                                            •Task 2.1.2 (ISS)              April - Lab
1. Ontology Development                     •Task 2.1.3.a (ISS)            April - Lab
                                            •Task 2.1.3.c (ISS)            August - Lab
2. Structured Data Alignment                •Task 3.1.2.a (GDIT)           April Lab | Aug PreProto
                                            •Task 3.1.3 (GDIT)
3. Unstructured Data Alignment              •Task 3.1.4 (GDIT)
                                                                           April PreProto
                                                                           April Lab | Aug Preproto
4. Alignment Reasoner                       •Task 3.2.1.a (GDIT)
                                            •Task 3.2.1.b (GDIT)
                                                                           April Lab | Aug PreProto
                                                                           April Lab | Aug PreProto
5. Alignment Optimization                   •Task 3.2.1.c (GDIT)           April Lab | Aug PreProto
                                            •Task 3.2.1.d (GDIT)          April Lab | Aug Preproto
                                            •Task 3.2.3 (GDIT)            April Proproto
                                            •Task 4.2.1 (Securboration)   Aug Lab
ADVANCED ANALYTICS                          •Task 5.1.1 (CUBRC)           Aug Lab
                                            •Task 6.1.2 (CUBRC/UB)        April Lab | Aug Preproto
6. Workflow Optimization                    •Task 6.1.4 (CUBRC)           April Lab | Aug Preproto
                                            •Task 6.1.5 (CUBRC)
7. Application of Analyst Context           •Task 6.3.1 (CUBRC)            April Lab
8. Data Association for Entity Resolution   •Task 7.3.1 (Securboration)
                                            •Task 7.4.1 (UB)
                                                                          April Lab
                                                                          Aug PreProto
9. Distributed Graph Matching               •Task 8.1.1 (UB)              Aug PreProto
                                            •Task 8.3.1 (UB)              April PreProto
                                            •Task 8.3.2 (UB)              April Lab | Aug PreProto
                                            •Task 9.1.1 (UB)              Aug Preprotp
                                            •Task 9.1.2 (CUBRC)           Aug Lab
                                            •Task 9.2.1 (UB)              Aug Lab
                                            •Task 9.3.1 (UB)              Aug Lab

                                                                          CUBRC KDD AIRS System       7
Visualization       Answers                          Analytics Data Flow
                                                                                                     Invoked
                  Invoke Algorithm
                                                                                                    Algorithm
                                       Query:             Query
                                      Expansion                                                   Single Threaded

  KDD RDF                             Ranked
                                                                   Requery:
                                      Queries
                                                                   Abduction
                                                                                            Analyze
         Query                  Use      Query:                 Evaluate
                 Aligned
                                        Execution
                 Models
                                      Results

Data Services
Search                       Graph                  Graph                    Association:                      Query:
                            Creation:             Creation:                Entities & Events                   Sparql
                           Structured            Unstructured
                                      Parallelized                               Parallelized              Parallelized

                                                                                                Read & Write
 Raw Data                                Write


                                                                   Data
                           Global                                 Sources
                           Model                                                          CUBRC KDD AIRS System           8
Backbone of Project
                                                Basic Formal
                                                 Ontology –
                                                  Relation
                                                  Ontology



 Artifact                                                                                                  Time
Ontology                                                                                                  Ontology
                                  Extended                          Information
             Agent      Event                          Geospatial                         Quality
                                   Relation                         Technology
            Ontology   Ontology                        Ontology                          Ontology
                                  Ontology                            Ontology




                                                 AIRS Mid-
                                                   Level
                                                 Ontology
                                                                          Defines Input &
                                                                        Output Format Most
                                              Counterterrorism
                                                                            Processes
                                                 Ontology



                                                                                  CUBRC KDD AIRS System              9
Information Entity Ontology                                Sample Document
     • 76 local classes
     • 21 equivalence class axioms
     • 1 superclass axioms
     • 28 local object properties
     • 7 datatype properties

Agent Ontology
    • 787 local classes
    • 231 equivalence class axioms (mostly
    persons with roles, e.g. Physician, Lawyer)
    • 70 local object properties (mostly
    familial relationships)
    • SPARQL Inferencing Notation (SPIN)
    rules that infer familial relationships from
    the primitive relationships of the child_of    #Note   #Paragraph        #SectionOfText
    and parent_of and the qualities of male
    and female gender.
                                                     #Person            #Place

                                                                        CUBRC KDD AIRS System   10
Analytics Query                    ‘Soup-to-Nuts’ Graph
“Documents where

 Smyth is a Person &&

 has Associates &&
                             Ontology
 footnote contains ‘XY’ &&

 from data set 4 or 5”




                                            4                    5
           SPARQL Query                   Raw
                                              CUBRC KDD AIRS System   11
Visualization       Answers                          Analytics Data Flow
                                                                                                     Invoked
                  Invoke Algorithm
                                                                                                    Algorithm
                                       Query:             Query
                                      Expansion                                                   Single Threaded

  KDD RDF                             Ranked
                                                                   Requery:
                                      Queries
                                                                   Abduction
                                                                                            Analyze
         Query                  Use      Query:                 Evaluate
                 Aligned
                                        Execution
                 Models
                                      Results

Data Services
Search                       Graph                  Graph                    Association:                      Query:
                            Creation:             Creation:                Entities & Events                   Sparql
                           Structured            Unstructured
                                      Parallelized                               Parallelized              Parallelized

                                                                                                Read & Write
 Raw Data                                Write


                                                                   Data
                           Global                                 Sources
                           Model                                                          CUBRC KDD AIRS System           12
Architecture                              Implementation
                                                Column Alignment Request




                                                                                       Data Value
     Learner                                                                         Characterization
      Learner
        Learner
                           Learner Context           Column                          Categorical Based
                                                Alignment Mega-                         Data Value
                                Alignment
                                Data Cube            Learner                         Characterization
        Mega-
       Learner                                                                         Lucene Base
                                                                                        Alignment

           * Spring Framework
                                               Column Alignment Prediction
Data Value Characterization
•   Used metadata, data values, regular expressions, and neural networks to classify columns
•   Combined with a collection of heuristics
     • Date Time
     • Person’s Name, Alias, and Birth Date
     • Recognizing unstructured data within structured
                                                                                                         13
                                                                             CUBRC KDD AIRS System
D2RQ Mapping File
• Enable dynamic RDF generation




                                  CUBRC KDD AIRS System   14
Method
1.   Document Type Identification:
         • Determine document type with pattern-based configurations
2.   Passage & Metadata Retrieval:
         • With Document Type, Identify & extract data using:
             a. Template / Grammar Process
             b. Generic Heuristic Process
3.   Document Genre Association:
         • Link associated document genres
            Document Type                Passage & Metadata                  Document Genre
             Identification                   Retrieval                        Association


             Identification                   Template          Passage &
             Configuration                    Grammars          Metadata
                              Document
                                Type                           Annotations                           Passages,
Document
                                                                                                    Metadata,
            Document Type                  (a) Template /                    Document Genre
             Identification
                                                                                                    Genre links
                                          Grammar Process                      Association
                Process                                                          Process

                                             (b) Generic
                                           Heuristic Process


                                                                                 CUBRC KDD AIRS System            15
Methods
• Extraction of Entity types (People, Place, Location, Facility, etc.)
• Extraction of Events and Relationships - Uses an external file of
  patterns to extract attributes, relationships, and events.
• Speed is 100 - 250K per second for information extraction



    Purchaser                                Pattern Language


                   Seller




                            Quickly Define

                                                                                 16
                                                         CUBRC KDD AIRS System
Developed Tools
  Create Corpora Tool
  1. Pulls down documents from data sources (uses samples)
  2. Performs document analysis
  3. Generates Core Types ~20 minutes for full markup of 1200 documents




                                                                 CUBRC KDD AIRS System   17
Developed Tools
   Corner Case Coverage
   Text to RDF tool




                          CUBRC KDD AIRS System   18
Visualization       Answers                          Analytics Data Flow
                                                                                                     Invoked
                  Invoke Algorithm
                                                                                                    Algorithm
                                       Query:             Query
                                      Expansion                                                   Single Threaded

  KDD RDF                             Ranked
                                                                   Requery:
                                      Queries
                                                                   Abduction
                                                                                            Analyze
         Query                  Use      Query:                 Evaluate
                 Aligned
                                        Execution
                 Models
                                      Results

Data Services
Search                       Graph                  Graph                    Association:                      Query:
                            Creation:             Creation:                Entities & Events                   Sparql
                           Structured            Unstructured
                                      Parallelized                               Parallelized              Parallelized

                                                                                                Read & Write
 Raw Data                                Write


                                                                   Data
                           Global                                 Sources
                           Model                                                          CUBRC KDD AIRS System           19
Many Data   Keyword-           Fast Core               Dynamic Graph
 Sources     based             Analytics                 Generation
             Query

                             Structured Data
                                Processing
            Keyword
                             Natural Language
             Index
                                Processing

                             Custom Analytics
             Data
            Service

                       Consistent      5 Minute    Realist            Scalable
                      Running Time       Goal     Ontology           (Hadoop)



                                                         CUBRC KDD AIRS System   20
Purpose
To create a component that selects
the workflow definition that
satisfies a set of QoS requirements,
maximizing the expected outcome
of the workflow.

Method
Solve Composite Service Problem
      • The problem is decomposed into a sequence of functionalities.
      • Functionalities (service classes) can be executed by many candidate
        services.
      • Candidates have associated benefits/costs (QoS Parameters).
      • Candidates are substitute and complementary within a service class.
      • Given QoS requirements, e.g., algorithm runtime ≤ 5 minutes

                                                            CUBRC KDD AIRS System   21
• Implemented in prototype system as runtime QoS

                              Structured
                              Processing
                                                  Write    SPARQL            Write to
                    Search                        Model     Query             VIZ
                             Unstructured
                              Processing



                                           5 Minutes

• Developers must adhere to QoS parameters
• Phenomenal feedback loop developed with analysts; analysts
  understood and diagnosed system
• Choose two additional QoS metrics for Phase 3 (memory)


                                                          CUBRC KDD AIRS System         22
Method
                Representation                                Similarity
                                       Euclidean
                                                              Dynamic Weighting (.80)
                   Location              String
                                                                   Static Weighting
                                  Spatial/Hierarchical
                                                              Logistic Regression (.75)
   Event             Time
                                                                Neural Network (.77)
                                    TFIDF (0.80)                    SVM     (0.75)
                  Description
                                   Semantic (0.64)
                                                                     (Max F)

Major Research Tasks:
• Identified succinct easily extractable event representation
• Tested Location and Description similarity measures
• Tested Event Similarity Algorithms
• Tested performance on natural language and structured data
  sources
                                                         CUBRC KDD AIRS System         23
GTD: 200804060007                               WITS: 200804509
04/06/2008: On Sunday, unknown gunmen set       On 6 April 2008, in the morning, in Jurn, Ninawa,
up a fake checkpoint and intercepted two        Iraq, armed assailants stopped two school buses
college buses, one carrying male students and   carrying students to Mosul University at a fake
one carrying female students, in Mosul,         checkpoint. The assailants then fired upon one of the
Nineveh province, Iraq. The bus carrying the    busses as it managed to escape, wounding three
female students managed to escape but the       students and damaging the bus. Assailants kidnapped
gunmen held the 42 male college students…       all 42 students on board the second bus…

              Jurn ≈ Mosul                                     Gaza ≠ Sderot



                        Mosul

            25 km



                 Jurn
                                                      Close Distance ≠ Similarity
   24                                                                     CUBRC KDD AIRS System    24
Processing Pipelines for Speed vs. Quality Decision

 <RDF INPUT DIRECTORY>         FastestEntityResolutionSolverLocal.java
       Text Files
                                 LREntityResolutionSolverLocal.java                        <NEW-RDF OUTPUT DIRECTORY>
      Ont Model 1
                                                                                                            Text File
      Ont Model 2           EntityResolutionSubproblemConstruction.java
                                                                                                        New Ont Model
      Ont Model 3

      Ont Model 4
                              Subproblems                   FastestEntityResolutionSolverMR.java

                            Subproblem (1,2)
                                                              LREntityResolutionSolverMR.java
Associate:                          …
 Person                     Subproblem (3,4)

 Location                <SUBPROBLEM DIRECTORY>
                                                                                     Implements JavaJobRunner

 Organization                                                               Implements JavaJobRunner, but runs MR Jobs

 Date                                                                            Implements MapReduceJobRunner
 Artifact
                                                                                                CUBRC KDD AIRS System    25
Method
                                                                                                                             P1
       Lagrangian relaxation of an integer programming
       formulation of the clustering problem. This                                                                      55                  65
       algorithm iteratively adjusts scores to resolve
       inconsistencies, and also provides a performance                                                                 P2                   P3
       guarantee (optimality gap) on the solutions.                                                                          -85

                  310                                                                                          45

                  290                                                                                          40




                                                                            Run Time per Iteration (minutes)
                                                                                                               35
                  270
                                                                                                               30
Objective Value




                  250
                                                                                                               25
                  230
                                                                                                               20
                  210
                                                                                                               15
                  190                                                                                          10
                  170                                                                                           5

                  150                                                                                           0
                        1   6   11   16    21     26    31   36   41   46                                           0    4      8            12        16

                                          Iteration Number                                                                   # Processors
                                                                                                                               CUBRC KDD AIRS System        26
Results      Cluster      AIRS Search



Arrest

                      Similar
                      Content
  Trial



    Cluster Similar      Group       300 Distinct
       Content        Information      Results



                                    CUBRC KDD AIRS System   27
•   Analyst Context and Current State
      – Analyst may come to the system with some information
           • “There was a Terrorist Act at time X”
           • “I am interested in this suspected Insurgent”
           • “I want to know about a relationship between groups A and B”
      – Initial queries may produce statements aligned with CTO
 •   Abductive Requery is applied
      – Select weighted fragments whose bound variables match CTO elements used in
        Context/State
      – Select rules those fragments correspond to, weighting by selected fragments
      – Combine rule statements with known Context/State
      – Produce subsequent query with known values ‘filled in’
                                                                            SELECT ?w1 {
Context:                             CONSTRUCT {                            }
“Jane Doe” wife “John Doe”             ?p1 wife ?p2 .                       WHERE {
                                       ?p2 husband ?p1 .                      “Jane Doe” bride ?w1 .
                                     }                                        “John Doe” groom ?w1 .
                                     WHERE {                                  ?w1 rdf:type Wedding .
                                       ?p1 bride ?w1 .                      }
       Fragment 1 {                    ?p2 groom ?w1 .
         ?p1 wife ?p2 . }              ?w1 rdf:type Wedding .
                                     }                                          CUBRC KDD AIRS System   28
Visualization       Answers                          Analytics Data Flow
                                                                                                     Invoked
                  Invoke Algorithm
                                                                                                    Algorithm
                                       Query:             Query
                                      Expansion                                                   Single Threaded

  KDD RDF                             Ranked
                                                                   Requery:
                                      Queries
                                                                   Abduction
                                                                                            Analyze
         Query                  Use      Query:                 Evaluate
                 Aligned
                                        Execution
                 Models
                                      Results

Data Services
Search                       Graph                  Graph                    Association:                      Query:
                            Creation:             Creation:                Entities & Events                   Sparql
                           Structured            Unstructured
                                      Parallelized                               Parallelized              Parallelized

                                                                                                Read & Write
 Raw Data                                Write


                                                                   Data
                           Global                                 Sources
                           Model                                                          CUBRC KDD AIRS System           29
• Developed on the Hadoop/ MapReduce framework
 • Distributed services used in AIRS
     – Algorithms are written within the MapReduce and HDFS (file-system)
       environment – single threaded algorithms are a single “slot” algorithm
     – Oozie is the workflow coordination service; all jobs are monitored,
       dispatched, and logged
     – HBase and HDFS are used as distributed data stores for document
       metadata, and RDF graphs

                                AIRS Software

                     HBase Database         Oozie Workflow Coordination Service

MySQL Database                  Map Reduce Processing Framework

                              Hadoop Distributed File System (HDFS)

                          Server / Cluster Hardware

                                                            CUBRC KDD AIRS System   30
SELECT DISTINCT ?personNameText
WHERE
{
    ?act rdf:type event:Act .
    ?act ro:has_participant ?person .
    ?person rdf:type agent:Person .
    ?person ero:designated_by ?personName .
    ?personName ero:bearer_of ?personNameBearer .
    ?personNameBearer info:has_text_value ?PersonNameText .
}



 Initial Query               Merging Query               Merging Query               Merging Query
 • ?act rdf:type event:Act   • ?act ro:has_participant   • ?person rdf:type          • ?person
                               ?person                     agent:Person                ero:designated_by
                                                                                       ?personName




 Merging Query               Merging Query               Distinct Query
                                                                                          Save a result
 • ?personName               • ?personNameBearer         Step
                                                                                      iterator and return
   ero:bearer_of               info:has_text_value       • Filter on distinct
   ?personNameBearer           ?PersonNameText                                         results to the user
                                                           ?PersonNameText’s


                                                                                CUBRC KDD AIRS System        31
“Raw” Algorithms                               “Secondary” Algorithms

Accept Model            Query                   Airs Query
Data Association        Query Ingestion         Cluster Results
Data Association Only   Query Inprocess         Extract All Organizations
Ingestion               Query Structured        Extract All Persons
Ingestion Only          Translation Data        Filter By Date
                        Association             Find Events
Sparql                  Translation Ingestion   Topic Filters (32 variants)
Structured                                      • Leadership
                                                • Corruption
                                                • Dirty bombs
                                                • Drugs, etc.



                                                          CUBRC KDD AIRS System   32
Probe Task - Wrapper Algorithms
                              1400

                              1200
Total Wrapper Lines of Code




                              1000

                              800

                              600

                              400

                              200

                                 0
                                     1    2    3   4   5     6   7   8     9    10     11   12    13   14   15   16    17   18   19      20
                                                                               Probe Task




                                                                                       • Total Lines: 13,958*
                                                                     Wrapper Code
                                     29%                                                    – Wrapper Code: 6,778
                                                                     Implementation
                                                   49%               Code                   – Implementation Code: 3,186*
                                                                     Validation Code        – Validation Code: 3,994
                                         23%
                                                                                                 * Less code developed before Test & Evaluation

                                                                                                                 CUBRC KDD AIRS System        33
Task: Find Life Events of
    an Individual

                       Day



    0     1        2         3   4   5



   Tune Life             Develop Algorithm
Event Extraction          (glue code) to       New Analytic
 (NLP & SDA)               Align Events
                                             Capabilities in Days



                                                    CUBRC KDD AIRS System   34
 Over 1200 workflows
  were issued by analysts
  over a 3 day period




         CUBRC KDD AIRS System   35
Cluster Monitoring (Ganglia)



         •   System Load
         •   CPU Usage
         •   Memory Usage
         •   Network Bandwidth




                                 CUBRC KDD AIRS System   36
• Fast translation technologies for structured and unstructured

• Many analytics successes - more to come in Phase 3

• All open source software, written entirely in Java
   • Full Government Purpose Rights

• Installation manual and user manual ready to go




                                                 CUBRC KDD AIRS System   37
Justin Del Vecchio
delvecchio@cubrc.org
716-204-5139




                       CUBRC KDD AIRS System   38

Mais conteúdo relacionado

Semelhante a Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program

Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
Ian Foster
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
inside-BigData.com
 
GIS Presentation.pptx
GIS Presentation.pptxGIS Presentation.pptx
GIS Presentation.pptx
AbdilbasitHamid
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Salman Baset
 
Asset performance management using Druid by Eric Lim, Bistel
Asset performance management using Druid by Eric Lim, BistelAsset performance management using Druid by Eric Lim, Bistel
Asset performance management using Druid by Eric Lim, Bistel
Metatron
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
Mario Juric
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
Anubhav Jain
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
mariuseriksen4
 
Performing Large Scale Repeatable Software Engineering Studies
Performing Large Scale Repeatable Software Engineering StudiesPerforming Large Scale Repeatable Software Engineering Studies
Performing Large Scale Repeatable Software Engineering Studies
Georgios Gousios
 
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
ArrayUDF: User-Defined Scientific Data Analysis on ArraysArrayUDF: User-Defined Scientific Data Analysis on Arrays
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
Goon83
 
Event Processing Using Semantic Web Technologies
Event Processing Using Semantic Web TechnologiesEvent Processing Using Semantic Web Technologies
Event Processing Using Semantic Web Technologies
Mikko Rinne
 
Kliment ppt gi2011_testing_remote_final
Kliment ppt gi2011_testing_remote_finalKliment ppt gi2011_testing_remote_final
Kliment ppt gi2011_testing_remote_final
IGN Vorstand
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
Inside Azure Diagnostics
Inside Azure DiagnosticsInside Azure Diagnostics
Inside Azure Diagnostics
Michael Collier
 
Automated Experimentation in Social Informatics
Automated Experimentation in Social InformaticsAutomated Experimentation in Social Informatics
Automated Experimentation in Social Informatics
Aliaksandr Birukou
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
Raffaele Montella
 
Modern DevOps with Spinnaker/Concourse and Micrometer
Modern DevOps with Spinnaker/Concourse and MicrometerModern DevOps with Spinnaker/Concourse and Micrometer
Modern DevOps with Spinnaker/Concourse and Micrometer
Jesse Tate Pulfer
 
Ben Evans SPEDDEXES 2014
Ben Evans SPEDDEXES 2014Ben Evans SPEDDEXES 2014
Ben Evans SPEDDEXES 2014
aceas13tern
 
A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011
A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011
A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011
Alberto Lluch Lafuente
 

Semelhante a Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program (20)

Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 
GIS Presentation.pptx
GIS Presentation.pptxGIS Presentation.pptx
GIS Presentation.pptx
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
 
Asset performance management using Druid by Eric Lim, Bistel
Asset performance management using Druid by Eric Lim, BistelAsset performance management using Druid by Eric Lim, Bistel
Asset performance management using Druid by Eric Lim, Bistel
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
 
Performing Large Scale Repeatable Software Engineering Studies
Performing Large Scale Repeatable Software Engineering StudiesPerforming Large Scale Repeatable Software Engineering Studies
Performing Large Scale Repeatable Software Engineering Studies
 
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
ArrayUDF: User-Defined Scientific Data Analysis on ArraysArrayUDF: User-Defined Scientific Data Analysis on Arrays
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
 
Event Processing Using Semantic Web Technologies
Event Processing Using Semantic Web TechnologiesEvent Processing Using Semantic Web Technologies
Event Processing Using Semantic Web Technologies
 
Kliment ppt gi2011_testing_remote_final
Kliment ppt gi2011_testing_remote_finalKliment ppt gi2011_testing_remote_final
Kliment ppt gi2011_testing_remote_final
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Inside Azure Diagnostics
Inside Azure DiagnosticsInside Azure Diagnostics
Inside Azure Diagnostics
 
Automated Experimentation in Social Informatics
Automated Experimentation in Social InformaticsAutomated Experimentation in Social Informatics
Automated Experimentation in Social Informatics
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
 
Modern DevOps with Spinnaker/Concourse and Micrometer
Modern DevOps with Spinnaker/Concourse and MicrometerModern DevOps with Spinnaker/Concourse and Micrometer
Modern DevOps with Spinnaker/Concourse and Micrometer
 
Ben Evans SPEDDEXES 2014
Ben Evans SPEDDEXES 2014Ben Evans SPEDDEXES 2014
Ben Evans SPEDDEXES 2014
 
A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011
A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011
A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011
 

Mais de DataCards

Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...
DataCards
 
Fusion of Human Geography Data
Fusion of Human Geography DataFusion of Human Geography Data
Fusion of Human Geography Data
DataCards
 
Data Normalization and Alignment in Heterogeneous Data Sets
Data Normalization and Alignment in Heterogeneous Data SetsData Normalization and Alignment in Heterogeneous Data Sets
Data Normalization and Alignment in Heterogeneous Data Sets
DataCards
 
The Challenges and Pitfalls of Aggregating Social Media Data
The Challenges and Pitfalls of Aggregating Social Media DataThe Challenges and Pitfalls of Aggregating Social Media Data
The Challenges and Pitfalls of Aggregating Social Media Data
DataCards
 
How NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling DataHow NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling Data
DataCards
 
3rd Socio-Cultural Data Summit
3rd Socio-Cultural Data Summit3rd Socio-Cultural Data Summit
3rd Socio-Cultural Data Summit
DataCards
 
Statistical Approaches to Missing Data
Statistical Approaches to Missing DataStatistical Approaches to Missing Data
Statistical Approaches to Missing Data
DataCards
 

Mais de DataCards (7)

Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...
 
Fusion of Human Geography Data
Fusion of Human Geography DataFusion of Human Geography Data
Fusion of Human Geography Data
 
Data Normalization and Alignment in Heterogeneous Data Sets
Data Normalization and Alignment in Heterogeneous Data SetsData Normalization and Alignment in Heterogeneous Data Sets
Data Normalization and Alignment in Heterogeneous Data Sets
 
The Challenges and Pitfalls of Aggregating Social Media Data
The Challenges and Pitfalls of Aggregating Social Media DataThe Challenges and Pitfalls of Aggregating Social Media Data
The Challenges and Pitfalls of Aggregating Social Media Data
 
How NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling DataHow NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling Data
 
3rd Socio-Cultural Data Summit
3rd Socio-Cultural Data Summit3rd Socio-Cultural Data Summit
3rd Socio-Cultural Data Summit
 
Statistical Approaches to Missing Data
Statistical Approaches to Missing DataStatistical Approaches to Missing Data
Statistical Approaches to Missing Data
 

Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program

  • 1. Advantage Through Technology Actionable Intelligence Retrieval System (AIRS) Overview 27 November 2012 CUBRC KDD AIRS System 1
  • 2. Alignment of Data Models Apr – Jul OccursOn Crop 2008 Type Failure - Single representation for all data sources Event - Easily plug-in new data sources Report OccursAt Transcript Western RecordedBy Afghanistan Newsletter Remove Perspective Report Transcript Newsletter Observer ID:556AS4 Date: 15 May 08 Date: 10 Apr 08 Date: 26 Apr 08 Event: Situation Description: Data Model Event: Crop in certain areas Crop outlook Perspective Failure Extent dire as lack of for early Detection rain … summer … Confounds Data Integration Event Observer CUBRC KDD AIRS System 2
  • 3. Event Advanced Analytics Algorithms Quantitative “Easy” Analyst Questions - Identify All Event Information Timeline “Harder” Analyst Questions -Identify Similar Events “Hardest” Analyst Question - Identify Predictor Events Qualitative CUBRC KDD AIRS System 3
  • 4. Probe Tasks • Fully automated tasks • Test system plumbing • Ex: Find all associates of Jim Johnson and list the person’s affiliation to Jim. Use only data sets A, E, M. • 20 questions like these Analyst Tasks • Manual task executed by actual analysts • Test usability and applicability of developed algorithms to realistic tasks • Ex: Find all information that may have predicted an attack was imminent in Khost, Afghanistan on 3 June, 2008. • 10 questions like these CUBRC KDD AIRS System 4
  • 5. Many Sources Many Records Many Types 1K 100K 1M DS 1 Reports Articles DS 2 DS 3 Blogs Transcripts DS 4 Structured DS 5 DS 6 DOMEX DS 7 Semi-Structured DS 8 Social Media CUBRC KDD AIRS System 5
  • 6. Three Essential Components Architecture Research Integrated Tasks Prototype CUBRC KDD AIRS System 6
  • 7. 9 High Level Research Areas 30 Research Tasks in Phase 2 •Task 1.1.3 (CUBRC) April - PreProto •Task 1.1.4 (CUBRC) Aug - Lab •Task 1.1.5 (CUBRC) Aug - Lab ALIGNMENT •Task 1.2.2 (CUBRC) April - Lab •Task 2.1.2 (ISS) April - Lab 1. Ontology Development •Task 2.1.3.a (ISS) April - Lab •Task 2.1.3.c (ISS) August - Lab 2. Structured Data Alignment •Task 3.1.2.a (GDIT) April Lab | Aug PreProto •Task 3.1.3 (GDIT) 3. Unstructured Data Alignment •Task 3.1.4 (GDIT) April PreProto April Lab | Aug Preproto 4. Alignment Reasoner •Task 3.2.1.a (GDIT) •Task 3.2.1.b (GDIT) April Lab | Aug PreProto April Lab | Aug PreProto 5. Alignment Optimization •Task 3.2.1.c (GDIT) April Lab | Aug PreProto •Task 3.2.1.d (GDIT) April Lab | Aug Preproto •Task 3.2.3 (GDIT) April Proproto •Task 4.2.1 (Securboration) Aug Lab ADVANCED ANALYTICS •Task 5.1.1 (CUBRC) Aug Lab •Task 6.1.2 (CUBRC/UB) April Lab | Aug Preproto 6. Workflow Optimization •Task 6.1.4 (CUBRC) April Lab | Aug Preproto •Task 6.1.5 (CUBRC) 7. Application of Analyst Context •Task 6.3.1 (CUBRC) April Lab 8. Data Association for Entity Resolution •Task 7.3.1 (Securboration) •Task 7.4.1 (UB) April Lab Aug PreProto 9. Distributed Graph Matching •Task 8.1.1 (UB) Aug PreProto •Task 8.3.1 (UB) April PreProto •Task 8.3.2 (UB) April Lab | Aug PreProto •Task 9.1.1 (UB) Aug Preprotp •Task 9.1.2 (CUBRC) Aug Lab •Task 9.2.1 (UB) Aug Lab •Task 9.3.1 (UB) Aug Lab CUBRC KDD AIRS System 7
  • 8. Visualization Answers Analytics Data Flow Invoked Invoke Algorithm Algorithm Query: Query Expansion Single Threaded KDD RDF Ranked Requery: Queries Abduction Analyze Query Use Query: Evaluate Aligned Execution Models Results Data Services Search Graph Graph Association: Query: Creation: Creation: Entities & Events Sparql Structured Unstructured Parallelized Parallelized Parallelized Read & Write Raw Data Write Data Global Sources Model CUBRC KDD AIRS System 8
  • 9. Backbone of Project Basic Formal Ontology – Relation Ontology Artifact Time Ontology Ontology Extended Information Agent Event Geospatial Quality Relation Technology Ontology Ontology Ontology Ontology Ontology Ontology AIRS Mid- Level Ontology Defines Input & Output Format Most Counterterrorism Processes Ontology CUBRC KDD AIRS System 9
  • 10. Information Entity Ontology Sample Document • 76 local classes • 21 equivalence class axioms • 1 superclass axioms • 28 local object properties • 7 datatype properties Agent Ontology • 787 local classes • 231 equivalence class axioms (mostly persons with roles, e.g. Physician, Lawyer) • 70 local object properties (mostly familial relationships) • SPARQL Inferencing Notation (SPIN) rules that infer familial relationships from the primitive relationships of the child_of #Note #Paragraph #SectionOfText and parent_of and the qualities of male and female gender. #Person #Place CUBRC KDD AIRS System 10
  • 11. Analytics Query ‘Soup-to-Nuts’ Graph “Documents where Smyth is a Person && has Associates && Ontology footnote contains ‘XY’ && from data set 4 or 5” 4 5 SPARQL Query Raw CUBRC KDD AIRS System 11
  • 12. Visualization Answers Analytics Data Flow Invoked Invoke Algorithm Algorithm Query: Query Expansion Single Threaded KDD RDF Ranked Requery: Queries Abduction Analyze Query Use Query: Evaluate Aligned Execution Models Results Data Services Search Graph Graph Association: Query: Creation: Creation: Entities & Events Sparql Structured Unstructured Parallelized Parallelized Parallelized Read & Write Raw Data Write Data Global Sources Model CUBRC KDD AIRS System 12
  • 13. Architecture Implementation Column Alignment Request Data Value Learner Characterization Learner Learner Learner Context Column Categorical Based Alignment Mega- Data Value Alignment Data Cube Learner Characterization Mega- Learner Lucene Base Alignment * Spring Framework Column Alignment Prediction Data Value Characterization • Used metadata, data values, regular expressions, and neural networks to classify columns • Combined with a collection of heuristics • Date Time • Person’s Name, Alias, and Birth Date • Recognizing unstructured data within structured 13 CUBRC KDD AIRS System
  • 14. D2RQ Mapping File • Enable dynamic RDF generation CUBRC KDD AIRS System 14
  • 15. Method 1. Document Type Identification: • Determine document type with pattern-based configurations 2. Passage & Metadata Retrieval: • With Document Type, Identify & extract data using: a. Template / Grammar Process b. Generic Heuristic Process 3. Document Genre Association: • Link associated document genres Document Type Passage & Metadata Document Genre Identification Retrieval Association Identification Template Passage & Configuration Grammars Metadata Document Type Annotations Passages, Document Metadata, Document Type (a) Template / Document Genre Identification Genre links Grammar Process Association Process Process (b) Generic Heuristic Process CUBRC KDD AIRS System 15
  • 16. Methods • Extraction of Entity types (People, Place, Location, Facility, etc.) • Extraction of Events and Relationships - Uses an external file of patterns to extract attributes, relationships, and events. • Speed is 100 - 250K per second for information extraction Purchaser Pattern Language Seller Quickly Define 16 CUBRC KDD AIRS System
  • 17. Developed Tools Create Corpora Tool 1. Pulls down documents from data sources (uses samples) 2. Performs document analysis 3. Generates Core Types ~20 minutes for full markup of 1200 documents CUBRC KDD AIRS System 17
  • 18. Developed Tools Corner Case Coverage Text to RDF tool CUBRC KDD AIRS System 18
  • 19. Visualization Answers Analytics Data Flow Invoked Invoke Algorithm Algorithm Query: Query Expansion Single Threaded KDD RDF Ranked Requery: Queries Abduction Analyze Query Use Query: Evaluate Aligned Execution Models Results Data Services Search Graph Graph Association: Query: Creation: Creation: Entities & Events Sparql Structured Unstructured Parallelized Parallelized Parallelized Read & Write Raw Data Write Data Global Sources Model CUBRC KDD AIRS System 19
  • 20. Many Data Keyword- Fast Core Dynamic Graph Sources based Analytics Generation Query Structured Data Processing Keyword Natural Language Index Processing Custom Analytics Data Service Consistent 5 Minute Realist Scalable Running Time Goal Ontology (Hadoop) CUBRC KDD AIRS System 20
  • 21. Purpose To create a component that selects the workflow definition that satisfies a set of QoS requirements, maximizing the expected outcome of the workflow. Method Solve Composite Service Problem • The problem is decomposed into a sequence of functionalities. • Functionalities (service classes) can be executed by many candidate services. • Candidates have associated benefits/costs (QoS Parameters). • Candidates are substitute and complementary within a service class. • Given QoS requirements, e.g., algorithm runtime ≤ 5 minutes CUBRC KDD AIRS System 21
  • 22. • Implemented in prototype system as runtime QoS Structured Processing Write SPARQL Write to Search Model Query VIZ Unstructured Processing 5 Minutes • Developers must adhere to QoS parameters • Phenomenal feedback loop developed with analysts; analysts understood and diagnosed system • Choose two additional QoS metrics for Phase 3 (memory) CUBRC KDD AIRS System 22
  • 23. Method Representation Similarity Euclidean Dynamic Weighting (.80) Location String Static Weighting Spatial/Hierarchical Logistic Regression (.75) Event Time Neural Network (.77) TFIDF (0.80) SVM (0.75) Description Semantic (0.64) (Max F) Major Research Tasks: • Identified succinct easily extractable event representation • Tested Location and Description similarity measures • Tested Event Similarity Algorithms • Tested performance on natural language and structured data sources CUBRC KDD AIRS System 23
  • 24. GTD: 200804060007 WITS: 200804509 04/06/2008: On Sunday, unknown gunmen set On 6 April 2008, in the morning, in Jurn, Ninawa, up a fake checkpoint and intercepted two Iraq, armed assailants stopped two school buses college buses, one carrying male students and carrying students to Mosul University at a fake one carrying female students, in Mosul, checkpoint. The assailants then fired upon one of the Nineveh province, Iraq. The bus carrying the busses as it managed to escape, wounding three female students managed to escape but the students and damaging the bus. Assailants kidnapped gunmen held the 42 male college students… all 42 students on board the second bus… Jurn ≈ Mosul Gaza ≠ Sderot Mosul 25 km Jurn Close Distance ≠ Similarity 24 CUBRC KDD AIRS System 24
  • 25. Processing Pipelines for Speed vs. Quality Decision <RDF INPUT DIRECTORY> FastestEntityResolutionSolverLocal.java Text Files LREntityResolutionSolverLocal.java <NEW-RDF OUTPUT DIRECTORY> Ont Model 1 Text File Ont Model 2 EntityResolutionSubproblemConstruction.java New Ont Model Ont Model 3 Ont Model 4 Subproblems FastestEntityResolutionSolverMR.java Subproblem (1,2) LREntityResolutionSolverMR.java Associate: … Person Subproblem (3,4) Location <SUBPROBLEM DIRECTORY> Implements JavaJobRunner Organization Implements JavaJobRunner, but runs MR Jobs Date Implements MapReduceJobRunner Artifact CUBRC KDD AIRS System 25
  • 26. Method P1 Lagrangian relaxation of an integer programming formulation of the clustering problem. This 55 65 algorithm iteratively adjusts scores to resolve inconsistencies, and also provides a performance P2 P3 guarantee (optimality gap) on the solutions. -85 310 45 290 40 Run Time per Iteration (minutes) 35 270 30 Objective Value 250 25 230 20 210 15 190 10 170 5 150 0 1 6 11 16 21 26 31 36 41 46 0 4 8 12 16 Iteration Number # Processors CUBRC KDD AIRS System 26
  • 27. Results Cluster AIRS Search Arrest Similar Content Trial Cluster Similar Group 300 Distinct Content Information Results CUBRC KDD AIRS System 27
  • 28. Analyst Context and Current State – Analyst may come to the system with some information • “There was a Terrorist Act at time X” • “I am interested in this suspected Insurgent” • “I want to know about a relationship between groups A and B” – Initial queries may produce statements aligned with CTO • Abductive Requery is applied – Select weighted fragments whose bound variables match CTO elements used in Context/State – Select rules those fragments correspond to, weighting by selected fragments – Combine rule statements with known Context/State – Produce subsequent query with known values ‘filled in’ SELECT ?w1 { Context: CONSTRUCT { } “Jane Doe” wife “John Doe” ?p1 wife ?p2 . WHERE { ?p2 husband ?p1 . “Jane Doe” bride ?w1 . } “John Doe” groom ?w1 . WHERE { ?w1 rdf:type Wedding . ?p1 bride ?w1 . } Fragment 1 { ?p2 groom ?w1 . ?p1 wife ?p2 . } ?w1 rdf:type Wedding . } CUBRC KDD AIRS System 28
  • 29. Visualization Answers Analytics Data Flow Invoked Invoke Algorithm Algorithm Query: Query Expansion Single Threaded KDD RDF Ranked Requery: Queries Abduction Analyze Query Use Query: Evaluate Aligned Execution Models Results Data Services Search Graph Graph Association: Query: Creation: Creation: Entities & Events Sparql Structured Unstructured Parallelized Parallelized Parallelized Read & Write Raw Data Write Data Global Sources Model CUBRC KDD AIRS System 29
  • 30. • Developed on the Hadoop/ MapReduce framework • Distributed services used in AIRS – Algorithms are written within the MapReduce and HDFS (file-system) environment – single threaded algorithms are a single “slot” algorithm – Oozie is the workflow coordination service; all jobs are monitored, dispatched, and logged – HBase and HDFS are used as distributed data stores for document metadata, and RDF graphs AIRS Software HBase Database Oozie Workflow Coordination Service MySQL Database Map Reduce Processing Framework Hadoop Distributed File System (HDFS) Server / Cluster Hardware CUBRC KDD AIRS System 30
  • 31. SELECT DISTINCT ?personNameText WHERE { ?act rdf:type event:Act . ?act ro:has_participant ?person . ?person rdf:type agent:Person . ?person ero:designated_by ?personName . ?personName ero:bearer_of ?personNameBearer . ?personNameBearer info:has_text_value ?PersonNameText . } Initial Query Merging Query Merging Query Merging Query • ?act rdf:type event:Act • ?act ro:has_participant • ?person rdf:type • ?person ?person agent:Person ero:designated_by ?personName Merging Query Merging Query Distinct Query Save a result • ?personName • ?personNameBearer Step iterator and return ero:bearer_of info:has_text_value • Filter on distinct ?personNameBearer ?PersonNameText results to the user ?PersonNameText’s CUBRC KDD AIRS System 31
  • 32. “Raw” Algorithms “Secondary” Algorithms Accept Model Query Airs Query Data Association Query Ingestion Cluster Results Data Association Only Query Inprocess Extract All Organizations Ingestion Query Structured Extract All Persons Ingestion Only Translation Data Filter By Date Association Find Events Sparql Translation Ingestion Topic Filters (32 variants) Structured • Leadership • Corruption • Dirty bombs • Drugs, etc. CUBRC KDD AIRS System 32
  • 33. Probe Task - Wrapper Algorithms 1400 1200 Total Wrapper Lines of Code 1000 800 600 400 200 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Probe Task • Total Lines: 13,958* Wrapper Code 29% – Wrapper Code: 6,778 Implementation 49% Code – Implementation Code: 3,186* Validation Code – Validation Code: 3,994 23% * Less code developed before Test & Evaluation CUBRC KDD AIRS System 33
  • 34. Task: Find Life Events of an Individual Day 0 1 2 3 4 5 Tune Life Develop Algorithm Event Extraction (glue code) to New Analytic (NLP & SDA) Align Events Capabilities in Days CUBRC KDD AIRS System 34
  • 35.  Over 1200 workflows were issued by analysts over a 3 day period CUBRC KDD AIRS System 35
  • 36. Cluster Monitoring (Ganglia) • System Load • CPU Usage • Memory Usage • Network Bandwidth CUBRC KDD AIRS System 36
  • 37. • Fast translation technologies for structured and unstructured • Many analytics successes - more to come in Phase 3 • All open source software, written entirely in Java • Full Government Purpose Rights • Installation manual and user manual ready to go CUBRC KDD AIRS System 37

Notas do Editor

  1. Test notes
  2. Speak about Big Data in terms
  3. - Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.
  4. 1. ontology (backbone of this project) -- Why is an ontology important; It speaks the language. -- Here are our ontologies -- Here is data that we have developed. -- Maybe some statistics on the explosion of data -- How overlaying a model to truly network information together is the best approach -- Show the exotic queries from Phase 1; very very powerful -- Query can go from the raw data to the extracted types
  5. - Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.
  6. This architecture will allow us to integrate more machine learning algorithms and create a hybrid system for producing predictions for alignmentSupport weighting of alignment learnersLearner can be Mega-Learner, therefore it supports multiple levels of predictionAll learners can utilize the data contained within the Learner ContextEach learner will post its alignment result and score to the Alignment Data Cube for other Learners to access if needed.Alignment Data Cube is similar architecture to a Data Cube used within Data MiningAll scores are normalized between 0 and 1Data Value CharacterizationRegex to determine overall categorization of the data in the columnLucene Based AlignmentTF/IDF based learnerUtilizes Wordnet to expand the search terms
  7. This architecture will allow us to integrate more machine learning algorithms and create a hybrid system for producing predictions for alignmentSupport weighting of alignment learnersLearner can be Mega-Learner, therefore it supports multiple levels of predictionAll learners can utilize the data contained within the Learner ContextEach learner will post its alignment result and score to the Alignment Data Cube for other Learners to access if needed.Alignment Data Cube is similar architecture to a Data Cube used within Data MiningAll scores are normalized between 0 and 1Data Value CharacterizationRegex to determine overall categorization of the data in the columnLucene Based AlignmentTF/IDF based learnerUtilizes Wordnet to expand the search terms
  8. - Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.
  9. - Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.